Tutorial: Downloading Sequence Data and Gene Annotation
The page where you download sequence data can be accessed from the main navigation at the top of the screen: from the "Genomic Data" menu, select the last item "Download".

Now you have reached the page with links to data files for seven strains of TB and 26 related organisms. At the top of the page you find links to raw sequence data in fasta format, with one row for each organism. Each column represents a different file format. Choose your preferred compression scheme, then click on the arrow symbol to start the download.
For each genome, files are presented in a number of formats to facilitate various analyses:
- .fasta: a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
- .gtf: The Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information.
- .agp: A file that describes how primary sequences can be assembled to make a non-redundant, contiguous sequence. The sequence being assembled may be a contig or a chromosome. For more information about the file specifiction, see the format definition page.
- .txt: tab-delimited text file, best viewed in a spreadsheet program to allow easy sorting.
For example, the file "annotation_summary_per_gene.txt" shows one row for each gene, with all associated features grouped together by category (such as PFAM domain or KEGG pathway) where multiple features in any category are separated by commas; in contrast, the file "annotation_summary.txt" shows multiple rows for genes that have multiple features in any category (e.g., a gene that is associated with four KEGG pathways will be listed with one row for each pathway, so you can sort by KEGG ID).