New User Guide
This guide is here to provide you with a walk-through of the site by searching for a particular gene, viewing the annotation information for that gene, and retrieving the corresponding expression data.
Outline
- Finding a Gene with Quicksearch
- What Information is Provided for each Gene?
- Downloading Sequence Data and Annotation
What is TBDB?
The Tuberculosis Database (TBDB) is an integrated database providing access to TB genomic data and resources, relevant to the discovery and development of TB drugs, vaccines and biomarkers. The current release of TBDB houses genome sequence data and annotations for 28+ different Mycobacterium tuberculosis strains and related bacteria. TBDB stores pre- and post-publication gene-expression data from M. tuberculosis and its close relatives. TBDB currently hosts data for nearly 1500 public tuberculosis microarrays and 260 arrays for Streptomyces. In addition, TBDB provides access to a suite of comparative genomics and microarray analysis software.
Finding a Gene with Quicksearch
Suppose you are studying DosR, the transcription factor known to regulate the hypoxic response of Mycobacterium tuberculosis, Park HD et al., Mol Microbiol. 2003 May;48(3):833-43. Simply enter 'DosR' into either search field on the TB Database home page.
You will see the search results as below:
Alternatively, enter the ORF identifier, 'Rv3133c':
Clicking the first item takes you to the returned results from the Mtb strain H37Rv:

Click on the entry with the highest relevance score, and you will see the gene details page for Rv3133c:
The Gene Detail Page
What information is provided for each gene in TBDB?
Downloading Sequence Data and Gene Annotation
The page where you download data can be accessed from the main navigation at the top of the screen: from the "Genomic Data" menu, select the last item "Download".
Now you have reached the page with links to data files for seven strains of TB and 26 related organisms. At the top of the page you find links to raw sequence data in fasta format, with one row for each organism. Each column represents a different file format. Choose your preferred compression scheme, then click on the arrow symbol to start the download.
For each genome, files are presented in a number of formats to facilitate various analyses:
- .fasta: a text-based format for representing either nucleic acid sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences.
- .gtf: The Gene transfer format (GTF) is a file format used to hold information about gene structure. It is a tab-delimited text format based on the general feature format (GFF), but contains some additional conventions specific to gene information.
- .agp: A file that describes how primary sequences can be assembled to make a non-redundant, contiguous sequence. The sequence being assembled may be a contig or a chromosome. For more information about the file specifiction, see the format definition page.
- .txt: tab-delimited text file, best viewed in a spreadsheet program to allow easy sorting.
For example, the file "annotation_summary_per_gene.txt" shows one row for each gene, with all associated features grouped together by category (such as PFAM domain or KEGG pathway) where multiple features in any category are separated by commas; in contrast, the file "annotation_summary.txt" shows multiple rows for genes that have multiple features in any category (e.g., a gene that is associated with four KEGG pathways will be listed with one row for each pathway, so you can sort by KEGG ID).