The National Centre for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health. The NCBI is located in Bethesda, Maryland and was founded in 1988.

The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include Gene Bank for DNA sequences and Pub Med, a bibliographic database for the biomedical literature. Other databases include the NCBI Epigenomics database. All these databases are available online through the Entrees search engine.

Nucleotide annotation

The first step of nucleotide annotation is to find a sequence that has the features of a gene. Many eukaryotic genes contain specific features, such as introns that separate exons, that can serve are markers for the discovery process. Therefore, it is important to develop a software program that properly recognizes such features. A number of programs are available that perform these searches. A key feature of each of these programs is sensor algorithms that identify the key structural features. For genes, these would include, for example, introns that are defined by the consensus splice site junctions (GT…AG). The program might also include other sensors that detect a transcriptional start site or recognize specific GC content. Collectively, potential genes are discovered by scanning the DNA sequence in all six possible reading frames to ensure all possible genes are recognized for further analysis.

Once a sequence has been defined as a gene, the next step is to name it. The naming of genes relies upon the significant amount of research that predated genome projects. This research was historically done on a gene-by-gene approach to clone and characterize individual genes that were of interest to a specific research group. For example, many of the proteins involved in the housekeeping processes of a cell have been characterized at the nucleotide and protein levels. This information is stored in large databases such as GenBank and Swiss-Prot. Therefore with a specific sequence highlighted as a potential gene, the next step is to determine if that sequence indeed is like some other gene or protein.


Naming the genes

The software tool most often used to annotate (or name) a gene is BLAST. This stands for Basic Local Alignment Search Tool. This series of computer programs  looks for sequence similarities. Typically, a database such as GenBank or Swiss-Prot is used to uncover sequences that are similar to the query. The query can be either a protein or nucleotide sequence. The database can be either a nucleotide or protein database.

A critical database used to determine if a sequence is indeed a gene contains EST (Expressed Sequence Tags) sequences. ESTs are DNA sequences of expressed genes that are  represented in a cDNA library. The data is collected by end sequencing (usually the 3’ end) a large collection of clones representing transcripts expressed under a specific developmental or environmental condition. Because they are expressed, then they clearly were transcribed from functioning gene. Therefore, the predicted genes are also used as a BLAST query against an EST database .



