Aidan Lakshman
posted 2 months ago
Ph.D. Candidate at University of Pittsburgh, Dept. of Biomedical Informatics. I specialize in phylogenomics on big data, especially on prokaryotes. Core contributor to the SynExtend R package; I also write tutorials for DECIPHER and SynExtend. Check out my website at!

Genomics in R, Part 3: Gene Calling and Annotation

Suppose you've sent out some strains/samples/whatever for sequencing, and you've just received a fresh dump of sequencing data files. How do you go about finding what's in that sequence? Where do the genes start and end? How do we take a string of nucleotides/amino acids and convert it to real gene names like rpoB or BRCA1?

This tutorial is a continuation of my previous posts on genomics in R, and will demonstrate how to answer the above questions. These posts are focused on using the DECIPHER and SynExtend packages; for more details on setup, background, and genome data formats you can check out my first post. I'll also note that our packages are targeted at prokaryotic analyses; other packages may be more suitable for eukaryotic analysis.

Getting Data

You may already have a genome you're interested in, but in case you don't, it's pretty simple to get sequencing data for free online. If you already have sequencing data, feel free to skip to the next section. NCBI Datasets provide an easy interface for downloading whole genome datafiles. It's easier to illustrate the example using a small genome, so I recommend something like Micrococcus luteus or Staphylococcus aureus, both of which have relatively small genomes. I'm going to be using a M. luteus genome for these examples.

To use the browser, simply type in the name of the organism to search for. For this analysis we're looking for complete genomes, so we want to avoid using datafiles with incomplete assemblies. Once you've searched for a genome, you can open additional options by clicking the "Filters" button. I recommend checking "Reference Genomes" and setting the Assembly Level to "Complete".

Once you've found a genome, click on it to view further information. This should take you to a page where you can download the sequences--the file format you're looking for is "Genome sequences (FASTA)".

Gene Calling

The first step of the process is gene calling--this identifies where the actual protein-encoding regions in the genome are. I'll start by loading DECIPHER and the genome file:


# You can use readAAStringSet if using an amino acid sequence
myGenome <- readDNAStringSet('/path/to/genome.fasta')

Finding the coding regions in the genome is super simple:

geneLocations <- FindGenes(myGenome)
genes <- ExtractGenes(geneLocations, myGenome, type="AAStringSet")
## AAStringSet object of length 2275:
##        width seq
##    ...   ... ...

FindGenes() will find locations of the protein coding genes in the genome, and ExtractGenes uses those locations to return nucleotide or amino acid sequences for each gene in the set. If you'd prefer nucleotide sequences, you can set type="DNAStringSet" instead of the above block.

Of note is that FindGenes assumes each individual gene does not have introns or frameshifts, which makes it very well suited to prokaryotic genomes. It can work with eukaryotic genomes, but some genes may be missed. However, you don't need to use FindGenes for the next step--as long as you have gene sequences (from DECIPHER or any other program), you can pass them directly into the analysis below.

Annotating Genes

Now we have coding sequences--the next step is to figure out what they are! DECIPHER implements the IDTAXA algorithm for protein annotation, which has been shown to outperform BLAST and HMMER. You can read more about the algorithm and benchmarks here.

IDTAXA requires a training set, which can be downloaded from the DECIPHER website. These files are found at the bottom under "Training Sets for Functional Classification (amino acids)". Pretrained datasets are available for a wide variety of organisms across the tree of life--I recommend choosing the most specific dataset for your genome (e.g. choosing KEGG Animals will produce better results for human genome annotation than KEGG All [Subsampled]). Each of these are .RData files.

Once you have a training set, classification is pretty simple:

# loads `trainingSet`

annotations <h IdTaxa(genes, trainingSet)
##   A test set of class 'Taxa' with length 10
##      confidence taxon
##  [1]        99% Root; 09130 Environmental Information Processing; 09132 Signa...
##  [2]       100% Root; 09120 Genetic Information Processing; 09124 Replication...
##  [3]        99% Root; 09120 Genetic Information Processing; 09124 Replication...
##  [4]        28% Root; unclassified_Root...                                      
##  [5]        96% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti...
##  [6]       100% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti...
##  [7]        35% Root; unclassified_Root...                                      
##  [8]        89% Root; unclassified_Root...                                      
##  [9]        46% Root; unclassified_Root...                                      
## [10]        28% Root; unclassified_Root...

annotations will hold classifications corresponding to the sequences provided, so the first entry is the annotation for the first gene, the second for the second gene, etc. We can get more information about a single classification by looking at it:

# look at the first annotation
## $taxon
## [1] "Root"                                                   
## [2] "09130 Environmental Information Processing"             
## [3] "09132 Signal transduction"                              
## [4] "02020 Two-component system [PATH:ko02020]"              
## [5] "K02313  dnaA, chromosomal replication initiator protein"
## $confidence
## [1] 100 100 100 100 100

We can also plot the distribution of classifications:

plot(annotations, trainingSet)



Using DECIPHER, you can find and annotate genes with just a few lines of code. This writeup is just a quick summary of the features offered--you can check out my longer tutorial for images, example datafiles, and more. You can also look at the Gene Finding Vignette for an even more in-depth tour of the features and functionality of FindGenes.

Keith Loritz2 months ago

Thanks Aidan; another excellent set of tools for the SynBio student/researcher!

Login or Signup to leave a comment