Genomics in R, Part 3: Gene Calling and Annotation
Suppose you've sent out some strains/samples/whatever for sequencing, and you've just received a fresh dump of sequencing data files. How do you go about finding what's in that sequence? Where do the genes start and end? How do we take a string of nucleotides/amino acids and convert it to real gene names like
This tutorial is a continuation of my previous posts on genomics in R, and will demonstrate how to answer the above questions. These posts are focused on using the
SynExtend packages; for more details on setup, background, and genome data formats you can check out my first post. I'll also note that our packages are targeted at prokaryotic analyses; other packages may be more suitable for eukaryotic analysis.
You may already have a genome you're interested in, but in case you don't, it's pretty simple to get sequencing data for free online. If you already have sequencing data, feel free to skip to the next section. NCBI Datasets provide an easy interface for downloading whole genome datafiles. It's easier to illustrate the example using a small genome, so I recommend something like Micrococcus luteus or Staphylococcus aureus, both of which have relatively small genomes. I'm going to be using a M. luteus genome for these examples.
To use the browser, simply type in the name of the organism to search for. For this analysis we're looking for complete genomes, so we want to avoid using datafiles with incomplete assemblies. Once you've searched for a genome, you can open additional options by clicking the "Filters" button. I recommend checking "Reference Genomes" and setting the Assembly Level to "Complete".
Once you've found a genome, click on it to view further information. This should take you to a page where you can download the sequences--the file format you're looking for is "Genome sequences (FASTA)".
The first step of the process is gene calling--this identifies where the actual protein-encoding regions in the genome are. I'll start by loading
DECIPHER and the genome file:
library(DECIPHER) # You can use readAAStringSet if using an amino acid sequence myGenome <- readDNAStringSet('/path/to/genome.fasta')
Finding the coding regions in the genome is super simple:
geneLocations <- FindGenes(myGenome) genes <- ExtractGenes(geneLocations, myGenome, type="AAStringSet") genes
## AAStringSet object of length 2275: ## width seq ##  516 VVADQAVLSSWRSVVGSLEDDARVSARLMGFV...RKIRELMAERRTIYNQVTELTNEIKRKQRGA* ##  372 VKFTVERDILTDAVSWAARSLSPRPPVPVLSG...SAPKPALLTGVNQEDGVVSDYRYLVMPVRIA* ##  405 VYLSHLTVADFRSYRWADLELTPGSTVLLGAN...RQDAEGSEAVAAAAPVEGDIREPRREGGADG* ##  188 MAETPAPFEPDRPDLALVQLRRVREAARERGE...TRIEVAGPQAPSWRKGPRTVRGGRGPRDTYG* ##  721 VVDAMPENPAEEPTAASAAPNPEAVPDAVGQP...DAVFSMLMGEDVESRRTFIQQNAKDIRFLDV* ## ... ... ... ##  195 MSAENVTPAPEAEDAVVETPAGQGSRVAEQDD...KIVHDVVADAGLTSESEGEGARRHVVISADD* ##  333 MFEAILSPFRWLMSWLLGAFHSILEFAGLPAD...QAGRGGAVLNGEVVRQSGQRVQPQRKNRRRK* ##  123 MTVTPSPFVVLEPSREWGPLRALPSALLAGLL...PPGHRRWPPGRQPRILALNHPPIPPDLPQED* ##  133 VLPRDRRVRTPAEFRHLGRTGTRAGRRTVVVS...AEADYALLRRETVGALGKALKPHLPAASEHA* ##  46 VTKRTFQPNNRRRARKHGFRARMRTRAGRAILSARRGKNRAELSA*
FindGenes() will find locations of the protein coding genes in the genome, and
ExtractGenes uses those locations to return nucleotide or amino acid sequences for each gene in the set. If you'd prefer nucleotide sequences, you can set
type="DNAStringSet" instead of the above block.
Of note is that
FindGenes assumes each individual gene does not have introns or frameshifts, which makes it very well suited to prokaryotic genomes. It can work with eukaryotic genomes, but some genes may be missed. However, you don't need to use
FindGenes for the next step--as long as you have gene sequences (from
DECIPHER or any other program), you can pass them directly into the analysis below.
Now we have coding sequences--the next step is to figure out what they are!
DECIPHER implements the
IDTAXA algorithm for protein annotation, which has been shown to outperform BLAST and HMMER. You can read more about the algorithm and benchmarks here.
IDTAXA requires a training set, which can be downloaded from the DECIPHER website. These files are found at the bottom under "Training Sets for Functional Classification (amino acids)". Pretrained datasets are available for a wide variety of organisms across the tree of life--I recommend choosing the most specific dataset for your genome (e.g. choosing
KEGG Animals will produce better results for human genome annotation than
KEGG All [Subsampled]). Each of these are
Once you have a training set, classification is pretty simple:
load('/path/to/trainingSet.RData') # loads `trainingSet` annotations <h IdTaxa(genes, trainingSet) annotations
## A test set of class 'Taxa' with length 10 ## confidence taxon ##  99% Root; 09130 Environmental Information Processing; 09132 Signa... ##  100% Root; 09120 Genetic Information Processing; 09124 Replication... ##  99% Root; 09120 Genetic Information Processing; 09124 Replication... ##  28% Root; unclassified_Root... ##  96% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti... ##  100% Root; 09180 Brite Hierarchies; 09182 Protein families: geneti... ##  35% Root; unclassified_Root... ##  89% Root; unclassified_Root... ##  46% Root; unclassified_Root... ##  28% Root; unclassified_Root...
annotations will hold classifications corresponding to the sequences provided, so the first entry is the annotation for the first gene, the second for the second gene, etc. We can get more information about a single classification by looking at it:
# look at the first annotation annotations[]
## $taxon ##  "Root" ##  "09130 Environmental Information Processing" ##  "09132 Signal transduction" ##  "02020 Two-component system [PATH:ko02020]" ##  "K02313 dnaA, chromosomal replication initiator protein" ## ## $confidence ##  100 100 100 100 100
We can also plot the distribution of classifications:
DECIPHER, you can find and annotate genes with just a few lines of code. This writeup is just a quick summary of the features offered--you can check out my longer tutorial for images, example datafiles, and more. You can also look at the Gene Finding Vignette for an even more in-depth tour of the features and functionality of
Thanks Aidan; another excellent set of tools for the SynBio student/researcher!