Genomics in R, Part 2: Phylogenetic Reconstruction
My last post covered getting started with R and using the
DECIPHER package for visualizing and aligning sequences. In this post, I'd like to cover a really popular topic in genomics: phylogenetic reconstruction.
What Is Phylogenetic Reconstruction?
Evolutionary biologists are very interested in charting the evolutionary history of an organism or group of organisms. If we could "turn back time" to see how each individual organism has changed over millenia, we could infer what ancient organisms looked like and what selective pressures have led to the diversity of life we currently see.
The goal of phylogenetic reconstruction is to build a phylogenetic tree that accurately shows the evolutionary trajectory and ancestral relationships of a group of things. These things can be organisms, species, or even just gene segments.
Trees can be rooted, meaning the tree has a root corresponding to the most recent common ancestor of all the leaves, or unrooted, meaning the tree shows evolutionary relationships without making assumptions about ancestry between leaves.
How do we build a tree?
Determining a phylogenetic tree is a complex problem. Imagine we have
n genomes corresponding to
n organisms; how could we determine a tree that fits their evolutionary history?
The simplest algorithm would involve looking at all possible trees, and seeing how well the data we have fit the trees that are possible. Then we just return the tree that has the best match. Unforunately, the number of possible unrooted binary trees for a set of
n things is
(2n-5)!!, and possible rooted binary trees is
(2n-3)!!. The double factorial indicates multiplying by all lesser values of equivalent parity (even/odd), so
6!! = 6*4*2 and
7!!=7*5*3*1. This means that for seven species we have 945 unrooted trees to check, and at ten species we have over two million. If we wanted to root the tree, we'd have almost 35 million trees at just ten species. Clearly, checking every possible tree isn't super practical.
To solve this problem, a multitude of algorithms for constructing trees have been created. The most popular algorithms today tend to fall into four broad categories: ultrametric, neighbor-joining (NJ), maximum parsimony (MP), and maximum likelihood (ML). Ultrametric methods typically use a hierarchical clustering strategy to group species into larger and larger clusters, which imply the leaf partitions in a phylogenetic tree. NJ methods perform a similar approach, but cluster from the top-down rather than the bottom-up. MP methods iteratively create trees by trying to create a tree that minimizes evolutionary changes along the branches, and ML methods create trees by finding the topology that is most likely (probabilistically) to have occured for a given evolutionary model.
This description is super broad and I've omitted a lot of details. I've written an in-depth tutorial on the mathematics behind tree building on my website, so check that out if you're interested in the nitty-gritty with some visual examples. I'm still working on the ML description, but all other methods are described in detail.
Currently, there are a lot of programs to construct phylogenetic trees. I'm going to stick with using
DECIPHER since it's the focus of my blog posts, but I'll mention alternative software at the end.
I'm going to use a set of
rpoB sequences I found online, but you can do this with any set of sequences you have. The sequences I'm using come from this link. Downloading the
rpoB_gene_alignment.fasta file will give you 174 sequences, all about 500 nucleotides long.
Just to review, we can read in these sequences and view them using DECIPHER:
library(DECIPHER) seqs <- readDNAStringSet("rpoB_gene_alignment.fasta") BrowseSeqs(seqs) # Seqences are already aligned, but if they weren't: # seqs <- AlignSeqs(seqs)
As mentioned in the codeblock above, these sequences are already aligned. If you're using your own sequences, you need to be sure to align them prior to constructing a tree.
Once we have an alignment, constructing a tree with DECIPHER is very easy:
tree <- TreeLine(seqs) plot(tree)
And that's it! It's also worth mentioning that this returns a
dendrogram object, which is essentially a tree in R. This can be saved as an
RData file, but many other programs expect trees in something called Newick Format. DECIPHER also incorporates a tool to convert dendrograms into Newick format for use with other software:
The default arguments are specified in such a way that the users don't have to think too hard about how to configure the function. By default
TreeLine, creates an ML tree and automatically determines the evolutionary model that works best for the data you provide. If you're interested in constructing trees with different algorithms, you can specify those in the
# ML Tree tree <- TreeLine(seqs, method='ML') # MP Tree tree <- TreeLine(seqs, method='MP') # NJ and ultrametric trees require a distance matrix distMat <- DistanceMatrix(seqs) # NJ Tree tree <- TreeLine(myDistMatrix=distMat, method='NJ') # Ultrametric Tree tree <- TreeLine(myDistMatrix=distMat, method='UPGMA')
It's also possible to specify the exact evolutionary models you want
TreeLine to use. For more information, check out the vignette written about this function.
Other software available
As mentioned before, there are other software available for creating phylogenetic trees. The most popular algorithms currently are RAxML, IQTREE, FastTree, PhyML, and MEGA-X.
Of these, DECIPHER and RAxML show the best performance, whereas FastTree runs the fastest. IQTREE is less accurate than DECIPHER/RAxML but more than FastTree, and slower than FastTree but faster than DECIPHER/RAxML. DECIPHER is notable in that it supports MP trees, whereas many other programs only support ML.
Conclusion and Links
Thanks for following along! My next post will cover gene calling and annotation using DECIPHER and SynExtend, so stay tuned!
Links from this post:
Sorry if anyone saw this multiple times--seems like it posted more than once due to a bug.