Aidan Lakshman
posted 2 months ago
Ph.D. Candidate at University of Pittsburgh, Dept. of Biomedical Informatics. I specialize in phylogenomics on big data, especially on prokaryotes. Core contributor to the SynExtend R package; I also write tutorials for DECIPHER and SynExtend. Check out my website at https://www.ahl27.com!
Method

Genomics in R, Part 1: Intro, Sequence Alignment, and Visualization

Quick Intro

This will be the first of a couple posts I'm planning to make on running some common genomics analyses in R. My advisor develops the DECIPHER package, and I develop the SynExtend package along with a postdoc in our lab. The goal of our tools is to make running common genomics analyses approachable and easy--everything is done entirely within R, so end users don't need to fiddle with bash scripts or lots of extraneous datafiles. Additionally, we develop every aspect of the pipeline ourselves, so no need to worry about whether or not one analysis will produce output in a compatible format for another. The tools in our packages are thoroughly benchmarked, competitive with alternatives, and are the state of the art for several analyses.

This post will cover visualizing and aligning sequences; my next will cover building phylogenetic trees. You can view more in-depth tutorials with example datasets and more at the tutorials page on my website.

Downloading and Installing

All of these analyses will be shown in R, which you can download from the R Webpage. The easiest way to interface with R is using RStudio, downloadable https://posit.co/download/rstudio-desktop/ (includes instructions on installing R).

Once you've installed R, you can download our packages with the following commands:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install('DECIPHER')
BiocManager::install('SynExtend')

I'm going to be working with sequencing data for this tutorial. If you already have FASTA/FASTQ files, feel free to use those! If you don't you can get free sequencing data from NCBI. Here is an example query for sequences of the rpoB gene, which encodes the RNA polymerase beta subunit. rpoB is pretty conserved in prokaryotic species, and is a great sequence to analyze. Clicking on any of the links will give you more information on the sequence, which you can download as a FASTA with the "Download Datasets" button.

If you're unfamiliar with the FASTA file format, it's a common file format for sequencing data. The format is text based and super simple, the content is in the following form:

> IDENTIFIER1
SEQUENCE1

> IDENTIFIER2
SEQUENCE2

Essentially, each sequence is stored in the file in plain text. The line before each sequence contains a > character, a space, and then the identifier for that sequence. A blank line separates the sequence entries. An example of a FASTA file for some made up sequence would look like this:

> sequence I made up
ATGCATGCATGC

> another sequence
ATGGGTTAAAGT

Visualizing and Aligning Sequences

The rest of this tutorial will be fairly short, because the methods are super simple to use! The hardest part is reading in the data. I'm going to assume you've saved a fasta file with sequences somewhere like /path/to/sequences.fasta. We can read this into R with the following:

# Load the library
library(DECIPHER)

path_to_fasta <- "/path/to/sequences.fasta"
seqs <- readDNAStringSet(path_to_fasta)
seqs

That's all there is to it! If you have fastq files, you can use readDNAStringSet(path_to_fasta, format="fastq"). As you may have guessed from the name of the function, this reads in DNA strings (so sequences with ATGC). If you have amino acids or proteins, you can read them in the same way using readAAStringSet(...).

If you have sequences in a bunch of different files, you can also loop over them and add sequences slowly. For example, if I have a bunch of files all_my_files, I can do:

library(DECIPHER)

dnaset <- DNAStringSet()
for(f in all_my_files){
  dnaset <- c(dnaset, readDNAStringSet(f))
}
dnaset

# If we want to translate it to amino acids, we can just use translate()
translate(dnaset)

All of this functionality is provided by the Biostrings package. However, we can use some of the functionality in DECIPHER to do some other analyses. For example, viewing a lot of sequences is difficult in the R console, and reading FASTA files is not much better. DECIPHER incorporates a nice viewing window for sequence sets:

# seqs is our DNAStringSet of sequences
BrowseSeqs(seqs)

This should open a webpage allowing you to view the sequences together, along with a consensus sequence below. However, this consensus sequence isn't worth much--typically we'd want to align our sequences before looking at how positions are similar between them. Fret not, for we can also do this in a one line command:

# seqs is still our set of sequences
alignment <- AlignSeqs(seqs)
BrowseSeqs(alignment)

Now we can visualize our alignment. DECIPHER also includes a function AlignTranslation for aligning DNA/RNA sequences based on their amino acid translation, which is often more accurate for protein coding regions.

Finally, we may want to save our alignment. This is done with the writeXStringSet command, which writes any type of StringSet (DNA, AA, or anything else) to a FASTA file.

writeXStringSet(alignment, filepath="/path/to/newfile.fasta", append=FALSE)

Note here that if we set append=TRUE we could have added sequences to an existing FASTA file.

Conclusion

That's it for this post! Stay tuned for the next one, in which I'll dive a little deeper into more complex functions like gene calling and phylogenetic reconstruction. There are lots of other fun tricks with DECIPHER; it's optimized for handling extremely large datasets. If you are interested in using DECIPHER's database interface for working with big biological data, check out the video tutorials available on our website. You can also check out my website for more tutorials and a more in-depth description of the examples from this post (with pictures!).

6
Keith Loritz2 months ago

Hi Aiden, I am just starting my adventures working with/utilizing MSAs. I have reviewed the "Highly accurate protein structure prediction with Alphafold" (circa 2021JUL15) paper. In the methods section of the paper, the authors discuss the use of "jackhammer", HHBlits/HHSearch, MMseqs2, Linclust, and FAMSA to produce MSAs/alignments/search results. How do these tools compare to the referenced DECIPHER and SynExtend tools?

Aidan Lakshman2 months ago

Hey Keith, thanks for your comment. You can check out the original publication for DECIPHER's alignment algorithm at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0749-z, but in short, our algorithm produces more accurate alignments than MAFFT, CLUSTAL-Omega, MUSCLE, and other alternatives.

HHblits/HHSearch aren't used for MSA, they're more for homology detection. SynExtend incorporates IdTaxa, which is currently state of the art for protein classification (see https://academic.oup.com/nargab/article/3/3/lqab080/6371227). You can use that for homology searches (to see if multiple proteins match to the same annotation), or you can use NucleotideOverlap() from SynExtend to directly look for homology using kmer frequencies. We're working on benchmarking that against other algorithms, but the algorithmic process is very similar to the algorithms you referenced.

Linclust is a sequence clustering algorithm--DECIPHER recently added the Clusterize() function, which scales linearly and significantly outperforms Linclust in accuracy, especially at large numbers of sequences. That paper is currently under review, but should be published shortly.

My next few posts will cover these in more detail!

Login or Signup to leave a comment