Genomics in R, Part 1: Intro, Sequence Alignment, and Visualization
Quick Intro
This will be the first of a couple posts I'm planning to make on running some common genomics analyses in R. My advisor develops the DECIPHER package, and I develop the SynExtend package along with a postdoc in our lab. The goal of our tools is to make running common genomics analyses approachable and easy--everything is done entirely within R, so end users don't need to fiddle with bash scripts or lots of extraneous datafiles. Additionally, we develop every aspect of the pipeline ourselves, so no need to worry about whether or not one analysis will produce output in a compatible format for another. The tools in our packages are thoroughly benchmarked, competitive with alternatives, and are the state of the art for several analyses.
This post will cover visualizing and aligning sequences; my next will cover building phylogenetic trees. You can view more in-depth tutorials with example datasets and more at the tutorials page on my website.
Downloading and Installing
All of these analyses will be shown in R, which you can download from the R Webpage. The easiest way to interface with R is using RStudio, downloadable https://posit.co/download/rstudio-desktop/ (includes instructions on installing R).
Once you've installed R, you can download our packages with the following commands:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install('DECIPHER')
BiocManager::install('SynExtend')
I'm going to be working with sequencing data for this tutorial. If you already have FASTA/FASTQ files, feel free to use those! If you don't you can get free sequencing data from NCBI. Here is an example query for sequences of the rpoB
gene, which encodes the RNA polymerase beta subunit. rpoB is pretty conserved in prokaryotic species, and is a great sequence to analyze. Clicking on any of the links will give you more information on the sequence, which you can download as a FASTA with the "Download Datasets" button.
If you're unfamiliar with the FASTA file format, it's a common file format for sequencing data. The format is text based and super simple, the content is in the following form:
> IDENTIFIER1
SEQUENCE1
> IDENTIFIER2
SEQUENCE2
Essentially, each sequence is stored in the file in plain text. The line before each sequence contains a >
character, a space, and then the identifier for that sequence. A blank line separates the sequence entries. An example of a FASTA file for some made up sequence would look like this:
> sequence I made up
ATGCATGCATGC
> another sequence
ATGGGTTAAAGT
Visualizing and Aligning Sequences
The rest of this tutorial will be fairly short, because the methods are super simple to use! The hardest part is reading in the data. I'm going to assume you've saved a fasta file with sequences somewhere like /path/to/sequences.fasta
. We can read this into R with the following:
# Load the library
library(DECIPHER)
path_to_fasta <- "/path/to/sequences.fasta"
seqs <- readDNAStringSet(path_to_fasta)
seqs
That's all there is to it! If you have fastq files, you can use readDNAStringSet(path_to_fasta, format="fastq")
. As you may have guessed from the name of the function, this reads in DNA strings (so sequences with ATGC
). If you have amino acids or proteins, you can read them in the same way using readAAStringSet(...)
.
If you have sequences in a bunch of different files, you can also loop over them and add sequences slowly. For example, if I have a bunch of files all_my_files
, I can do:
library(DECIPHER)
dnaset <- DNAStringSet()
for(f in all_my_files){
dnaset <- c(dnaset, readDNAStringSet(f))
}
dnaset
# If we want to translate it to amino acids, we can just use translate()
translate(dnaset)
All of this functionality is provided by the Biostrings
package. However, we can use some of the functionality in DECIPHER to do some other analyses. For example, viewing a lot of sequences is difficult in the R console, and reading FASTA files is not much better. DECIPHER incorporates a nice viewing window for sequence sets:
# seqs is our DNAStringSet of sequences
BrowseSeqs(seqs)
This should open a webpage allowing you to view the sequences together, along with a consensus sequence below. However, this consensus sequence isn't worth much--typically we'd want to align our sequences before looking at how positions are similar between them. Fret not, for we can also do this in a one line command:
# seqs is still our set of sequences
alignment <- AlignSeqs(seqs)
BrowseSeqs(alignment)
Now we can visualize our alignment. DECIPHER also includes a function AlignTranslation
for aligning DNA/RNA sequences based on their amino acid translation, which is often more accurate for protein coding regions.
Finally, we may want to save our alignment. This is done with the writeXStringSet
command, which writes any type of StringSet (DNA, AA, or anything else) to a FASTA file.
writeXStringSet(alignment, filepath="/path/to/newfile.fasta", append=FALSE)
Note here that if we set append=TRUE
we could have added sequences to an existing FASTA file.
Conclusion
That's it for this post! Stay tuned for the next one, in which I'll dive a little deeper into more complex functions like gene calling and phylogenetic reconstruction. There are lots of other fun tricks with DECIPHER; it's optimized for handling extremely large datasets. If you are interested in using DECIPHER's database interface for working with big biological data, check out the video tutorials available on our website. You can also check out my website for more tutorials and a more in-depth description of the examples from this post (with pictures!).
Hi Aiden, I am just starting my adventures working with/utilizing MSAs. I have reviewed the "Highly accurate protein structure prediction with Alphafold" (circa 2021JUL15) paper. In the methods section of the paper, the authors discuss the use of "jackhammer", HHBlits/HHSearch, MMseqs2, Linclust, and FAMSA to produce MSAs/alignments/search results. How do these tools compare to the referenced DECIPHER and SynExtend tools?