Zachary Martinez
posted 14 days ago
Pro Lab TipMethod

Top 10 most useful (to me) commands when working with FASTA files

Working with FASTA files is a daily task for many biologists and has been a daily task of mine for a few years now. I wrote this quick post to just provide some examples on how to manipulate FASTAs using SeqKit.

Due to the advent of next-gen sequencing, humanity has amassed petabases of data and it appears that bioinformatics is becoming more useful even to pure experimentalists. However, there are many software packages and methods for handling sequencing data such as BioPython and SAMtools. While I use BioPython often in my software, I often don’t want to write a one-off python script for every little task I have.

Thankfully I heard of SeqKit, a platform agnostic command-line tool that can efficiently manipulate FASTA files (https://bioinf.shenwei.me/seqkit/). I definitely recommend checking out the docs, they are pretty extensive with lots of examples and functionality that I don’t cover here! These following commands I use on a weekly basis, with some I use almost daily. Seqkit is pretty easy to install, with the most straightforward way being conda.

$ conda install -c bioconda seqkit

In the following examples, I’ll be using two fasta files, VP1s.fasta and phix174.fasta. The former is 4,007 major capsid proteins from Microviridae and the latter is the genome of phi-X 174 (also part of Microviridae), the first sequenced DNA genome! You can get the files from https://github.com/martinez-zacharya/seqkit_scifind_top10.

  1. stats I use the “stats” subcommand on a daily basis to quickly inspect FASTA files from the command line. It gives you some simple summary statistics, with options for more info such as N50 and GC% by using -a. You can also pipe output from other bash command to stats, giving it lots of flexibility and utility.

    $ seqkit stats VP1s.fasta

  2. seq If you need to filter FASTA files according to size, you can use the “seq” subcommand. You can also use “seq” to convert DNA to and from RNA, as well as filter FASTQ files by quality. Notice how I piped the output of the “seq” subcommand to “stats”! Also note that in order to save a file with the output of the seqkit subcommand, you need to use “>”.

    $ seqkit seq -m 500 -M 1000 VP1s.fasta | seqkit stats $ seqkit seq -m 500 -M 1000 VP1s.fasta > filtered_VP1s.fasta

  3. split Since my research involves machine learning, I often need to split sequences into training and test sets. You can split by proportion, by number of sequences, by header or even by subsequences! Here I simply split the file into four parts and then save 75% for training and the rest for testing. The “split” subcommand creates a subdirectory “VP1s.fasta.split” that now contains the 4 parts, while leaving the original VP1s.fasta file unchanged.

    $ seqkit split -p 4 VP1s.fasta $ mv VP1s.fasta.split/VP1s.part_001.fasta ./VP1s_train.fasta $ cat VP1s.fasta.split/*.fasta > VP1s_test.fasta $ rm -r VP1s.fasta.split $ seqkit stats *.fasta

    In the lines of code above, the 3 lines in between the seqkit command are for renaming, concatenating and deleting files. Specifically, the “mv” command both moves and renames VP1s.part_001.fasta, while the “cat” command concatenates the rest of the fasta files into VP1s_test.fasta. Lastly, the “rm” command is used to remove (delete) the split directory. By using -r, I can recursively delete a directory and all of its files.

  4. grep If you’ve used the command-line, you’ve probably heard of grep, which is a utility to search through plain-text. Seqkit has its own version of grep specifically for FASTA files, which I definitely recommend over normal grep. You can search through the entire FASTA or specifically the sequences with -s or headers with -n. You can also inverse search by using -v, which would return only lines that don’t match. For instance, if I wanted to know how many proteins I have from a certain family, I could grep search for the family name in the headers and then pipe the output to SeqKit stats. Note that unless the pattern is an exact match, you must use -r to support partly matching. However, keep in mind you might end up with false positives, as shown below!

    $ seqkit grep -n -r -p Family1 VP1s.fasta | seqkit stats $ seqkit grep -n -r -p Family1_ VP1s.fasta | seqkit stats

    Notice how the two lines resulted in different outputs! The first grep search would match Family1, Family10, Family11 etc. The second grep only matched Family1, because it followed with an underscore and that is how the headers are formatted.

  5. fx2tab and tab2fx I often need to convert FASTA files to tabular format and back for machine learning purposes. By default, the FASTA -> TAB only outputs two columns (FASTQ also has a quality column), the header and the sequence, but you can also add GC content with -g.

    $ seqkit fx2tab VP1s.fasta > VP1s.csv

  6. translate While I use this feature less often, it is still very powerful and allows for the usage of non-standard codon tables such as for plastids. In this example, I’m using the default settings to translate the genome of phi-X 174, a model phage.

    $ seqkit translate phix174.fasta > phix174_prots.fasta

  7. rmdup For the sake of this example, I will first add the VP1s_train.fasta to VP1s.fasta to create duplicate entries. The subcommand “rmdup” keeps only the first instance of a duplication by default and you can look for duplicates at either the sequence (-s) or header (-n) level.

    $ cat VP1s.fasta VP1s_train.fasta > cat_VP1s.fasta $ seqkit rmdup -n cat_VP1s.fasta > deduped_VP1s.fasta

  8. rename Sometimes, I will have duplicate headers in FASTA files but with different sequences. The easiest way to rename duplicated entries is with the “rename” subcommand, where it will append “_N”. You can also specify optional flags such as renaming the first instance of the duplicate as well.

    $ seqkit rename cat_VP1s.fasta > renamed_cat_VP1s.fasta

  9. sample The “sample” subcommand randomly samples from fasta files either by proportion (-p) or total number of sequences (-n). You can also set a random seed with -s to ensure reproducibility of the sampling!

    $ seqkit sample -p .1 VP1s.fasta > sampled_VP1s.fasta

  10. sort and range If you wanted to get the “top 100 longest sequences” or “smallest X sequences” from a fasta file, you could first use “sort” to order by length (you can also sort by name) and the “range” to filter. For example, let’s say I wanted the top 100 shortest sequences from VP1s.fasta

    $ seqkit sort -l VP1s.fasta | seqkit range -r 1:100 > shortest_100.fasta

This is by no means the guaranteed definitive/best way to manipulate fasta files nor is this post close to being an exhaustive guide on how to use SeqKit. I just hope that these examples are useful and please let me know how I could improve it!

4
Ruben Ramalho14 days ago

This is really useful, thanks! On 2. did you mean to use the seq subcommand instead of the split subcommand in the examples?

Zachary Martinez14 days ago

Thanks for catching that! It should be fixed now.

Ruben Ramalho14 days ago

My pleasure! Again, incredibly useful llittle piece of software.

Ali Hakimzadeh14 days ago

This is an extremely useful and necessary tool. Good points were made, and thank you for this post!

Login or Signup to leave a comment