Decoding Metabarcoding: A Look at Data Processing Software Suites and Pipelines (part2)
Navigating Through Sequencing Platforms
Navigating the ever-changing world of metabarcoding feels a bit like trying to keep up with the latest smartphone tech—it's a constant race. The sequencing platform you choose is becoming more and more crucial. Second-generation technologies, like Illumina, are popular hotshots in high-throughput sequencing. They're known for their ability to spit out a large volume of high-quality, paired-end short reads (up to 300 bp for single-end), and they do it without breaking the bank.
Amplicon data analysis pipelines have been primarily built around processing this kind of data. But there are also new kids on the block, like MGI-Tech platforms, that can handle paired-end reads. That said, while they share similarities with Illumina, they also have their quirks when it comes to data processing. For example, some pipelines only work with paired-end data, leaving single-end data and long-read (third-generation) sequencing data out in the cold.
Speaking of third-generation sequencing platforms, they're growing in popularity, thanks to their improved accuracy and throughput. These platforms allow researchers to create longer metabarcodes, which can give us a more detailed picture of taxonomy and reduce the sequencing bias towards short amplicons. Platforms like PacBio are already being used by software originally designed for short reads. On the flip side, data from Oxford Nanopore Technologies (ONT) often needs a more tailored approach, so if you're thinking of using these tools for nanopore reads, it's wise to tread carefully.
There's a lot of variability in sequencing depth (the amount of data produced) between platforms. For instance, the Illumina MiSeq system and NovaSeq can create up to 25M 2x300 bp reads and 1600M 2x250 bp reads per flow cell, respectively. Compare that to the PacBio Sequel II(e) system, which can only generate up to 4M HiFi reads. This disparity can have a big impact on denoising tools, which struggle with low-abundance sequences. So, if you're working with samples with low sequencing depth or complex communities, you'll need to be on guard for potential false negatives.
Speaking of denoising, algorithms like UNOISE and deblur, which were created for Illumina reads, might not work as well with data from other platforms. In these cases, an OTU clustering approach might do a better job. You can filter out any unimportant low-abundance OTUs after clustering, leaving you with only the most relevant data to analyze.
The PacBio has a new long-read sequencing system on the market, the Revio, which promises to offer up to 15 times the throughput of the Sequel II. This sounds great, but we still need to see how denoisers will handle this influx of long-read data.
One more thing to keep in mind is the sequencing cycle. When the amplicon is shorter than the sequencing cycle, platforms like Illumina NovaSeq and NextSeq may add a poly-G tail to lengthen the amplicon. It's generally a good idea to trim primers from amplicon reads by default. You can use tools like Fastp to trim these non-biological tails. Also, since third-generation platforms have a wider Phred score range (0-93), you might need to make some adjustments when using software like VSEARCH or USEARCH, which are designed for Illumina's default score of 41.
In a nutshell, while high-throughput sequencing technologies are opening up exciting new possibilities for metabarcoding, they also come with their own set of challenges. To get the most out of them, researchers need to stay up-to-date with the latest developments and understand the nuances of these ever-evolving platforms.
This is a really interesting look into the newer sequencing platform! For someone who started off when sequencing was still prohibitively expensive in many cases, it's amazing how quickly it's progressed!