Decoding Metabarcoding: A Look at Data Processing Software Suites and Pipelines (part1)
In the ever-evolving world of bioinformatics, researchers continually seek new tools to process and analyze metabarcoding data. Two primary categories of software have emerged to cater to these needs: suites hosting multiple algorithms and pre-defined pipelines. This series of blog posts delves into the strengths and limitations of each category and their impact on the field.
Software suites and pre-compiled pipelines Metabarcoding software suites such as USEARCH, VSEARCH, DADA2, OBITools, mothur, and QIIME 2 offer a diverse range of algorithms for sequence data analysis. These suites enable users to create custom pipelines tailored to their specific needs by combining commands and settings. VSEARCH stands out as a cost-effective alternative, providing a wide range of functionalities without the need for an expensive license. Furthermore, mothur and QIIME 2 integrate unique processing algorithms while also incorporating features from VSEARCH and/or DADA2.
In contrast, pre-defined analytical pipelines streamline the metabarcoding data analysis process, catering to users with limited bioinformatics skills. These pipelines consist of workflow steps that have been validated for specific sequencing data, and while some include newly designed algorithms, others combine open-source tools into a cohesive and user-friendly process. Despite their pre-defined nature, many of these pipelines still allow for customization, enabling users to adapt settings based on the characteristics of their sequencing data sets.
Basic structure of a metabarcoding pipeline Demultiplexing, an essential step in the analysis of metabarcoding data, involves distributing sequences into individual files corresponding to specific samples. While many sequencing providers offer demultiplexed sequences, in cases where this step is not integrated, researchers can employ software such as cutadapt, sdm, or lima. Additionally, sequencing adapters, indexes, and primers should be removed before proceeding to the subsequent analyses.
The subsequent stages of a standard DNA metabarcoding pipeline include sequence filtration based on read quality scores, removal of putative chimeric/artifactual sequences, defining features (e.g., ASVs, OTUs), and taxonomic annotation of the features. Different strategies may be employed depending on the characteristics of the sequencing data or the study objectives. Notably, quality threshold calculation methods, such as filtering based on the expected number of errors, are preferred over the average quality score threshold, as they minimize the risk of false positive features.
Chimeric sequences can be removed using de novo methods or, when an appropriate reference database is available, reference-based chimera filtering. Some pipelines, like NextITS and FROGS, have implemented approaches to recover false-positive chimeras and preserve real community members.
The formation of features, such as ASVs and OTUs, occurs through various algorithms across different software. ASVs represent denoised reads with minimal differences between variants, while OTUs are generally formed based on global sequence similarities. The choice between ASVs and OTUs depends on the specific context and research goals.
After feature formation, low-abundant features are often discarded, as many are considered artifacts. However, post-clustering processes like LULU can help retain rare, potentially real features. Pipelines such as NextITS, LotuS2, and Dadaist2 implement the UNCROSS2 algorithm to address tag-switching errors, which can artificially inflate richness.
Taxonomy assignment is a crucial step in metabarcoding pipelines, with alignment-based (e.g., BLAST) and sequence composition-based approaches (e.g., RDP Naïve Bayesian classifier) being the most common methods. The choice of assignment method and reference database significantly impacts classification accuracy, necessitating a trade-off between detecting true-positives and false-positives.
In conclusion, DNA metabarcoding pipelines are complex and multifaceted, requiring careful consideration of various strategies, software, and databases. As a bioinformatics enthusiast, understanding these intricacies will enable you to harness the power of this technology and make well-informed decisions in your research endeavors.
Hello Ali, Thank you for this beautiful work you have shared. In my PhD, I used metabarcoding for gut microbiota sequencing in Daphnia (a water flea) and also RNASeq for genome-wide transcription analysis to understand the mechanism of toxicity of persistent pollutants or non-target species in a freshwater environment. I would be very grateful if could have a chat, especially on the downstream analysis of my microbiome data. Almost there, but I am stuck on functional analysis. Thank you