ITS amplicon computational workflow

Geoff Zahn

Published: 2023-11-11 DOI: 10.17504/protocols.io.q26g7pw63gwz/v1

Abstract

Many research questions in plant science depend on surveys of the plant microbiome. When the questions depend on answering "who is there" rather than "what are they doing," the most cost-effective method for interrogating microbiomes is often via targeted meta-amplicon surveys. There are numerous platforms for processing and analyzing meta-amplicon data sets, but here we will look at a flexible, reproducible, and fully customizable pipeline in the R environment. This approach has the benefit of being able to process and analyze your data in the same setting, without moving back and forth between standalone platforms, and further benefits from being completely flexible in terms of analyses and visualizations you can produce, without being limited to pre-selected tools available in point-and-click analysis engines, such as QIIME, Galaxy, or MG-RAST.

Before start

All scripts for this computational "recipe" can be found at: https://github.com/Bio-protocol/metaamplicon-recipe

Attachments

Zahn - 2022 - Marker Genes (16S and ITS) Protocol for Plant Micr.pdf

Steps

Remove primers

The first step is to remove any PCR primers that may still be present on your reads. Often, sequencing centers remove these by default, but it's a good idea to check. The R script "00_Remove_Primers.R" walks you through this process. In short, you provide the PCR primers

that we used for generating our amplicons and then use the cutadapt tool to search for their presence in all possible orientations. They are then removed and we can proceed with quality filtration.

This process first removes all reads with ambiguous (N) bases, and then removes primer sequences and adaptors from remaining reads. The newly modified reads are stored in a separate directory (in this case ./data/filtN/ ). When working with bacterial 16S amplicons, this step can generally be accomplished by simply trimming the first N bases from both forward and reverse reads. However, using cutadapt is a more generalized approach that also accounts for mis-oriented reads, does not require you to know ahead of time if your primers are present or not, and is thus recommended.

Extract ITS region

It’s important that we extract the ITS region of interest at this point. The problem is that while ITS1 or ITS2 are good at predicting fungal taxonomy, the primers used to amplify these regions sit inside highly conserved flanking regions (Gardes and Bruns, 1993). Since ITS lengths can vary so greatly (Bengtsson-Palme et al., 2013), keeping these conserved regions in our sequences can greatly bias downstream taxonomic assignment algorithms. A solution exists though, in the itsxpress software (Rivers et al., 2018), which can identify and trim these conserved regions from the ends of our ITS reads. This is typically run via the command line, but we can do it from within R. The supplementary file “00_Trim_ITS_Region.R” gives an thorough walkthrough example of how to extract the ITS1 region.

Quality filtration

Once you are sure your primers aren’t present in your reads (or flanking conserved regions if analyzing fungal ITS data), it’s time to do some quality control work to filter out ambiguous and low-quality reads, and to trim reads where quality begins to suffer. The R script “01_Process_Raw_16S_Reads.R” goes through this step by step, but it’s worth our time to explain what’s going on. The workhorse software package for this and subsequent steps is DADA2 (Callahan et al., 2016).

We first parse our files generated by cutadapt (and itsxpress, in the case of fungal amplicons). The only filtration done on these so far is to remove ambiguous reads and any remaining primer sequences. The file parsing needs to account for separate forward and reverse reads, and it does so by matching patterns in the file names. In our example data forward reads are denoted by “…pass_1” and reverse reads are denoted by “…pass_2” in their names. This can vary between sequencing centers, and your reads may have different patterns such as “R1” and “R2” or “F” and “R.” Be sure to modify the example code to match your forward and reverse reads accordingly where noted.

The plotQualityProfile() function from the DADA2 package takes a look inside these sequence files and allows you to make an informed judgement about how to filter and trim your raw reads. For example, in Fig. 2, we are looking at the separate quality profiles of the forward and reverse reads of two samples. The X-axis shows read length, and the Y-axis shows summary information of quality scores for every read in the file. The green line is the mean, the orange line is the median, and the dashed orange lines are the 25th and 75th quantiles. It’s up to you to determine where good cutoff points might be for the forward and reverse reads. Keep in mind that in order to merge the reads later on, they will need to have sufficient overlap (typically >50 bases).

If you are working with fungal ITS reads, it is typical that reverse reads aren't of sufficient quality to merge with forward reads. Especially with variable ITS lengths, it is somewhat common practice to forego using reverse reads, and to just process the forward reads output by itsxpress (Darcy et al .,

2020; Tipton et al l., 2019, 2021).

The trimming and filtration step will discard some reads; how many depends on how stringently you set the parameters. If you notice that the majority of your reads are being discarded, you might try

relaxing the maxEE parameter a bit.

ASV inferrence

Now that we have a set of quality-controlled sequence files, we’re ready to perform a “denoising” step that can predict and correct likely sequencing errors. This is an important step when analyzing exact “amplicon sequence variants” (ASVs) as opposed to “operational taxonomic units” (OTUs). There is always a bit of “noise” in sequence data, representing sequence errors. OTUs were, in part, conceived of as a way to deal with this constant threat of sequencing noise. By clustering highly similar sequences together, it accounts for small amounts of variation. However, much of that variation is real and informative (Callahan et al., 2017; Tipton et al., 2021). Methods now exist to discover and correct sequence errors, allowing us to use ASVs. The advantages include finer taxonomic resolution, the potential to discover patterns that might be hidden by artificially clustering sequences together, and most importantly, the ability to easily combine data from multiple studies. Since each ASV is an actual sequence, rather than an undefined cloud of sequences (as with OTUs), they are directly comparable outside of a given study. This facilitates reproducibility and re-use such as in meta-analyses.

The DADA2 R package gives us a set of tools for inferring these ASVs. First, we build a mathematical profile of error rates along the length of our reads. In basic terms, we can think of this error profile as being a mathematical representation of our quality profile plots (Figure 3). The learnErrors() function is a machine-learning tool that builds an error model based on your data. Figure 4 shows this error model. What you’re looking for is a decent fit between observed and estimated errors, and a decline in error as quality scores increase; It doesn’t need to be a perfect fit. If your model looks reasonable (as ours does in Figure 4), you are ready to use that error model to correct sequence errors. A poorly fit error model (indicated by large discrepancies between observed points and the model fit line) may be a result of binned quality scores or a lack of ‘diversity’ of quality scores. In this case, be sure you are using the raw sequence read files for creating the error profile, not reads that have been pre-processed in any way.

The dada() function performs denoising based on the estimated error model. In essence, it is performing statistical tests that determine whether a given sequence was observed too many times to have been caused by sequencing errors in currently inferred sequences (Rosen et al., 2012). The dada() function is run separately on forward and reverse reads and the output is a set of sequence variants that were inferred from the “noisy” input sequences. From this point, we are ready to merge our forward and reverse reads and construct a sequence table. These tasks are accomplished with the mergePairs() and makeSequenceTable() functions, respectively.

Remove chimeras

Chimeric sequences result from incomplete extension during PCR. Partial amplicons can serve as "false templates" for primers to bind to, resulting in reads that are composed of more than one true

sequence. Chimera detection involves evaluating distinct matches between the left-side and right-side of reads and potetial "parent" source reads. In practice, this detection and removal step can be

conducted with removeBimeraDenovo() vo() function as shown in the R script 01_Process_Raw_16S_Reads.R."

Remove contaminant sequences

While sadly not universally practiced, including and sequencing negative controls is a crucial step when performing meta-amplicon studies. Often, these control samples are blank DNA extractions that are then sequenced. This allows us to discover DNA sequences that were present in the lab environment that do not necessarily represent meaningful diversity in our real samples. The decontam package (Davis et al., 2018) has been implemented to deal with these negative controls, using them to detect likely contaminant sequences. The way we will use this package is by giving the isContaminant() function our newly created sequence table and a list of which samples are negative controls. Your sample metadata spreadsheet likely contains study-relevant information about each sequenced sample, but should also include a column denoting whether a sample is a negative control (Table 2 shows an example metadata sheet formatted correctly). Sequences with increased prevalence in control samples can then be identified and removed. Once potential contaminants are removed, we can then remove the negative control samples from further downstream steps since their job is done.

Assign taxonomy

Assigning taxonomy to DNA barcodes is a complicated subject. The success and accuracy of any assignments depend on the algorithms and databases that you use. In fact, with phylogeny-independent taxonomic assignments, the current limiting factors are the completeness and accuracy of the databases to which we match our sequences. With this in mind, it is critical that you carefully consider what methods to use. The RDP Naive Bayesian Classifier algorithm (Wang et al., 2007) is one such algorithm that is widely used and has several advantages. First, it does not rely on sequence alignments, which makes it suitable for fungal ITS studies. Second, it is incredibly fast compared to algorithms such as BLAST. Third, it provides bootstrapped confidence values to all assignments and allows matching to the “least common ancestor,” reducing the incidence of false positive matches. Its performance for species-level taxonomic assignments of bacteria is lacking, and new information has shown that exact 100% matching is the only justifiable method for assigning species to bacterial ASVs (Edgar, 2018). A selection of properly formatted reference files for popular taxonomic databases are available at http://benjjneb.github.io/dada2/training.html. Here, we use two complementary methods to assign taxonomy to our reads: 1) the RDP Classifier using the RDP 16S Training Set database, implemented via the assignTaxonomy() function, and 2) exact 16S matching for species-level assignments implemented via the addSpecies() function. The “Assign Taxonomy” section of “01_Process_Raw_16S_Reads.R” script covers these steps. These functions take our sequence table as input and return a data frame matching our ASVs to taxonomy at the deepest levels that could be assigned unambiguously.

If you are using fungal ITS reads, taxonomic assignment is essentially the same, but you will need to use a eukaryotic database such as the UNITE Eukaryotic Database (DOI: 10.15156/BIO/2938069) instead of a bacterial 16S database, and the addSpecies() function should be omitted.

The last step in this script is to construct a phyloseq object in which to store your ASV table along with your sample metadata and taxonomic assignments. The phyloseq package (McMurdie and Holmes, 2013) provides a data structure that can combine a sequence abundance table (like our ASV table), taxonomic information, sample metadata, and a phylogenetic tree into one object that makes working with these disparate tables much simpler.

Inferring cellular and molecular processes in single-cell data with non-negative matrix factorization using Python, R and GenePattern Notebook implementations of CoGAPS

Hyperosmolar blood–brain barrier opening using intra-arterial injection of hyperosmotic mannitol in mice under real-time MRI guidance

References

Abstract
Before start
Attachments
Steps

Bryan, J. (2017). Project-oriented workflow. https://www.tidyverse.org/blog/2017/12/workflow-vs-script/
Abarenkov, K., Zirk, A., Piirmann, T., Pöhönen, R., Ivanov, F., Nilsson, R. H. and Kõljalg, U. (2020). UNITE general FASTA release for eukaryotes [Application/gzip]. UNITE Community. https://doi.org/10.15156/BIO/786370
Afgan, E., Baker, D., Batut, B., van den Beek, M., Bouvier, D., Čech, M., Chilton, J., Clements, D., Coraor, N., Grüning, B. A., Guerler, A., Hillman-Jackson, J., Hiltemann, S., Jalili, V., Rasche, H., Soranzo, N., Goecks, J., Taylor, J., Nekrutenko, A. and Blankenberg, D. (2018). The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46(W1): W537-W544.
Bengtsson-Palme, J., Ryberg, M., Hartmann, M., Branco, S., Wang, Z., Godhe, A., De Wit, P., Sánchez-García, M., Ebersberger, I., de Sousa, F., Amend, A. S., Jumpponen, A., Unterseher, M., Kristiansson, E., Abarenkov, K., Bertrand, Y. J. K., Sanli, K., Eriksson, K. M., Vik, U., Veldre, V. and Nilsson, R. H. (2013). Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data. Methods Ecol Evol 4(10): 914-919.
Bodenhofer, U., Bonatesta, E., Horejš-Kainrath, C. and Hochreiter, S. (2015). msa: An R package for multiple sequence alignment. Bioinformatics 31(24): 3997-3999.
Bolyen, E., Rideout, J. R., Dillon, M. R., Bokulich, N. A., Abnet, C. C., Al-Ghalith, G. A., Alexander, H., Alm, E. J., Arumugam, M., Asnicar, F., Bai, Y., Bisanz, J. E., Bittinger, K., Brejnrod, A., Brislawn, C. J., Brown, C. T., Callahan, B. J., Caraballo-Rodríguez, A. M., Chase, J., Cope, E. K., Silva, R. D., Diener, C., Dorrestein, P. C., Douglas, G. M., Durall, D. M., Duvallet, C., Edwardson, C. F., Ernst, M., Estaki, M., Fouquier, J., Gauglitz, J. M., Gibbons, S. M., Gibson, D. L., Gonzalez, A., Gorlick, K., Guo, J., Hillmann, B., Holmes, S., Holste, H., Huttenhower, C., Huttley, G. A., Janssen, S., Jarmusch, A. K., Jiang, L., Kaehler, B. D., Kang, K. B., Keefe, C. R., Keim, P., Kelley, S. T., Knights, D., Koester, I., Kosciolek, T., Kreps, J., Langille, M. G. I., Lee, J., Ley, R., Liu, Y. X., Loftfield, E., Lozupone, C., Maher, M., Marotz, C., Martin, B. D., McDonald, D., McIver, L. J., Melnik, A. V., Metcalf, J. L., Morgan, S. C., Morton, J. T., Naimey, A. T., Navas-Molina, J. A., Nothias, L. F., Orchanian, S. B., Pearson, T., Peoples, S. L., Petras, D., Preuss, M. L., Pruesse, E., Rasmussen, L. B., Rivers, A., Robeson II, M. S., Rosenthal, P., Segata, N., Shaffer, M., Shiffer, A., Sinha, R., Song, S. J., Spear, J. R., Swafford, A. D., Thompson, L. R., Torres, P. J., Trinh, P., Tripathi, A., Turnbaugh, P. J., Ul-Hasan, S., van der Hooft, J. J. J., Vargas, F., Vázquez-Baeza, Y., Vogtmann, E., von Hippel, M., Walters, W., Wan, Y., Wang, M., Warren, J., Weber, K. C., Williamson, C. H. D., Willis, A. D., Xu, Z. Z., Zaneveld, J. R., Zhang, Y., Zhu, Q., Knight, R. and Caporaso, J. G. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol 37(8): 852-857.
Busby, P. E., Peay, K. G., & Newcombe, G. (2016). Common foliar fungi of Populus trichocarpa modify Melampsora rust disease severity. New Phytologist, 209(4), 1681–1692. https://doi.org/10.1111/nph.13742
Callahan, B. J., McMurdie, P. J. and Holmes, S. P. (2017). Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. ISME J 11: 2639-2643.
Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A. and Holmes, S. P. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nat Methods 13(7): 581-583.
Darcy, J. L., Swift, S. O. I., Cobian, G. M., Zahn, G. L., Perry, B. A. and Amend, A. S. (2020). Fungal communities living within leaves of native Hawaiian dicots are structured by landscape-scale variables as well as by host plants. Mol Ecol 29(16): 3102-3115.
Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A. and Callahan, B. J. (2018). Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome 6(1): 226.
Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5(1): 113.
Fernandes, A. D., Macklaim, J. M., Linn, T. G., Reid, G. and Gloor, G. B. (2013). ANOVA-like differential expression (ALDEx) analysis for mixed population RNA-Seq. PLOS ONE 8(7): e67019.
Gardes, M. and Bruns, T. D. (1993). ITS primers with enhanced specificity for basidiomycetes-application to the identification of mycorrhizae and rusts. Mol Ecol 2(2): 113-118.
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. and Egozcue, J. J. (2017). Microbiome n are compositional: and this is not optional. Front Microbiol 8: 1-6.
Love, M. I., Huber, W. and Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15(12): 550.
Lozupone, C. and Knight, R. (2005). UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71(12): 8228-8235.
Martin, B. D., Witten, D. and Willis, A. D. (2020). Modeling microbial abundances and dysbiosis with β-binomial regression. Ann Appl Stat 14(1): 94-115.
Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 17(1): 10-12.
McMurdie, P. J. and Holmes, S. (2013). phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLOS ONE 8(4): e61217.
McMurdie, P. J. and Holmes, S. (2014). Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol 10(4): e1003531.
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E. M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., Wilkening, J. and Edwards, R. A. (2008). The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9(1): 386.
Morgan, M., Anders, S., Lawrence, M., Aboyoun, P., Pages, H. and Gentleman, R. (2009). ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data. Bioinformatics 25(19): 2607-2608.
Okansen, J., Blanchet, F. G., Friendly, M., Kindt, R., Legendre, P., McGlinn, D., Minchin, P. R., O’Hara, R. B., Simpson, G. L., Solymos, P., Stevens, M. H. H., Szoecs, E. and Wagner, H. (2016). vegan: Community Ecology Package. (2.4-0) [Computer software].
Pagès, H., Aboyoun, P., Gentleman, R. and DebRoy, S. (2021). Biostrings: Efficient manipulation of biological strings (2.58.0) [Computer software]. Bioconductor version: Release (3.12).
Pedersen, T. L. (2020). patchwork: The Composer of Plots (1.1.1) [Computer software].
Team, R. C. (2017). R: A Language and Environment for Statistical Computing (3.6.3) [Computer software]. R Foundation for Statistical Computing.
Rivers, A. R., Weber, K. C., Gardner, T. G., Liu, S. and Armstrong, S. D. (2018). ITSxpress: software to rapidly trim internally transcribed spacer sequences with quality scores for marker gene analysis. F1000Res 7: 1418.
Rosen, M. J., Callahan, B. J., Fisher, D. S. and Holmes, S. P. (2012). Denoising PCR-amplified metagenome data. BMC Bioinformatics 13(1): 283.
Schliep, K. P. (2011). phangorn: phylogenetic analysis in R. Bioinformatics 27(4): 592-593.
Tipton, L., Zahn, G., Datlof, E., Kivlin, S. N., Sheridan, P., Amend, A. S. and Hynson, N. A. (2019). Fungal aerobiota are not affected by time nor environment over a 13-y time series at the Mauna Loa Observatory. Proc Natl Acad Sci U S A 116(51): 25728-25733.
Tipton, L., Zahn, G. L., Darcy, J. L., Amend, A. S. and Hynson, N. A. (2021). Hawaiian Fungal Amplicon Sequence Variants Reveal Otherwise Hidden Biogeography. Microb Ecol
Wainwright, B. J., Zahn, G. L., Arlyza, I. S. and Amend, A. S. (2018). Seagrass-associated fungal communities follow Wallace’s line, but host genotype does not structure fungal community. J Biogeogr 45(4): 762-770.
Wainwright, B. J., Afiq-Rosli, L., Zahn, G. L. and Huang, D. (2019). Characterisation of coral-associated bacterial communities in an urbanised marine environment shows strong divergence over small geographic scales. Coral Reefs 38: 1097-1106.
Wang, Q., Garrity, G. M., Tiedje, J. M. and Cole, J. R. (2007). Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol 73(16): 5261-5267.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K. and Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software 4(43): 1686.
Zahn, G., & Amend, A. S. (2017). Foliar microbiome transplants confer disease resistance in a critically-endangered plant. PeerJ, 5, e4020. https://doi.org/10.7717/peerj.4020