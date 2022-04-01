What are the fundamentals of NGS? An initial question in studying the gut microbiota is which microbes are present in a given sample. Subsequent inquiries, addressable by NGS analyses, include determining the relative abundance and predictive functional profiles of the microbes present, as well as understanding intraspecies and population heterogeneity (13). NGS methods address these questions by directly sequencing microbial DNA or RNA, for example, in fecal, blood, and/or tissue samples. With the improving affordability of NGS, the two primary NGS methodologies now in use are amplicon sequencing and shotgun metagenomic sequencing; however, RNA sequencing is also a valid and, in some ways, superior method for microbial characterization, as it allows for determination of the transcriptome, representing a further step to define microbiota function (14, 15).

One of the most common NGS methods for bacterial identification and characterization is amplicon sequencing. Amplicon sequencing involves first amplifying a region of the DNA via PCR, and then sequencing the resultant product. The target for PCR amplification is, most commonly, the bacterial 16S ribosomal RNA (rRNA) gene (Figure 1). For this reason, amplicon sequencing is also referred to as 16S rRNA sequencing or analysis. The use of the 16S rRNA gene to characterize uncultured microbes was first described by Lane et al. in 1985 (16). The 16S rRNA gene is an ideal target because it is highly conserved and ubiquitous among bacteria (without it, bacteria would be unable to translate mRNA into proteins and thus be nonfunctional) and it also contains nine hypervariable regions (V1–V9) that differ between bacterial species and genera (Figure 1). Thus, PCR primers can be designed such that forward and reverse primers bind to conserved regions but amplify an intervening variable region. Typically only a subset of the variable regions are targeted for sequencing in a given study (e.g., V1–V3, V4–V5) to limit the amount and, thus, time and cost of sequencing. However, it is important to note that no one region adequately differentiates all bacteria (17), and sequencing of select hypervariable regions can yield differing data interpretation (17–19). For example, amplification of certain hypervariable regions may bias results, leading to under- or overrepresentation of taxa (18), but may also be advantageous for distinguishing between certain species within a genus (17). Recently, NGS sequencing of the full 16S rRNA gene has emerged and, using increasingly sophisticated analytical methods, may provide both species and strain resolution in microbiota communities (20).

Figure 1 Bacterial 16S rRNA gene. (A) Percentage sequence identity of conserved and hypervariable regions of the bacterial 16S rRNA gene. Adapted with permission from the Journal of Microbiological Methods (17) and Ilona Lehtinen (137). (B) Illustration of conserved and hypervariable regions corresponding to A and PCR amplification of the V1–V3 region of the bacterial 16S rRNA gene. Adapted with permission from Humana Press (148). (C) Schematic of 16S rRNA gene structure with hypervariable regions (V1–V9) labeled.

After PCR amplification of the selected hypervariable regions, the resulting amplicons are sequenced, followed by data “cleaning.” Data cleaning involves multiple steps, such as adapter and primer sequence trimming, removal of low-quality bases and sequences from reads, and removal of sequences matching a control library such as the PhiX Control (Illumina), chimeric sequences, and human contaminant reads, as well as chloroplast and mitochondrial contaminants. Subsequent analyses lead to organization of the sequence data into, most often, operational taxonomic units (OTUs). OTUs are distance-based clusters of sequences, initially constructed without a reference database (21). An OTU sequence identity greater than 97% (or with up to 3% dissimilarity) is typically estimated to define a species, while OTUs with sequence similarities of 95% and 80% are used to define genus and phylum, respectively (21). Taxonomic identification is then inferred by computational alignment to reference 16S rRNA sequence databases such as the Ribosomal Database Project (RDP) (22), SILVA (23), or Greengenes (24). OTUs and identified taxa are then used for downstream analysis. An alternative, less frequently used non-distance-based analytical approach for amplicon sequencing relies on exact nucleotide matching to yield amplicon sequence variants (ASVs). ASV taxon assignments are dependent on the quality of reference databases (25). Additionally, ASVs have the potential to split single genomes into multiple clusters, because most bacterial cells possess more than one rRNA gene copy and these, not infrequently, differ in nucleotide sequence (26). While each method (OTUs versus ASVs) has proponents (26, 27), importantly, both are computational approaches to estimate taxonomy. For unculturable microbes, NGS data alone produce “candidate species,” whereas firmer classification of cultured bacterial species is possible using both phenotypic and genome sequence data (28).

In contrast to amplicon sequencing, shotgun metagenomic sequencing and RNA sequencing analyze all the DNA or RNA in a given sample, respectively. For shotgun metagenomic sequencing, after extraction, the DNA is randomly fragmented, and barcodes and adapters are ligated to the ends of each segment to facilitate sample identification and DNA sequencing. The resultant reads are cleaned and subsequently aligned to a reference database to identify taxa and functional potential. The primary reference databases are usually Reference Sequence (RefSeq; ref. 29) and GenBank (30). These are large databases containing all publicly available genomes. Smaller pathogen-focused databases such as Pathosystems Resource Integration Center (PATRIC; ref. 31) and the Eukaryotic Pathogen Database (EuPathDB; ref. 32) are also used. The RNA sequencing workflow is similar to that for shotgun metagenomic sequencing; however, after fragmentation, the RNA segments are reverse transcribed, using PCR, into complementary DNA (cDNA), which is then processed using the DNA sequencing pipeline. Figure 2 provides an overview of NGS processes. Because of their diverse methodologies, 16S rRNA amplicon, shotgun metagenomic, and RNA sequencing each have advantages and drawbacks. These are discussed below. Choosing one method over the others requires comparison and consideration of study goals. Several recent reviews and books have provided guides to microbiome analysis (21, 33).

Figure 2 NGS implementation. Overview of key steps in 16S rRNA gene sequencing, shotgun metagenomic sequencing, and RNA sequencing processes. 1Host DNA or RNA depletion can be performed (optional steps). 2PCR amplification is used to amplify bacterial 16S rRNA gene variable regions (16S rRNA amplicon sequencing) or random cDNA fragments resulting from RNA reverse transcription for RNA sequencing. DNA-based shotgun metagenomic sequencing is optimally done without use of PCR amplification to avoid introduction of PCR-associated experimental bias. However, in samples with low DNA quantities, PCR amplification of the DNA library is sometimes used. 3Commonly Illumina-based sequencing chemistry (33). 4The taxonomic and functional analyses of NGS data are complex and make use, most often, of software available in the public domain.

Direct comparisons between NGS methods. Although comparisons of 16S rRNA sequencing and shotgun metagenomics exist for a variety of samples, including those from humans (14, 15, 34–40), laboratory model organisms (13, 39), plants (39, 41), soil (42), and water (43), overall, direct method comparisons for human samples are limited. Comparisons with RNA sequencing and across all three sequencing modalities are even more limited (14). Here, we address common considerations in choosing an NGS method (Table 2). Additionally, we review studies within the past five years that have directly compared NGS methods in humans (Table 3).

Table 2 Comparisons of common microbiome sequencing methods

Table 3 Recent comparisons of NGS methods for microbial taxonomic classification and functional profiling in human samples

16S rRNA, shotgun metagenomic, and RNA sequencing can all be used to determine what bacteria are present in a microbiome; however, the latter two also detect members of other domains such as fungi and parasites, as well as viruses. Only RNA sequencing examines RNA viruses. With respect to taxonomic resolution, an overarching finding of the studies that have compared these methods is that phylum designations are comparable (39); however, 16S rRNA sequencing tends to offer less resolution and sensitivity for detecting changes at the species level and cannot detect strain-level changes (13, 34, 43). For example, Jovel and colleagues conducted parallel 16S rRNA and shotgun metagenomic sequencing on mock bacterial populations with defined consortia and found that the 16S rRNA method and software pipelines (Quantitative Insights Into Microbial Ecology [QIIME], refs. 44, 45; and mothur, ref. 46) effectively resolved sequences to the genus level, but shotgun metagenomic sequencing resulted in improved genus- and species-level classification (36). This finding has been replicated in other human studies (Table 3 and refs. 14, 15, 35–40). Interestingly, Drewes et al. compared 16S rRNA analysis pipelines and found that the Resphera Insight high-resolution taxonomic assignment tool (Resphera Biosciences; refs. 47–49) better characterized species-level differences using human colon cancer samples compared with other 16S rRNA sequencing pipelines (50). To our knowledge, no studies, as yet, have directly compared the Resphera Insight (47–49) pipeline species classification with that of shotgun metagenomic sequencing. Very recently, the Kraken pipeline for shotgun metagenomic analysis was expanded to enable 16S rRNA analysis, and results show that it is more accurate and up to 300 times faster than QIIME (51). However, QIIME includes a wealth of other helpful tools, making it more a stand-alone “complete” package. For users sophisticated enough to mix-and-match packages, Kraken could replace the core QIIME step of 16S read assignment.

A functional profile cannot be directly obtained from 16S rRNA sequencing, because the method only characterizes sequences from one essential gene. Methods like PICRUSt (Phylogenetic Investigations of Communities by Reconstruction of Unobserved States; ref. 52) or PICRUSt2 (53) and Tax4Fun (54) or Tax4Fun2 (55) aim to predict functional profiles of bacteria based on 16S rRNA data. However, the success of these methods, when compared with functional potentials obtained via shotgun metagenomics, varies with the 16S gene primers used for amplification (35, 36). Conversely, shotgun metagenomics and RNA sequencing consider all the microbial DNA and RNA; thus it is possible to more comprehensively predict the functional potential. Importantly, a distinct difference between shotgun metagenomics and RNA sequencing is that shotgun metagenomics provides a random selection of all genes encoded by the microbes (predictive functional potential) whereas RNA sequencing identifies which genes are actively being transcribed (active functional profile).

Other considerations in pursuing an NGS method and analyses include host contamination, false positives, bias, and post-sequencing computational requirements. There is less risk of host contamination in 16S rRNA sequencing compared with other NGS methods because the gene being amplified and sequenced (i.e., the 16S rRNA gene) is specific to bacteria. With 16S rRNA sequencing, there is also a lower risk of false positives due to extensive reference databases and computational error correction tools; however, the risk of false positives increases with decreasing sample biomass (33). Conversely, there is a higher risk of bias with 16S rRNA sequencing because of primer-dependent PCR amplification bias and differences between the variable regions, as discussed above (17–19). Importantly, one must also consider the computational expertise and analysis required after sequencing. Currently, 16S rRNA sequencing bioinformatics analysis is less of an undertaking than either shotgun metagenomics or RNA sequencing, as there are fewer data (i.e., sequencing output from one gene versus all genes) as well as several publicly available and user-friendly platforms, like QIIME (44, 45) and mothur (46). This makes 16S rRNA sequencing more accessible to researchers with beginner- and intermediate-level bioinformatics experience (33). For projects directed at detection of specific taxa, pilot data using mock microbial communities can guide experimental choices (e.g., primer sets and/or estimation of read numbers or sequencing depth [see below] needed for taxon identification).

Finally, cost must be considered for any project and is arguably one of the most important factors in what type of NGS to initially perform. The differences in cost between the methods relate to the amount and depth of sequencing. Sequencing depth refers to the number of times a certain nucleotide base is represented in the sequencing reads for a given sample (56). Typically, shotgun metagenomics and RNA sequencing analyses require much more sequence data than 16S rRNA sequencing, resulting in their higher costs. However, a recent study by Laudadio and colleagues suggests that shotgun metagenomics, at lower sequencing depths, is comparable in price to 16S rRNA sequencing and still identifies more species (38). Notably, this study did not consider other inherent NGS costs, including computational burden and data storage.

In summary, the use of the 16S rRNA gene as a phylogenetic marker is efficient and cost effective (52); however, it is subject to biases that other microbiome characterization methods are not (i.e., choice of hypervariable regions and primer-dependent PCR amplification) and can thus result in significant variance in the determined microbial composition of a sample. Additionally, 16S rRNA sequencing is commonly limited to taxonomic classification at the genus level or above (36), as horizontal transfer of the 16S rRNA locus and the existence of multiple bacterial species and strains that are more than 97% similar can prevent more nuanced classification (35, 43). Finally, 16S rRNA analysis provides limited predicted functional information (14, 52). Conversely, shotgun metagenomics and RNA sequencing are more expensive than 16S rRNA sequencing but offer far broader taxonomic coverage (i.e., species- and strain-level resolution), more accurate functional profiling, and the possibility of detecting previously unknown species and strains of microbes (36). Although shotgun metagenomic and/or RNA sequencing undoubtedly provides more information, determining which approach is appropriate depends on the question(s) being asked. For instance, if you want to identify the dominant bacteria in a sample, 16S rRNA sequencing is likely the better method owing to the lower cost and bioinformatics burden (42). We present comparisons herein not to suggest that one sequencing method or protocol is best for all projects but rather to assist readers in selecting the best protocol for their projects.

Technical and individual laboratory issues: sources of variability. There are multiple parameters to consider regarding sample collection and processing, because variabilities in any of these steps can alter NGS data. First, the investigator must choose the type of sample for NGS sequencing. Although fecal samples and body fluids are easier to collect and permit serial sampling, intraluminal fecal samples or tissue samples may provide representative regional colon or site-specific microbiome characterization. Storage conditions can further impact NGS results, and thus this information should be reported. The gold standard is immediate freezing of samples and storage at –80°C (57); however, samples can also be preserved chemically using solutions such as DNA/RNA Shield (Zymo Research) (58).

The first step in sample processing is DNA or RNA extraction, and this step is responsible for the majority, but not all, of experimental variability in microbiome analysis according to the MicroBiome Quality Control project (59). Numerous commercially available kits exist for DNA extraction, including from Covaris, Qiagen, Zymo Research, and others. Typically samples are homogenized, but protocols vary substantially from laboratory to laboratory (59). Although there is not yet a globally accepted gold-standard protocol for DNA or RNA extraction, it is critical that all samples be processed in the same manner. Furthermore, it is strongly recommended that negative controls be processed to better assess the comparability of different NGS runs, normalize across separate NGS runs to limit batch effects, identify kit-specific contaminants, and determine whether the detection of low-abundance microbes in a sample are of biologic interest or, more likely, represent contaminants. Examples of controls include (a) storage buffer (e.g., DNA/RNA Shield); (b) DNA extraction kit components; and (c) a community standard containing known species at known quantities (e.g., Zymo Microbial Community Standard [D6300] and Zymo Microbial Community Standard II Log Distribution [D6310]).

For 16S rRNA sequencing, the PCR amplification step is also a source of variability. As discussed, there are nine hypervariable regions in the 16S rRNA gene, and available primer sets typically amplify only a subset of these regions. Thus, the performance characteristics of the primer set chosen will influence the number of the analyzable reads (60) as well as the results of the analysis (61). For example, one study reports that the V4 primer set yields significantly more Bacteroides and lower Firmicutes reads than other primer sets tested; this is particularly notable given that the Bacteroidetes/Firmicutes ratio is a commonly reported metric (60).

The results of sequencing itself also vary with different equipment, and thus, ideally, all samples are sequenced using the same sequencing platform (e.g., Illumina MiSeq, NovaSeq). Finally, a wide variety of bioinformatics pipelines are available, for both 16S rRNA and shotgun metagenomics data, and the choice of computational and statistical methods can have a critical effect on outcomes and conclusions (36, 59, 62), including the risk of reporting false associations and of missing true ones. While a full review of computational methods and their relative strengths and weaknesses is beyond the scope of our discussion, Liu et al. (33) provided a recent review covering dozens of methods.

Overall, variability in any of the steps of NGS sequencing (e.g., sample type, sample storage, DNA extraction, PCR amplification, sequencing technology, read length, and/or bioinformatics analysis) can lead to data variability. There is generally not a “right” answer as to the best method or approach. The most important principle is that all samples be treated the same to facilitate meaningful comparisons between samples in the same study. However, as discussed in the next section, great care must be taken in comparing results between different studies, as these variables may differ.

Challenges of rigor, reproducibility, and reporting. Microbiome science is complex, cutting across many scientific fields, including microbiology, epidemiology, biology, computational science, genomics, and biostatistics. This complexity and the rapid evolution of approaches within the field have led to the reporting of disparate findings between studies investigating seemingly similar patient populations. Thus, increasing attention is now directed to developing well-curated and validated databases that are critical for accurate analyses, and providing guidance for the consistent conduct and reporting of study design, methods, and results of microbiome research. In 2018, Schloss provided a thoughtful and pragmatic essay for translational researchers to consider the threats to rigor, reproducibility, and generalizability within microbiome research (63). Others have called for a centralized robust curated data repository for microbiome data adherent to FAIR (findable, accessible, interoperable, and reproducible) principles (64). Consistent with this need, the FDA has established an evolving quality-controlled and highly curated public microbial reference database (FDA-ARGOS) for microbiome research (65), although this database is still relatively small. Most recently, the STORMS (Strengthening the Organization and Reporting of Microbiome Studies) 17-point Microbiome Reporting Checklist was proposed as a guide for researchers, reviewers, and readers for the presentation, assessment, and understanding of microbiome research across studies (66, 67). Although STORMS was developed through a strong iterative process, it is based on the analysis of only one paper and has been minimally used to date (66). Nonetheless, previous reporting guidelines — e.g., CONSORT (Consolidated Standards of Reporting Trials) — improved the quality of clinical trial reporting (66), and such results support calls for more structured microbiome research reporting. Improvement of microbiome science communication and of the ability to cross-compare studies is essential for human microbiome studies to yield progress in applying microbiome science to patient care.

Essential considerations in collection of clinical metadata. Beyond the complexities of designing the laboratory, computational, and statistical approaches to NGS-driven human studies, the investigator must also consider what and how much clinical metadata to collect. Age, sex, and geography are fundamental as each impacts microbiome composition and likely function (68, 69). However, given the interindividual variability in the microbiome (70) and disease-associated data (discussed below), more nuanced considerations of individual exposures, both current and over time, may be needed. These include environmental exposures associated with migration (71), diet and food additives (72), and antibiotic and non-antibiotic medications (73, 74). While genetic impacts on the human microbiome have been downplayed in recent literature (75), this is likely short-sighted, as we do not yet understand how microbial communities function, and data suggest that select members of the microbiome serve as functional drivers that intersect with host genes to regulate clinical outcomes (76–78). This broad field of human exposures that impact health and disease is termed the “exposome” and, while impossible to fully capture in most studies, deserves careful thought in study design, data accrual, and interpretation (79).