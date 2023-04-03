Patient cohort and LRTI adjudication. We enrolled children with acute respiratory failure requiring mechanical ventilation at 8 hospitals in the United States between February 2015 and December 2017, as previously described (9, 25). TA was collected within 24 hours of intubation and underwent mNGS of RNA to assay host gene expression and detect respiratory microbiota (Figure 1). High-quality host gene expression and microbial data were obtained for 261 patients (Supplemental Data File 1).

Figure 1 Study overview. Pediatric patients with acute respiratory failure requiring mechanical ventilation were clinically adjudicated into 4 LRTI status groups. The patients in the Definite and No Evidence groups, whose LRTI status was presumed to be known, were used to develop an integrated host/microbe mNGS classifier for LRTI and to evaluate its performance by cross-validation. The classifier was then applied to the patients in the Suspected and Indeterminate groups, whose LRTI status was considered uncertain. The integrated mNGS classifier takes into account a host probability of LRTI, derived from the host gene counts, and features of any viral or bacterial/fungal pathogens, derived from the nonhost (microbial) taxon counts.

Adjudication of LRTI status was performed without knowledge of the mNGS results and depended on the combination of two elements: (a) a retrospective clinical diagnosis made by study-site clinicians, who reviewed all clinical, laboratory, and imaging data available at the end of the admission, and (b) any standard-of-care respiratory microbiologic diagnostics (NP swab viral PCR and/or TA culture) performed on specimens collected during the first 48 hours of intubation. Patients were assigned to their LRTI status group as follows: (a) Definite, if clinicians made a diagnosis of LRTI and the patient had clinical microbiologic findings; (b) Suspected, if clinicians made a diagnosis of LRTI, but there were no microbiologic findings; (c) Indeterminate, if no diagnosis of LRTI was made despite some microbiologic findings; and (d) No Evidence, if clinicians identified a clear noninfectious cause of acute respiratory failure and no clinical or microbiologic suspicion of LRTI arose. We note that comprehensive microbiologic testing was not always performed in the No Evidence group in the absence of clinical suspicion.

The Definite and No Evidence groups were used to develop the metagenomic classifiers and to evaluate their performance by cross-validation due to the high degree of confidence in their clinical diagnoses (Figure 1). The patients in the Definite group were 39% female, with a median age of 0.5 years (IQR, 0.2–1.8), while the patients in the No Evidence group were 50% female, with a median age of 6.5 years (IQR, 1.5–12.9; Table 1 and Supplemental Figure 1; supplemental material available online with this article; https://doi.org/10.1172/JCI165904DS1). The difference in the age distribution of these groups (P < 0.001, Mann-Whitney test) reflected recognized epidemiological distinctions in the conditions that typically lead to respiratory failure in very young versus older children (3, 5).

Table 1 Demographic and clinical cohort characteristics

Within the Definite group, 95% of patients were intubated by 2 days from hospital admission, indicative of community-acquired infection (Table 1). Clinical microbiologic testing identified viral infection alone in 46% of patients, bacterial infection alone in 14% of patients, and viral/bacterial coinfection in 40% of patients. The most common pathogens were respiratory syncytial virus (RSV) and Haemophilus influenzae, which frequently co-occurred (9). Diagnoses in the No Evidence group included trauma, neurological conditions, cardiovascular disease, airway abnormalities, ingestion of drugs/toxins, and sepsis that was clearly unconnected to LRTI. Nevertheless, most patients received antibiotic treatment by the time of TA sample collection in both the Definite (96%) and No Evidence (84%) groups (Table 1).

Classification of LRTI status based on TA host gene expression features. We first compared TA host gene expression between the Definite and No Evidence groups to determine whether it could distinguish patients based on LRTI status, regardless of the underlying cause of infection. We identified 4,718 differentially expressed genes at a Benjamini-Hochberg–adjusted P < 0.05 (Supplemental Figure 2A and Supplemental Data File 2). As expected, gene set enrichment analysis identified elevated expression of pathways involved in the immune response to infection in the Definite group (Supplemental Figure 2B and Supplemental Data File 3). Pathways related to the interferon response, a hallmark of antiviral innate immunity, were most strongly upregulated, consistent with the high prevalence of viral infections in the Definite group. Additional immune pathways upregulated in this group included Toll-like receptor signaling, cytokine signaling, inflammasome activation, neutrophil degranulation, antigen processing, and B cell and T cell receptor signaling. Conversely, pathways with reduced expression in the Definite group included translation, cilium assembly, and lipid metabolism (Supplemental Figure 2B and Supplemental Data File 3).

Because we observed a clear host signature of infection, we developed a classification approach to distinguish the patients in the Definite and No Evidence groups based on gene expression and evaluated its performance by 5-fold cross-validation. For each train/test split, we (a) used LASSO logistic regression on the samples in the training folds to select a parsimonious set of informative genes, (b) trained a random forest classifier using the selected genes, and (c) applied it to the samples in the test fold to obtain a host probability of LRTI.

Our approach yielded a median area under the receiver operating characteristic curve (AUC) of 0.967 (range, 0.953–0.996), with the number of genes selected for use in the classifier ranging from 11 to 25 across the 5 train/test splits (Figure 2A and Supplemental Table 1). Using a 50% out-of-fold probability threshold to classify a patient as suffering from LRTI (LRTI+), the classifier assigned 92% of patients in the Definite group and 80% of patients in the No Evidence group according to their clinical LRTI adjudication (Figure 2B).

Figure 2 Host gene expression classifier for LRTI diagnosis. (A) Receiver operating characteristic (ROC) curve of the host gene expression classifier in each of the test folds. The median and range of the area under the curve (AUC) are indicated. (B) The number and percentage of patients in the Definite and No Evidence groups who were classified according to their clinical adjudication using a 50% out-of-fold probability threshold. (C) Heatmap showing standardized variance-stabilized expression values across all patients (columns) for the 14 final classifier genes (rows) selected from the full Definite and No Evidence data set. Shown are the LRTI adjudication (top colored horizontal bar) and out-of-fold LRTI probability (top dot plot) of each patient and the regression coefficient of each selected gene (side bar plot).

Having validated the performance of our approach by cross-validation, we then applied LASSO logistic regression to all the patients in the Definite and No Evidence groups to select a final set of genes (n = 14) for later classification of patients with Suspected or Indeterminate LRTI status (Figure 2C and Supplemental Table 2). As expected, the genes in the final classifier set that were assigned high absolute regression coefficients were also repeatedly selected in the cross-validation procedure (Supplemental Table 2).

The selected genes with the most positive regression coefficients, corresponding to higher expression in the Definite group, were GNLY, encoding an antibacterial peptide present in cytolytic granules of cytotoxic T cells and natural killer cells (26); SLC38A2, encoding a glutamine transporter upregulated in CD28-stimulated T cells (27, 28); FFAR3, encoding a G protein–coupled receptor activated by short-chain fatty acids that is induced by alveolar macrophages upon infection (29); and the interferon-stimulated genes PSMB8, ISG15, and IRF1 (Figure 2C and Supplemental Table 2).

The selected genes with the most negative regression coefficients, corresponding to lower expression in the Definite group, were FABP4, encoding a fatty acid-binding protein considered a marker of alveolar macrophages, whose expression in the lung decreases in patients with LRTI, including COVID-19 (30–32), and RBP4, encoding a retinol-binding protein, whose expression in the lung also sharply decreases following onset of LRTI (30) and whose expression by macrophages in vitro is depressed by inflammatory stimuli (33) (Figure 2C and Supplemental Table 2).

We examined the expression of the final classifier genes as a function of patient age to confirm that their selection was not influenced by the different age distributions of the Definite and No Evidence groups (Supplemental Figure 3). Reassuringly, we found no significant difference in the expression of the 14 genes when comparing patients in the No Evidence group under the age of 4 (n = 23; median age, 1.3 years) and over the age of 4 (n = 27; median age, 12.5 years) (Supplemental Table 3A). Furthermore, we found that expression of 12 of the genes remained significantly different when comparing only children under the age of 4 in the Definite (n = 100; median age, 0.4 years) and No Evidence (n = 23; median age, 1.3 years) groups (Supplemental Table 3B).

Detection of pathogens by mNGS and definition of microbial classification features. We proceeded to analyze the microbial mNGS data to nominate likely pathogens whose features could be integrated into the LRTI classifier. We processed the TA samples alongside water controls through the Chan Zuckerberg ID (CZ-ID) metagenomic analysis pipeline (https://czid.org/) to obtain a count matrix of microbial taxa. The water controls allowed us to generate a background count distribution for each taxon, which modeled the contribution of contamination by microbes present in the laboratory environment or reagents.

Viruses with known ability to cause LRTI that were present at an abundance statistically exceeding their background distribution were considered probable pathogens. By this criterion, we detected viruses in the lungs of 107 of 117 (91%) patients in the Definite group, with RSV being the most prevalent (Figure 3A). Among patients in the No Evidence group, 8 of 50 (16%) also had viruses detected by mNGS, which were probably missed clinically in the absence of characteristic symptoms. We defined the summed abundance of all pathogenic viruses detected in a patient, measured in reads-per-million (rpM), as the patient’s “viral score” for later use in an integrated host/microbe classifier (Figure 3B).

Figure 3. Metagenomic identification of respiratory pathogens. (A) Bar plot showing the distribution of viruses detected by mNGS after background filtering in patients in the Definite and No Evidence groups. RSV, respiratory syncytial virus; HRV, human rhinovirus; PIV, parainfluenza virus; HMPV, human metapneumovirus; HCoV, human coronavirus; IV, influenza virus; ADV, adenovirus; HBoV, human bocavirus; CMV, cytomegalovirus. (B) Box plot showing the log 10 -transformed summed abundance, measured in reads-per-million (rpM), of all pathogenic viruses detected in each patient, separated by group. Prior to log 10 -transformation, the minimum non-zero rpM value in the data set was divided by 10 and added to all the samples. Horizontal lines denote the median, box hinges represent the interquartile range (IQR), and whiskers extend to the most extreme value no greater than 1.5 × IQR from the hinges. (C) Analysis steps applied as part of the rules-based model (RBM), a heuristic approach designed to identify potential bacterial/fungal pathogens in the context of LRTI. (D) Graphical illustration of the RBM results in two representative patients from the Definite group. Each dot represents a bacterial/fungal species most abundant in its respective genus. Species above the maximum drop-off in rpM are colored in red; otherwise, the color is white. Species on the list of known respiratory pathogens have black outlines; otherwise, the outline is gray. (E) Bar plot showing the distribution of bacteria/fungi called as potential pathogens by the RBM in patients in the Definite and No Evidence groups. Strep. spp., Streptococcus species other than S. pneumoniae. (F) Box plot showing the proportion of the RBM-identified pathogen(s) out of all nonhost counts in each patient, separated by group. Horizontal lines denote the median, box hinges represent the interquartile range (IQR), and whiskers extend to the most extreme value no greater than 1.5 × IQR from the hinges.

Because most patients in the Definite group had a positive NP swab viral PCR test, we could compare the viruses detected by PCR and mNGS (Supplemental Data File 4). The comparison was complicated, however, by the fact that PCR was performed on upper airway samples, so a virus detected by PCR was not necessarily present in the lower airway. Bearing this in mind, we found that 99 of 101 (98%) patients in the Definite group with a viral PCR hit also had a virus detected by mNGS, and both approaches detected at least 1 virus in common in 91 (92%) of those patients (Supplemental Figure 4A). Most cases in which NP swab PCR detected a virus, but mNGS did not, involved adenovirus (Supplemental Figure 4B). mNGS alone detected viruses in 8 of 16 (50%) patients in the Definite group lacking a viral PCR hit (Supplemental Figure 4A). We additionally performed viral PCR on the same TA samples subjected to mNGS in a subset of patients in the Definite group (n = 21), and 96% of PCR hits were detected by mNGS in this direct comparison (Supplemental Table 4).

Bacterial and fungal taxa in the mNGS data also underwent background filtering to retain only those present at an abundance statistically exceeding their background distribution based on water controls. Because incidental carriage of potentially pathogenic bacteria is common in children, we additionally applied a previously published algorithm to distinguish possible pathogens from commensals, called the rules-based model (RBM) (9, 22). The RBM identifies bacteria and fungi with known pathogenic potential that are relatively dominant in a sample (Figure 3, C and D), based on the principle that uncontrolled growth of a pathogen leads to reduced lung microbiome α-diversity in the context of LRTI (22, 34–36) (Supplemental Figure 4, C and D).

The RBM identified possible bacterial/fungal pathogens in 78 of 117 (66%) patients in the Definite group, with the most common being H. influenzae, Moraxella catarrhalis, and Streptococcus pneumoniae (Figure 3E). The RBM also identified potential bacterial/fungal pathogens in 17 of 50 (34%) patients in the No Evidence group. Patients in the Definite group with an RBM-identified pathogen exhibited markedly lower bacterial α-diversity compared with patients in the Definite group without an RBM-identified pathogen and compared with patients in the No Evidence group (Supplemental Figure 4D). In contrast, patients in the No Evidence group with an RBM-identified pathogen did not typically exhibit a loss of bacterial α-diversity (Supplemental Figure 4D), and in such cases, the RBM-identified species was far less dominant (Figure 3F). We, therefore, defined the patient’s “bacterial score” for use in an integrated host/microbe classifier as the proportion of the nonhost counts assigned to the RBM-identified pathogens, a measure of relative dominance (Figure 3F).

We next sought to compare the bacterial and fungal pathogens identified by mNGS with those found by culture of TA samples (Supplemental Data File 4). Importantly, mNGS can detect organisms that are challenging to grow in culture or are inhibited by previous antibiotic treatment, and the RBM selects the likeliest pathogen based on a global view of the microbiome. Despite these inherent differences between culture and the RBM, we found that in 44 of 63 (70%) patients in the Definite group who had a positive culture, at least 1 pathogen identified by the RBM was also found by culture (Supplemental Figure 4E). In the remaining 19 patients, the RBM identified a different species than culture (n = 7) or no pathogen at all (n = 12). Even in these cases, the species grown in culture was usually present in the mNGS data, but other species were more dominant (Supplemental Figure 4E). The RBM also identified a potential pathogen in 27 of 54 (50%) patients in the Definite group lacking a positive culture (Supplemental Figure 4E). Most cases where the species grown in culture was absent from the mNGS data after background filtering involved Staphylococcus aureus, Streptococcus species other than S. pneumoniae, and E. coli (Supplemental Figure 4F).

Host gene expression differences between viral and bacterial LRTI. Overall, mNGS identified viral and/or bacterial pathogens in 114 of 117 (97%) patients in the Definite group. Having established by mNGS which patients had an exclusively bacterial infection (n = 7), an exclusively viral infection (n = 36), or a viral/bacterial coinfection (n = 71), we went back and examined how effectively the top host classifier genes captured these different scenarios (Supplemental Figure 5A). As expected, some of the interferon-stimulated genes (e.g., ISG15) provided much more discriminating power for patients with a viral infection as compared with those with a purely bacterial infection. Reassuringly, however, several other classifier genes behaved similarly regardless of the underlying infection type.

We then asked more broadly whether host gene expression differed between patients with any bacterial LRTI (including viral coinfection) and patients with purely viral LRTI. We identified 108 differentially expressed genes at a Benjamini-Hochberg–adjusted P < 0.05 (Supplemental Figure 5B and Supplemental Data File 2) and found that genes related to neutrophil degranulation and cytokine signaling were enriched in patients with any bacterial LRTI (Supplemental Figure 5C and Supplemental Data File 3). These results suggest the potential for developing in future work a rule-out classifier for bacterial infection that could be used to limit unnecessary antibiotic usage.

Classification of LRTI status based on integration of host and microbial features. Next, we asked whether integrating the host and microbial features could improve the performance of metagenomic LRTI classification. We fit a logistic regression model on the following features: (a) the LRTI probability output of the host classifier, (b) the summed abundance, measured in rpM, of any pathogenic viruses present after background filtering (the viral score), and (c) the proportion of the potentially pathogenic bacteria/fungi identified by the RBM out of all nonhost read counts (the bacterial score) (Figure 4A). As expected, the host and microbial features were correlated across most samples, but some notable exceptions were observed (Supplemental Figure 6).

Figure 4 Integrated host/microbe classifier for LRTI diagnosis. (A) Schematic of the integrated host/microbe classifier. (B) Receiver operating characteristic (ROC) curve of the integrated classifier in each of the test folds. The median and range of the area under the curve (AUC) are indicated. (C) Bar plot showing the number and percentage of patients in the Definite and No Evidence groups who were classified according to their clinical adjudication using a 50% out-of-fold probability threshold. (D) The shift in out-of-fold LRTI probability from the host classifier to the integrated classifier for patients in the Definite (left) and No Evidence (right) groups. Dark connecting lines highlight patients whose LRTI probability shifted across the 50% threshold.

The integrated classifier achieved a median AUC of 0.986 (range, 0.953–1.000) when assessed by 5-fold cross-validation (Figure 4B and Supplemental Table 5), applying the same train/test splits from the host-only cross-validation. Using a 50% out-of-fold probability threshold, the integrated classifier assigned 109 of 117 (93%) patients in the Definite group as LRTI+ and 44 of 50 (88%) patients in the No Evidence group as LRTI– (Figure 4C and Supplemental Table 6). Compared with the host-only classifier, a net of 5 additional patients were now classified according to their clinical adjudication, and the confidence of patient classifications increased, as reflected by more extreme output probabilities (Figure 4D). Reassuringly, all patients in the No Evidence group with a diagnosis of nonpulmonary sepsis (n = 6) were classified as LRTI–, despite suffering an infection elsewhere in the body (Supplemental Table 7). We note that at a 15% out-of-fold probability threshold, the integrated classifier’s sensitivity for LRTI in the Definite group rose to more than 98%, suggesting a use case as a rule-out test for LRTI.

Finally, we trained the integrated host/microbe classifier on all the patients in the Definite and No Evidence groups and then applied it to the patients in the Suspected and Indeterminate groups, whose clinical diagnosis was less certain. The integrated classifier indicated that 37 of 57 (65%) patients in the Suspected group were LRTI+ compared with 12 of 37 (32%) patients in the Indeterminate group (Figure 5A), consistent with the stronger clinical suspicion of LRTI in the former case. Across all 49 patients classified as LRTI+ in these groups, likely pathogens (viral, bacterial, or fungal) were identified in 48 patients (98%). Pathogens detected included common (e.g., rhinovirus, H. influenzae), uncommon (e.g., bocavirus, parechovirus), and difficult to culture (e.g., Mycoplasma pneumoniae) microbes (Figure 5B). We also designed a visual summary incorporating all 3 inputs of the integrated classifier and its output LRTI probability (Figure 5C).