Microbiome analysis generates extensive datasets composed of unique sequences, such as ASVs or MAGs, each representing unique types of microbes. These datasets encapsulate a wealth of information about microbial diversity and functionality within microbiome samples. Nevertheless, the high dimensionality and sparsity of these datasets present significant challenges. With variables outnumbering samples, a phenomenon known as the “curse of dimensionality” emerges, complicating the identification of authentic health-related microbial signatures (14). Thus, reducing the dimensionality and sparsity of the original microbiome datasets, collectively called data reduction, is imperative for microbiome analysis. However, information loss and distortion can occur in current data reduction practices in mainstream microbiome analysis.

Information loss. Information loss on novel or understudied microbes and their functions can occur in database-dependent analysis of microbiome datasets. The primary step in conventional microbiome data analysis involves taxonomic assignment and functional annotation, heavily relying on reference databases such as SILVA (15) (https://www.arb-silva.de/) or KEGG (16) (https://www.genome.jp/kegg/). When unclassified or unannotated sequences are excluded from downstream analysis, information on the diversity and function of novel or understudied microbes they represent will be ignored in any further analysis. In practice, it’s common for 10%–40% of ASVs to remain unclassifiable at the genus level, and up to 50% of genes may lack functional annotations (17). This exclusion can result in a substantial portion of data being disregarded, thus potentially skewing the representation of microbial communities and functions.

Information distortion. Information distortion, on the other hand, happens when the process of reducing dataset complexity introduces biases. For instance, lumping ASVs by genus or genes by pathways (14) can conceal the nuances of strain-level variation. Strains within the same taxon may exhibit different or opposing correlations with the same disease or intervention. Similarly, the same critical pathway gene, such as the but gene for butyrate production, may be harbored by two competing bacterial strains, masking the true abundance change of this gene in microbiome datasets (18). Failure to account for these strain-level variations during dimensionality reduction can lead to information distortion, resulting in inconsistencies across microbiome studies and hindering the establishment of clear associations between microbiome features and diseases.

A model for evaluating information loss and distortion. To address these challenges, we propose employing β diversity matrices of all ASVs or MAGs as a benchmark for the entire information content of the original datasets. We then advocate for the combined use of Procrustes analysis (19) and the Mantel test (20) to evaluate information loss and distortion, which may happen after each attempt at data reduction. These methods can compare and assess the similarity or dissimilarity between multivariate datasets. A close match between the β diversity matrices before and after data reduction indicates successful preservation of original dataset characteristics. At the same time, a significant difference may signal a loss or misrepresentation of information. For example, a new β diversity matrix based on genus-level variables should be created when analyzing ASV datasets at the genus level. This new matrix needs to be compared with the original one at the ASV level using Procrustes analysis and the Mantel test. If the matrices show congruence, it suggests minimal information loss and distortion, indicating that the reduced dataset accurately represents the original. Conversely, the pronounced disparity between the matrices may indicate potential information misrepresentation. In addition, when comparing different data reduction methods, the combined use of Procrustes analysis and the Mantel test can help determine which method better preserves information from the original datasets.

These methods ensure that dimensionality reduction maintains data integrity. By preserving the essence of information in the original datasets while reducing complexity, researchers can generate more reproducible and consistent results for microbiome biomarker discovery.