A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide

JD Wren - Bioinformatics, 2009 - academic.oup.com
Bioinformatics, 2009academic.oup.com
Abstract Motivation: Approximately 9334 (37%) of Human genes have no publications
documenting their function and, for those that are published, the number of publications per
gene is highly skewed. Furthermore, for reasons not clear, the entry of new gene names into
the literature has slowed in recent years. If we are to better understand human/mammalian
biology and complete the catalog of human gene function, it is important to finish predicting
putative functions for these genes based upon existing experimental evidence. Results: A …
Abstract
Motivation: Approximately 9334 (37%) of Human genes have no publications documenting their function and, for those that are published, the number of publications per gene is highly skewed. Furthermore, for reasons not clear, the entry of new gene names into the literature has slowed in recent years. If we are to better understand human/mammalian biology and complete the catalog of human gene function, it is important to finish predicting putative functions for these genes based upon existing experimental evidence.
Results: A global meta-analysis (GMA) of all publicly available GEO two-channel human microarray datasets (3551 experiments total) was conducted to identify genes with recurrent, reproducible patterns of co-regulation across different conditions. Patterns of co-expression were divided into parallel (i.e. genes are up and down-regulated together) and anti-parallel. Several ranking methods to predict a gene's function based on its top 20 co-expressed gene pairs were compared. In the best method, 34% of predicted Gene Ontology (GO) categories matched exactly with the known GO categories for ∼5000 genes analyzed versus only 3% for random gene sets. Only 2.4% of co-expressed gene pairs were found as co-occurring gene pairs in MEDLINE.
Conclusions: Via a GO enrichment analysis, genes co-expressed in parallel with the query gene were frequently associated with the same GO categories, whereas anti-parallel genes were not. Combining parallel and anti-parallel genes for analysis resulted in fewer significant GO categories, suggesting they are best analyzed separately. Expression databases contain much unexpected genetic knowledge that has not yet been reported in the literature. A total of 1642 Human genes with unknown function were differentially expressed in at least 30 experiments.
Availability: Data matrix available upon request.
Contact:  jdwren@gmail.com
Supplementary information:  Supplementary data are available at Bioinformatics online.
Oxford University Press