The importance of gene-centring microarray data

T Sørlie, E Borgan, S Myhre, HK Vollan… - The lancet …, 2010 - thelancet.com
T Sørlie, E Borgan, S Myhre, HK Vollan, H Russnes, X Zhao, G Nilsen, OC Lingjærde
The lancet oncology, 2010thelancet.com
In the April, 2010, issue of The Lancet Oncology, Weigelt and colleagues1 investigate the
association between three different methods of predicting molecular subtypes of breast
cancer, all referred to as single sample predictors (SSPs). Their conclusions, however, are
flawed by the use of uncentred microarray data. The original methods are all based on their
correlation to expression centroids generated from three different gene lists. Weigelt and
colleagues1 apply the methods to four different gene-expression datasets without …
In the April, 2010, issue of The Lancet Oncology, Weigelt and colleagues1 investigate the association between three different methods of predicting molecular subtypes of breast cancer, all referred to as single sample predictors (SSPs). Their conclusions, however, are flawed by the use of uncentred microarray data. The original methods are all based on their correlation to expression centroids generated from three different gene lists. Weigelt and colleagues1 apply the methods to four different gene-expression datasets without performing the essential gene centring procedures before classification. The consistency between these subtype assignments was measured by Cohen’s kappa scores. Several important conclusions were made on the basis of these erroneous classifications: different SSPs produce inconsistent results, none of the microdissected specimens with a tumour-cell content greater than 90% were assigned to the normal breastlike group, and the human epidermal growth factor receptor 2 (HER2; also ERBB2) group, as defined by microarray analysis, does not equate with the clinical subgroup of HER2-positive breast cancer. The effect and importance of centring is shown in the webappendix for the Sørlie centroids; similar results apply to the Hu and PAM centroids (data not shown). For the two-channel NKI-295 data (webappendix C), the centred and uncentred centroid correlations correspond reasonably well since the comparison with a common reference suggests partial centring. For the remaining three one-channel-based datasets (webappendix A, B, D), most of the variation is caused by differences in the general expression level of different genes, and not as much by differences between samples. Hence, for uncentred data, correlation of the expression level to the centroids biases the results by raising correlations to the luminal-B centroid and lowering correlations to the normal-like centroid, which explains why many samples are classified as luminal B and few as normal-like. Also, the correlation values vary over a smaller range in the uncentred data, because the sample differences only constitute a small portion of the variance. The subtype centroids from the original classification, 2–4 are based on median-centred two-channel microarray data. For a sample to be correctly assigned to a subtype, it must be centred against an appropriately large and heterogeneous sample set. This is fundamental when applying the classifier to samples handled by different expression platforms rather than the original dataset5 and hence disqualifies the Sørlie approach from being used as a SSP in the sense that Weigelt and colleagues1 did. Applying this method to uncentred data cannot be expected to give meaningful results. The Hu6 and Parker7
CMP, MJE, and PSB are co-founders of University Genomics and are major stock holders of University Genomics and Bioclassifier LLC. CMP, MJE, PSB, and JSP have filed a patent for the PAM50 assay from the University of North Carolina and University of Utah. AP declared no conflict of interest.
thelancet.com