Molecular and statistical approaches to the detection and correction of errors in genotype databases.

LM Brzustowicz, C Merette, X Xie… - American Journal of …, 1993 - ncbi.nlm.nih.gov
LM Brzustowicz, C Merette, X Xie, L Townsend, TC Gilliam, J Ott
American Journal of Human Genetics, 1993ncbi.nlm.nih.gov
Errors in genotyping data have been shown to have a significant effect on the estimation of
recombination fractions in high-resolution genetic maps. Previous estimates of errors in
existing databases have been limited to the analysis of relatively few markers and have
suggested rates in the range 0.5%-1.5%. The present study capitalizes on the fact that within
the Centre d'Etude du Polymorphisme Humain (CEPH) collection of reference families, 21
individuals are members of more than one family, with separate DNA samples provided by …
Abstract
Errors in genotyping data have been shown to have a significant effect on the estimation of recombination fractions in high-resolution genetic maps. Previous estimates of errors in existing databases have been limited to the analysis of relatively few markers and have suggested rates in the range 0.5%-1.5%. The present study capitalizes on the fact that within the Centre d'Etude du Polymorphisme Humain (CEPH) collection of reference families, 21 individuals are members of more than one family, with separate DNA samples provided by CEPH for each appearance of these individuals. By comparing the genotypes of these individuals in each of the families in which they occur, an estimated error rate of 1.4% was calculated for all loci in the version 4.0 CEPH database. Removing those individuals who were clearly identified by CEPH as appearing in more than one family resulted in a 3.0% error rate for the remaining samples, suggesting that some error checking of the identified repeated individuals may occur prior to data submission. An error rate of 3.0% for version 4.0 data was also obtained for four chromosome 5 markers that were retyped through the entire CEPH collection. The effects of these errors on a multipoint map were significant, with a total sex-averaged length of 36.09 cM with the errors, and 19.47 cM with the errors corrected. Several statistical approaches to detect and allow for errors during linkage analysis are presented. One method, which identified families containing possible errors on the basis of the impact on the maximum lod score, showed particular promise, especially when combined with the limited retyping of the identified families. The impact of the demonstrated error rate in an established genotype database on high-resolution mapping is significant, raising the question of the overall value of incorporating such existing data into new genetic maps.
ncbi.nlm.nih.gov