Mary Sara McPeek

Mary Sara McPeek
Professor, Departments of Statistics and Human Genetics, and the College; Member, Committee on Genetics; Senior Fellow, Computation Institute

University of Chicago
Department of Statistics
5734 S. University Avenue
Eckhart 129
Chicago, IL 60637 USA
Phone: 773.702.7554
Fax: 773-702-9810
E-mail:
Web Page: http://galton.uchicago.edu/~mcpeek/


Publications

Research Interests

My research focuses on applications of probability and statistics to genetics and molecular biology. Following are some of my recent and ongoing projects:

I. Case-control association testing with related individuals (Bourgain et al. (2003) Am J Hum Genet 73: 612-626; Thornton and McPeek (2007) Am J Hum Genet 81:321-337; CC-QLS and MQLS software available on this site)

In Bourgain et al. (2003), we developed a QLS method for case-control association testing in samples that contain related individuals. The test statistic is constructed based on null and alternative means and the null covariance matrix of a function of genotype indicators. Choice of an alternative mean model affects the power, but not the validity of the test. The alternative mean model used in the WQLS of Bourgain et al. (2003) is based on a simple case-control allele frequency difference. We implemented our method in a computationally efficient algorithm, and we applied it to a Hutterite sample (from an isolated population with large, inbred pedigree) in which we detected a highly-significant novel association between atopy (an asthma-related phenotype) and an amino-acid polymorphism in the P-selectin gene. We demonstrated that, for the chosen alternative mean model, our QLS test is asymptotically locally most powerful in a general class of linear tests.

We followed up on this work with Thornton and McPeek (2007), in which the major development was a novel construction of an alternative mean model with a direct connection to genetic models (a reversal of conditioning under the assumption of a very general mode of inheritance with a small effect of the locus on the trait). In the resulting alternative mean model for the genotype indicators, the expected frequency of a predisposing allele in an individual depends not only on the individual's phenotype, but on the phenotypes of relatives as well. This is a desirable property because complex genetic models imply an enrichment for predisposing variants in affected individuals with affected relatives compared to affected individuals without affected relatives. Our resulting MQLS case-control association test has similar optimality properties as in Bourgain et al. (2003) but for the improved alternative model, leading to a substantial power improvement in simulations under various multilocus models. At the same time the MQLS test retains the appealing computational simplicity of the method of Bourgain et al. (2003). Other properties of the MQLS include: (1) it is applicable to completely general combinations of family and case-control designs, including samples from isolated founder populations; (2) it can incorporate both unaffected controls and controls of unknown phenotype into the same analysis; and (3) it can incorporate phenotype information on relatives with missing genotype data. Using the method to reanalyze the GAW 14 COGA data, we detected highly significant association to an alcoholism-related phenotype for four different SNPs. Three of these four significant associations were not detected in previous studies. Our software, including source code, is freely available on my website: the CC-QLS package implements the methods of Bourgain et al. (2003), and the MQLS package is an expanded version of CC-QLS which incorporates the methods of Thornton and McPeek (2007).

II. Multipoint linkage disequilibrium mapping with block haplotype structure (Zheng and McPeek (2004) Springer Lecture Notes in Computer Science 2983:113-123; Zheng and McPeek (2007) Am J Hum Genet 80:112-125)

In Zheng and McPeek (2004), we developed a class of hidden Markov models for background LD based on the block structure of haplotypes, and we fit these models to dense SNP data from an outbred Caucasian population. Our class of models allows a fairly general graph structure of preferred and non-preferred transitions based on the haplotype block structure. It allows for common haplotypes and uncommon haplotypes in each block, and it captures the idea of ancestral haplotypes. We use a parametric bootstrap approach to assess goodness of fit, which allows a wide latitude in choice of test statistic. We implemented an additional layer of Monte Carlo to assess the type I error of the parametric bootstrap procedure for assessment of goodness of fit.

In Zheng and McPeek (2007), we followed up on this work by applying our models of background LD based on block haplotype structure to the problem of multipoint LD mapping from dense SNP data in case-control samples from an outbred population. We developed a virtual variant approach that characterizes untyped SNPs by various partitions of the set of haplotypes within a block into two disjoint subsets, corresponding to two alleles. We demonstrated that the virtual variant method greatly increases power for detection of untyped common variants associated with a trait. Because full multipoint LD mapping can be slow, we exploited the haplotype block information to develop a fast single-block multipoint mapping method. Our methods are appropriate for genotype data and take into account the uncertainty in phase. Our simulations indicate that the most important gains from taking into account the haplotype block structure at the analysis stage of multipoint LD mapping come from (1) greatly increased power to detect association with untyped variants, and (2) greatly improved localization of untyped variants associated with the trait.

III. Multipoint linkage disequilibrium mapping by the decay of haplotype sharing (McPeek and Strahs (1999) Am J Hum Genet 65:858-875; Strahs and McPeek (2003) Festschrift for Terry Speed pp. 343-366; Zhang, Schneider, Ober, McPeek (2005) Genet Epi 29:128-140, DHSMAP and DHSMAP_PVM software available on this site)

In McPeek and Strahs (1999), we proposed a multipoint approach to linkage disequilibrium mapping. For each individual, a likelihood for multilocus data is calculated, while incorporation of dependence of recombinational history among related individuals is based on estimating equations that can be thought of as generalizations of both quasi-likelihood and composite likelihood. McPeek and Strahs (1999) assumed a conditional coalescent model for the relationships among individuals.

In Strahs and McPeek (2003), we addressed the problems of (1) modeling background LD in an outbred population and (2) incorporating the background LD model into our decay of haplotype sharing method in outbred samples. We developed a Markov model of order 2 for background LD in haplotypes of moderately dense SNPs, and we developed a hidden Markov implementation of the model for use with unphased genotype in our decay of haplotype sharing method. We used the AIC and BIC model selection criteria to compare models of background LD and found that the Markov(2) model provided a major improvement over a Markov(1) model. Within the context of the decay of haplotype sharing method, we demonstrated the importance of appropriate modeling of background LD, and we developed a mapping-in-controls diagnostic to detect the possibility that lack of fit of the background model would be having an influence in the analysis. Software for the method, including source code, is freely available on this site.

In Zhang et al. (2005), we addressed the problem of multilocus linkage disequilibrium (LD) mapping of a trait-associated variant from case-control samples in which some individuals may be related, with special attention to the extreme case of an isolated founder population. Our method, which we call DHS-R, is an extension of our previous decay of haplotype sharing (DHS) method. The DHS-R method shares the main features of the DHS method: (i) it allows construction of a confidence interval for the location of a trait-associated variant; (ii) it allows for missing observations and unphased genotype data, with the uncertainty in the haplotypes taken into account in the analysis; (iii) it allows for heterogeneity, mutation, recombination, and background LD. The main advances of the DHS-R are (i) the ability to include individuals of arbitrary known relationship (including inbreeding) in the case and control samples; (ii) an extension to allow partially-phased haplotypes derived from case-parent trio genotype data; and (iii) an extension to allow for genotyping error in the model. Our method, which uses a hidden Markov model for likelihood calculation and maximization, has the advantage of being computationally feasible even in a large, complex pedigree. Simulations based on a 13-generation, 1623-member Hutterite pedigree demonstrated accurate coverage of the confidence intervals for location of the variant. We applied the method to fine-mapping of a susceptibility locus for the asthma-associated phenotype, bronchial hyperresponsiveness (BHR), in the Hutterites, on a region of chromosome 19.

IV. Application of quasi-likelihood to testing for Hardy-Weinberg in samples with related individuals (Bourgain, Abney, Schneider, Ober, McPeek (2004) Genetics 168:2349-2361)

In Bourgain et al. (2004), we demonstrated that when the classical chi^2 goodness-of-fit test for Hardy-Weinberg equilibrium (HWE) is used on samples with related individuals, the type I error can be greatly inflated. In particular the test is inappropriate in population isolates where the individuals are related through multiple lines of descent. In Bourgain et al. (2004), we proposed a novel quasi-likelihood score (QLS) test of HWE suitable for any sample with related individuals. Performed conditional on the pedigree structure, our test detects departures from HWE that are not due to the genealogy.

V. Best linear unbiased estimation of allele frequencies (McPeek, Wu, Ober (2004) Biometrics 60:359-367)

In McPeek et al. (2004), we addressed the problem of efficient allele frequency estimation in an isolated founder population in which all individuals are related by a large, complex, pedigree with multiple inbreeding loops. We developed a quasi-likelihood (QL) estimator, which for this problem is also the best linear unbiased estimator, where the QL estimator weights the individuals based on their kinship to all the other individuals in the sample. We developed and implemented an efficient algorithm for computing the estimate and its variance, and we applied our method to allele frequency estimation in (1) a Hutterite data set containing over 800 individuals related by a 13-generation 1623-person pedigree as well as in (2) an outbred sample of 996 individuals drawn from 85 moderate-size pedigrees. Notably, our QL estimator has very close performance to the maximum likelihood estimator (when it is feasible to calculate the latter), but is substantially easier to calculate, making it feasible to use for large numbers of markers even in large, complex pedigrees. In the context of high-density scans, its accuracy and computational efficiency make it a valuable tool in samples composed of moderate-size pedigrees as well. Our software, including source code, is freely available on this site as part of the CC-QLS package.

VI. Identification of polymorphisms that explain a linkage result (Sun, Cox, McPeek (2002) Am J Hum Genet 70:399-411; STEPC software freely available on the web)

In Sun et al. (2002), we developed a statistical method for identification of polymorphisms that explain a linkage result. Given many polymorphic sites genotyped in a region showing strong linkage with a trait, our goal is to determine which site or combination of sites in the region influences susceptibility to the trait. Our approach is to use linkage data to identify the polymorphisms whose genotypes could fully explain the observed linkage to the region. The information provided by this analysis is different from that provided by either linkage or association studies. Our approach is based on the observation that if a particular site is the only site in the region that influences the trait, then conditional on the genotypes at that site for the affected relatives, there should be no unexplained over-sharing among the affecteds in the region. Our method is applicable to sibships and allows for a very general model for how the site influences the trait, including epistasis with unlinked loci, environmental effects and gene-environment interaction. We perform hypothesis tests and derive a confidence set for the true causal polymorphic site, under the assumption that there is only one site in the region influencing the trait. Future work will initially focus on the problem of multiple causal sites present in the region.

VII. Analysis of quantitative trait loci in the Hutterites (Abney, McPeek, Ober (2000) Am J Hum Genet 66:629-650; Abney, McPeek, Ober (2001) Am J Hum Genet 68:1302-1307; Ober, Abney, McPeek (2001) Am J Hum Genet 69:1068-1079; Newman et al. (2001) Am J Hum Genet 69:1146-1148; Abney, Ober, McPeek (2002) Am J Hum Genet 70:920-934; Newman et al. 2003, Newman et al. 2004, Weiss et al. 2004)

In Abney et al. (2000; 2001; 2002), we developed statistical methods for analysis of quantitative traits in founder populations. We have applied the methods to genetic analysis in a Hutterite population. The complexity of this large inbred pedigree poses special challenges and makes many standard types of analyses computationally onerous or completely infeasible. At the same time, certain features of this population make it extremely promising for genetic analysis of complex traits: a small number of founders presumably leading to reduced genetic heterogeneity, close-knit social structure and communal living which are expected to reduce environmental heterogeneity. Methods of analysis must generally be tailor-made for application to founder populations, and major computational problems must often be overcome. We have developed and implemented variance component methods and linkage disequilibrium mapping methods designed especially for founder populations. We have also developed a novel permutation-based assessment of significance that is applicable to data on related individuals, based on a general class of matrix decompositions, of which the Cholesky decomposition is a special case.

VIII. Relationship inference (McPeek and Sun 2000; Sun, Abney, McPeek 2001; Sun, Wilder, McPeek 2002; McPeek 2002; PREST software freely available on the web)

Lei Sun and I have developed several approaches for the problem of detecting relationship errors in pedigrees on the basis of genome screen data collected for linkage studies. We have developed methods for simple outbred pedigrees as well as for the much more difficult situation of a large, complex, inbred pedigree. Part of this work is related to identifiability of hidden Markov models and efficient methods for determination of the orbits of the group of symmetries on the hypercube that leave certain sets invariant.

IX. Optical mapping (Tong, Mets, McPeek (2007))

Multi-color optical mapping is a new technique being developed, in the Mets lab at U. of C., to obtain detailed physical maps (indicating relative positions of various recognition sites) of DNA molecules. We consider a study design in which the data consist of noisy observations of multiple copies of a DNA molecule marked with colors at recognition sites. The primary goal is to estimate a physical map. A secondary goal is to estimate error rates associated with the experiment, which are potentially useful for analysis and refinement of the biochemical steps in the mapping procedure. We propose statistical models for various sources of error and use maximum likelihood estimation (MLE) to construct a physical map and estimate error rates. To overcome difficulties arising in the maximization process, a latent-variable Markov chain version of the model is proposed, and the EM algorithm is used for maximization. In addition, a simulated annealing procedure is applied to maximize the profile likelihood over the discrete space of sequences of colors. We apply the methods to simulated data on the bacteriophage lambda genome.

X. Other work includes

A. Statistical models for recombination and interference (Speed, McPeek, Evans (1992) PNAS 89:3103-3106; Evans, McPeek, Speed (1993) Theor Pop Biol 43:80-90; McPeek and Speed (1995) Genetics 139:1031-1044; Zhao, Speed, McPeek (1995) Genetics 139:1045-1056; Zhao, McPeek, Speed (1995) Genetics 139:1057-1065; Armstrong, McPeek, Speed (2006) Biostatistics 7:374-386)

B. Optimal allele-sharing statistics for genetic mapping of affected pedigree members (McPeek 1999)

C. Statistical inference for sperm-typing data (Leeflang, McPeek, Arnheim 1996; Grewal et al. 1999; McPeek 1999; Girardet et al 2000)

Last update: 7/24/07