Research Interests
My research focuses on applications of probability and statistics to genetics and molecular biology. Following
are some of my recent and ongoing projects:
I. Case-control association testing with related individuals (Bourgain et al. (2003) Am J Hum Genet 73: 612-626;
Thornton and McPeek (2007) Am J Hum Genet 81:321-337; CC-QLS and MQLS software available on this site)
In Bourgain et al. (2003), we developed a QLS method for case-control association testing in samples that contain
related individuals. The test statistic is constructed based on null and alternative means and the null covariance
matrix of a function of genotype indicators. Choice of an alternative mean model affects the power, but not the validity
of the test. The alternative mean model used in the WQLS of Bourgain et al. (2003) is based on a simple case-control
allele frequency difference. We implemented our method in a computationally efficient algorithm, and we applied it
to a Hutterite sample (from an isolated population with large, inbred pedigree) in which we detected a highly-significant
novel association between atopy (an asthma-related phenotype) and an amino-acid polymorphism in the P-selectin gene.
We demonstrated that, for the chosen alternative mean model, our QLS test is asymptotically locally most powerful
in a general class of linear tests.
We followed up on this work with Thornton and McPeek (2007), in which the major development was a novel construction
of an alternative mean model with a direct connection to genetic models (a reversal of conditioning under the assumption
of a very general mode of inheritance with a small effect of the locus on the trait). In the resulting alternative
mean model for the genotype indicators, the expected frequency of a predisposing allele in an individual depends
not only on the individual's phenotype, but on the phenotypes of relatives as well. This is a desirable property
because complex genetic models imply an enrichment for predisposing variants in affected individuals with affected
relatives compared to affected individuals without affected relatives. Our resulting MQLS case-control association
test has similar optimality properties as in Bourgain et al. (2003) but for the improved alternative model, leading
to a substantial power improvement in simulations under various multilocus models. At the same time the MQLS test
retains the appealing computational simplicity of the method of Bourgain et al. (2003). Other properties of the MQLS
include: (1) it is applicable to completely general combinations of family and case-control designs, including samples
from isolated founder populations; (2) it can incorporate both unaffected controls and controls of unknown phenotype
into the same analysis; and (3) it can incorporate phenotype information on relatives with missing genotype data.
Using the method to reanalyze the GAW 14 COGA data, we detected highly significant association to an alcoholism-related
phenotype for four different SNPs. Three of these four significant associations were not detected in previous studies.
Our software, including source code, is freely available on my website: the CC-QLS package implements the methods
of Bourgain et al. (2003), and the MQLS package is an expanded version of CC-QLS which incorporates the methods of
Thornton and McPeek (2007).
II. Multipoint linkage disequilibrium mapping with block haplotype structure (Zheng and McPeek (2004) Springer
Lecture Notes in Computer Science 2983:113-123; Zheng and McPeek (2007) Am J Hum Genet 80:112-125)
In Zheng and McPeek (2004), we developed a class of hidden Markov models for background LD based on the block structure
of haplotypes, and we fit these models to dense SNP data from an outbred Caucasian population. Our class of models
allows a fairly general graph structure of preferred and non-preferred transitions based on the haplotype block structure.
It allows for common haplotypes and uncommon haplotypes in each block, and it captures the idea of ancestral haplotypes.
We use a parametric bootstrap approach to assess goodness of fit, which allows a wide latitude in choice of test
statistic. We implemented an additional layer of Monte Carlo to assess the type I error of the parametric bootstrap
procedure for assessment of goodness of fit.
In Zheng and McPeek (2007), we followed up on this work by applying our models of background LD based on block
haplotype structure to the problem of multipoint LD mapping from dense SNP data in case-control samples from an outbred
population. We developed a virtual variant approach that characterizes untyped SNPs by various partitions of the
set of haplotypes within a block into two disjoint subsets, corresponding to two alleles. We demonstrated that the
virtual variant method greatly increases power for detection of untyped common variants associated with a trait.
Because full multipoint LD mapping can be slow, we exploited the haplotype block information to develop a fast single-block
multipoint mapping method. Our methods are appropriate for genotype data and take into account the uncertainty in
phase. Our simulations indicate that the most important gains from taking into account the haplotype block structure
at the analysis stage of multipoint LD mapping come from (1) greatly increased power to detect association with untyped
variants, and (2) greatly improved localization of untyped variants associated with the trait.
III. Multipoint linkage disequilibrium mapping by the decay of haplotype sharing (McPeek and Strahs (1999) Am J
Hum Genet 65:858-875; Strahs and McPeek (2003) Festschrift for Terry Speed pp. 343-366; Zhang, Schneider, Ober, McPeek
(2005) Genet Epi 29:128-140, DHSMAP and DHSMAP_PVM software available on this site)
In McPeek and Strahs (1999), we proposed a multipoint approach to linkage disequilibrium mapping. For each individual,
a likelihood for multilocus data is calculated, while incorporation of dependence of recombinational history among
related individuals is based on estimating equations that can be thought of as generalizations of both quasi-likelihood
and composite likelihood. McPeek and Strahs (1999) assumed a conditional coalescent model for the relationships among
individuals.
In Strahs and McPeek (2003), we addressed the problems of (1) modeling background LD in an outbred population and
(2) incorporating the background LD model into our decay of haplotype sharing method in outbred samples. We developed
a Markov model of order 2 for background LD in haplotypes of moderately dense SNPs, and we developed a hidden Markov
implementation of the model for use with unphased genotype in our decay of haplotype sharing method. We used the
AIC and BIC model selection criteria to compare models of background LD and found that the Markov(2) model provided
a major improvement over a Markov(1) model. Within the context of the decay of haplotype sharing method, we demonstrated
the importance of appropriate modeling of background LD, and we developed a mapping-in-controls diagnostic to detect
the possibility that lack of fit of the background model would be having an influence in the analysis. Software for
the method, including source code, is freely available on this site.
In Zhang et al. (2005), we addressed the problem of multilocus linkage disequilibrium (LD) mapping of a trait-associated
variant from case-control samples in which some individuals may be related, with special attention to the extreme
case of an isolated founder population. Our method, which we call DHS-R, is an extension of our previous decay of
haplotype sharing (DHS) method. The DHS-R method shares the main features of the DHS method: (i) it allows construction
of a confidence interval for the location of a trait-associated variant; (ii) it allows for missing observations
and unphased genotype data, with the uncertainty in the haplotypes taken into account in the analysis; (iii) it allows
for heterogeneity, mutation, recombination, and background LD. The main advances of the DHS-R are (i) the ability
to include individuals of arbitrary known relationship (including inbreeding) in the case and control samples; (ii)
an extension to allow partially-phased haplotypes derived from case-parent trio genotype data; and (iii) an extension
to allow for genotyping error in the model. Our method, which uses a hidden Markov model for likelihood calculation
and maximization, has the advantage of being computationally feasible even in a large, complex pedigree. Simulations
based on a 13-generation, 1623-member Hutterite pedigree demonstrated accurate coverage of the confidence intervals
for location of the variant. We applied the method to fine-mapping of a susceptibility locus for the asthma-associated
phenotype, bronchial hyperresponsiveness (BHR), in the Hutterites, on a region of chromosome 19.
IV. Application of quasi-likelihood to testing for Hardy-Weinberg in samples with related individuals (Bourgain,
Abney, Schneider, Ober, McPeek (2004) Genetics 168:2349-2361)
In Bourgain et al. (2004), we demonstrated that when the classical chi^2 goodness-of-fit test for Hardy-Weinberg
equilibrium (HWE) is used on samples with related individuals, the type I error can be greatly inflated. In particular
the test is inappropriate in population isolates where the individuals are related through multiple lines of descent.
In Bourgain et al. (2004), we proposed a novel quasi-likelihood score (QLS) test of HWE suitable for any sample with
related individuals. Performed conditional on the pedigree structure, our test detects departures from HWE that are
not due to the genealogy.
V. Best linear unbiased estimation of allele frequencies (McPeek, Wu, Ober (2004) Biometrics 60:359-367)
In McPeek et al. (2004), we addressed the problem of efficient allele frequency estimation in an isolated founder
population in which all individuals are related by a large, complex, pedigree with multiple inbreeding loops. We
developed a quasi-likelihood (QL) estimator, which for this problem is also the best linear unbiased estimator, where
the QL estimator weights the individuals based on their kinship to all the other individuals in the sample. We developed
and implemented an efficient algorithm for computing the estimate and its variance, and we applied our method to
allele frequency estimation in (1) a Hutterite data set containing over 800 individuals related by a 13-generation
1623-person pedigree as well as in (2) an outbred sample of 996 individuals drawn from 85 moderate-size pedigrees.
Notably, our QL estimator has very close performance to the maximum likelihood estimator (when it is feasible to
calculate the latter), but is substantially easier to calculate, making it feasible to use for large numbers of markers
even in large, complex pedigrees. In the context of high-density scans, its accuracy and computational efficiency
make it a valuable tool in samples composed of moderate-size pedigrees as well. Our software, including source code,
is freely available on this site as part of the CC-QLS package.
VI. Identification of polymorphisms that explain a linkage result (Sun, Cox, McPeek (2002) Am J Hum Genet 70:399-411;
STEPC software freely available on the web)
In Sun et al. (2002), we developed a statistical method for identification of polymorphisms that explain a linkage
result. Given many polymorphic sites genotyped in a region showing strong linkage with a trait, our goal is to determine
which site or combination of sites in the region influences susceptibility to the trait. Our approach is to use linkage
data to identify the polymorphisms whose genotypes could fully explain the observed linkage to the region. The information
provided by this analysis is different from that provided by either linkage or association studies. Our approach
is based on the observation that if a particular site is the only site in the region that influences the trait, then
conditional on the genotypes at that site for the affected relatives, there should be no unexplained over-sharing
among the affecteds in the region. Our method is applicable to sibships and allows for a very general model for how
the site influences the trait, including epistasis with unlinked loci, environmental effects and gene-environment
interaction. We perform hypothesis tests and derive a confidence set for the true causal polymorphic site, under
the assumption that there is only one site in the region influencing the trait. Future work will initially focus
on the problem of multiple causal sites present in the region.
VII. Analysis of quantitative trait loci in the Hutterites (Abney, McPeek, Ober (2000) Am J Hum Genet 66:629-650;
Abney, McPeek, Ober (2001) Am J Hum Genet 68:1302-1307; Ober, Abney, McPeek (2001) Am J Hum Genet 69:1068-1079; Newman
et al. (2001) Am J Hum Genet 69:1146-1148; Abney, Ober, McPeek (2002) Am J Hum Genet 70:920-934; Newman et al. 2003,
Newman et al. 2004, Weiss et al. 2004)
In Abney et al. (2000; 2001; 2002), we developed statistical methods for analysis of quantitative traits in founder
populations. We have applied the methods to genetic analysis in a Hutterite population. The complexity of this large
inbred pedigree poses special challenges and makes many standard types of analyses computationally onerous or completely
infeasible. At the same time, certain features of this population make it extremely promising for genetic analysis
of complex traits: a small number of founders presumably leading to reduced genetic heterogeneity, close-knit social
structure and communal living which are expected to reduce environmental heterogeneity. Methods of analysis must
generally be tailor-made for application to founder populations, and major computational problems must often be overcome.
We have developed and implemented variance component methods and linkage disequilibrium mapping methods designed
especially for founder populations. We have also developed a novel permutation-based assessment of significance that
is applicable to data on related individuals, based on a general class of matrix decompositions, of which the Cholesky
decomposition is a special case.
VIII. Relationship inference (McPeek and Sun 2000; Sun, Abney, McPeek 2001; Sun, Wilder, McPeek 2002; McPeek 2002;
PREST software freely available on the web)
Lei Sun and I have developed several approaches for the problem of detecting relationship errors in pedigrees on
the basis of genome screen data collected for linkage studies. We have developed methods for simple outbred pedigrees
as well as for the much more difficult situation of a large, complex, inbred pedigree. Part of this work is related
to identifiability of hidden Markov models and efficient methods for determination of the orbits of the group of
symmetries on the hypercube that leave certain sets invariant.
IX. Optical mapping (Tong, Mets, McPeek (2007))
Multi-color optical mapping is a new technique being developed, in the Mets lab at U. of C., to obtain detailed
physical maps (indicating relative positions of various recognition sites) of DNA molecules. We consider a study
design in which the data consist of noisy observations of multiple copies of a DNA molecule marked with colors at
recognition sites. The primary goal is to estimate a physical map. A secondary goal is to estimate error rates associated
with the experiment, which are potentially useful for analysis and refinement of the biochemical steps in the mapping
procedure. We propose statistical models for various sources of error and use maximum likelihood estimation (MLE)
to construct a physical map and estimate error rates. To overcome difficulties arising in the maximization process,
a latent-variable Markov chain version of the model is proposed, and the EM algorithm is used for maximization. In
addition, a simulated annealing procedure is applied to maximize the profile likelihood over the discrete space of
sequences of colors. We apply the methods to simulated data on the bacteriophage lambda genome.
X. Other work includes
A. Statistical models for recombination and interference (Speed, McPeek, Evans (1992) PNAS 89:3103-3106; Evans,
McPeek, Speed (1993) Theor Pop Biol 43:80-90; McPeek and Speed (1995) Genetics 139:1031-1044; Zhao, Speed, McPeek
(1995) Genetics 139:1045-1056; Zhao, McPeek, Speed (1995) Genetics 139:1057-1065; Armstrong, McPeek, Speed (2006)
Biostatistics 7:374-386)
B. Optimal allele-sharing statistics for genetic mapping of affected pedigree members (McPeek 1999)
C. Statistical inference for sperm-typing data (Leeflang, McPeek, Arnheim 1996; Grewal et al. 1999; McPeek 1999;
Girardet et al 2000)
Last update: 5/13/09
|