|
Research Interests
My research focuses on applications of probability and
statistics to genetics and molecular biology.
Following are some of my recent and ongoing projects:
I. Case-control association testing with related individuals (Bourgain
et al. (2003) Am J Hum Genet 73: 612-626; Thornton and McPeek (2007) Am
J Hum Genet 81:321-337; CC-QLS and MQLS software available on this site)
In Bourgain et al. (2003), we developed a QLS method for case-control
association testing in samples that contain related individuals.
The test statistic is constructed based on null and
alternative means and the null covariance matrix
of a function of genotype indicators. Choice of an alternative mean
model affects the power, but not the validity of the test.
The alternative mean model used in the WQLS of Bourgain et al.
(2003) is based on a simple case-control allele frequency difference. We
implemented our method in a computationally efficient algorithm, and we
applied it to a Hutterite sample (from
an isolated population with large, inbred
pedigree) in which we detected a highly-significant
novel association between atopy (an asthma-related phenotype) and an
amino-acid polymorphism in the P-selectin gene. We demonstrated that,
for the chosen alternative mean model,
our QLS test is asymptotically locally most powerful in a general class
of linear tests.
We followed up on this work with Thornton and McPeek
(2007), in which the major development was a novel construction of
an alternative mean model with a direct connection to genetic
models (a reversal of conditioning under the assumption of a very
general mode of inheritance with a small
effect of the locus on the trait). In the resulting alternative mean
model for the genotype indicators, the expected
frequency of a predisposing allele in an individual depends not only
on the individual's phenotype, but on the phenotypes of relatives as
well. This is a desirable property
because complex genetic models imply an enrichment for predisposing
variants in affected individuals with affected relatives compared to
affected individuals without affected relatives. Our resulting
MQLS case-control association test has similar optimality
properties as in Bourgain et al. (2003) but for the improved
alternative model, leading to a substantial power improvement in
simulations under various multilocus models. At the same time
the MQLS test retains
the appealing computational simplicity of the method of Bourgain et al.
(2003). Other properties of the
MQLS include: (1) it is applicable to completely general
combinations of family and case-control designs, including samples from
isolated founder populations; (2) it can incorporate
both unaffected controls and controls of unknown phenotype into the same
analysis; and (3) it can incorporate phenotype information on relatives
with missing genotype data. Using the method to reanalyze the GAW 14
COGA data, we detected highly significant association to an
alcoholism-related phenotype for four different SNPs. Three of these
four significant associations were not detected in previous studies.
Our software, including source code, is freely available on my
website: the CC-QLS package implements the methods of Bourgain et al.
(2003), and the MQLS package is an expanded version of CC-QLS which
incorporates the methods of Thornton and McPeek
(2007).
II. Multipoint linkage disequilibrium mapping with block haplotype
structure (Zheng and McPeek (2004) Springer Lecture Notes in Computer
Science 2983:113-123; Zheng and McPeek (2007) Am
J Hum Genet 80:112-125)
In Zheng and McPeek (2004), we
developed a class of hidden Markov models for background LD based on the
block structure of haplotypes, and we fit these models to dense SNP
data from an outbred Caucasian population. Our class of
models allows a fairly general graph structure of preferred and
non-preferred transitions based on the haplotype block structure.
It allows for common haplotypes and uncommon haplotypes in each block, and it
captures the idea of ancestral haplotypes. We use a parametric
bootstrap approach to assess goodness of fit, which allows a
wide latitude in choice of test statistic. We implemented an
additional layer of Monte Carlo to assess the type I error of the
parametric bootstrap procedure for assessment of
goodness of fit.
In Zheng and McPeek (2007), we followed up on this work by applying
our models of background LD based on block haplotype structure to the
problem of multipoint LD mapping from dense SNP data in
case-control samples from an outbred
population. We developed a
virtual variant approach that characterizes untyped SNPs by
various partitions of the set of haplotypes within a block into two
disjoint subsets, corresponding to two alleles. We demonstrated that
the virtual variant method greatly increases power for detection of
untyped common variants associated with a trait. Because full multipoint
LD mapping can be slow, we exploited the haplotype block information to
develop a fast single-block multipoint mapping method. Our methods are
appropriate for genotype data and take into account the uncertainty in
phase. Our simulations indicate that the most important gains from taking into account the
haplotype block structure at the analysis stage of multipoint LD mapping
come from (1) greatly increased power to detect association with untyped
variants, and (2) greatly improved localization of untyped variants
associated with the trait.
III. Multipoint linkage disequilibrium mapping by the decay of
haplotype sharing (McPeek and Strahs (1999) Am J Hum Genet 65:858-875;
Strahs and McPeek (2003) Festschrift for Terry Speed pp. 343-366;
Zhang, Schneider, Ober, McPeek (2005) Genet Epi 29:128-140,
DHSMAP and DHSMAP_PVM software available on this site)
In McPeek and Strahs (1999), we proposed a multipoint
approach to linkage disequilibrium mapping. For each individual,
a likelihood for multilocus data is calculated, while incorporation of
dependence of recombinational history among related individuals is based
on estimating equations that can be thought of as generalizations of
both quasi-likelihood and composite likelihood. McPeek and Strahs
(1999) assumed a conditional coalescent model
for the relationships among individuals.
In Strahs and McPeek (2003), we addressed the problems of (1) modeling
background LD in an outbred population and (2) incorporating the background
LD model into our decay of haplotype sharing method in outbred samples.
We developed a Markov model of order 2 for background LD in haplotypes
of moderately dense SNPs, and we developed a hidden Markov implementation of the model
for use with unphased genotype in our decay of haplotype sharing
method. We used the AIC and BIC model selection
criteria to compare models of background LD and found that the Markov(2)
model provided a major improvement over a Markov(1) model. Within the
context of the decay of haplotype sharing method, we demonstrated the
importance of appropriate modeling of background LD, and we developed a
mapping-in-controls diagnostic to detect the possibility that lack of
fit of the background model would be having an influence in the analysis.
Software for the method, including source code, is freely available on
this site.
In Zhang et al. (2005), we addressed the problem of
multilocus linkage disequilibrium (LD) mapping of a trait-associated
variant from case-control samples in which some individuals may be
related, with special attention to the extreme case of an isolated
founder population.
Our method, which we call DHS-R, is an extension of our previous
decay of haplotype sharing (DHS) method. The DHS-R method shares
the main features of the DHS method: (i) it allows construction of a
confidence interval for the location of a trait-associated variant;
(ii) it allows for missing observations and unphased genotype data,
with the uncertainty in the haplotypes taken into account in the
analysis; (iii) it allows for heterogeneity, mutation, recombination,
and background LD. The main advances of the DHS-R are (i) the
ability to include individuals of arbitrary known relationship
(including inbreeding) in the case and control samples; (ii) an
extension to allow partially-phased haplotypes derived from
case-parent trio genotype data; and (iii) an extension to allow for
genotyping error in the model. Our method, which uses a hidden
Markov model for likelihood calculation and maximization, has the
advantage of being computationally feasible even in a large, complex
pedigree. Simulations based on a 13-generation,
1623-member Hutterite pedigree demonstrated accurate coverage of the
confidence intervals for location of the variant. We applied the
method to fine-mapping of a susceptibility locus for the
asthma-associated phenotype, bronchial hyperresponsiveness (BHR), in
the Hutterites, on a region of chromosome
19.
IV. Application of quasi-likelihood to testing for
Hardy-Weinberg in samples with related individuals (Bourgain, Abney,
Schneider, Ober, McPeek (2004) Genetics
168:2349-2361)
In Bourgain et al. (2004), we demonstrated that
when the classical chi^2 goodness-of-fit test for
Hardy-Weinberg equilibrium (HWE) is used on samples with related
individuals, the type I error can be greatly inflated. In
particular the test is inappropriate in population isolates where
the individuals are related through multiple lines of descent.
In Bourgain et al. (2004), we proposed a novel quasi-likelihood score
(QLS) test of HWE suitable for
any sample with related individuals. Performed conditional on the
pedigree structure, our test detects
departures from HWE that are not due to the
genealogy.
V. Best linear unbiased estimation of allele frequencies (McPeek, Wu,
Ober (2004) Biometrics 60:359-367)
In McPeek et al. (2004), we addressed the problem of efficient
allele frequency
estimation in an isolated founder population in which all individuals
are related by a large, complex, pedigree with multiple inbreeding
loops. We developed a quasi-likelihood (QL)
estimator, which for this problem is also the best linear unbiased
estimator, where the QL estimator
weights the individuals based on their kinship to all
the other individuals in the sample. We developed and implemented
an efficient
algorithm for computing the estimate and its variance, and we applied
our method to allele frequency
estimation in (1) a Hutterite data set containing over 800 individuals
related by a 13-generation 1623-person pedigree as well as in (2) an
outbred sample of 996 individuals drawn from 85 moderate-size
pedigrees. Notably, our QL estimator has very close performance to the
maximum likelihood estimator (when it is feasible to calculate the
latter), but is substantially easier to calculate,
making it feasible to use for large numbers
of markers even in large, complex pedigrees. In the context of
high-density scans, its accuracy and computational efficiency make it a
valuable tool in samples composed of
moderate-size pedigrees as well. Our software, including source code,
is freely available on this site as part of the
CC-QLS package.
VI. Identification of polymorphisms that explain a linkage result (Sun,
Cox, McPeek (2002) Am J Hum Genet 70:399-411;
STEPC software freely available on the web)
In Sun et al. (2002), we developed a statistical
method for identification of polymorphisms that explain a linkage
result. Given many polymorphic sites genotyped in a region showing
strong linkage with a trait, our goal is to determine
which site or combination of sites in the region
influences susceptibility to the trait.
Our approach is to use linkage data to
identify the polymorphisms whose genotypes could
fully explain the observed linkage to the region.
The information provided by this analysis is different from
that provided by either linkage or association studies.
Our approach is based on the observation that
if a particular site is the only site in the region
that influences the trait, then conditional on the genotypes
at that site for the affected relatives,
there should be no unexplained over-sharing
among the affecteds in the region.
Our method is applicable to sibships and
allows for a very general model
for how the site influences the trait, including
epistasis with unlinked loci, environmental effects
and gene-environment interaction.
We perform hypothesis tests and derive a confidence set
for the true causal polymorphic site,
under the assumption that there
is only one site in the region influencing the trait.
Future work will initially focus on the problem of multiple causal sites
present in the region.
VII. Analysis of quantitative trait loci in the Hutterites
(Abney, McPeek, Ober (2000) Am J Hum Genet 66:629-650;
Abney, McPeek, Ober (2001) Am J Hum Genet 68:1302-1307; Ober, Abney,
McPeek (2001) Am J Hum Genet 69:1068-1079; Newman et al. (2001)
Am J Hum Genet 69:1146-1148; Abney, Ober, McPeek (2002) Am J Hum Genet
70:920-934; Newman et al. 2003, Newman et al.
2004, Weiss et al. 2004)
In Abney et al. (2000; 2001; 2002), we
developed statistical methods for analysis of quantitative traits
in founder populations. We have applied the methods to genetic analysis
in a Hutterite population. The complexity of this large inbred pedigree
poses special challenges and makes many standard types of analyses
computationally onerous or completely infeasible. At the same time,
certain features of this population make it extremely promising for
genetic analysis of complex traits: a small number of founders
presumably leading
to reduced genetic heterogeneity, close-knit social structure and
communal living which are expected to reduce environmental
heterogeneity. Methods of analysis must generally be tailor-made for
application to founder populations, and major computational problems must
often be overcome. We have developed and implemented
variance component methods and linkage disequilibrium mapping methods
designed especially for founder populations. We have also
developed a novel
permutation-based assessment of significance that is applicable to data
on related individuals, based on a general class of matrix
decompositions, of which the Cholesky decomposition is a special case.
VIII. Relationship inference (McPeek and Sun 2000; Sun, Abney, McPeek
2001; Sun, Wilder, McPeek 2002; McPeek 2002; PREST software freely
available on the web)
Lei Sun and I have developed several approaches for
the problem of detecting
relationship errors in pedigrees on the basis of genome screen data
collected for linkage studies.
We have
developed methods for simple outbred pedigrees as well as for the much
more difficult situation of a large, complex, inbred pedigree.
Part of this work is related to identifiability
of hidden Markov models and efficient methods for determination of the
orbits of the group of symmetries on the hypercube that leave certain
sets invariant.
IX. Optical mapping (Tong, Mets, McPeek
(2007))
Multi-color optical mapping is a new
technique being developed, in the Mets lab at U. of C.,
to obtain detailed physical maps
(indicating relative positions of various recognition sites)
of DNA molecules. We consider a study
design in which the data consist of noisy observations of multiple
copies of a DNA molecule marked with colors at recognition sites.
The primary goal is to estimate a physical map. A secondary goal
is to estimate error rates associated with the experiment, which
are potentially useful for analysis and refinement of the biochemical
steps in the mapping procedure. We propose statistical models for various
sources of error and use maximum likelihood estimation (MLE) to
construct a physical map and estimate error rates. To overcome
difficulties arising in the maximization process, a latent-variable
Markov chain version of the
model is proposed, and the EM algorithm is used for
maximization. In addition, a simulated annealing procedure is
applied to maximize the profile likelihood over the discrete space
of sequences of colors. We apply the methods to simulated data on
the bacteriophage lambda genome.
X. Other work includes
A. Statistical models for recombination and interference (Speed,
McPeek, Evans (1992) PNAS 89:3103-3106; Evans, McPeek, Speed (1993)
Theor Pop Biol 43:80-90; McPeek and Speed (1995) Genetics 139:1031-1044;
Zhao, Speed, McPeek (1995) Genetics 139:1045-1056; Zhao, McPeek, Speed
(1995) Genetics 139:1057-1065; Armstrong,
McPeek, Speed (2006) Biostatistics
7:374-386)
B. Optimal allele-sharing statistics for genetic mapping of affected
pedigree members (McPeek 1999)
C. Statistical inference for sperm-typing data (Leeflang, McPeek,
Arnheim 1996; Grewal et al. 1999; McPeek 1999;
Girardet et al 2000)
Last update: 7/24/07
|