Course description

This course examines the problems of multiple testing and statistical inference from a modern point of view. High-dimensional data is now common in many applications across the biological, physical, and social sciences. With this increased capacity to generate and analyze data, classical statistical methods may no longer ensure the reliability or replicability of scientific discoveries. We will examine a range of modern methods that provide statistical inference tools in the context of modern large-scale data analysis. The course will have weekly assignments as well as a final project, both of which will include both theoretical and computational components.


Stat 24400 or equivalent. Undergraduates may enroll with permission of the instructor.


Course syllabus.

Course materials2016

Links to resources & references (updated periodically)
R code for gene expression data (day 1): COPD_statin_gene_expr.R
R code for Benjamini-Hochberg simulations with dependent p-values (week 2): BH_simulations.R
Matlab code for visualizing the Benjamini-Hochberg procedure with n=3 (week 2): BH_worst_case.m
Gene expression data / z-scores / two groups model (from week 3/4): R code COPD_statin_gene_expr_mixture_model.R
Online testing demo comparing various methods: R code online_testing_methods.R & online_testing_demo.R.
Regression tutorial - code: regression_tutorial.R
Debiasing for the lasso - simulation: debiasing.R


Problem set 1: assignment ProbSet1.pdf ; code COPD_statin_gene_expr_for_HW.R
P-hacking challenge: assignment p-hacking_challenge.pdf ; data set p-hacking_data_set.txt
Results: p-hacking_pvals.txt, p-hacking_responses.txt, p-hacking_plot_results.R
Problem set 2: assignment ProbSet2.pdf
Real data critique: assignment real_data_critique.pdf
Problem set 3: assignment ProbSet3.pdf
Problem set 4: assignment ProbSet4.pdf. Code: conditional_affine.R

Final Project

Final project topic suggestions: final_project_ideas.pdf