--- title: "STAT 224 Chapter 12" output: html_document: code_folding: hide df_print: kable number_sections: yes self_contained: yes toc: FALSE toc_depth: 4 toc_float: FALSE params: dotabs: TRUE --- ```{r test-main, child = 'preamble.Rmd'} ``` # Chapter 12 Logistic Regression (12.1-12.7) `r if(params$dotabs) paste('{.tabset}')` ## Lecture 1 ### Regression Models for the Probability of Response Outcome So far, response variable $Y$ continuous and (approximately) normally distributed. What if $Y$ is a binary discrete variable, taking on values $0$ or $1$? For example, bank failure: $Y = 1$ if failed (bankcruptcy), $Y=0$ if not disease event: $Y = 1$ if with disease (case), 0 otherwise (unaffected/control). Each individual observation is like a "trial" in a binary outcome experiment where $P(Y=1)$ is some value $p$. $Y$ is a Bernoulli random variable, with probability of "success" $\pi$. | $Y$ | $P(Y)$ | | ------------- |-------------| | $0$ (failure) | $P(Y=0) \;=\; 1-\pi$ | | $1$ (success) | $P(Y = 1) \;=\; \pi$ | For each observtion, $Y_i \sim {\rm Bernoulli}(\pi)$, for $i=1, 2, \ldots, n$ $E(Y_i) = \pi = P(Y_i=1)$ $\var(Y_i) = \pi(1-\pi)$ So, the variance of $Y$ depends on its mean! + Want to identify factors related to the probability of an event. + Want to predict the probability of an event. If we try to model $\pi = \E(Y)$ as a linear function of predictors $X$: \[ \E(Y \,|\, X) = \P(Y = 1\, |\, X) = \pi_x = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p \] Several problems arise: + Model can readily (and incorrectly) estimate a probability below $0$ or above $1$ + $0 < \pi < 1$ is a restriction not properly handled by a linear model + $Y$ not even approx.\ normally distributed (not even continuous) + The variance of $Y$ (thus, the error term) is not constant over $X$ + since $\E(Y)=\pi$ and $\var(Y) = \pi(1-\pi)$. + So if $\pi$ varies conditional on $X$, so does the variance. The following transform does permit us to formulate $\P(Y=1)$ (actually, a function of it) as a linear function of predictor variables $X$ \[ \log_e \left[\frac{\pi}{1-\pi}\right] = \log_e \left[\frac{\P(Y=1\, |\, X)}{1-\P(Y=1\, |\, X)} \right] = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p . \] $$-\infty < \log_e \left[\frac{\pi}{1-\pi}\right] < \infty$$ So, this transformed response is a continuous variable with no restrictions, unlike $0 < \pi < 1$. From here, after some algebra \[ \pi_x = \P(Y=1\, |\, X) = \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)} \] For this model, observations with same values of the $X$s have the same probability of "success" -- $\pi_x = \P(Y=1)$. #### The response is now the "odds of success" $\displaystyle O = \mbox{odds of success} = \frac{\mbox{probability of success}}{\mbox{probability of failure}}=\frac{\pi}{1-\pi} $$ Odds is just another way of expressing probabilities.
**CAUTION:** I don't know why, but the textbook calls $\displaystyle \frac{\pi}{1-\pi}$ an "odds ratio". This is incorrect. This is the "odds", not an "odds ratio" (a ratio of two odds).
### Regression Models for Log Odds of Response The **logit transform** of a Bernoulli($\pi$) response variable $Y$ is \[ \logit(\pi) =\log_e \left(\frac{\pi}{1-\pi}\right) = \log_e(\mbox{odds}) = \mbox{"log odds"} \] **Logistic regression** models predict the logit as a linear function of $X$s The name derives from the fact that we can write \[ \pi = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} \] and the right-hand side of an algebraic form called a **logistic function**. + A logistic function is the inverse of a logit function. + Logit function "links" the mean of $Y$ to a linear function of $X$s. + Logit function is also called the **link function** for logistic regression. + This method produces valid probability values ($\pi_x$) for any values of $X$. + The estimates of $\pi$ are always values between $0$ and $1$. ### A Review (or new to you): Odds and Odds Ratios How do we typically summarize and analyze data with binary response? We should have analogous basic methods in logistic regression, just as in linear regression, where we can reproduce the two-sample t-test for comparison of means via SLR. (See Chapter 1-2 notes and textbook Section 2.11) If the question is whether the probability of a binary event differs between, say, two groups, we have + Test for difference of proportions between two groups + normal-distribution-based test + exact binomial test + Test for homogeneity in a 2x2 contingency table + Chi-square test ($\chi^2$) We also have associated measures of difference or effect + difference of proportions, + ratio of proportions ...and another summary that we will usehere, the **odds ratio**
**CAUTION:** I don't know why, but the textbook calls $\displaystyle \frac{\pi}{1-\pi}$ an "odds ratio". This is incorrect. This is the "odds", not an "odds ratio" (a ratio of two odds).
#### 2 x 2 tables review Recall the basic set-up for the **2 $\times$ 2 Table**. For a retrospective, case-control epidemiological study of some exposure: | Disease status | Exposed | Not exposed| Total | | ---: | --- | --- |--- | case | a | b | a+ b non-case | c | d | c+d Total | a+c | b+d | a + b + c + d = N To study the association between outcome and exposure based on a 2x2 table, generally we may consider the following measures: #### Difference of Proportions (or Risk Difference) \[ \mbox{risk difference} = \frac{a}{a + c} - \frac{b}{b + d} \] However, in a case-control study, the proportion of cases **is not** the proportion (estimated risk) of disease in the population. i The samples are collected based on outcome (diseased versus non-diseased). It is generally not possible to calculate and compare the absolute risk of disease (incidence) in a case-control study. (Although, the risk proportions would be valid in a cohort study.) #### Why do a retrospective, case-control study vs. a cohort study? In a cohort study, + Determine exposed or not status at the start. + Follow participants to see if get the disease (case) or not. + Could take years and years to get the data. + ...and if the disease is rare, tough to get enough data on cases #### The Odds Ratio Ratio of odds of event ($Y=1$) in each exposure group Note that in the *exposed* group \[ \mbox{odds(case, diseased)} = \frac{\pi_e}{1-\pi_e } = \frac{a\,/\,(a+c)}{c\,/\,(a+c) } = a\,/\,c \] and in the *unexposed* group \[ \mbox{odds(case, diseased)} = \frac{\pi_u}{1-\pi_u } = \frac{b\,/\,(b+d)}{d\,/\,(b+d) } = b\,/\,d \] Then \[ OR = \mbox{odds ratio} = \frac{\mbox{ odds of disease in exposed group}}{\mbox{ odds of disease in unexposed group}} = { \frac{a\,/\, c}{b\,/\,d} } = \frac{ad}{bc} \] This is the **odds ratio** or relative odds #### Odds Ratio interpretable for case-control, retrospective studies Why use an odds ratio? In case-control studies: + The samples are collected based on outcome (diseased versus non-diseased). + Odds(disease for exposure group) **cannot** be estimated from the data. + Odds(disease for unexposed group) **cannot** be estimated from the data. However, odds ratio, a relative measure, is still valid. | Disease status | Exposed | Not exposed| Total | | ---: | --- | --- |--- | case | a | b | a+ b non-case | c | d | c+d Total | a+c | b+d | a + b + c + d = N Odds(exposure) **can** be estimated for both diseased and not. The odds of exposure among cases: \[ \mbox{odds(exposure)} = \frac{a\,/\,(a+b)}{b\,/\,(a+b) } = a\,/\,b \] The odds of exposure among non-cases: \[ \mbox{odds(exposure)} = \frac{c\,/\,(c+d)}{d\,/\,(c+d) } = c\,/\,d \] $\displaystyle{ OR = \mbox{odds ratio} = { \frac{a\,/\, b}{c\,/\,d} } = \frac{ad}{bc} }$ = same as OR of diseased for exposed vs. not!!! ### Logistic Regression #### Example: Hospital admissions and ICU deaths We are interested in whether there is any difference in ICU mortality rates (outcome) for different initial hospital admission types (exposure, X variable). | Admission Status | Emergency | Elective | Total | | ---: | --- | --- |--- | Death in ICU | a | b | a+ b Survived after ICU | c | d | c+d Total | a+c | b+d | a + b + c + d = N Let $\pi_0=$ risk of death in he population of elective admission patients Let $\pi_1=$ risk of death in population of emergency admission patients The study considers a window of time and classifies ICU stays by response (death or not) and predictor (elective or emergency hospital admission) #### The odds For the elective admission population, the "odds" of death is: \[ \frac{\pi_0}{1-\pi_0} = \frac{b\,/\,(b+d)}{d\,/\,(b+d)} = \frac{b}{d} \] $\displaystyle{ \log \left[ \frac{\pi_0}{1-\pi_0} \right] }$ is the "log odds" of death, or the "logit" of $\pi_0$. For the emergency admission population, the "odds" of death is: \[ \frac{\pi_1}{1-\pi_1} = \frac{a\,/\,(a+c)}{c\,/\,(a+c)} = \frac{a}{c}\] **REMINDER:** For a case-control study like this one these are **NOT** estimates of the population odds. #### The log odds \[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\}\] is the log odds of death, or the logit of $\pi_0$ for the elective admission group. \[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\}\] is the log odds of death, or the logit of $\pi_1$ for the emergency admission group. Why log odds (logit)? Rather than model on either the probability or odds ratio scale (both of which have a restricted valid range), we need another metric that is equivalent but not constrained. $0 < \pi < 1\qquad 0 < OR < \infty \qquad -\infty < \log(\textrm{odds}) < \infty$ #### The log odds ratio We can take the difference of log odds (in the same way that we work with a difference of means) \[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\} - \log \left\{ \frac{\pi_0}{1-\pi_0} \right\} \;=\; \log \left\{ \frac{ \pi_1 / (1 - \pi_1) }{ \pi_0 / (1-\pi_0) } \right\} \;=\; \log(\mbox{odds ratio}) \;=\; \log(OR)\] by basic rules of logarithms: $\log(v - w) = \log(v/w)$. $\log(OR)$ is the population "log odds ratio" of death, comparing the emergency admissions to the elective admissions. We label it $\log(OR) = \beta_1$ and want our model to estimate it. ### Logistic Regression Model Define an indicator variable for admission type: \[ x = \mbox{type } = \begin{cases} 0 & \mbox{elective}\\ 1 & \mbox{emergency} \end{cases} \] Algebraically, the log odds can be expressed as a linear function of predictors (log odds has range $(-\infty, \infty)$). \[ \log(\mbox{odds of event}) = \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \] **Note:** There is no $\epsilon$ term written because the left side is a function of the mean of $Y$ ($\pi$), not just $Y$ or a transformation of $Y$ Since the log odds is simply a function of the probability of interest (\pi=\pr(Y=1)), a model that predicts the log odds is equivalent to a model that predicts the probability. #### Example: Hospital admissions and ICU deaths Here is the logistic regression model for the risk of death in ICU as a function of admission type: \[ \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \] When $x=0$, \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (0) = \beta_0 \;=\; \log \left( \frac{\pi_0}{1-\pi_0} \right) \] When $x=1$, \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (1) = \beta_0 + \beta_1 \;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) \] Then, $$ \begin{align} \log(OR) &\;=\; \log \left( \frac{\pi_1/(1-\pi_1)}{\pi_0(1-\pi_0)} \right)\\ &\;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) - \log \left( \frac{\pi_0}{1-\pi_0} \right) \\ &\;=\; (\beta_0 + \beta_1) - \beta_0\\ &\;=\; \beta_1 \end{align} $$ Thus, odds ratio = $OR = e^{\beta_1}$ ### More About the Logistic Regression Model A linear regression model describes the mean of a random variable ($Y$) as a linear combination of some other variables ($X$'s) A logistic regression model describes the log odds of a probability (that a binary variable $Y$ takes on the value $1$) as a linear combination of some other variables ($X$'s) Because error structure in the two models is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used: "**maximum likelihood estimation.**" ### Logistic Regression - Estimation and Model Interpretation In R, we use the GLM function (GLM stands for Generalized Linear Models). For the logit model, we specify the distribution family as binomial ```{r} icuData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/icu.csv") head(icuData) addmargins( tally(status ~ type, data=icuData) ) glmfit <- glm(status ~ type, data = icuData, family = "binomial") tidy(glmfit) ``` Interpretation of regression coefficients: Here, $\hat{\beta}_0 = -3.24$ and $\hat{\beta}_1 = 2.18$. The model $\displaystyle{\mbox{log odds of status being death} = \log \left\{ \frac{\pi}{1-\pi} \right\} = -3.24 + 2.18 (\mbox{type})}$ predicts the log odds (and odds) of death for each exposure group: when $X=0$ (elective group): log odds or logit = $\beta_0 = -3.24$, then $$\widehat{O}_0 = \mbox{odds of death}_0 = e^{\b0} = e^{-3.24} = 0.039 $$ when $X=1$ (emergency group): $$\widehat{O}_1 = \mbox{odds of death}_1 = e^{\b0 + \b1} = e^{-3.24 + 2.18} = 0.346 $$ \[ \widehat{OR} = \textrm{odds ratio} = \frac{\mbox{odds}_1}{\mbox{odds}_0} = e^{2.18} = 8.89 = e^{\widehat{\beta}_1} \] Back to the data for a minute | Status | Elective | Emergency | Total | | --- | --- | --- | --- | 0 | 51 | 109 | 160 1 | 2 | 38 | 40 Total | 53 | 147 | 200 From this table, note that odds are elective: 2/51 = 0.039 emergency 38/109 = 0.346 Then, from the table $\qquad \widehat{OR} = \frac{51 \times 38}{2 \times 109} = 8.89$ From the model, the ratio of predicted odds (i.e.' odds ratio) is $$\frac{0.346}{0.039} = 8.89 = \widehat{OR}$$ But also note that the difference of log odds $\equiv$ log odds ratio is **directly given** by the model ($\b1$). An estimate of the odds ratio is therefore: \[\widehat{OR} = e^{\widehat{\beta}_1} = e^{2.18} = 8.89 \] So, the $\widehat{\beta}$ coefficient for `type` gives the log odds ratio for type of admission. $e^{\b1}$ gives the OR estimate. **Note:** Unless predicting probabilities, we don't need $\beta_0$. It might not be interpretable in some study designs. #### Predicting probabilities The model was intended to predict probabilities. Does it do this? Use the logistic function to obtain $$\Pr(Y = 1 \,|\, X=0) = \frac{e^{-3.24}}{1 + e^{-3.24}} = 0.0377 = 2/53$$ $$\Pr(Y = 1 \,|\, X=1) = \frac{e^{-3.24+2.18}}{1 + e^{-3.24+2.18}} = 0.2573 = 38/147$$ Check these probabilities against the 2x2 data table. ```{r} tally(status ~ type, data=icuData, margins=TRUE) ``` The model predicts $\P(Y=1)$ by admission type, equaling the proportion of deaths in each group. If both OR and probabilities are valid (prospective study), why would we still care about the probability while we have odds ratio? + OR is on a relative scale while the probability tells you how large the actual risk is. + There are certain cases when the probability is very very low but the OR is high. + Although it is concerning that OR is high, but since the probability is so small. + People may still be willing to take the risk. Test of the hypothesis that ICU death is associated with type of hospital admission: \[ H_0: OR = e^{\beta_1} = 1 \] is same as \[ H_0: \log( OR ) = \beta_1 = 0 \] ```{r} tidy(glmfit) ``` The test statistic reported in the output is: \[ Z = \frac{\widehat{\beta}_1}{\widehat{\se}( \hat{\beta}_1 )} = \frac{2.18}{ 0.745 } = 2.93 \] and the p-value is \[ = \P(|Z| > 2.93 ) = 0.003 \] From the model estimation run, a confidence interval for $\beta_1$ is constructed as: \[ \left( \b1- \textrm{CriticalValue}\times se(\b1),\quad \b1 + \textrm{CriticalValue}\times se(\b1) \right) \] Exponentiating the boundaries of the CI gives a CI for the $OR$: \[ \left(\exp\left( \b1- \textrm{CriticalValue}\times se(\b1) \right),\quad \exp\left(\b1 + \textrm{CriticalValue}\times se(\b1) \right)\right) \] ```{r} tidy(glmfit) ``` Overall conclusion: ICU death is significantly and positively associated with emergency (versus elective) hospital admission p-value $=0.003$ $\widehat{OR} = 8.89$ 95\% CI: $[2.06, 38.3]$ Note: test on $\beta_1$ in this model is equivalent to this test: \[ H_0: \mbox{ row frequencies same for both columns in 2x2 table} \] vs \[ H_A: \mbox{ row frequencies same for both columns in 2x2 table} \] via the $\chi^2$ test. Advantage of logistic regression is that we have an effect measure estimate with associated CI Another advantage: Logistic regression also can be readily used with predictors that are not binary or categorical, where tables cannot be readily formed or would result in loss of information. ### Logistic Regression: Multiple and Continuous Covariates We can easily expand the analysis to also accommodate multiple predictors. Suppose that we were interested in the association between both age and admission type, and ICU mortality Then as before we have a (0,1) indicator for admission type \[ X_1 = \mbox{type } = \left\{ \begin{array}{l} 1 \mbox{ if emergency} 0 \mbox{ if elective} \end{array} \right. \] and define \[ X_2 = \mbox{age (yrs)} \] and propose a logistic regression model for death as a function of these two covariates: \[ \log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \] What is the interpretation of $\beta_2$? Let's compare the log odds of death of a 50 year old to a 49 year old subject, both admitted on an emergency basis \[ \log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right)=\beta_0+\beta_1+\beta_2\times 50\] \[ \log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)=\beta_0+\beta_1+\beta_2\times 49\] $$\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right) - \log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)$$ $$\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}/ \frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)$$ Therefore: $\beta_2$ is the log odds ratio for death for a 50 year old compared to a 49 year old, both of whom had emergency admissions to the hospital (or both of whom had elective admissions to the hospital) Some important notes about $\beta_2$: As in linear regression, $\beta_2$ represents the increment in log odds for a one unit change in $X$, holding other predictors constant. Here, $\beta_2$ is the log odds ratio for death, and $\exp(\beta_2)$ is the odds ratio for death comparing two subjects who differ in age at admission by one year **and** who are of the same admission type (be it emergency or elective). This is what we mean by an **adjusted effect**: $\beta_2$ is the log odds ratio and $\exp(\beta_2)$ is the odds ratio comparing two subjects who differ in age at admission by one year, **adjusted for admission type**. What is the interpretation of $\beta_1$? The log odds ratio (emergency vs. elective) for two subjects of a given age, (i.e., adjusted for age) and $e^{\beta_1}$ is the odds ratio. ### Hospital admissions example in R First few observations: ```{r eval=TRUE} icuData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/icu.csv") glimpse(icuData) head(icuData) ``` The 2x2 table of icu survival status (`status`) vs. predictor (hospital admission type, `type`) ```{r} addmargins(tally(status ~ type, data=icuData)) ``` #### Logistic regression model fit This is a **generalized linear model** (glm). The "link function" is the logit. The errors have a Bernoulli = binomial($n=1$) distribution. R syntax: `glmfit <- glm(y ~ x, data = mydata, family = "binomial")` `tidy(glmfit)` ```{r} glmfit <- glm(status ~ type, data = icuData, family = "binomial") tidy(glmfit) ``` Transform to the odds ratio scale: $e^{\b1}$ ```{r} b1 <- tidy(glmfit)$estimate[2] exp(b1) ``` This should match the odds ratio right from the 2x2 table. ```{r} theTable <- tally(status ~ type, data = icuData) oddsRatio(theTable) ``` Hmmmm.... The odds ratio does not match. After looking at the documentation for `oddsRatio` in the `mosaic` package, I can see it calculates OR = (odds row 1) / (odds row 2). That's odds of elective admission among those who survied divided by odds of elective admission among those who died. Instead, we wanted odds of death among those with emergency admission divided by odds of death among those with elective admission. A little care shows that what we want is the reciprocal of the OR that R provided. ```{r} theTable <- tally(status ~ type, data = icuData) 1 / oddsRatio(theTable) ``` Now, the OR from the table matches the OR from the model, as expected. Another way to do this is to swap the order of the response and predictor variables and get the "successes" (deaths) into the firs column as suggested in the documentation. ```{r} theTable <- tally(type ~ -status, data = icuData) theTable oddsRatio(theTable) summary(oddsRatio(theTable)) ``` #### Example continued: Add another predictor ```{r} glmfit2 <- glm(status ~ age + type, data = icuData, family = "binomial") tidy(glmfit2) ``` The coefficients on the odds ratio scale: ```{r} coef(glmfit2) exp( coef(glmfit2) ) ``` Note that admission `type` effect is even larger taking age into account (OR = 11.6). And age is a potential **confounder** here. Age can also **modify** the effect of admission type on risk of death. Older individuals are more likely to die under either admission type (since they are older), whereas among younger individuals, emergency admissions may be more strongly associated with death than elective admissions. At a given age, emergency admissions have much higher (11.6-fold) greater risk of death. \medskip The age effect is to increase odds of death by 1.035, or about 3.5\%, per year of age increase. **SKIP OVER TO SLIDES PAGE 40** ## Lecture 2 Let's consider the example of an auto insurance company which is trying to measure the chances of a customer filing a claim in a given year. The insurance company knows that age is strongly related to the probably of filing a claim, so `age` is the predictor variable here. I'll simulate some data: ```{r} # Code taken from # https://www.kaggle.com/captcalculator/logistic-regression-and-roc-curve-primer/code # generate fake insurance dataset set.seed(12) insurance <- data.frame( claim = c(rep(1, 20), rep(0, 20)), age = c(rnorm(20, mean = 25, sd = 5), rnorm(20, mean = 37, sd = 7))) # plot the insurance data ggplot(data = insurance, aes(x = age, y = claim)) + geom_point(color = 'grey50') + theme_bw() ``` As we can see, young people here are far more likely to file claims. (Hmm, in my simulation I have some drivers as young as 15... that's a little scary!) A reminder of what happens if we try a linear fit on this data: ```{r} # regress claim on age lmmod <- lm(claim ~ age, data = insurance) # plot the data and the regression line ggplot(data = insurance, aes(x = age, y = claim)) + geom_point(color = 'grey50') + geom_abline(intercept = coef(lmmod)[1], slope = coef(lmmod)[2], color = 'red') + theme_bw() ``` It's obvious from the diagnostic plots that the data don't meet the assumptions for linear regression: ```{r} plot(lmmod) ``` We choose instead to do a logistic regression on this data: ```{r} # estimate the logistic regression model logmod <- glm(claim ~ age, data = insurance, family = binomial(link = 'logit')) # make predictions on the training data insurance$preds <- predict(logmod, type = 'response') tidy(logmod) # plot results p1<-ggplot(data = insurance, aes(x = age, y = claim)) + geom_point(color = 'grey50') + geom_line(aes(x = age, y = preds), color = 'red', size = 0.3) + theme_bw() p1 ``` The coefficient for `age` is `r coef(logmod)[2]`. We exponentiate this to get `r exp(coef(logmod)[2])`, so the odds of a claim decrease by about 50% per year. ```{r} exp( coef(logmod)[2] ) ``` Here is the relationship between odds and probability: ```{r} probability<-seq(0.01:0.99,by=0.01) the_odds<-probability/(1-probability) gf_line(probability~the_odds) ``` ### Logistic Regression - Model Significance and Goodness of Fit Measures For the model overall, a likelihood ratio test is performed, contrasting the model in question with a null model containing no predictors. This is analogous to the F-test in SLR/MLR. The test statistic $$\Lambda = -2 (\mbox{log-likelihood}_{null} - \mbox{log-likelihood}_{full})$$ is $\chi^2_{df}$ where degrees of freedom df = number of predictors. This same type of test is used for contrasting two nested models, say dropping out 3 predictors out of 7 $$LR\; Test = -2 (\mbox{log-likelihood}_{smaller \;model} - \mbox{log-likelihood}_{bigger \;model})$$ is $\chi^2_3$ (3 degrees of freedom) and tests whether the three parameters considered for dropping simultaneously have $\beta_j = 0$. We can see the results of the LR Test for our insurance example: ```{r} library(lmtest) fit_null <- glm(claim ~ 1 , data = insurance, family = binomial(link = 'logit')) lrtest(logmod,fit_null) ``` ### Multiple Logistic Regression Inference, uses, etc \vspace{1ex} From a multiple logistic regression analysis, we obtain several quantities: likelihood ratio (LR) test - overall test for any `significant' predictors $X$ of log odds ratio (analogous to the $F$-test in the linear model). \medskip We have $Z$--tests for individual $\beta$ coefficients ($H_0: \beta =0$) for the $X's$, which are in fact tests for corresponding ORs ($H_0: \mbox{OR} = 1.0$). Pseudo-$R^2$ measures proportion reduction in log-likelihood over null model. This is a useful measure, but more like another F-test than a measure of model explanatory power. \vspace{1ex} Predicted probabilities for individuals. \[ \Pr(Y=1 | X )= \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_qX_q)} \] Here what we have plotted in red are really our predictions: ```{r} p1 ``` These can be used to develop classification algorithms, predict probability of response prospectively, etc. Suppose we want to pick an age below which we will not offer insurance. In other words, we want to find a threshold probability and only offer claims to people who, given their age, have lower than the threshold probability of filing a claim. We want to maximize the number of customers we get while minimizing the number of claims. We can plot the data versus the probabilities to see where we could best draw the line: ```{r} p2<-ggplot(data = insurance, aes(x = preds, y = claim)) + geom_point(color = 'grey50') + xlab("Predicted probability")+ theme_bw() p2 ``` At each possible threshold location, we will get some predictions wrong. Where we choose to draw the line depends on how much we weigh the risk of claims against the desire for more customers.































We can make "receiver operating charactaristcs" curve to summarize the effects of the different choices: ```{r} # roc.plot(insurance$claim,insurance$pred,30) ``` ### Assumptions and Model Diagnostics \vspace{1ex} #### The assumptions: * No systematic bias in measurement assumption: *still required* * Uncorrelated observations assumption *still required* * No strong multicollinearity *still required* (and you can still use VIF to check for violation of assumption). * Linearity (w.r.t link function) *still required* * No influential observations *still required*. * Normality and constant variance assumptions are **no longer required** The error term is Bernoulli distributed. Many of the approaches we discussed before for addressing assumption violations in MLR may still apply for logistic regression. Confounders should still be always included. Interaction terms should not be left alone without main effects. Categorical variables should be considered as a whole set. ...and so on. Several diagnostic quantities, aiming to detect outliers and influential points, are defined (C\&H 12.5). These borrow concepts from linear regression. *Pearson residuals* and *deviance residuals* plotted against the predicted probabilities or an index measure are similarly assessed for large deviations. A similar *leverage* measure can be derived to identify the effect of particular observations. The *DBETA* measure determines the change in regression coefficients when observation $i$ is omitted, implicitly evaluating influence. A commonly used goodness of fit measure in logistic regression is the Hosmer-Lemeshow test. The test groups the $n$ observations into groups (according to their estimated probability of event) and calculates the corresponding generalized Pearson $\chi^2$ statistic. Usually deciles (10 groups) are used. $$H = \sum_{g=1}^G \frac{(O_g - E_g)^2 }{ n_g(\hat p_g (1 - \hat p_g)) },$$ where $O_g$ is the number of events in group $g$ and $E_g = n_g \times \hat p_g$ is the expected number of events In this type of test, a *large* p-value ($>$ 0.05) indicates good correspondence between observed and predicted outcomes. ```{r} library(generalhoslem) mod1 <- glm(vs ~ cyl + mpg, data = mtcars, family = binomial) logitgof(icuData$status, fitted(glmfit2)) ``` ``` Logistic model for status, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) +--------------------------------------------------------+ | Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total | |-------+--------+-------+-------+-------+-------+-------| | 1 | 0.0362 | 0 | 0.5 | 20 | 19.5 | 20 | | 2 | 0.0478 | 1 | 0.9 | 20 | 20.1 | 21 | | 3 | 0.0812 | 1 | 1.2 | 18 | 17.8 | 19 | | 4 | 0.1209 | 2 | 1.8 | 18 | 18.2 | 20 | | 5 | 0.1916 | 2 | 3.1 | 18 | 16.9 | 20 | |-------+--------+-------+-------+-------+-------+-------| | 6 | 0.2531 | 7 | 4.6 | 14 | 16.4 | 21 | | 7 | 0.2936 | 6 | 6.1 | 16 | 15.9 | 22 | | 8 | 0.3376 | 5 | 5.8 | 13 | 12.2 | 18 | | 9 | 0.3846 | 8 | 7.3 | 12 | 12.7 | 20 | | 10 | 0.5186 | 8 | 8.7 | 11 | 10.3 | 19 | +--------------------------------------------------------+ number of observations = 200 number of groups = 10 Hosmer-Lemeshow chi2(8) = 2.95 Prob > chi2 = 0.9372 ``` This result looks very positive - good fit. However, Pseudo-$R^2 = 13\%$, and is not very high. Pseudo-$R^2$ measures proportion reduction in log-likelihood over null model. This is a useful measure, but more like another F-test than a measure of model explanatory power. The Hosmer-Lemeshow g.o.f. test is more valuable as a means to identify major systematic variation that is not explained. Large p-value does not assure that prediction will be highly accurate, etc. AIC and BIC measures can also be used to select among models. ### Summary -- Logistic Regression Models Logistic regression can be extended to a multinomial outcome variable, where $Y$ equals one of several mutually exclusive nominal categories (see C\&H). Similarly, ordinal logistic regression permits an ordered categorical response variable. These models fit into a general class of models in which a ``link" function of the mean of $Y$ is modeled by a linear combination of predictors. * Logistic regression can be extended to a multinomial outcome variable, where $Y$ equals one of several mutually exclusive nominal categories (see C\&H). * Similarly, ordinal logistic regression permits an ordered categorical response variable. * These models fit into a general class of models in which $Y$ is related to a function of a linear combination of predictors. ## Lecture 3 ### A Generalized Approach for Many Model Types ### * Noting and taking advantage of commonalities among linear models for different response variable types, Nelder and Wedderburn and later McCullagh (UChicago) and Nelder developed **Generalized Linear Models**. * This approach generalizes many types of models into one framework, unifying theory and estimation methods. * For each model relating $Y$ to predictors $X$, one specifies * The *link function* $h( \cdot)$, which indicates the relationship between the linear prediction equation and $\E(Y)$; * The distribution for the error term $\epsilon$ of the model. * Then, a unified theory and single estimation approach subsumes a wide variety of models. A Few of the Several Types of GLMs: | Response | Link Function | Error Term | Model | | --- | --- | --- | --- | Continuous ($\approx$ normal) | identity | normal | linear 0/1 discrete | logit | Binomial | logistic polychotomous discrete | logit | multinomial | multinomial logistic Integer counts | natural log | Poisson | Poisson regression real valued, non-negative | inverse | Gamma | survival ### Poisson Regression * **Poisson regression** is used to model **count variables** as outcome. The outcome (i.e., the count variable) in a Poisson regression cannot take on negative values (can equal 0). \medskip * A Poisson regression model is sometimes known as a **log-linear model**, and it takes the form: \[ \log \big(\E(Y|X) \big) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p. \] **Note that** $\log \big(\E(Y|X) \big) \neq \E\big( \log(Y|X) \big)$, and the latter is OLS using log transformation on $Y$. The predicted mean of Poisson model is $\E(Y|X) = \exp \big(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \big).$ * Recall that the mean and variance of the Poisson distribution are the same. Therefore, a very strong assumption in Poisson regression is -- **conditional on the predictors, the conditional mean and variance of the outcome are equal. ** ### Examples of Poisson observations 1. The number of persons killed by mule or horse kicks in the Prussian army per year. Ladislaus Bortkiewicz collected data from 20 volumes of Preussischen Statistik. These data were collected on 10 corps of the Prussian army in the late 1800s over 20 years. 2. The number of people in line in front of you at the grocery store. Predictors may include the number of items currently offered at a special discounted price and whether a special event (e.g., a holiday, a big sporting event) is three or fewer days away. Problems involving queueing frequently involve the Poisson distribution. 3. The number of awards earned by students at one high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., vocational, general or academic) and the score on a mathematics exam. We illustrate Poisson regression using Example 3 above: * `num_awards` is the outcome variable and indicates the number of awards earned by students at a high school in a given year, * `math` is a continuous predictor variable and represents students' scores on their math final exam, and * `prog` is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. \bigskip For Poisson regression, we assume that the outcome variable number of awards, conditioned on the predictor variables, will have roughly equal mean and variance. ### Poisson regression assumptions Examining the mean numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards. Additionally, the means and variances within each level of program -- the conditional means and variances -- are similar. ```{r} library(foreign) awards<-read.dta('http://statistics.uchicago.edu/~collins/data/s224/poisson_sim.dta') library(dplyr) grouped <- group_by(awards, prog) summarise(grouped, mean=mean(num_awards), sd=sd(num_awards),var=var(num_awards),N=length(num_awards)) hist(awards$num_awards) ``` Can we use OLS here? Normality assumption is violated. Count outcome variables are sometimes log-transformed and analyzed using OLS regression. Some issues arise with this approach, for example, more than half of the data (124 students) have **zero** awards (log is undefined) ### Poisson regression model and coefficients We do the fit using the glm function: `fit<-glm(num_awards~prog+math,family='poisson',data=awards) ` ```{r} fit<-glm(num_awards~prog+math,family='poisson',data=awards) tidy(fit) glance(fit) ``` * Results ($\beta$s) are increase/decrease in log(counts) on an additive scale. * To interpret the coefficients, one needs to take $\exp(\beta)$'s and interpret them as the expected fold change (relative change) in counts per unit of X change. ### Poisson regression model fit To help assess the fit of the model, we can do the **goodness-of-fit** $\chi^2$ test. This is **not** a test of the model coefficients, but rather \underline{a test of the model form}: Does the Poisson model form fit our data? Thus, a large goodness-of-fit p-value indicates the observed and the predicted data are not too different from each other, i.e., a good fit. ```{r} # Deviance goodness-of-fit pchisq(fit$deviance, df=fit$df.residual, lower.tail=FALSE) ``` A statistically significant (small p-value) here would indicate that the model does not fit the data well. In that situation, we may try to determine if there are omitted predictor variables, if our linearity assumption holds and/or if the conditional mean and variance of outcome are very different. ### Summary - Poisson regression and GLMs * Poisson is a useful model for many phenomena, but has a strong theoretical assumption, that conditional mean and variance of the outcome variable are equal. * When there seems to be an issue of bad fit, we should first check if our model is appropriately specified, such as omitted variables and functional forms. * The assumption that the conditional variance is equal to the conditional mean should be checked. There are alternative variations on Poisson regression that may work. ### Linear mixed-effects models * In certain data, a group of individuals are measured repeatedly over a time period. For example, we measure the blood pressures of ICU patients in the day of admission, then day 1, day 2, until dismissal or death. * Samples can be naturally clustered. E.g., we took the income and education data of 357 individuals from 31 extended families. * Other examples of clustered data: students in multiple schools/classes/sections, patients from multiple clinics/hospitals, samples processed in multiple experiments, workers at several work sites, people from different cities, etc. * When analyzing $N$ samples from $k$ clusters, observations are not completely independent. Samples from the same cluster can be more similar to each other, and are correlated. * Example: Consider a longitudinal dataset used in Ruppert, Wand, and Carroll (2003) and Diggle et al. (2002). We took weight measurements of 48 pigs in 9 successive weeks. Each pig has a unique id. We are interested in the rate of growth of pig weight over the 9 week period. Each pig experiences a linear trend in growth, but each pig has a different starting weight (i.e., intercept). ```{r} library(readstata13) pig<-read.dta13('http://www.stata-press.com/data/r13/pig.dta') p1<-gf_point(weight~week, color=~id, group=~id,data=pig) %>% gf_line(weight~week, color=~id, group=~id,data=pig) p1 ``` * You may fit a simple linear regression with weight being the response and week being the predictor. However, the weight measurements for the same pig are correlated with each other, and the error terms are correlated -- violating regression assumptions. Note the $s.e.(\hat\beta_1)_{SLR}=0.0818409$. ```{r} fit_pig<-lm(weight~week,data=pig) tidy(fit_pig) glance(fit_pig) ``` * You may also fit a linear regression with each pig having a different intercept. This model implicitly says, other than each pig may start with different baseline weight, the weight growth for different pig is the same. Note that the $\beta_{1}$ estimate is similar to OLS estimate, however, $s.e.(\hat\beta_{week})_{fixed}=0.0390633$, much smaller than before. Note that the $\beta_1$ estimates are similar as before. ```{r} pig<-mutate(pig,id_factor=as.factor(id)) fit_pig2<-lm(weight~week+id_factor,data=pig) summary(fit_pig2) ``` * However, we are not really interested in estimating the baseline weights of these particular 48 pigs per se. They consumed 48 degrees of freedom in the model estimation. * Another idea is to consider those 48 pigs as a random sample from a larger population, and their baseline weights coming from a same distribution with a mean of $\beta_0$, and each pig may have a different baseline weight differing from the population average by $u_j$, and $u_j \sim N(0, \sigma_u^2)$. * The weight measurements for a same pig are correlated with each other because they are from the same pig, sharing a same baseline. (This is a simplifying assumption that can be relaxed later.) * In summary, we treat the pigs as a random sample from a larger population and model the between-pig variability as a random effect, and more specifically, as a random-intercept term for each pig. For each pig $j$, the beginning weight may differ from the population average $\beta_0$ by $u_j$, and $u_j \sim N(0, \sigma_u^2)$. * In comparison with estimating the intercept for each of the 48 pigs, in the mixed-effects model here we only need to estimate one parameter $\sigma_u$, and the estimated $u_j$ can be derived given other parameter estimates. * We thus wish to fit the model \[ \textrm{weight}_{ij} = \beta_0+\beta_1\textrm{week}_{ij}+u_j+\epsilon_{ij}, \] for $i=1,\ldots,9$ weeks and $j=1,\ldots,48$ pigs. * Here $\beta_0+\beta_1\textrm{week}_{ij}$ is called the ``fixed-effects" term. The coefficients $\beta_0$ and $\beta_1$ are fixed for all individuals and represent the population average effects, but not individual-specific effects. * In contrast, the $u_j$ here is a random effect (and a random intercept here) that shifting the regression line up or down according to each pig. That is, the random effect occurs as an individual-specific effect, with $u_j\sim N(0, \sigma_u^2)$. * The model above contains both fixed-effects (population effects common to all clusters/individuals) and random-effects (individual-specific or cluster-specific effects), and as such is called a linear mixed-effects model. * Based on the linear mixed-effects model, \[ \textrm{weight}_{ij} \sim N( \beta_0+\beta_1\textrm{week}_{ij}, \sigma^2_u+\sigma_e^2). \] * In R, you may load the R library ``lme4". ```{r} library(lme4) fit_pig3<-lmer(weight~week + (1|id),data=pig) summary(fit_pig3) ``` * In the mixed-effects model here, $s.e.(\hat\beta_1)_{mixed}=0.0390124$. Correlations among samples are accounted for. We have valid inference. * Now, look at the scatter plot again. \centerline{\includegraphics[width=3in]{./graphs/pigs.png}} ```{r} p1 ``` * It also seems like that there is a general average growth rate for the entire pig population ($\beta_1$) but individual pigs may have slightly faster or slower rates of growth. * Instead of fitting 48 interaction effects and estimating 48 slopes ($\beta_{1j}$) for individual pigs. We may think that $\beta_{1j} = \beta_{1}+ u_{1j}$ with $u_{1j}\sim N(0, \sigma_{u_1}^2)$. We can fit the following model with both random intercepts and random-effects in week on weight: \[ \textrm{weight}_{ij} = \beta_0+u_{0j}+\beta_1\textrm{week}_{ij}+u_{1j}\textrm{week}_{ij}+\epsilon_{ij}, \] ```{r} library(lme4) fit_pig4<-lmer(weight~week + (week|id),data=pig) fit_pig4 ``` ### Summary -- linear mixed-effect models * When data are clustered and samples within each clusters are correlated, OLS would not provide the right s.e. estimates nor valid inferences. One may consider using linear mixed-effects models. * The linear mixed-effects model estimates the population average ``fixed-effects" and the individual-specific ``random-effects." * The random effects are the individual-specific effects that are uncorrelated with the predictors. The fixed effect are the population average effects of the predictor variables. * In estimating the parameters, no closed form solutions exist and the Expectation-Maximization (EM) algorithm is commonly used. ### Summary -- Beyond Linear Regression * There are additional models for other data types - studied in courses such as: * Categorical Data Analysis - all types of discrete outcome variables * Generalized Linear Models - all GLMs * Biostatistical Methods - logistic, Poisson, hazard (time to event data) * Applied Survival Analysis - time to event data * Applied Longitudinal Data Analysis - mixed effects model for data with repeated measures or clustered data