---
title: "STAT 224 Chapter 12"
output:
html_document:
code_folding: hide
df_print: kable
number_sections: yes
self_contained: yes
toc: FALSE
toc_depth: 4
toc_float: FALSE
params:
dotabs: FALSE
---
```{r test-main, child = 'preamble.Rmd'}
```
# Chapter 12 Logistic Regression `r if(params$dotabs) paste('{.tabset}')`
## Part 1
### Regression Models for the Probability of Response Outcome
So far, response variable $Y$ continuous and (approximately) normally distributed.
What if $Y$ is a binary discrete variable, taking on values $0$ or $1$?
For example,
bank failure:
$Y = 1$ if failed (bankcruptcy), $Y=0$ if not
disease event:
$Y = 1$ if with disease (case), 0 otherwise (unaffected/control).
Each individual observation is like a "trial" in a binary outcome experiment
where $P(Y=1)$ is some value $p$.
$Y$ is a Bernoulli random variable, with probability of "success" $\pi$.
| $Y$ | $P(Y)$ |
| ------------- |-------------|
| $0$ (failure) | $P(Y=0) \;=\; 1-\pi$ |
| $1$ (success) | $P(Y = 1) \;=\; \pi$ |
For each observtion, $Y_i \sim {\rm Bernoulli}(\pi)$, for $i=1, 2, \ldots, n$
$E(Y_i) = \pi = P(Y_i=1)$
$\var(Y_i) = \pi(1-\pi)$
So, the variance of $Y$ depends on its mean!
+ Want to identify factors related to the probability of an event.
+ Want to predict the probability of an event.
If we try to model $\pi = \E(Y)$ as a linear function of predictors $X$:
\[
\E(Y \,|\, X) = \P(Y = 1\, |\, X) = \pi_x = \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p
\]
Several problems arise:
+ Model can readily (and incorrectly) estimate a probability below $0$ or above $1$
+ $0 < \pi < 1$ is a restriction not properly handled by a linear model
+ $Y$ not even approx.\ normally distributed (not even continuous)
+ The variance of $Y$ (thus, the error term) is not constant over $X$
+ since $\E(Y)=\pi$ and $\var(Y) = \pi(1-\pi)$.
+ So if $\pi$ varies conditional on $X$, so does the variance.
The following transform does permit us to formulate $\P(Y=1)$
(actually, a function of it)
as a linear function of predictor variables $X$
\[
\log_e \left[\frac{\pi}{1-\pi}\right]
= \log_e \left[\frac{\P(Y=1\, |\, X)}{1-\P(Y=1\, |\, X)} \right]
= \beta_0 + \beta_1X_1 + \ldots + \beta_pX_p .
\]
$$-\infty < \log_e \left[\frac{\pi}{1-\pi}\right] < \infty$$
So, this transformed response is a continuous variable with no restrictions,
unlike $0 < \pi < 1$.
From here, after some algebra
\[
\pi_x = \P(Y=1\, |\, X)
= \frac{\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)}{1+\exp(\beta_0 + \beta_1X_1 + \ldots + \beta_pX_p)}
\]
For this model, observations with same values of the $X$s
have the same probability of "success" -- $\pi_x = \P(Y=1)$.
#### The response is now the "odds of success"
$\displaystyle O = \mbox{odds of success} = \frac{\mbox{probability of success}}{\mbox{probability of failure}}=\frac{\pi}{1-\pi} $$
Odds is just another way of expressing probabilities.
**CAUTION:**
I don't know why, but the textbook calls $\displaystyle \frac{\pi}{1-\pi}$ an "odds ratio".
This is incorrect. This is the "odds", not an "odds ratio" (a ratio of two odds).
### Regression Models for Log Odds of Response
The **logit transform** of a Bernoulli($\pi$) response variable $Y$ is
\[
\logit(\pi) =\log_e \left(\frac{\pi}{1-\pi}\right)
= \log_e(\mbox{odds}) = \mbox{"log odds"}
\]
**Logistic regression** models predict the logit as a linear function of $X$s
The name derives from the fact that we can write
\[ \pi = \frac{e^{\beta_0 + \beta_1 x}}{1 + e^{\beta_0 + \beta_1 x}} \]
and the right-hand side of an algebraic form called a **logistic function**.
+ A logistic function is the inverse of a logit function.
+ Logit function "links" the mean of $Y$ to a linear function of $X$s.
+ Logit function is also called the **link function** for logistic regression.
+ This method produces valid probability values ($\pi_x$) for any values of $X$.
+ The estimates of $\pi$ are always values between $0$ and $1$.
### A Review (or new to you): Odds and Odds Ratios
How do we typically summarize and analyze data with binary response?
We should have analogous basic methods in logistic regression,
just as in linear regression,
where we can reproduce the two-sample t-test for comparison of means via SLR.
(See Chapter 1-2 notes and textbook Section 2.11)
If the question is whether the probability of a binary event differs
between, say, two groups, we have
+ Test for difference of proportions between two groups
+ normal-distribution-based test
+ exact binomial test
+ Test for homogeneity in a 2x2 contingency table
+ Chi-square test ($\chi^2$)
We also have associated measures of difference or effect
+ difference of proportions,
+ ratio of proportions
...and another summary that we will usehere, the **odds ratio**
**CAUTION:**
I don't know why, but the textbook calls $\displaystyle \frac{\pi}{1-\pi}$ an "odds ratio".
This is incorrect. This is the "odds", not an "odds ratio" (a ratio of two odds).
#### 2 x 2 tables review
Recall the basic set-up for the **2 $\times$ 2 Table**.
For a retrospective, case-control epidemiological study of some exposure:
| Disease status | Exposed | Not exposed| Total |
| ---: | --- | --- |--- |
case | a | b | a+ b
non-case | c | d | c+d
Total | a+c | b+d | a + b + c + d = N
To study the association between outcome and exposure based on a 2x2 table,
generally we may consider the following measures:
#### Difference of Proportions (or Risk Difference)
\[
\mbox{risk difference} = \frac{a}{a + c} - \frac{b}{b + d}
\]
However, in a case-control study, the proportion of cases **is not**
the proportion (estimated risk) of disease in the population. i
The samples are collected based on outcome (diseased versus non-diseased).
It is generally not possible to calculate and compare
the absolute risk of disease (incidence) in a case-control study.
(Although, the risk proportions would be valid in a cohort study.)
#### Why do a retrospective, case-control study vs. a cohort study?
In a cohort study,
+ Determine exposed or not status at the start.
+ Follow participants to see if get the disease (case) or not.
+ Could take years and years to get the data.
+ ...and if the disease is rare, tough to get enough data on cases
#### The Odds Ratio
Ratio of odds of event ($Y=1$) in each exposure group
Note that in the *exposed* group
\[ \mbox{odds(case, diseased)} = \frac{\pi_e}{1-\pi_e } = \frac{a\,/\,(a+c)}{c\,/\,(a+c) } = a\,/\,c \]
and in the *unexposed* group
\[ \mbox{odds(case, diseased)} = \frac{\pi_u}{1-\pi_u } = \frac{b\,/\,(b+d)}{d\,/\,(b+d) } = b\,/\,d \]
Then
\[
OR = \mbox{odds ratio} = \frac{\mbox{ odds of disease in exposed group}}{\mbox{ odds of disease in unexposed group}}
= { \frac{a\,/\, c}{b\,/\,d} } = \frac{ad}{bc}
\]
This is the **odds ratio** or relative odds
#### Odds Ratio interpretable for case-control, retrospective studies
Why use an odds ratio?
In case-control studies:
+ The samples are collected based on outcome (diseased versus non-diseased).
+ Odds(disease for exposure group) **cannot** be estimated from the data.
+ Odds(disease for unexposed group) **cannot** be estimated from the data.
However, odds ratio, a relative measure, is still valid.
| Disease status | Exposed | Not exposed| Total |
| ---: | --- | --- |--- |
case | a | b | a+ b
non-case | c | d | c+d
Total | a+c | b+d | a + b + c + d = N
Odds(exposure) **can** be estimated for both diseased and not.
The odds of exposure among cases:
\[ \mbox{odds(exposure)} = \frac{a\,/\,(a+b)}{b\,/\,(a+b) } = a\,/\,b \]
The odds of exposure among non-cases:
\[ \mbox{odds(exposure)} = \frac{c\,/\,(c+d)}{d\,/\,(c+d) } = c\,/\,d \]
$\displaystyle{
OR = \mbox{odds ratio} = { \frac{a\,/\, b}{c\,/\,d} } = \frac{ad}{bc}
}$
= same as OR of diseased for exposed vs. not!!!
### Logistic Regression
#### Example: Hospital admissions and ICU deaths
We are interested in whether there is any difference in ICU mortality rates (outcome)
for different initial hospital admission types (exposure, X variable).
| Admission Status | Emergency | Elective | Total |
| ---: | --- | --- |--- |
Death in ICU | a | b | a+ b
Survived after ICU | c | d | c+d
Total | a+c | b+d | a + b + c + d = N
Let $\pi_0=$ risk of death in he population of elective admission patients
Let $\pi_1=$ risk of death in population of emergency admission patients
The study considers a window of time and classifies ICU stays by
response (death or not)
and
predictor (elective or emergency hospital admission)
#### The odds
For the elective admission population, the "odds" of death is:
\[
\frac{\pi_0}{1-\pi_0} = \frac{b\,/\,(b+d)}{d\,/\,(b+d)} = \frac{b}{d}
\]
$\displaystyle{ \log \left[ \frac{\pi_0}{1-\pi_0} \right] }$
is the "log odds" of death, or the "logit" of $\pi_0$.
For the emergency admission population, the "odds" of death is:
\[
\frac{\pi_1}{1-\pi_1} = \frac{a\,/\,(a+c)}{c\,/\,(a+c)} = \frac{a}{c}\]
**REMINDER:**
For a case-control study like this one
these are **NOT** estimates of the population odds.
#### The log odds
\[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\}\]
is the log odds of death, or the logit of $\pi_0$ for the elective admission group.
\[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\}\]
is the log odds of death, or the logit of $\pi_1$ for the emergency admission group.
Why log odds (logit)?
Rather than model on either the probability or odds ratio scale
(both of which have a restricted valid range),
we need another metric that is equivalent but not constrained.
$0 < \pi < 1\qquad 0 < OR < \infty \qquad -\infty < \log(\textrm{odds}) < \infty$
#### The log odds ratio
We can take the difference of log odds
(in the same way that we work with a difference of means)
\[ \log \left\{ \frac{\pi_1}{1-\pi_1} \right\} - \log \left\{ \frac{\pi_0}{1-\pi_0} \right\}
\;=\; \log \left\{ \frac{ \pi_1 / (1 - \pi_1) }{ \pi_0 / (1-\pi_0) } \right\}
\;=\; \log(\mbox{odds ratio}) \;=\; \log(OR)\]
by basic rules of logarithms: $\log(v - w) = \log(v/w)$.
$\log(OR)$ is the population "log odds ratio" of death,
comparing the emergency admissions to the elective admissions.
We label it $\log(OR) = \beta_1$ and want our model to estimate it.
### Logistic Regression Model
Define an indicator variable for admission type:
\[
x = \mbox{type } =
\begin{cases}
0 & \mbox{elective}\\
1 & \mbox{emergency}
\end{cases}
\]
Algebraically, the log odds can be expressed as a linear function of predictors
(log odds has range $(-\infty, \infty)$).
\[ \log(\mbox{odds of event}) = \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \]
**Note:** There is no $\epsilon$ term written
because the left side is a function of the mean of $Y$ ($\pi$),
not just $Y$ or a transformation of $Y$
Since the log odds is simply a function of the probability of interest (\pi=\pr(Y=1)),
a model that predicts the log odds is equivalent to a model that predicts the probability.
#### Example: Hospital admissions and ICU deaths
Here is the logistic regression model for the risk of death in ICU as a function of admission type:
\[ \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \]
When $x=0$,
\[
\log \left( \frac{\pi}{1-\pi} \right)
\;=\; \beta_0 + \beta_1 (0) = \beta_0
\;=\; \log \left( \frac{\pi_0}{1-\pi_0} \right)
\]
When $x=1$,
\[
\log \left( \frac{\pi}{1-\pi} \right)
\;=\; \beta_0 + \beta_1 (1) = \beta_0 + \beta_1
\;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right)
\]
Then,
$$
\begin{align}
\log(OR)
&\;=\; \log \left( \frac{\pi_1/(1-\pi_1)}{\pi_0(1-\pi_0)} \right)\\
&\;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) - \log \left( \frac{\pi_0}{1-\pi_0} \right) \\
&\;=\; (\beta_0 + \beta_1) - \beta_0\\
&\;=\; \beta_1
\end{align}
$$
Thus, odds ratio = $OR = e^{\beta_1}$
### More About the Logistic Regression Model
A linear regression model describes the mean of a random variable ($Y$)
as a linear combination of some other variables ($X$'s)
A logistic regression model describes the log odds of a probability
(that a binary variable $Y$ takes on the value $1$)
as a linear combination of some other variables ($X$'s)
Because error structure in the two models is different
(Bernoulli vs. assumed normal),
a different model estimation method from least squares is used:
"**maximum likelihood estimation.**"
### Logistic Regression - Estimation and Model Interpretation
In R, we use the GLM function
(GLM stands for Generalized Linear Models).
For the logit model, we specify the distribution family as binomial
```{r}
icuData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/icu.csv")
head(icuData)
addmargins( tally(status ~ type, data=icuData) )
glmfit <- glm(status ~ type, data = icuData, family = "binomial")
tidy(glmfit)
```
Interpretation of regression coefficients:
Here, $\hat{\beta}_0 = -3.24$ and $\hat{\beta}_1 = 2.18$.
The model
$\displaystyle{\mbox{log odds of status being death} = \log \left\{ \frac{\pi}{1-\pi} \right\} = -3.24 + 2.18 (\mbox{type})}$
predicts the log odds (and odds) of death for each exposure group:
when $X=0$ (elective group): log odds or logit = $\beta_0 = -3.24$, then
$$\widehat{O}_0 = \mbox{odds of death}_0 = e^{\b0} = e^{-3.24} = 0.039 $$
when $X=1$ (emergency group):
$$\widehat{O}_1 = \mbox{odds of death}_1 = e^{\b0 + \b1} = e^{-3.24 + 2.18} = 0.346 $$
\[
\widehat{OR} = \textrm{odds ratio} = \frac{\mbox{odds}_1}{\mbox{odds}_0} = e^{2.18} = 8.89 = e^{\widehat{\beta}_1}
\]
Back to the data for a minute
| Status | Elective | Emergency | Total |
| --- | --- | --- | --- |
0 | 51 | 109 | 160
1 | 2 | 38 | 40
Total | 53 | 147 | 200
From this table, note that odds are
elective: 2/51 = 0.039
emergency 38/109 = 0.346
Then, from the table
$\qquad \widehat{OR} = \frac{51 \times 38}{2 \times 109} = 8.89$
From the model, the ratio of predicted odds (i.e.' odds ratio) is
$$\frac{0.346}{0.039} = 8.89 = \widehat{OR}$$
But also note that the difference of log odds $\equiv$ log odds ratio
is **directly given** by the model ($\b1$).
An estimate of the odds ratio is therefore:
\[\widehat{OR} = e^{\widehat{\beta}_1} = e^{2.18} = 8.89 \]
So, the $\widehat{\beta}$ coefficient for `type` gives the log odds ratio for type of admission.
$e^{\b1}$ gives the OR estimate.
**Note:**
Unless predicting probabilities, we don't need $\beta_0$.
It might not be interpretable in some study designs.
#### Predicting probabilities
The model was intended to predict probabilities.
Does it do this?
Use the logistic function to obtain
$$\Pr(Y = 1 \,|\, X=0) = \frac{e^{-3.24}}{1 + e^{-3.24}} = 0.0377 = 2/53$$
$$\Pr(Y = 1 \,|\, X=1) = \frac{e^{-3.24+2.18}}{1 + e^{-3.24+2.18}} = 0.2573 = 38/147$$
Check these probabilities against the 2x2 data table.
```{r}
tally(status ~ type, data=icuData, margins=TRUE)
```
The model predicts $\P(Y=1)$ by admission type,
equaling the proportion of deaths in each group.
If both OR and probabilities are valid (prospective study),
why would we still care about the probability while we have odds ratio?
+ OR is on a relative scale while the probability tells you how large the actual risk is.
+ There are certain cases when the probability is very very low but the OR is high.
+ Although it is concerning that OR is high, but since the probability is so small.
+ People may still be willing to take the risk.
Test of the hypothesis that ICU death is associated with type of hospital admission:
\[ H_0: OR = e^{\beta_1} = 1 \]
is same as
\[ H_0: \log( OR ) = \beta_1 = 0 \]
```{r}
tidy(glmfit)
```
The test statistic reported in the output is:
\[ Z = \frac{\widehat{\beta}_1}{\widehat{\se}( \hat{\beta}_1 )}
= \frac{2.18}{ 0.745 } = 2.93 \]
and the p-value is
\[ = \P(|Z| > 2.93 ) = 0.003 \]
From the model estimation run, a confidence interval for $\beta_1$ is constructed as:
\[
\left( \b1- \textrm{CriticalValue}\times se(\b1),\quad \b1 + \textrm{CriticalValue}\times se(\b1) \right)
\]
Exponentiating the boundaries of the CI gives a CI for the $OR$:
\[
\left(\exp\left( \b1- \textrm{CriticalValue}\times se(\b1) \right),\quad
\exp\left(\b1 + \textrm{CriticalValue}\times se(\b1) \right)\right)
\]
```{r}
tidy(glmfit)
```
Overall conclusion:
ICU death is significantly and positively associated with
emergency (versus elective) hospital admission
p-value $=0.003$
$\widehat{OR} = 8.89$
95\% CI: $[2.06, 38.3]$
Note: test on $\beta_1$ in this model is equivalent to this test:
\[ H_0: \mbox{ row frequencies same for both columns in 2x2 table} \]
vs
\[ H_A: \mbox{ row frequencies same for both columns in 2x2 table} \]
via the $\chi^2$ test.
Advantage of logistic regression is that we have an effect measure estimate with associated CI
Another advantage:
Logistic regression also can be readily used with predictors that are not binary or categorical,
where tables cannot be readily formed or would result in loss of information.
### Logistic Regression: Multiple and Continuous Covariates
We can easily expand the analysis to also accommodate multiple predictors. Suppose that we were interested in the association between both
age and admission type, and ICU mortality
Then as before we have a (0,1) indicator for admission type
\[
X_1 = \mbox{type } = \left\{ \begin{array}{l}
1 \mbox{ if emergency}
0 \mbox{ if elective}
\end{array}
\right.
\]
and define
\[
X_2 = \mbox{age (yrs)}
\]
and propose a logistic regression model for death as a function of these two covariates:
\[
\log \left\{ \frac{p}{1-p} \right\} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
What is the interpretation of $\beta_2$?
Let's compare the log odds of death of a 50 year
old to a 49 year old subject, both admitted on an emergency basis
\[
\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right)=\beta_0+\beta_1+\beta_2\times 50\]
\[
\log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)=\beta_0+\beta_1+\beta_2\times 49\]
$$\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}\right) - \log\left(\frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)$$
$$\log\left(\frac{P(Y=1|X_1=1,X_2=50)}{1-P(Y=1|X_1=1,X_2=50)}/ \frac{P(Y=1|X_1=1,X_2=49)}{1-P(Y=1|X_1=1,X_2=49)}\right)$$
Therefore: $\beta_2$ is the log odds ratio for death
for a 50 year old compared to a 49 year old,
both of whom had emergency admissions to the hospital
(or both of whom had elective admissions to the hospital)
Some important notes about $\beta_2$:
As in linear regression, $\beta_2$ represents
the increment in log odds for a one unit change in $X$,
holding other predictors constant.
Here, $\beta_2$ is the log odds ratio for death,
and $\exp(\beta_2)$ is the odds ratio for death
comparing two subjects who differ in age at admission by one year
**and** who are of the same admission type (be it emergency or elective).
This is what we mean by an **adjusted effect**:
$\beta_2$ is the log odds ratio and $\exp(\beta_2)$ is the odds ratio
comparing two subjects who differ in age at admission by one year,
**adjusted for admission type**.
What is the interpretation of $\beta_1$?
The log odds ratio (emergency vs. elective) for two subjects of a given age,
(i.e., adjusted for age)
and $e^{\beta_1}$ is the odds ratio.
### Hospital admissions example in R
First few observations:
```{r eval=TRUE}
icuData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/icu.csv")
glimpse(icuData)
head(icuData)
```
The 2x2 table of icu survival status (`status`) vs. predictor (hospital admission type, `type`)
```{r}
addmargins(tally(status ~ type, data=icuData))
```
#### Logistic regression model fit
This is a **generalized linear model** (glm).
The "link function" is the logit.
The errors have a Bernoulli = binomial($n=1$) distribution.
R syntax:
`glmfit <- glm(y ~ x, data = mydata, family = "binomial")`
`tidy(glmfit)`
```{r}
glmfit <- glm(status ~ type, data = icuData, family = "binomial")
tidy(glmfit)
```
Transform to the odds ratio scale: $e^{\b1}$
```{r}
b1 <- tidy(glmfit)$estimate[2]
exp(b1)
```
This should match the odds ratio right from the 2x2 table.
```{r}
theTable <- tally(status ~ type, data = icuData)
oddsRatio(theTable)
```
Hmmmm.... The odds ratio does not match.
After looking at the documentation for `oddsRatio` in the `mosaic` package,
I can see it calculates OR = (odds row 1) / (odds row 2).
That's odds of elective admission among those who survied
divided by odds of elective admission among those who died.
Instead, we wanted odds of death among those with emergency admission
divided by odds of death among those with elective admission.
A little care shows that what we want is the reciprocal of the OR that R provided.
```{r}
theTable <- tally(status ~ type, data = icuData)
1 / oddsRatio(theTable)
```
Now, the OR from the table matches the OR from the model, as expected.
Another way to do this is to swap the order of the response and predictor variables and get the "successes" (deaths) into the firs column as suggested in the documentation.
```{r}
theTable <- tally(type ~ -status, data = icuData)
theTable
oddsRatio(theTable)
summary(oddsRatio(theTable))
```
#### Example continued: Add another predictor
```{r}
glmfit2 <- glm(status ~ age + type, data = icuData, family = "binomial")
tidy(glmfit2)
```
The coefficients on the odds ratio scale:
```{r}
coef(glmfit2)
exp( coef(glmfit2) )
```
Note that admission `type` effect is even larger taking age into account (OR = 11.6). And age is a potential **confounder** here.
Age can also **modify** the effect of admission type on risk of death.
Older individuals are more likely to die under either admission type (since they are older),
whereas among younger individuals,
emergency admissions may be more strongly associated with death than elective admissions.
At a given age, emergency admissions have much higher (11.6-fold) greater risk of death.
\medskip
The age effect is to increase odds of death by 1.035, or about 3.5\%, per year of age increase.
**SKIP OVER TO SLIDES PAGE 40**