---
title: "STAT 224 Chapter 11"
output:
html_document:
theme: cerulean
css: styles.css
code_folding: show
df_print: kable
number_sections: yes
self_contained: yes
toc: FALSE
toc_depth: 4
toc_float: FALSE
params:
dotabs: TRUE
---
```{r test-main, child = 'preamble.Rmd'}
```
# Chapter 11: Model Selection `r if(params$dotabs) paste('{.tabset}')`
### Intro
#### Variable (model) selection
##### Thus far...
+ predictors identified in advance
+ most predictors had some value for constructing a linear model
##### Reality...
+ Many predictors
+ Several candidate models
+ all may pass the usual diagnostics and tests
+ How do we pick the best model?
##### What is variable (model) selection
Variable selection is the process of choosing a "best" subset of all available predictors.
Well, there is no single "best" subset.
We do want a model we can interpret or justify with respect to the questions of interest.
#### Model comparison statistics
Nested models
+ F-test
Any two models with the same response
+ MSE (Mean Squared Error)
+ AIC (Akaike Information Criterion)
+ BIC (Bayesian Information Criterion)
+ Mallow's $C_p$ (has fallen out of favor)
Any two models with the same response (but possibly differently transformed)
+ adjusted $R^2$
##### Information criterion
**CAUTION:**
For AIC, BIC, and $C_p$,
$p$ = number of parameters (inluding the constant/intercept).
This is different than our usual of the letter $p$ (number of predictors).
Both Akaike and Bayesian Information Criteria
reward small variance ($SSE_p / n$ small) and penalize larger models ($p$ large).
$$AIC = n \log_e(SSE_p / n) + 2p$$
$$BIC = n \log_e(SSE_p / n) + p\log_e(n)$$
+ Smaller is better for both criterion
+ Models with AIC difference $\le 2$ should be treated as equally adequate
+ Similarly, models with BIC difference $\le 2$ should be treated as equally adequate
+ BIC penalty for larger models is more severe
+ $p\log_e(n) > 2p$ (whenever $n > 8$)
+ Controls tendency of "overfitting" from AIC
#### Which models to compare?
If there are $p$ predictor variables, then $2^p$ possible models.
If $p=10$, then more than 1,000 candidate models to choose from!
...and that's not even including the possibility of interaction effects
So, how to choose models in an intelligent and efficient way?
### Consequences
#### Consequences of excluding necessary variables
Consider a world where there is a "correct" model with $q$ predictors.
$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$
with least squares estimate
$$\yhat_i^*=\b0^*+\b1^* x_{i1} + ... + \bq^* x_{iq}$$
Let $p < q$ and consider the model
$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$
which excludes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (all non-zero coefficients).
This model is estimated by
$$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$
###### What do we gain?
+ Decreased variance of coefficients and predictions
+ $\var(\bj) \le \var(\bj^*)$
+ $\var(\yhat_i) \le \var(\yhat_i^*)$
###### What do we lose?
+ Bias of coefficients and predictions
+ tend to over-estimate or tend to under-estimate on average
+ coefficient bias = $E(\bj) - \beta_j$
+ prediction bias = $E(\yhat_i) - \mu_i$
+ ...and we don't know which direction or how large the bias
+ However, the bias can be considered negligible if $\bj < se(\bj)$.
###### None of these consequence occur if predictors uncorrelated
So, check for collinearity first!
For example, check variance inflation factors (VIF)
#### Consequences of including unnecessary variables
Consider a world where there is a "correct" model with $p$ predictors.
$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$
with least squares estimate
$$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$
Let $q < p$ and consider the model
$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$
which includes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (but all these coefficients = 0).
This model is estimated by
$$\yhat_i^*=\b0^*+\b1^*x_{i1} + ... + \bq^*x_{iq}$$
###### What do we gain?
+ Nothing, really
###### What do we lose?
+ Increased variance of coefficients and predictions
+ Decreases degrees of freedom (amount of information for estimating variance)
+ $\var(\bj) \le \var(\bj^*)$
+ $\var(\yhat_i) \le \var(\yhat_i^*)$
###### None of these consequence occur if predictors uncorrelated
So, check for collinearity first!
For example, check variance inflation factors (VIF)
### Modeling Purpose
#### What is your modeling purpose?
An overarching consideration in model selection
relates to the purpose and intended use of the model.
+ **Descriptive** (exploratory, understanding relationships)
+ Try to account for as much response variability as possible, but keep simple.
+ Search for fundamental relationships
+ Usually start with a few essential variables
+ Requires consideration of why variables should be in the model
+ Then choose variables (and combinations of variables) to build forward
+ Useful with big data
+ The descriptive/exploratory model might not be the final model
+ **Predictive** (getting good predictions)
+ Minimize the MSE of prediction: $MSE(\yhat_i) = \var(\yhat_i) + \text{bias}^2$
+ Want realistic predictions and close to the data
+ Less consideration about which variables are required, included
+ "black box" modeling to some extent
+ **Explanatory** (describe the process, interpretability)
+ Lots of thinking required about which variables are important
+ Parsimony important (smallest model that is "complete")
+ Thoroughly address confounding and possible effect modification
+ Do not omit important confounders
+ Explore the need for interaction terms
###### The three goals are not mutually exclusive
Ideally, would like a little bit of all three properties.
### Example
#### Supervisor performance example
Let’s look at an example:
From textbook Table 3.3, the supervisor performance data.
$Y=$ Overall rating of job being done by supervisor
$x_1=$ Handles employee complaints
$x_2=$ Does not allow special privileges
$x_3=$ Opportunity to learn new things
$x_4=$ Raises based on performance
$x_5=$ Too critical of poor performance
$x_6=$ Rate of advancing to better jobs
```{r}
superData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P060.txt")
glimpse(superData)
head(superData)
```
##### Is multicollinearity a potential problem?
###### High correlations between predictors?
```{r}
ggpairs(superData)
```
##### Variance inflation factors larger than 10?
```{r message=FALSE, warning=FALSE}
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
library(car)
```
```{r}
VIFvalues <- as.data.frame(vif(lmfit))
colnames(VIFvalues) <- "VIF"
VIFvalues
```
No strong evidence of multicollinearity
#### Three models to compare
###### The full model
$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$$
```{r}
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
tidy(lmfit, conf.int=TRUE)
glance(summary(lmfit))
```
###### A model based on advancement and raises (positive stuff)
$$Y = \beta_0 + \beta_1 x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$$
$Y=$ Overall rating of job being done by supervisor
$x_1=$ Handles employee complaints
$x_3=$ Opportunity to learn new things
$x_4=$ Raises based on performance
$x_6=$ Rate of advancing to better jobs
```{r}
lmfit.positive <- lm(y ~ x1 + x3 + x4 + x6, data = superData)
tidy(lmfit.positive, conf.int=TRUE)
glance(summary(lmfit.positive))
```
###### A model focusing on the negative
$$Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$$
$Y=$ Overall rating of job being done by supervisor
$x_2=$ Does not allow special privileges
$x_5=$ Too critical of poor performance
```{r}
lmfit.negative <- lm(y ~ x2 + x5, data = superData)
tidy(lmfit.negative, conf.int=TRUE)
glance(summary(lmfit.negative))
```
#### Comparing non-nested models
##### Criteria
+ Adjusted $R^2\quad$ (larger is better)
+ MSE = $\widehat{\sigma}^2\quad$ (smaller is better)
+ AIC = $n(\SSE_p/n) + 2p\quad$ (smaller is better)
+ BIC = $n\log_e(\SSE_p/n) + p\log_e(n)\quad$ (smaller is better)
###### The three models we are comparing:
$Y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3
+ \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$
(the full model)
$Y = \beta_0 + \beta_1x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$
(positive stuff)
$Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$
(negative stuff)
The full model:
```{r}
glance(lmfit)[c(2, 3, 8, 9)]
```
A model based on advancement and raises (positive stuff)
```{r}
glance(lmfit.positive)[c(2, 3, 8, 9)]
```
A model focusing on the negative
```{r}
glance(lmfit.negative)[c(2, 3, 8, 9)]
```
The full model and the "positive" model are fairly close.
Both are clearly better than the "negative" model.
Perhaps choose the "positive" model for parsimony.
$Y=$ Overall rating of job being done by supervisor
$x_1=$ Handles employee complaints
$x_3=$ Opportunity to learn new things
$x_4=$ Raises based on performance
$x_6=$ Rate of advancing to better jobs
...and how shall we interpret this result with regard to rating supervisors?
### Automation
#### Automated variable selection strategies
Thinking through the problem carefully is always a strategy that should be employed!!!
But, sometimes may want to use one of these objective approaches:
+ **Forward Selection**
+ **Backward Elimination**
+ **Stepwise Selection** (a little of both)
##### Forward Selection
(1) Begin with an "empty" model (no predictors)
(2) Add the predictor with highest correlation with response
+ or largest t-statistic (smallest p-value)
+ $H_0: \text{correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$
+ decide here on a significance level required for entry at each step
(3) Continue on to add next predictor with highest partial correlation with response
+ that is, after adjusting for the predictor added in step (2)
+ or largest t-statistic meeting the significance level criterion
+ $H_0: \text{partial correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$
(4) With new model from (3), go back to step (3) again
+ add another predictor if one meets the criterion (a p-value threshold)
+ repeat (3) until no other predictors meet the criterion
##### Backward Elimination
(1) Begin with the "full" model (all predictors)
(2) Remove the predictor with the smallest t-statistic (largest p-value)
(3) Continue on to remove the next predictor with smallest t-statistic
+ that is, after adjusting for the remaining predictors after step (2)
(4) With the new model from (3), go back to step (3) again
+ remove another predictor if it meets the criterion (a p-value threshold)
+ repeat (3) until no other predictors meet the criterion
##### Stepwise Selection
It combines forward and backward steps,
usually beginning from empty model (forward stepwise).
### Example V2
I found an R package (`olsrr`) that will do variable selection much like described in the textbook.
I'm sure there are other packages.
`install.packages("olsrr")`
`library(olsrr)`
#### Forward selection
`lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)`
###### Entry criteria: p-value of t-test for coefficient
`ols_step_forward_p(lmfit, penter = 0.99)`
Using `penter = 0.99` we are adding in **all** predictors,
one by one, whichever variable has the lowest p-value comes in next.
```{r}
library(olsrr)
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99)
forward.sequence
```
The `olsrr` package includes a `plot` method for the output
from `ols_step_forward_p`
`forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99)`
`plot(forward.sequence)`
You can see how the various model summaries change as variables are added one-by-one.
SBC is BIC (I don't know why it's relabeled)
SBIC is a slight modification of BIC.
C(p) is Mallow's $C_p$ discussed in the textbook, but not in lecture.
```{r}
plot(forward.sequence)
```
Could use a rule to only add variables if the p-value is less than a specified value:
Then set criterion to, say, p-value $= 0.25$
```{r}
forward.output <- ols_step_forward_p(lmfit, penter = 0.25)
forward.output
```
$Y=$ Overall rating of job being done by supervisor
$x_1=$ Handles employee complaints
$x_3=$ Opportunity to learn new things
$x_6=$ Rate of advancing to better jobs
###### Entry criteria: AIC no longer decreases
Alternatively, we can use a criterion like AIC to decide when to stop.
We will stop when AIC no longer decreases.
`ols_step_forward_aic(lmfit)`
```{r}
ols_step_forward_aic(lmfit)
```
#### Backward elimination
`lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)`
###### Exit criteria: p-value of t-test for coefficient
Start with all variables in the model and remove according to the p-value.
Remove predictor with largest p-value.
`ols_step_backward_p(lmfit, prem = 0.33)`
If we set the p-value for removal = 0.33, that corresponds to a t-statistic $\approx 1$.
```{r}
2 * (1 - pt(1, df = 30 - 6 - 1))
ols_step_backward_p(lmfit, prem = 0.33)
```
###### Exit criteria: AIC no longer decreases
Alternatively, we can use a criterion like AIC to decide when to stop.
We will stop when AIC no longer decreases.
`ols_step_backward_aic(lmfit)`
```{r}
ols_step_backward_aic(lmfit)
```
#### Stepwise (forward) selection
###### Criteria: Use t-statistic p-value as the entry and exit criterion
As variables come into the model,
you can also check to see if any could later be removed.
Both criteria can be the p-values for coefficient t-tests.
`ols_step_both_p(lmfit, pent = 0.25, prem = 0.33)`
```{r}
ols_step_both_p(lmfit, pent = 0.25, prem = 0.33)
```
###### Criterion: Or use AIC as the entry and exit criterion
`ols_step_both_aic(lmfit)`
```{r}
ols_step_both_aic(lmfit)
```
#### Stepwise (backward) selection
Start with the full model (all predictors).
Remove according to p-value (large)
...and possibly later enter a variable back in according to p-value (small).
In the `olsrr` package, the function `ols_step_both_p` does not appear
to support backward stepwise selection -- only forward (demonstrated above).