---
title: "STAT 224 Chapter 11"
output:
  html_document:
    theme: cerulean
    css: styles.css
    code_folding: show
    df_print: kable
    number_sections: yes
    self_contained: yes
    toc: FALSE
    toc_depth: 4
    toc_float: FALSE
params:
  dotabs:  TRUE
---


```{r test-main, child = 'preamble.Rmd'}
```

# Chapter 11: Model Selection `r if(params$dotabs) paste('{.tabset}')`

### Intro

#### Variable (model) selection

##### Thus far...

  + predictors identified in advance
  + most predictors had some value for constructing a linear model
  
##### Reality...

  + Many predictors
  + Several candidate models
    + all may pass the usual diagnostics and tests
  + How do we pick the best model?
  
##### What is variable (model) selection

Variable selection is the process of choosing a "best" subset of all available predictors.

Well, there is no single "best" subset.

We do want a model we can interpret or justify with respect to the questions of interest.

#### Model comparison statistics

Nested models

  + F-test
  
Any two models with the same response

  + MSE (Mean Squared Error)
  
  + AIC (Akaike Information Criterion)
  
  + BIC (Bayesian Information Criterion)
  
  + Mallow's $C_p$ (has fallen out of favor)
  
Any two models with the same response (but possibly differently transformed)

  + adjusted $R^2$
  
##### Information criterion

**CAUTION:**    
For AIC, BIC, and $C_p$,    
$p$ = number of parameters (inluding the constant/intercept).    
This is different than our usual of the letter $p$ (number of predictors).

Both Akaike and Bayesian Information Criteria   
reward small variance ($SSE_p / n$ small) and penalize larger models ($p$ large).

$$AIC = n \log_e(SSE_p / n) + 2p$$

$$BIC = n \log_e(SSE_p / n) + p\log_e(n)$$

  + Smaller is better for both criterion 
  
  + Models with AIC difference $\le 2$ should be treated as equally adequate
  
  + Similarly, models with BIC difference $\le 2$ should be treated as equally adequate
  
  + BIC penalty for larger models is more severe 
  
    + $p\log_e(n) > 2p$ (whenever $n > 8$)
    
    + Controls tendency of "overfitting" from AIC
  
#### Which models to compare?

If there are $p$ predictor variables, then $2^p$ possible models.   

If $p=10$, then more than 1,000 candidate models to choose from!

...and that's not even including the possibility of interaction effects

So, how to choose models in an intelligent and efficient way?

### Consequences

#### Consequences of excluding necessary variables

Consider a world where there is a "correct" model with $q$ predictors.

$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$
with least squares estimate

$$\yhat_i^*=\b0^*+\b1^* x_{i1} + ... + \bq^* x_{iq}$$

Let $p < q$ and consider the model

  $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$
which excludes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (all non-zero coefficients).

This model is estimated by

  $$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$

###### What do we gain?

  + Decreased variance of coefficients and predictions
    
    + $\var(\bj) \le \var(\bj^*)$
      
    + $\var(\yhat_i) \le \var(\yhat_i^*)$
      
###### What do we lose?
    
  + Bias of coefficients and predictions 
    
    + tend to over-estimate or tend to under-estimate on average
    
      + coefficient bias = $E(\bj) - \beta_j$
      
      + prediction bias = $E(\yhat_i) - \mu_i$
      
    + ...and we don't know which direction or how large the bias
    
    + However, the bias can be considered negligible if $\bj < se(\bj)$.
  
###### None of these consequence occur if predictors uncorrelated

So, check for collinearity first!  
For example, check variance inflation factors (VIF)
  
#### Consequences of including unnecessary variables

Consider a world where there is a "correct" model with $p$ predictors.

$$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$
with least squares estimate

$$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$

Let $q < p$ and consider the model

  $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$
which includes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (but all these coefficients = 0).

This model is estimated by

  $$\yhat_i^*=\b0^*+\b1^*x_{i1} + ... + \bq^*x_{iq}$$

###### What do we gain?

  + Nothing, really

###### What do we lose?
  
  + Increased variance of coefficients and predictions
  
    + Decreases degrees of freedom (amount of information for estimating variance)
    
    + $\var(\bj) \le \var(\bj^*)$
      
    + $\var(\yhat_i) \le \var(\yhat_i^*)$
    
###### None of these consequence occur if predictors uncorrelated

So, check for collinearity first!  
For example, check variance inflation factors (VIF)
  
      
### Modeling Purpose

#### What is your modeling purpose?

An overarching consideration in model selection    
relates to the purpose and intended use of the model.

  + **Descriptive** (exploratory, understanding relationships)
  
    + Try to account for as much response variability as possible, but keep simple.
    
    + Search for fundamental relationships
    
      + Usually start with a few essential variables
      
        + Requires consideration of why variables should be in the model
      
      + Then choose variables (and combinations of variables) to build forward
      
      + Useful with big data
      
    + The descriptive/exploratory model might not be the final model
  
  + **Predictive** (getting good predictions)
  
    + Minimize the MSE of prediction: $MSE(\yhat_i) = \var(\yhat_i) + \text{bias}^2$
    
    + Want realistic predictions and close to the data
    
    + Less consideration about which variables are required, included
    
      + "black box" modeling to some extent
  
  + **Explanatory** (describe the process, interpretability)
  
    + Lots of thinking required about which variables are important
    
    + Parsimony important (smallest model that is "complete")
    
    + Thoroughly address confounding and possible effect modification
    
      + Do not omit important confounders
      
      + Explore the need for interaction terms
    
###### The three goals are not mutually exclusive

Ideally, would like a little bit of all three properties.

### Example

#### Supervisor performance example 

Let’s look at an example:
From textbook Table 3.3, the supervisor performance data.

$Y=$ Overall rating of job being done by supervisor      
$x_1=$ Handles employee complaints    
$x_2=$ Does not allow special privileges    
$x_3=$ Opportunity to learn new things    
$x_4=$ Raises based on performance    
$x_5=$ Too critical of poor performance    
$x_6=$ Rate of advancing to better jobs

```{r}
superData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P060.txt")
glimpse(superData)
head(superData)
```

##### Is multicollinearity a potential problem?

###### High correlations between predictors?

```{r}
ggpairs(superData)
```

##### Variance inflation factors larger than 10?

```{r message=FALSE, warning=FALSE}
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
library(car)
```

```{r}
VIFvalues <- as.data.frame(vif(lmfit))
colnames(VIFvalues) <- "VIF"
VIFvalues
```

No strong evidence of multicollinearity

#### Three models to compare

###### The full model

$$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$$

```{r}
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
tidy(lmfit, conf.int=TRUE)
glance(summary(lmfit))
```

###### A model based on advancement and raises (positive stuff)

$$Y = \beta_0 + \beta_1 x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$$

$Y=$ Overall rating of job being done by supervisor      
$x_1=$ Handles employee complaints    
$x_3=$ Opportunity to learn new things    
$x_4=$ Raises based on performance    
$x_6=$ Rate of advancing to better jobs

```{r}
lmfit.positive <- lm(y ~ x1 + x3 + x4 + x6, data = superData)
tidy(lmfit.positive, conf.int=TRUE)
glance(summary(lmfit.positive))
```

###### A model focusing on the negative

$$Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$$

$Y=$ Overall rating of job being done by supervisor      
$x_2=$ Does not allow special privileges    
$x_5=$ Too critical of poor performance    


```{r}
lmfit.negative <- lm(y ~ x2 + x5, data = superData)
tidy(lmfit.negative, conf.int=TRUE)
glance(summary(lmfit.negative))
```

#### Comparing non-nested models

##### Criteria

  + Adjusted $R^2\quad$ (larger is better)

  + MSE = $\widehat{\sigma}^2\quad$ (smaller is better)

  + AIC = $n(\SSE_p/n) + 2p\quad$  (smaller is better)

  + BIC = $n\log_e(\SSE_p/n) + p\log_e(n)\quad$  (smaller is better)
  
###### The three models we are comparing:

$Y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 
  + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$
  (the full model)

$Y = \beta_0 + \beta_1x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$
(positive stuff)

$Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$
(negative stuff)


The full model:
```{r}
glance(lmfit)[c(2, 3, 8, 9)]
```

A model based on advancement and raises (positive stuff)
```{r}
glance(lmfit.positive)[c(2, 3, 8, 9)]
```

A model focusing on the negative
```{r}
glance(lmfit.negative)[c(2, 3, 8, 9)]
```

The full model and the "positive" model are fairly close.   
Both are clearly better than the "negative" model.

Perhaps choose the "positive" model for parsimony.

$Y=$ Overall rating of job being done by supervisor      
$x_1=$ Handles employee complaints    
$x_3=$ Opportunity to learn new things    
$x_4=$ Raises based on performance    
$x_6=$ Rate of advancing to better jobs

...and how shall we interpret this result with regard to rating supervisors?


### Automation

#### Automated variable selection strategies

Thinking through the problem carefully is always a strategy that should be employed!!!

But, sometimes may want to use one of these objective approaches:

  + **Forward Selection**
  
  + **Backward Elimination**
  
  + **Stepwise Selection** (a little of both)
  
##### Forward Selection

  (1) Begin with an "empty" model (no predictors)

  (2) Add the predictor with highest correlation with response
       + or largest t-statistic (smallest p-value)
       + $H_0: \text{correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$
       + decide here on a significance level required for entry at each step
       
  (3) Continue on to add next predictor with highest partial correlation with response
       + that is, after adjusting for the predictor added in step (2)
       + or largest t-statistic meeting the significance level criterion
       + $H_0: \text{partial correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$
       
  (4) With new model from (3), go back to step (3) again
       + add another predictor if one meets the criterion (a p-value threshold)
       + repeat (3) until no other predictors meet the criterion


##### Backward Elimination

  (1) Begin with the "full" model (all predictors)
  
  (2) Remove the predictor with the smallest t-statistic (largest p-value)
  
  (3) Continue on to remove the next predictor with smallest t-statistic
       + that is, after adjusting for the remaining predictors after step (2)
       
  (4) With the new model from (3), go back to step (3) again
       + remove another predictor if it meets the criterion (a p-value threshold)
       + repeat (3) until no other predictors meet the criterion

##### Stepwise Selection

It combines forward and backward steps,    
usually beginning from empty model (forward stepwise).


### Example V2

I found an R package (`olsrr`) that will do variable selection much like described in the textbook.    
I'm sure there are other packages.

`install.packages("olsrr")`    
`library(olsrr)`    

#### Forward selection

`lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)` 

###### Entry criteria: p-value of t-test for coefficient

`ols_step_forward_p(lmfit, penter = 0.99)`

Using `penter = 0.99` we are adding in **all** predictors,   
one by one, whichever variable has the lowest p-value comes in next.
```{r}
library(olsrr)
lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)
forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99)
forward.sequence
```

The `olsrr` package includes a `plot` method for the output
from `ols_step_forward_p`

`forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99)`    
`plot(forward.sequence)`

You can see how the various model summaries change as variables are added one-by-one.

SBC is BIC (I don't know why it's relabeled)    
SBIC is a slight modification of BIC.   
C(p) is Mallow's $C_p$ discussed in the textbook, but not in lecture.    

```{r}
plot(forward.sequence)
```

Could use a rule to only add variables if the p-value is less than a specified value:

Then set criterion to, say, p-value $= 0.25$

```{r}
forward.output <- ols_step_forward_p(lmfit, penter = 0.25)
forward.output
```

$Y=$ Overall rating of job being done by supervisor      
$x_1=$ Handles employee complaints    
$x_3=$ Opportunity to learn new things    
$x_6=$ Rate of advancing to better jobs


###### Entry criteria: AIC no longer decreases

Alternatively, we can use a criterion like AIC to decide when to stop.    
We will stop when AIC no longer decreases.

`ols_step_forward_aic(lmfit)`

```{r}
ols_step_forward_aic(lmfit)
```

#### Backward elimination

`lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)` 

###### Exit criteria: p-value of t-test for coefficient

Start with all variables in the model and remove according to the p-value.    
Remove predictor with largest p-value.

`ols_step_backward_p(lmfit, prem = 0.33)`

If we set the p-value for removal = 0.33, that corresponds to a t-statistic $\approx 1$. 

```{r}
2 * (1 - pt(1, df = 30 - 6 - 1))
ols_step_backward_p(lmfit, prem = 0.33)
```

###### Exit criteria: AIC no longer decreases

Alternatively, we can use a criterion like AIC to decide when to stop.    
We will stop when AIC no longer decreases.

`ols_step_backward_aic(lmfit)`

```{r}
ols_step_backward_aic(lmfit)
```

#### Stepwise (forward) selection

###### Criteria: Use t-statistic p-value as the entry and exit criterion

As variables come into the model,    
you can also check to see if any could later be removed.     
Both criteria can be the p-values for coefficient t-tests.

`ols_step_both_p(lmfit, pent = 0.25, prem = 0.33)`


```{r}
ols_step_both_p(lmfit, pent = 0.25, prem = 0.33)
```

###### Criterion: Or use AIC as the entry and exit criterion

`ols_step_both_aic(lmfit)`

```{r}
ols_step_both_aic(lmfit)
```

#### Stepwise (backward) selection

Start with the full model (all predictors).    
Remove according to p-value (large)    
...and possibly later enter a variable back in according to p-value (small).

In the `olsrr` package, the function `ols_step_both_p` does not appear
to support backward stepwise selection -- only forward (demonstrated above).