--- title: "STAT 224 Chapter 11" output: html_document: theme: cerulean css: styles.css code_folding: show df_print: kable number_sections: yes self_contained: yes toc: FALSE toc_depth: 4 toc_float: FALSE params: dotabs: TRUE --- ```{r test-main, child = 'preamble.Rmd'} ``` # Chapter 11: Model Selection `r if(params$dotabs) paste('{.tabset}')` ### Intro #### Variable (model) selection ##### Thus far... + predictors identified in advance + most predictors had some value for constructing a linear model ##### Reality... + Many predictors + Several candidate models + all may pass the usual diagnostics and tests + How do we pick the best model? ##### What is variable (model) selection Variable selection is the process of choosing a "best" subset of all available predictors. Well, there is no single "best" subset. We do want a model we can interpret or justify with respect to the questions of interest. #### Model comparison statistics Nested models + F-test Any two models with the same response + MSE (Mean Squared Error) + AIC (Akaike Information Criterion) + BIC (Bayesian Information Criterion) + Mallow's $C_p$ (has fallen out of favor) Any two models with the same response (but possibly differently transformed) + adjusted $R^2$ ##### Information criterion **CAUTION:** For AIC, BIC, and $C_p$, $p$ = number of parameters (inluding the constant/intercept). This is different than our usual of the letter $p$ (number of predictors). Both Akaike and Bayesian Information Criteria reward small variance ($SSE_p / n$ small) and penalize larger models ($p$ large). $$AIC = n \log_e(SSE_p / n) + 2p$$ $$BIC = n \log_e(SSE_p / n) + p\log_e(n)$$ + Smaller is better for both criterion + Models with AIC difference $\le 2$ should be treated as equally adequate + Similarly, models with BIC difference $\le 2$ should be treated as equally adequate + BIC penalty for larger models is more severe + $p\log_e(n) > 2p$ (whenever $n > 8$) + Controls tendency of "overfitting" from AIC #### Which models to compare? If there are $p$ predictor variables, then $2^p$ possible models. If $p=10$, then more than 1,000 candidate models to choose from! ...and that's not even including the possibility of interaction effects So, how to choose models in an intelligent and efficient way? ### Consequences #### Consequences of excluding necessary variables Consider a world where there is a "correct" model with $q$ predictors. $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$ with least squares estimate $$\yhat_i^*=\b0^*+\b1^* x_{i1} + ... + \bq^* x_{iq}$$ Let $p < q$ and consider the model $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$ which excludes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (all non-zero coefficients). This model is estimated by $$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$ ###### What do we gain? + Decreased variance of coefficients and predictions + $\var(\bj) \le \var(\bj^*)$ + $\var(\yhat_i) \le \var(\yhat_i^*)$ ###### What do we lose? + Bias of coefficients and predictions + tend to over-estimate or tend to under-estimate on average + coefficient bias = $E(\bj) - \beta_j$ + prediction bias = $E(\yhat_i) - \mu_i$ + ...and we don't know which direction or how large the bias + However, the bias can be considered negligible if $\bj < se(\bj)$. ###### None of these consequence occur if predictors uncorrelated So, check for collinearity first! For example, check variance inflation factors (VIF) #### Consequences of including unnecessary variables Consider a world where there is a "correct" model with $p$ predictors. $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_p x_{ip} + \epsilon_i$$ with least squares estimate $$\yhat_i=\b0+\b1 x_{i1} + ... + \bp x_{ip}$$ Let $q < p$ and consider the model $$y_i=\beta_0+\beta_1 x_{i1} + ... + \beta_q x_{iq} + \epsilon_i$$ which includes $\beta_{p+1}, \beta_{p+2}, \ldots, \beta_q$ (but all these coefficients = 0). This model is estimated by $$\yhat_i^*=\b0^*+\b1^*x_{i1} + ... + \bq^*x_{iq}$$ ###### What do we gain? + Nothing, really ###### What do we lose? + Increased variance of coefficients and predictions + Decreases degrees of freedom (amount of information for estimating variance) + $\var(\bj) \le \var(\bj^*)$ + $\var(\yhat_i) \le \var(\yhat_i^*)$ ###### None of these consequence occur if predictors uncorrelated So, check for collinearity first! For example, check variance inflation factors (VIF) ### Modeling Purpose #### What is your modeling purpose? An overarching consideration in model selection relates to the purpose and intended use of the model. + **Descriptive** (exploratory, understanding relationships) + Try to account for as much response variability as possible, but keep simple. + Search for fundamental relationships + Usually start with a few essential variables + Requires consideration of why variables should be in the model + Then choose variables (and combinations of variables) to build forward + Useful with big data + The descriptive/exploratory model might not be the final model + **Predictive** (getting good predictions) + Minimize the MSE of prediction: $MSE(\yhat_i) = \var(\yhat_i) + \text{bias}^2$ + Want realistic predictions and close to the data + Less consideration about which variables are required, included + "black box" modeling to some extent + **Explanatory** (describe the process, interpretability) + Lots of thinking required about which variables are important + Parsimony important (smallest model that is "complete") + Thoroughly address confounding and possible effect modification + Do not omit important confounders + Explore the need for interaction terms ###### The three goals are not mutually exclusive Ideally, would like a little bit of all three properties. ### Example #### Supervisor performance example Let’s look at an example: From textbook Table 3.3, the supervisor performance data. $Y=$ Overall rating of job being done by supervisor $x_1=$ Handles employee complaints $x_2=$ Does not allow special privileges $x_3=$ Opportunity to learn new things $x_4=$ Raises based on performance $x_5=$ Too critical of poor performance $x_6=$ Rate of advancing to better jobs ```{r} superData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P060.txt") glimpse(superData) head(superData) ``` ##### Is multicollinearity a potential problem? ###### High correlations between predictors? ```{r} ggpairs(superData) ``` ##### Variance inflation factors larger than 10? ```{r message=FALSE, warning=FALSE} lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData) library(car) ``` ```{r} VIFvalues <- as.data.frame(vif(lmfit)) colnames(VIFvalues) <- "VIF" VIFvalues ``` No strong evidence of multicollinearity #### Three models to compare ###### The full model $$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$$ ```{r} lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData) tidy(lmfit, conf.int=TRUE) glance(summary(lmfit)) ``` ###### A model based on advancement and raises (positive stuff) $$Y = \beta_0 + \beta_1 x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$$ $Y=$ Overall rating of job being done by supervisor $x_1=$ Handles employee complaints $x_3=$ Opportunity to learn new things $x_4=$ Raises based on performance $x_6=$ Rate of advancing to better jobs ```{r} lmfit.positive <- lm(y ~ x1 + x3 + x4 + x6, data = superData) tidy(lmfit.positive, conf.int=TRUE) glance(summary(lmfit.positive)) ``` ###### A model focusing on the negative $$Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$$ $Y=$ Overall rating of job being done by supervisor $x_2=$ Does not allow special privileges $x_5=$ Too critical of poor performance ```{r} lmfit.negative <- lm(y ~ x2 + x5, data = superData) tidy(lmfit.negative, conf.int=TRUE) glance(summary(lmfit.negative)) ``` #### Comparing non-nested models ##### Criteria + Adjusted $R^2\quad$ (larger is better) + MSE = $\widehat{\sigma}^2\quad$ (smaller is better) + AIC = $n(\SSE_p/n) + 2p\quad$ (smaller is better) + BIC = $n\log_e(\SSE_p/n) + p\log_e(n)\quad$ (smaller is better) ###### The three models we are comparing: $Y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \beta_6 x_6 + \epsilon$ (the full model) $Y = \beta_0 + \beta_1x_1 + \beta_2x_3 + \beta_3 x_4 + \beta_4 x_6 + \epsilon$ (positive stuff) $Y = \beta_0 + \beta_1 x_2 + \beta_2 x_5 + \epsilon$ (negative stuff) The full model: ```{r} glance(lmfit)[c(2, 3, 8, 9)] ``` A model based on advancement and raises (positive stuff) ```{r} glance(lmfit.positive)[c(2, 3, 8, 9)] ``` A model focusing on the negative ```{r} glance(lmfit.negative)[c(2, 3, 8, 9)] ``` The full model and the "positive" model are fairly close. Both are clearly better than the "negative" model. Perhaps choose the "positive" model for parsimony. $Y=$ Overall rating of job being done by supervisor $x_1=$ Handles employee complaints $x_3=$ Opportunity to learn new things $x_4=$ Raises based on performance $x_6=$ Rate of advancing to better jobs ...and how shall we interpret this result with regard to rating supervisors? ### Automation #### Automated variable selection strategies Thinking through the problem carefully is always a strategy that should be employed!!! But, sometimes may want to use one of these objective approaches: + **Forward Selection** + **Backward Elimination** + **Stepwise Selection** (a little of both) ##### Forward Selection (1) Begin with an "empty" model (no predictors) (2) Add the predictor with highest correlation with response + or largest t-statistic (smallest p-value) + $H_0: \text{correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$ + decide here on a significance level required for entry at each step (3) Continue on to add next predictor with highest partial correlation with response + that is, after adjusting for the predictor added in step (2) + or largest t-statistic meeting the significance level criterion + $H_0: \text{partial correlation}_j = 0$ is equivalent to $H_0: \beta_j=0$ (4) With new model from (3), go back to step (3) again + add another predictor if one meets the criterion (a p-value threshold) + repeat (3) until no other predictors meet the criterion ##### Backward Elimination (1) Begin with the "full" model (all predictors) (2) Remove the predictor with the smallest t-statistic (largest p-value) (3) Continue on to remove the next predictor with smallest t-statistic + that is, after adjusting for the remaining predictors after step (2) (4) With the new model from (3), go back to step (3) again + remove another predictor if it meets the criterion (a p-value threshold) + repeat (3) until no other predictors meet the criterion ##### Stepwise Selection It combines forward and backward steps, usually beginning from empty model (forward stepwise). ### Example V2 I found an R package (`olsrr`) that will do variable selection much like described in the textbook. I'm sure there are other packages. `install.packages("olsrr")` `library(olsrr)` #### Forward selection `lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)` ###### Entry criteria: p-value of t-test for coefficient `ols_step_forward_p(lmfit, penter = 0.99)` Using `penter = 0.99` we are adding in **all** predictors, one by one, whichever variable has the lowest p-value comes in next. ```{r} library(olsrr) lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData) forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99) forward.sequence ``` The `olsrr` package includes a `plot` method for the output from `ols_step_forward_p` `forward.sequence <- ols_step_forward_p(lmfit, penter = 0.99)` `plot(forward.sequence)` You can see how the various model summaries change as variables are added one-by-one. SBC is BIC (I don't know why it's relabeled) SBIC is a slight modification of BIC. C(p) is Mallow's $C_p$ discussed in the textbook, but not in lecture. ```{r} plot(forward.sequence) ``` Could use a rule to only add variables if the p-value is less than a specified value: Then set criterion to, say, p-value $= 0.25$ ```{r} forward.output <- ols_step_forward_p(lmfit, penter = 0.25) forward.output ``` $Y=$ Overall rating of job being done by supervisor $x_1=$ Handles employee complaints $x_3=$ Opportunity to learn new things $x_6=$ Rate of advancing to better jobs ###### Entry criteria: AIC no longer decreases Alternatively, we can use a criterion like AIC to decide when to stop. We will stop when AIC no longer decreases. `ols_step_forward_aic(lmfit)` ```{r} ols_step_forward_aic(lmfit) ``` #### Backward elimination `lmfit <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = superData)` ###### Exit criteria: p-value of t-test for coefficient Start with all variables in the model and remove according to the p-value. Remove predictor with largest p-value. `ols_step_backward_p(lmfit, prem = 0.33)` If we set the p-value for removal = 0.33, that corresponds to a t-statistic $\approx 1$. ```{r} 2 * (1 - pt(1, df = 30 - 6 - 1)) ols_step_backward_p(lmfit, prem = 0.33) ``` ###### Exit criteria: AIC no longer decreases Alternatively, we can use a criterion like AIC to decide when to stop. We will stop when AIC no longer decreases. `ols_step_backward_aic(lmfit)` ```{r} ols_step_backward_aic(lmfit) ``` #### Stepwise (forward) selection ###### Criteria: Use t-statistic p-value as the entry and exit criterion As variables come into the model, you can also check to see if any could later be removed. Both criteria can be the p-values for coefficient t-tests. `ols_step_both_p(lmfit, pent = 0.25, prem = 0.33)` ```{r} ols_step_both_p(lmfit, pent = 0.25, prem = 0.33) ``` ###### Criterion: Or use AIC as the entry and exit criterion `ols_step_both_aic(lmfit)` ```{r} ols_step_both_aic(lmfit) ``` #### Stepwise (backward) selection Start with the full model (all predictors). Remove according to p-value (large) ...and possibly later enter a variable back in according to p-value (small). In the `olsrr` package, the function `ols_step_both_p` does not appear to support backward stepwise selection -- only forward (demonstrated above).