--- title: "STAT 224 Regularization" output: html_document: code_folding: hide df_print: kable number_sections: yes self_contained: yes toc: FALSE toc_depth: 4 toc_float: FALSE params: dotabs: TRUE --- ```{r test-main, child = 'preamblenocuttof.Rmd'} ``` ```{r} library(car) ``` # Regularization `r if(params$dotabs) paste('{.tabset}')` ## Lecture 1 Lecture modified from Chapter 24 in ['R for Statistical Learning'](https://daviddalpiaz.github.io/r4sl/index.html), by David Dalpiaz ```{r reg_opts, include = FALSE} knitr::opts_chunk$set(cache = TRUE, autodep = TRUE, fig.align = "center") ``` ### Setting up the data We will use the `Hitters` dataset from the `ISLR` package to explore two shrinkage methods: **ridge regression** and **lasso**. These are otherwise known as **penalized regression** methods. ```{r} data(Hitters, package = "ISLR") ``` This dataset has some missing data in the response `Salaray`. We use the `na.omit()` function the clean the dataset. ```{r} sum(is.na(Hitters)) sum(is.na(Hitters$Salary)) Hitters = na.omit(Hitters) sum(is.na(Hitters)) ``` The predictors variables are offensive and defensive statistics for a number of baseball players. ```{r} names(Hitters) ``` Let's look at just a few of the variables: ```{r} library(GGally) ggpairs(dplyr::select(Hitters,Salary,'Hits','CHits',HmRun,CHmRun,Runs,CRuns)) ``` We use the `glmnet()` and `cv.glmnet()` functions from the `glmnet` package to fit penalized regressions. ```{r, message = FALSE, warning = FALSE} library(glmnet) ``` Unfortunately, the `glmnet` function does not allow the use of model formulas, so we setup the data for ease of use with `glmnet`. Eventually we will use `train()` from `caret` which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from `glmnet` directly. ```{r} X = model.matrix(Salary ~ ., Hitters)[, -1] y = Hitters$Salary ``` ### Linear regression First, we fit an ordinary linear regression, and note the size of the predictors' coefficients, and predictors' coefficients squared. (The two penalties we will use.) ```{r} fit = lm(Salary ~ ., Hitters) coef(fit) sum(abs(coef(fit)[-1])) sum(coef(fit)[-1] ^ 2) vif(fit) ``` Note that we have a lot of coefficients with large values. The VIFs indicate that we have a high degree of colinearity. The sum of the absolute values of the coefficients is `r sum(abs(coef(fit)[-1]))`, and the sum of the coefficients squared is `r sum(coef(fit)[-1] ^ 2)`. These two sums are what are penalized by ridge regression and lasso, respectively, so we should see them decrease when we use those methods. ### Ridge Regression We first illustrate **ridge regression**, which can be fit using `glmnet()` with `alpha = 0` and seeks to minimize $$ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . $$ Notice that the intercept is **not** penalized. Also, note that that ridge regression is **not** scale invariant like the usual unpenalized regression. Thankfully, `glmnet()` takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale. The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice none of the coefficients are forced to be zero. ```{r ridge, fig.height = 4, fig.width = 8} library(glmnet) par(mfrow = c(1, 2)) fit_ridge = glmnet(X, y, alpha = 0) plot(fit_ridge, xvar = "lambda", label = TRUE) plot(fit_ridge) ``` We use cross-validation to select a good $\lambda$ value. The `cv.glmnet()`function uses 10 folds by default. The plot illustrates the MSE for the $\lambda$s considered. Two lines are drawn. The first is the $\lambda$ that gives the smallest MSE. The second is the $\lambda$ that gives an MSE within one standard error of the smallest. ```{r} fit_ridge_cv = cv.glmnet(X, y, alpha = 0) plot(fit_ridge_cv) ``` The `cv.glmnet()` function returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.) ```{r} # fitted coefficients, using 1-SE rule lambda, default behavior coef(fit_ridge_cv) ``` ```{r} # fitted coefficients, using minimum lambda coef(fit_ridge_cv, s = "lambda.min") ``` ```{r} # penalty term using minimum lambda sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2) ``` ```{r} # fitted coefficients, using 1-SE rule lambda coef(fit_ridge_cv, s = "lambda.1se") ``` ```{r} # penalty term using 1-SE rule lambda sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2) ``` ```{r, eval = FALSE} # predict using minimum lambda predict(fit_ridge_cv, X, s = "lambda.min") ``` ```{r, eval = FALSE} # predict using 1-SE rule lambda, default behavior predict(fit_ridge_cv, X) ``` ```{r} # calcualte "train error" mean((y - predict(fit_ridge_cv, X)) ^ 2) ``` ```{r} # CV-RMSE using minimum lambda sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min]) ``` ```{r} # CV-RMSE using 1-SE rule lambda sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se]) ``` ### Lasso We now illustrate **lasso**, which can be fit using `glmnet()` with `alpha = 1` and seeks to minimize $$ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . $$ Like ridge, lasso is not scale invariant. The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice some of the coefficients are forced to be zero. ```{r lasso, fig.height = 4, fig.width = 8} par(mfrow = c(1, 2)) fit_lasso = glmnet(X, y, alpha = 1) plot(fit_lasso, xvar = "lambda", label = TRUE) plot(fit_lasso) ``` Again, to actually pick a $\lambda$, we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.) ```{r} fit_lasso_cv = cv.glmnet(X, y, alpha = 1) plot(fit_lasso_cv) ``` `cv.glmnet()` returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0. ```{r} # fitted coefficients, using 1-SE rule lambda, default behavior coef(fit_lasso_cv) ``` ```{r} # fitted coefficients, using minimum lambda coef(fit_lasso_cv, s = "lambda.min") ``` ```{r} # penalty term using minimum lambda sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2) ``` ```{r} # fitted coefficients, using 1-SE rule lambda coef(fit_lasso_cv, s = "lambda.1se") ``` ```{r} # penalty term using 1-SE rule lambda sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2) ``` ```{r, eval = FALSE} # predict using minimum lambda predict(fit_lasso_cv, X, s = "lambda.min") ``` ```{r, eval = FALSE} # predict using 1-SE rule lambda, default behavior predict(fit_lasso_cv, X) ``` ```{r} # calcualte "train error" mean((y - predict(fit_lasso_cv, X)) ^ 2) ``` ### `broom` Sometimes, the output from `glmnet()` can be overwhelming. The `broom` package can help with that. ```{r, message = FALSE, warning = FALSE} library(broom) # the output from the commented line would be immense # fit_lasso_cv tidy(fit_lasso_cv) # the two lambda values of interest glance(fit_lasso_cv) ``` ## Lecture 2 ### Simulated Data, $p > n$ Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the $p > n$ case. ```{r} set.seed(1234) n = 1000 p = 5500 X = replicate(p, rnorm(n = n)) beta = c(1, 1, 1, rep(0, 5497)) z = X %*% beta prob = exp(z) / (1 + exp(z)) y = as.factor(rbinom(length(z), size = 1, prob = prob)) ``` We first simulate a classification example where $p > n$. ```{r} # glm(y ~ X, family = "binomial") # will not converge ``` We then use a lasso penalty to fit penalized logistic regression. This minimizes $$ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| $$ where $L$ is the appropriate *negative* **log**-likelihood. ```{r} library(glmnet) fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1) plot(fit_cv) ``` ```{r} head(coef(fit_cv), n = 10) ``` ```{r} fit_cv$nzero ``` Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds. ```{r} fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se) which(as.vector(as.matrix(fit_1se$beta)) != 0) ``` We can also see in the following plots, the three features entering the model well ahead of the irrelevant features. ```{r, fig.height = 4, fig.width = 8} par(mfrow = c(1, 2)) plot(glmnet(X, y, family = "binomial")) plot(glmnet(X, y, family = "binomial"), xvar = "lambda") ``` We can extract the two relevant $\lambda$ values. ```{r} fit_cv$lambda.min fit_cv$lambda.1se ``` Since `cv.glmnet()` does not calculate prediction accuracy for classification, we take the $\lambda$ values and create a grid for `caret` to search in order to obtain prediction accuracy with `train()`. We set $\alpha = 1$ in this grid, as `glmnet` can actually tune over the $\alpha = 1$ parameter. (More on that later.) Note that we have to force `y` to be a factor, so that `train()` recognizes we want to have a binomial response. The `train()` function in `caret` use the type of variable in `y` to determine if you want to use `family = "binomial"` or `family = "gaussian"`. ```{r, message = FALSE, warning = FALSE} library(caret) cv_5 = trainControl(method = "cv", number = 5) lasso_grid = expand.grid(alpha = 1, lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se)) lasso_grid ``` ```{r} sim_data = data.frame(y, X) fit_lasso = train( y ~ ., data = sim_data, method = "glmnet", trControl = cv_5, tuneGrid = lasso_grid ) fit_lasso$results ``` The interaction between the `glmnet` and `caret` packages is sometimes frustrating, but for obtaining results for particular values of $\lambda$, we see it can be easily used. ### Feature selection example Here we analyze data from a microarray study of Singh et al. (2002). This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men. The goal is to see which genes were most associated with the cancer diagnosis. ```{r} library(sda) data("singh2002") dim(singh2002$y) # number of observations dim(singh2002$x) fit <- glmnet(singh2002$x, singh2002$y, family = "binomial",pmax=20,alpha=1) plot(fit,label=TRUE,xvar="lambda") ``` ```{r} fit_lasso_cv = cv.glmnet(singh2002$x, singh2002$y, family = "binomial", nfolds=5,alpha = 0.99,type.measure='class') coefs<-coef(fit_lasso_cv) summary(coefs) ``` ## Lecture 3 ### More complete analysis of Hitters data Let's look more carefully at the results of the LASSO regression for the Hitters data. The `plotres` function gives some useful graphs. ```{r} library(plotmo) X = model.matrix(Salary ~ ., Hitters)[, -1] y = Hitters$Salary fit_lasso_cv = cv.glmnet(X, y, alpha = 1) plot(fit_lasso_cv) plotres(fit_lasso_cv) ``` I see some definite heteroskedasticity, which is not surprising for a variable like `Salary`; income data is often very right-skewed and often has non-constant variance. Let's check to be sure: ```{r} hist(Hitters$Salary) ``` Yup! Let's try a `log` transformation to make the response more normal, and see if this fixes the heteroscedasticity. ```{r} hist(log(Hitters$Salary)) ``` Looks better already. Let's do the fit... ```{r} logHitters<-mutate(Hitters,logSalary=log(Salary)) #Xlog = model.matrix(logSalary ~ . - Salary, logHitters)[, -1] ylog = logHitters$logSalary log_fit_lasso_cv = cv.glmnet(X, ylog, alpha = 1) plotres(log_fit_lasso_cv) ``` Did using the `log` transformation change the predictors found by the LASSO fit? Let's see... ```{r} coef(fit_lasso_cv, s = "lambda.1se") coef(log_fit_lasso_cv, s = "lambda.1se") ``` Yes, the chosen coefficients aren't exactly the same. Now I'm concerned about outliers: ```{r} plotres(log_fit_lasso_cv,which=3) ``` What is going on with Mike Schmidt? He is observation #173. Let's look at a few observations around him: ```{r} Hitters[173:175,c(1:8,19)] # Just show a few predictors for space reasons ``` He's had only 1 hit, 0 home runs, 0 runs, and has only been in the league 2 years, yet his salary seems very high. Let's confirm that: ```{r} hist(Hitters$Salary) ``` He is almost the highest payed player, according to this data. What is going on? Let's Google Mike Schmidt. ![Mike Schmidt](https://i.pinimg.com/originals/aa/e6/02/aae602ff58a81bacd4add7bad921d1d6.jpg) Mike Schmidt played with the Phillies starting in 1972, so he'd been playing for 14 years by 1986. According to Wikipedia: "Schmidt was a twelve-time All-Star and a three-time winner of the National League (NL) Most Valuable Player award (MVP), and he was known for his combination of power hitting and strong defense: as a hitter, he compiled 548 home runs and 1,595 runs batted in (RBIs), and led the NL in home runs eight times and in RBIs four times."" This all explains why Schmidt was payed so much, but not why he's listed as having 0 runs and 0 home runs. I'm going to conclude that there's something wrong with this data, and run the fit without Schmidt. ```{r} #Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1] ysans = logHitters[-173,]$logSalary log_fit_lasso_cv_sans = cv.glmnet(X[-173,], ysans, alpha = 1) plotres(log_fit_lasso_cv_sans) ``` Terry Kennedy seems to be a similar situation. ```{r} #Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1] ysans2 = logHitters[-c(173,241),]$logSalary log_fit_lasso_cv_sans2 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 1) plotres(log_fit_lasso_cv_sans2) ``` Steve Sax is a bit of an outlier here, but I don't see anything wrong with his data. Also, I doubt he is very influential, since he's near the middle of the fitted value chart. Let's look at the final coefficients: ```{r} coef(log_fit_lasso_cv_sans2) ``` Would we get similar values if we did a non-regularized fit for the same predictors? (Plus an advantage of the `lm` -- we get nicer diagnostic plots by default:) ```{r} linfit2<-lm(log(Salary)~Hits+RBI+Walks+Years+CHits+CRuns+Division,data=Hitters[-c(173,241),]) summary(linfit2) plot(linfit2) ``` Are the coefficients with the linear fit versus the LASSO found broadly similar? ```{r} coef(linfit2) lassocoefs<-as.matrix(coef(log_fit_lasso_cv_sans2)) lassocoefs[lassocoefs!=0] plot(coef(linfit2)[2:9]~lassocoefs[lassocoefs!=0][2:9]) ``` Lastly, let's try an elastic net fit to see if it will bring in both the career averages and the 1986 numbers for RBI and Walks. ```{r} log_fit_lasso_cv_sans3 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 0.5) plotres(log_fit_lasso_cv_sans3) coef(log_fit_lasso_cv_sans3) ``` ### External Links - [`glmnet` Web Vingette](https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) - Details from the package developers.