library(car)
Lecture modified from Chapter 24 in ‘R for Statistical Learning’, by David Dalpiaz
We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.
data(Hitters, package = "ISLR")
This dataset has some missing data in the response Salaray. We use the na.omit() function the clean the dataset.
sum(is.na(Hitters))
[1] 0
sum(is.na(Hitters$Salary))
[1] 0
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
[1] 0
The predictors variables are offensive and defensive statistics for a number of baseball players.
names(Hitters)
[1] "AtBat" "Hits" "HmRun" "Runs" "RBI" "Walks"
[7] "Years" "CAtBat" "CHits" "CHmRun" "CRuns" "CRBI"
[13] "CWalks" "League" "Division" "PutOuts" "Assists" "Errors"
[19] "Salary" "NewLeague"
Let’s look at just a few of the variables:
library(GGally)
ggpairs(dplyr::select(Hitters,Salary,'Hits','CHits',HmRun,CHmRun,Runs,CRuns))
We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions.
library(glmnet)
Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly.
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)
fit = lm(Salary ~ ., Hitters)
coef(fit)
(Intercept) AtBat Hits HmRun Runs RBI
163.1036 -1.9799 7.5008 4.3309 -2.3762 -1.0450
Walks Years CAtBat CHits CHmRun CRuns
6.2313 -3.4891 -0.1713 0.1340 -0.1729 1.4543
CRBI CWalks LeagueN DivisionW PutOuts Assists
0.8077 -0.8116 62.5994 -116.8492 0.2819 0.3711
Errors NewLeagueN
-3.3608 -24.7623
sum(abs(coef(fit)[-1]))
[1] 238.7
sum(coef(fit)[-1] ^ 2)
[1] 18337
vif(fit)
AtBat Hits HmRun Runs RBI Walks Years CAtBat
22.944 30.281 7.759 15.246 11.922 4.149 9.313 251.561
CHits CHmRun CRuns CRBI CWalks League Division PutOuts
502.954 46.488 162.521 131.966 19.744 4.134 1.075 1.236
Assists Errors NewLeague
2.709 2.215 4.099
Note that we have a lot of coefficients with large values. The VIFs indicate that we have a high degree of colinearity.
The sum of the absolute values of the coefficients is 238.7, and the sum of the coefficients squared is 18,337. These two sums are what are penalized by ridge regression and lasso, respectively, so we should see them decrease when we use those methods.
We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]
Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.
The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice none of the coefficients are forced to be zero.
library(glmnet)
par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge, xvar = "lambda", label = TRUE)
plot(fit_ridge)
We use cross-validation to select a good \(\lambda\) value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the \(\lambda\)s considered. Two lines are drawn. The first is the \(\lambda\) that gives the smallest MSE. The second is the \(\lambda\) that gives an MSE within one standard error of the smallest.
fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)
The cv.glmnet() function returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)
# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 199.418114
AtBat 0.093427
Hits 0.389767
HmRun 1.212875
Runs 0.623229
RBI 0.618548
Walks 0.810468
Years 2.544171
CAtBat 0.007897
CHits 0.030555
CHmRun 0.226546
CRuns 0.061266
CRBI 0.063385
CWalks 0.060720
LeagueN 3.743295
DivisionW -23.545192
PutOuts 0.056202
Assists 0.007879
Errors -0.164203
NewLeagueN 3.313773
# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 76.4582387
AtBat -0.6315180
Hits 2.6421596
HmRun -1.3882331
Runs 1.0457295
RBI 0.7315713
Walks 3.2780010
Years -8.7237336
CAtBat 0.0001256
CHits 0.1318975
CHmRun 0.6895578
CRuns 0.2830055
CRBI 0.2514905
CWalks -0.2599851
LeagueN 52.3371962
DivisionW -122.4169978
PutOuts 0.2623667
Assists 0.1629044
Errors -3.6440019
NewLeagueN -17.0259792
# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
[1] 18127
# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 199.418114
AtBat 0.093427
Hits 0.389767
HmRun 1.212875
Runs 0.623229
RBI 0.618548
Walks 0.810468
Years 2.544171
CAtBat 0.007897
CHits 0.030555
CHmRun 0.226546
CRuns 0.061266
CRBI 0.063385
CWalks 0.060720
LeagueN 3.743295
DivisionW -23.545192
PutOuts 0.056202
Assists 0.007879
Errors -0.164203
NewLeagueN 3.313773
# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
[1] 589
# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)
# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
[1] 130405
# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
[1] 343.3
# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se])
[1] 367.9
We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize
\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]
Like ridge, lasso is not scale invariant.
The two plots illustrate how much the coefficients are penalized for different values of \(\lambda\). Notice some of the coefficients are forced to be zero.
par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso, xvar = "lambda", label = TRUE)
plot(fit_lasso)
Again, to actually pick a \(\lambda\), we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
cv.glmnet() returns several details of the fit for both \(\lambda\) values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.
# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 167.91203
AtBat .
Hits 1.29270
HmRun .
Runs .
RBI .
Walks 1.39818
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.14168
CRBI 0.32193
CWalks .
LeagueN .
DivisionW .
PutOuts 0.04675
Assists .
Errors .
NewLeagueN .
# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 139.50053399
AtBat -1.73412453
Hits 6.05986118
HmRun 0.17081018
Runs .
RBI .
Walks 5.05382556
Years -10.61379198
CAtBat -0.00002484
CHits .
CHmRun 0.57214145
CRuns 0.72181056
CRBI 0.38563389
CWalks -0.60336657
LeagueN 33.39735185
DivisionW -119.39235545
PutOuts 0.27725600
Assists 0.21048279
Errors -2.37146800
NewLeagueN .
# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
[1] 15555
# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 167.91203
AtBat .
Hits 1.29270
HmRun .
Runs .
RBI .
Walks 1.39818
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.14168
CRBI 0.32193
CWalks .
LeagueN .
DivisionW .
PutOuts 0.04675
Assists .
Errors .
NewLeagueN .
# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
[1] 3.752
# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")
# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)
# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
[1] 123931
broomSometimes, the output from glmnet() can be overwhelming. The broom package can help with that.
library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
lambda estimate std.error conf.high conf.low nzero
1 255.2821 203503 24725 228228 178778 0
2 232.6035 196613 24396 221009 172217 1
3 211.9397 187747 23166 210914 164581 2
4 193.1115 180470 22173 202642 158297 2
5 175.9560 174208 21479 195687 152729 3
6 160.3246 167461 21072 188532 146389 4
7 146.0818 160814 20875 181689 139940 4
8 133.1043 154965 20774 175739 134191 4
9 121.2797 149748 20652 170399 129096 4
10 110.5055 145352 20579 165931 124773 4
11 100.6885 141690 20589 162279 121101 5
12 91.7436 138396 20679 159075 117717 5
13 83.5934 135528 20770 156298 114758 5
14 76.1672 132988 20858 153846 112130 5
15 69.4007 130716 20990 151707 109726 6
16 63.2353 128572 21138 149709 107434 6
17 57.6177 126671 21257 147928 105414 6
18 52.4991 125000 21312 146313 103688 6
19 47.8352 123547 21321 144868 102226 6
20 43.5857 122295 21329 143625 100966 6
21 39.7136 121248 21345 142593 99903 6
22 36.1856 120372 21366 141738 99006 6
23 32.9710 119655 21396 141050 98259 6
24 30.0419 119086 21435 140521 97651 6
25 27.3731 118650 21484 140134 97166 6
26 24.9413 118320 21536 139856 96784 6
27 22.7256 118095 21591 139685 96504 6
28 20.7067 117934 21638 139572 96295 6
29 18.8672 117820 21682 139502 96138 6
30 17.1911 117783 21739 139522 96044 7
31 15.6639 117767 21794 139562 95973 7
32 14.2723 117771 21844 139615 95927 7
33 13.0044 117793 21888 139681 95905 9
34 11.8491 117788 21932 139720 95856 9
35 10.7965 117765 21976 139741 95789 9
36 9.8374 117833 22064 139897 95770 9
37 8.9634 118080 22174 140253 95906 9
38 8.1672 118425 22259 140684 96167 11
39 7.4416 118467 22291 140759 96176 11
40 6.7805 118291 22307 140599 95984 12
41 6.1782 118044 22330 140374 95714 12
42 5.6293 117671 22378 140048 95293 13
43 5.1292 117054 22409 139463 94646 13
44 4.6735 116338 22421 138759 93917 13
45 4.2584 115665 22431 138096 93235 13
46 3.8801 115111 22442 137554 92669 13
47 3.5354 114672 22458 137130 92213 13
48 3.2213 114335 22477 136811 91858 13
49 2.9351 114134 22489 136623 91645 13
50 2.6744 114015 22499 136513 91516 13
51 2.4368 113912 22483 136395 91429 13
52 2.2203 113829 22443 136272 91385 14
53 2.0231 113803 22410 136213 91393 15
54 1.8433 113894 22380 136275 91514 15
55 1.6796 114042 22335 136377 91707 17
56 1.5304 114217 22306 136523 91910 17
57 1.3944 114354 22293 136646 92061 17
58 1.2705 114440 22292 136732 92148 17
59 1.1577 114486 22297 136783 92189 17
60 1.0548 114525 22308 136832 92217 17
61 0.9611 114586 22320 136906 92267 17
62 0.8757 114608 22317 136925 92291 17
63 0.7979 114636 22318 136954 92318 17
64 0.7271 114667 22317 136984 92350 17
65 0.6625 114695 22316 137012 92379 18
66 0.6036 114726 22317 137043 92410 18
67 0.5500 114766 22317 137082 92449 18
68 0.5011 114807 22317 137124 92490 17
69 0.4566 114858 22314 137172 92543 18
70 0.4160 114894 22314 137208 92581 18
71 0.3791 114949 22324 137272 92625 18
72 0.3454 115001 22336 137337 92665 18
73 0.3147 115047 22347 137394 92699 18
74 0.2868 115104 22355 137459 92749 18
75 0.2613 115137 22365 137502 92772 18
# the two lambda values of interest
glance(fit_lasso_cv)
lambda.min lambda.1se
1 2.023 83.59
Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the \(p > n\) case.
set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))
We first simulate a classification example where \(p > n\).
# glm(y ~ X, family = "binomial")
# will not converge
We then use a lasso penalty to fit penalized logistic regression. This minimizes
\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]
where \(L\) is the appropriate negative log-likelihood.
library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)
head(coef(fit_cv), n = 10)
10 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.02397
V1 0.59675
V2 0.56252
V3 0.60065
V4 .
V5 .
V6 .
V7 .
V8 .
V9 .
fit_cv$nzero
s0 s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19
0 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39
3 3 3 3 3 3 3 3 3 3 4 6 7 10 18 24 35 54 65 75
s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59
86 100 110 129 147 168 187 202 221 241 254 269 283 298 310 324 333 350 364 375
s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 s72 s73 s74 s75 s76 s77 s78 s79
387 400 411 429 435 445 453 455 462 466 475 481 487 491 496 498 502 504 512 518
s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 s90 s91 s92 s93 s94 s95 s96 s97 s98
523 526 528 536 543 550 559 561 563 566 570 571 576 582 586 590 596 596 600
Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.
fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
[1] 1 2 3
We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.
par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")
We can extract the two relevant \(\lambda\) values.
fit_cv$lambda.min
[1] 0.03087
fit_cv$lambda.1se
[1] 0.0515
Since cv.glmnet() does not calculate prediction accuracy for classification, we take the \(\lambda\) values and create a grid for caret to search in order to obtain prediction accuracy with train(). We set \(\alpha = 1\) in this grid, as glmnet can actually tune over the \(\alpha = 1\) parameter. (More on that later.)
Note that we have to force y to be a factor, so that train() recognizes we want to have a binomial response. The train() function in caret use the type of variable in y to determine if you want to use family = "binomial" or family = "gaussian".
library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1,
lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
alpha lambda
1 1 0.03087
2 1 0.05150
sim_data = data.frame(y, X)
fit_lasso = train(
y ~ ., data = sim_data,
method = "glmnet",
trControl = cv_5,
tuneGrid = lasso_grid
)
fit_lasso$results
alpha lambda Accuracy Kappa AccuracySD KappaSD
1 1 0.03087 0.7679 0.5358 0.03430 0.06845
2 1 0.05150 0.7689 0.5378 0.02807 0.05596
The interaction between the glmnet and caret packages is sometimes frustrating, but for obtaining results for particular values of \(\lambda\), we see it can be easily used.
Here we analyze data from a microarray study of Singh et al. (2002). This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men. The goal is to see which genes were most associated with the cancer diagnosis.
library(sda)
data("singh2002")
dim(singh2002$y) # number of observations
NULL
dim(singh2002$x)
[1] 102 6033
fit <- glmnet(singh2002$x, singh2002$y, family = "binomial",pmax=20,alpha=1)
Warning: from glmnet Fortran code (error code -10015); Number of nonzero
coefficients along the path exceeds pmax=20 at 15th lambda value; solutions for
larger lambdas returned
plot(fit,label=TRUE,xvar="lambda")
fit_lasso_cv = cv.glmnet(singh2002$x, singh2002$y, family = "binomial", nfolds=5,alpha = 0.99,type.measure='class')
coefs<-coef(fit_lasso_cv)
summary(coefs)
6034 x 1 sparse Matrix of class "dgCMatrix", with 44 entries
i j x
1 1 1 -0.493161
2 299 1 0.027846
3 333 1 -0.307869
4 365 1 0.040395
5 378 1 -0.053895
6 453 1 -0.110068
7 580 1 -0.168770
8 611 1 -0.373868
9 703 1 0.009255
10 722 1 -0.015974
11 740 1 0.001608
12 806 1 0.023213
13 915 1 -0.207116
14 922 1 0.154351
15 1069 1 -0.250180
16 1078 1 -0.140912
17 1090 1 -0.207007
18 1114 1 -0.064859
19 1558 1 -0.030528
20 1590 1 0.092322
21 1721 1 -0.414631
22 1967 1 -0.032548
23 2392 1 0.003964
24 2853 1 -0.038024
25 2869 1 -0.060466
26 3018 1 0.101644
27 3188 1 0.001800
28 3283 1 -0.015787
29 3293 1 0.020324
30 3376 1 -0.120177
31 3648 1 -0.121972
32 3697 1 0.003471
33 3880 1 -0.021968
34 3941 1 0.096150
35 4074 1 0.056335
36 4089 1 0.037156
37 4105 1 0.025749
38 4155 1 0.048707
39 4317 1 0.090442
40 4332 1 0.158554
41 4397 1 0.088418
42 4519 1 -0.235190
43 4547 1 0.064363
44 4982 1 -0.089703
Let’s look more carefully at the results of the LASSO regression for the Hitters data. The plotres function gives some useful graphs.
library(plotmo)
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
plotres(fit_lasso_cv)
I see some definite heteroskedasticity, which is not surprising for a variable like Salary; income data is often very right-skewed and often has non-constant variance. Let’s check to be sure:
hist(Hitters$Salary)
Yup! Let’s try a
log transformation to make the response more normal, and see if this fixes the heteroscedasticity.
hist(log(Hitters$Salary))
Looks better already. Let’s do the fit…
logHitters<-mutate(Hitters,logSalary=log(Salary))
#Xlog = model.matrix(logSalary ~ . - Salary, logHitters)[, -1]
ylog = logHitters$logSalary
log_fit_lasso_cv = cv.glmnet(X, ylog, alpha = 1)
plotres(log_fit_lasso_cv)
Did using the log transformation change the predictors found by the LASSO fit? Let’s see…
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 193.74264
AtBat .
Hits 1.21471
HmRun .
Runs .
RBI .
Walks 1.28958
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.12924
CRBI 0.31516
CWalks .
LeagueN .
DivisionW .
PutOuts 0.02533
Assists .
Errors .
NewLeagueN .
coef(log_fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.9970914
AtBat .
Hits 0.0033360
HmRun .
Runs .
RBI 0.0003002
Walks 0.0029740
Years 0.0080495
CAtBat .
CHits 0.0003775
CHmRun .
CRuns 0.0002244
CRBI 0.0000610
CWalks .
LeagueN .
DivisionW .
PutOuts .
Assists .
Errors .
NewLeagueN .
Yes, the chosen coefficients aren’t exactly the same.
Now I’m concerned about outliers:
plotres(log_fit_lasso_cv,which=3)
What is going on with Mike Schmidt? He is observation #173. Let’s look at a few observations around him:
Hitters[173:175,c(1:8,19)] # Just show a few predictors for space reasons
AtBat Hits HmRun Runs RBI Walks Years CAtBat Salary
-Mike Schmidt 20 1 0 0 0 0 2 41 2127
-Mike Scioscia 374 94 5 36 26 62 7 1968 875
-Mickey Tettleton 211 43 10 26 35 39 3 498 120
He’s had only 1 hit, 0 home runs, 0 runs, and has only been in the league 2 years, yet his salary seems very high. Let’s confirm that:
hist(Hitters$Salary)
He is almost the highest payed player, according to this data. What is going on? Let’s Google Mike Schmidt.
Mike Schmidt
Mike Schmidt played with the Phillies starting in 1972, so he’d been playing for 14 years by 1986. According to Wikipedia:
“Schmidt was a twelve-time All-Star and a three-time winner of the National League (NL) Most Valuable Player award (MVP), and he was known for his combination of power hitting and strong defense: as a hitter, he compiled 548 home runs and 1,595 runs batted in (RBIs), and led the NL in home runs eight times and in RBIs four times.”"
This all explains why Schmidt was payed so much, but not why he’s listed as having 0 runs and 0 home runs. I’m going to conclude that there’s something wrong with this data, and run the fit without Schmidt.
#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans = logHitters[-173,]$logSalary
log_fit_lasso_cv_sans = cv.glmnet(X[-173,], ysans, alpha = 1)
plotres(log_fit_lasso_cv_sans)
Terry Kennedy seems to be a similar situation.
#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans2 = logHitters[-c(173,241),]$logSalary
log_fit_lasso_cv_sans2 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 1)
plotres(log_fit_lasso_cv_sans2)
Steve Sax is a bit of an outlier here, but I don’t see anything wrong with his data. Also, I doubt he is very influential, since he’s near the middle of the fitted value chart.
Let’s look at the final coefficients:
coef(log_fit_lasso_cv_sans2)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.60250178
AtBat .
Hits 0.00509315
HmRun .
Runs .
RBI 0.00055799
Walks 0.00463658
Years 0.03839075
CAtBat .
CHits 0.00031499
CHmRun .
CRuns 0.00009165
CRBI .
CWalks .
LeagueN .
DivisionW -0.00915693
PutOuts .
Assists .
Errors .
NewLeagueN .
Would we get similar values if we did a non-regularized fit for the same predictors? (Plus an advantage of the lm – we get nicer diagnostic plots by default:)
linfit2<-lm(log(Salary)~Hits+RBI+Walks+Years+CHits+CRuns+Division,data=Hitters[-c(173,241),])
summary(linfit2)
Call:
lm(formula = log(Salary) ~ Hits + RBI + Walks + Years + CHits +
CRuns + Division, data = Hitters[-c(173, 241), ])
Residuals:
Min 1Q Median 3Q Max
-2.064 -0.446 0.068 0.416 1.335
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.276127 0.142734 29.96 < 2e-16 ***
Hits 0.006435 0.001464 4.40 0.000016 ***
RBI 0.001019 0.002317 0.44 0.66058
Walks 0.007025 0.002321 3.03 0.00273 **
Years 0.067816 0.019197 3.53 0.00049 ***
CHits 0.000399 0.000386 1.03 0.30206
CRuns -0.000288 0.000696 -0.41 0.67941
DivisionW -0.169181 0.072177 -2.34 0.01985 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.574 on 253 degrees of freedom
Multiple R-squared: 0.589, Adjusted R-squared: 0.578
F-statistic: 51.9 on 7 and 253 DF, p-value: <2e-16
plot(linfit2)
Are the coefficients with the linear fit versus the LASSO found broadly similar?
coef(linfit2)
(Intercept) Hits RBI Walks Years CHits
4.2761274 0.0064352 0.0010187 0.0070254 0.0678155 0.0003987
CRuns DivisionW
-0.0002878 -0.1691809
lassocoefs<-as.matrix(coef(log_fit_lasso_cv_sans2))
lassocoefs[lassocoefs!=0]
[1] 4.60250178 0.00509315 0.00055799 0.00463658 0.03839075 0.00031499
[7] 0.00009165 -0.00915693
plot(coef(linfit2)[2:9]~lassocoefs[lassocoefs!=0][2:9])
Lastly, let’s try an elastic net fit to see if it will bring in both the career averages and the 1986 numbers for RBI and Walks.
log_fit_lasso_cv_sans3 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 0.5)
plotres(log_fit_lasso_cv_sans3)
coef(log_fit_lasso_cv_sans3)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.78008050
AtBat .
Hits 0.00337053
HmRun .
Runs 0.00160678
RBI 0.00104338
Walks 0.00338272
Years 0.02248827
CAtBat 0.00003187
CHits 0.00015698
CHmRun .
CRuns 0.00022782
CRBI 0.00012046
CWalks .
LeagueN .
DivisionW .
PutOuts .
Assists .
Errors .
NewLeagueN .
glmnet Web Vingette - Details from the package developers.