Regularization

Lecture 1

Lecture modified from Chapter 24 in ‘R for Statistical Learning’, by David Dalpiaz

Setting up the data

We will use the Hitters dataset from the ISLR package to explore two shrinkage methods: ridge regression and lasso. These are otherwise known as penalized regression methods.

data(Hitters, package = "ISLR")

This dataset has some missing data in the response Salaray. We use the na.omit() function the clean the dataset.

sum(is.na(Hitters))
[1] 0
sum(is.na(Hitters$Salary))
[1] 0
Hitters = na.omit(Hitters)
sum(is.na(Hitters))
[1] 0

The predictors variables are offensive and defensive statistics for a number of baseball players.

names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"    
 [7] "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"     "CRBI"     
[13] "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"   
[19] "Salary"    "NewLeague"

Let’s look at just a few of the variables:

library(GGally)
ggpairs(dplyr::select(Hitters,Salary,'Hits','CHits',HmRun,CHmRun,Runs,CRuns))

We use the glmnet() and cv.glmnet() functions from the glmnet package to fit penalized regressions.

library(glmnet)

Unfortunately, the glmnet function does not allow the use of model formulas, so we setup the data for ease of use with glmnet. Eventually we will use train() from caret which does allow for fitting penalized regression with the formula syntax, but to explore some of the details, we first work with the functions from glmnet directly.

X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary

Linear regression

First, we fit an ordinary linear regression, and note the size of the predictors’ coefficients, and predictors’ coefficients squared. (The two penalties we will use.)

fit = lm(Salary ~ ., Hitters)
coef(fit)
(Intercept)       AtBat        Hits       HmRun        Runs         RBI 
   163.1036     -1.9799      7.5008      4.3309     -2.3762     -1.0450 
      Walks       Years      CAtBat       CHits      CHmRun       CRuns 
     6.2313     -3.4891     -0.1713      0.1340     -0.1729      1.4543 
       CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
     0.8077     -0.8116     62.5994   -116.8492      0.2819      0.3711 
     Errors  NewLeagueN 
    -3.3608    -24.7623 
sum(abs(coef(fit)[-1]))
[1] 238.7
sum(coef(fit)[-1] ^ 2)
[1] 18337
vif(fit)
    AtBat      Hits     HmRun      Runs       RBI     Walks     Years    CAtBat 
   22.944    30.281     7.759    15.246    11.922     4.149     9.313   251.561 
    CHits    CHmRun     CRuns      CRBI    CWalks    League  Division   PutOuts 
  502.954    46.488   162.521   131.966    19.744     4.134     1.075     1.236 
  Assists    Errors NewLeague 
    2.709     2.215     4.099

Note that we have a lot of coefficients with large values. The VIFs indicate that we have a high degree of colinearity.

The sum of the absolute values of the coefficients is 238.7, and the sum of the coefficients squared is 18,337. These two sums are what are penalized by ridge regression and lasso, respectively, so we should see them decrease when we use those methods.

Ridge Regression

We first illustrate ridge regression, which can be fit using glmnet() with alpha = 0 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} \beta_j^2 . \]

Notice that the intercept is not penalized. Also, note that that ridge regression is not scale invariant like the usual unpenalized regression. Thankfully, glmnet() takes care of this internally. It automatically standardizes predictors for fitting, then reports fitted coefficient using the original scale.

The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice none of the coefficients are forced to be zero.

library(glmnet)
par(mfrow = c(1, 2))
fit_ridge = glmnet(X, y, alpha = 0)
plot(fit_ridge, xvar = "lambda", label = TRUE)
plot(fit_ridge)

We use cross-validation to select a good $\lambda$ value. The cv.glmnet()function uses 10 folds by default. The plot illustrates the MSE for the $\lambda$s considered. Two lines are drawn. The first is the $\lambda$ that gives the smallest MSE. The second is the $\lambda$ that gives an MSE within one standard error of the smallest.

fit_ridge_cv = cv.glmnet(X, y, alpha = 0)
plot(fit_ridge_cv)

The cv.glmnet() function returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are smaller than the full linear regression. (As we would expect.)

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_ridge_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 199.418114
AtBat         0.093427
Hits          0.389767
HmRun         1.212875
Runs          0.623229
RBI           0.618548
Walks         0.810468
Years         2.544171
CAtBat        0.007897
CHits         0.030555
CHmRun        0.226546
CRuns         0.061266
CRBI          0.063385
CWalks        0.060720
LeagueN       3.743295
DivisionW   -23.545192
PutOuts       0.056202
Assists       0.007879
Errors       -0.164203
NewLeagueN    3.313773

# fitted coefficients, using minimum lambda
coef(fit_ridge_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)   76.4582387
AtBat         -0.6315180
Hits           2.6421596
HmRun         -1.3882331
Runs           1.0457295
RBI            0.7315713
Walks          3.2780010
Years         -8.7237336
CAtBat         0.0001256
CHits          0.1318975
CHmRun         0.6895578
CRuns          0.2830055
CRBI           0.2514905
CWalks        -0.2599851
LeagueN       52.3371962
DivisionW   -122.4169978
PutOuts        0.2623667
Assists        0.1629044
Errors        -3.6440019
NewLeagueN   -17.0259792

# penalty term using minimum lambda
sum(coef(fit_ridge_cv, s = "lambda.min")[-1] ^ 2)
[1] 18127

# fitted coefficients, using 1-SE rule lambda
coef(fit_ridge_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 199.418114
AtBat         0.093427
Hits          0.389767
HmRun         1.212875
Runs          0.623229
RBI           0.618548
Walks         0.810468
Years         2.544171
CAtBat        0.007897
CHits         0.030555
CHmRun        0.226546
CRuns         0.061266
CRBI          0.063385
CWalks        0.060720
LeagueN       3.743295
DivisionW   -23.545192
PutOuts       0.056202
Assists       0.007879
Errors       -0.164203
NewLeagueN    3.313773

# penalty term using 1-SE rule lambda
sum(coef(fit_ridge_cv, s = "lambda.1se")[-1] ^ 2)
[1] 589

# predict using minimum lambda
predict(fit_ridge_cv, X, s = "lambda.min")

# predict using 1-SE rule lambda, default behavior
predict(fit_ridge_cv, X)

# calcualte "train error"
mean((y - predict(fit_ridge_cv, X)) ^ 2)
[1] 130405

# CV-RMSE using minimum lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.min])
[1] 343.3

# CV-RMSE using 1-SE rule lambda
sqrt(fit_ridge_cv$cvm[fit_ridge_cv$lambda == fit_ridge_cv$lambda.1se]) 
[1] 367.9

Lasso

We now illustrate lasso, which can be fit using glmnet() with alpha = 1 and seeks to minimize

\[ \sum_{i=1}^{n} \left( y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right) ^ 2 + \lambda \sum_{j=1}^{p} |\beta_j| . \]

Like ridge, lasso is not scale invariant.

The two plots illustrate how much the coefficients are penalized for different values of $\lambda$. Notice some of the coefficients are forced to be zero.

par(mfrow = c(1, 2))
fit_lasso = glmnet(X, y, alpha = 1)
plot(fit_lasso, xvar = "lambda", label = TRUE)
plot(fit_lasso)

Again, to actually pick a $\lambda$, we will use cross-validation. The plot is similar to the ridge plot. Notice along the top is the number of features in the model. (Which changed in this plot.)

fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)

cv.glmnet() returns several details of the fit for both $\lambda$ values in the plot. Notice the penalty terms are again smaller than the full linear regression. (As we would expect.) Some coefficients are 0.

# fitted coefficients, using 1-SE rule lambda, default behavior
coef(fit_lasso_cv)
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 167.91203
AtBat         .      
Hits          1.29270
HmRun         .      
Runs          .      
RBI           .      
Walks         1.39818
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.14168
CRBI          0.32193
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.04675
Assists       .      
Errors        .      
NewLeagueN    .

# fitted coefficients, using minimum lambda
coef(fit_lasso_cv, s = "lambda.min")
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  139.50053399
AtBat         -1.73412453
Hits           6.05986118
HmRun          0.17081018
Runs           .         
RBI            .         
Walks          5.05382556
Years        -10.61379198
CAtBat        -0.00002484
CHits          .         
CHmRun         0.57214145
CRuns          0.72181056
CRBI           0.38563389
CWalks        -0.60336657
LeagueN       33.39735185
DivisionW   -119.39235545
PutOuts        0.27725600
Assists        0.21048279
Errors        -2.37146800
NewLeagueN     .

# penalty term using minimum lambda
sum(coef(fit_lasso_cv, s = "lambda.min")[-1] ^ 2)
[1] 15555

# fitted coefficients, using 1-SE rule lambda
coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 167.91203
AtBat         .      
Hits          1.29270
HmRun         .      
Runs          .      
RBI           .      
Walks         1.39818
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.14168
CRBI          0.32193
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.04675
Assists       .      
Errors        .      
NewLeagueN    .

# penalty term using 1-SE rule lambda
sum(coef(fit_lasso_cv, s = "lambda.1se")[-1] ^ 2)
[1] 3.752

# predict using minimum lambda
predict(fit_lasso_cv, X, s = "lambda.min")

# predict using 1-SE rule lambda, default behavior
predict(fit_lasso_cv, X)

# calcualte "train error"
mean((y - predict(fit_lasso_cv, X)) ^ 2)
[1] 123931

`broom`

Sometimes, the output from glmnet() can be overwhelming. The broom package can help with that.

library(broom)
# the output from the commented line would be immense
# fit_lasso_cv
tidy(fit_lasso_cv)
     lambda estimate std.error conf.high conf.low nzero
1  255.2821   203503     24725    228228   178778     0
2  232.6035   196613     24396    221009   172217     1
3  211.9397   187747     23166    210914   164581     2
4  193.1115   180470     22173    202642   158297     2
5  175.9560   174208     21479    195687   152729     3
6  160.3246   167461     21072    188532   146389     4
7  146.0818   160814     20875    181689   139940     4
8  133.1043   154965     20774    175739   134191     4
9  121.2797   149748     20652    170399   129096     4
10 110.5055   145352     20579    165931   124773     4
11 100.6885   141690     20589    162279   121101     5
12  91.7436   138396     20679    159075   117717     5
13  83.5934   135528     20770    156298   114758     5
14  76.1672   132988     20858    153846   112130     5
15  69.4007   130716     20990    151707   109726     6
16  63.2353   128572     21138    149709   107434     6
17  57.6177   126671     21257    147928   105414     6
18  52.4991   125000     21312    146313   103688     6
19  47.8352   123547     21321    144868   102226     6
20  43.5857   122295     21329    143625   100966     6
21  39.7136   121248     21345    142593    99903     6
22  36.1856   120372     21366    141738    99006     6
23  32.9710   119655     21396    141050    98259     6
24  30.0419   119086     21435    140521    97651     6
25  27.3731   118650     21484    140134    97166     6
26  24.9413   118320     21536    139856    96784     6
27  22.7256   118095     21591    139685    96504     6
28  20.7067   117934     21638    139572    96295     6
29  18.8672   117820     21682    139502    96138     6
30  17.1911   117783     21739    139522    96044     7
31  15.6639   117767     21794    139562    95973     7
32  14.2723   117771     21844    139615    95927     7
33  13.0044   117793     21888    139681    95905     9
34  11.8491   117788     21932    139720    95856     9
35  10.7965   117765     21976    139741    95789     9
36   9.8374   117833     22064    139897    95770     9
37   8.9634   118080     22174    140253    95906     9
38   8.1672   118425     22259    140684    96167    11
39   7.4416   118467     22291    140759    96176    11
40   6.7805   118291     22307    140599    95984    12
41   6.1782   118044     22330    140374    95714    12
42   5.6293   117671     22378    140048    95293    13
43   5.1292   117054     22409    139463    94646    13
44   4.6735   116338     22421    138759    93917    13
45   4.2584   115665     22431    138096    93235    13
46   3.8801   115111     22442    137554    92669    13
47   3.5354   114672     22458    137130    92213    13
48   3.2213   114335     22477    136811    91858    13
49   2.9351   114134     22489    136623    91645    13
50   2.6744   114015     22499    136513    91516    13
51   2.4368   113912     22483    136395    91429    13
52   2.2203   113829     22443    136272    91385    14
53   2.0231   113803     22410    136213    91393    15
54   1.8433   113894     22380    136275    91514    15
55   1.6796   114042     22335    136377    91707    17
56   1.5304   114217     22306    136523    91910    17
57   1.3944   114354     22293    136646    92061    17
58   1.2705   114440     22292    136732    92148    17
59   1.1577   114486     22297    136783    92189    17
60   1.0548   114525     22308    136832    92217    17
61   0.9611   114586     22320    136906    92267    17
62   0.8757   114608     22317    136925    92291    17
63   0.7979   114636     22318    136954    92318    17
64   0.7271   114667     22317    136984    92350    17
65   0.6625   114695     22316    137012    92379    18
66   0.6036   114726     22317    137043    92410    18
67   0.5500   114766     22317    137082    92449    18
68   0.5011   114807     22317    137124    92490    17
69   0.4566   114858     22314    137172    92543    18
70   0.4160   114894     22314    137208    92581    18
71   0.3791   114949     22324    137272    92625    18
72   0.3454   115001     22336    137337    92665    18
73   0.3147   115047     22347    137394    92699    18
74   0.2868   115104     22355    137459    92749    18
75   0.2613   115137     22365    137502    92772    18
# the two lambda values of interest
glance(fit_lasso_cv) 
  lambda.min lambda.1se
1      2.023      83.59

Lecture 2

Simulated Data, $p > n$

Aside from simply shrinking coefficients (ridge) and setting some coefficients to 0 (lasso), penalized regression also has the advantage of being able to handle the $p > n$ case.

set.seed(1234)
n = 1000
p = 5500
X = replicate(p, rnorm(n = n))
beta = c(1, 1, 1, rep(0, 5497))
z = X %*% beta
prob = exp(z) / (1 + exp(z))
y = as.factor(rbinom(length(z), size = 1, prob = prob))

We first simulate a classification example where $p > n$.

# glm(y ~ X, family = "binomial")
# will not converge

We then use a lasso penalty to fit penalized logistic regression. This minimizes

\[ \sum_{i=1}^{n} L\left(y_i, \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}\right) + \lambda \sum_{j=1}^{p} |\beta_j| \]

where $L$ is the appropriate negative log-likelihood.

library(glmnet)
fit_cv = cv.glmnet(X, y, family = "binomial", alpha = 1)
plot(fit_cv)

head(coef(fit_cv), n = 10)
10 x 1 sparse Matrix of class "dgCMatrix"
                  1
(Intercept) 0.02397
V1          0.59675
V2          0.56252
V3          0.60065
V4          .      
V5          .      
V6          .      
V7          .      
V8          .      
V9          .

fit_cv$nzero
 s0  s1  s2  s3  s4  s5  s6  s7  s8  s9 s10 s11 s12 s13 s14 s15 s16 s17 s18 s19 
  0   2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
s20 s21 s22 s23 s24 s25 s26 s27 s28 s29 s30 s31 s32 s33 s34 s35 s36 s37 s38 s39 
  3   3   3   3   3   3   3   3   3   3   4   6   7  10  18  24  35  54  65  75 
s40 s41 s42 s43 s44 s45 s46 s47 s48 s49 s50 s51 s52 s53 s54 s55 s56 s57 s58 s59 
 86 100 110 129 147 168 187 202 221 241 254 269 283 298 310 324 333 350 364 375 
s60 s61 s62 s63 s64 s65 s66 s67 s68 s69 s70 s71 s72 s73 s74 s75 s76 s77 s78 s79 
387 400 411 429 435 445 453 455 462 466 475 481 487 491 496 498 502 504 512 518 
s80 s81 s82 s83 s84 s85 s86 s87 s88 s89 s90 s91 s92 s93 s94 s95 s96 s97 s98 
523 526 528 536 543 550 559 561 563 566 570 571 576 582 586 590 596 596 600

Notice, only the first three predictors generated are truly significant, and that is exactly what the suggested model finds.

fit_1se = glmnet(X, y, family = "binomial", lambda = fit_cv$lambda.1se)
which(as.vector(as.matrix(fit_1se$beta)) != 0)
[1] 1 2 3

We can also see in the following plots, the three features entering the model well ahead of the irrelevant features.

par(mfrow = c(1, 2))
plot(glmnet(X, y, family = "binomial"))
plot(glmnet(X, y, family = "binomial"), xvar = "lambda")

We can extract the two relevant $\lambda$ values.

fit_cv$lambda.min
[1] 0.03087
fit_cv$lambda.1se
[1] 0.0515

Since cv.glmnet() does not calculate prediction accuracy for classification, we take the $\lambda$ values and create a grid for caret to search in order to obtain prediction accuracy with train(). We set $\alpha = 1$ in this grid, as glmnet can actually tune over the $\alpha = 1$ parameter. (More on that later.)

Note that we have to force y to be a factor, so that train() recognizes we want to have a binomial response. The train() function in caret use the type of variable in y to determine if you want to use family = "binomial" or family = "gaussian".

library(caret)
cv_5 = trainControl(method = "cv", number = 5)
lasso_grid = expand.grid(alpha = 1, 
                         lambda = c(fit_cv$lambda.min, fit_cv$lambda.1se))
lasso_grid
  alpha  lambda
1     1 0.03087
2     1 0.05150

sim_data = data.frame(y, X)
fit_lasso = train(
  y ~ ., data = sim_data,
  method = "glmnet",
  trControl = cv_5,
  tuneGrid = lasso_grid
)
fit_lasso$results
  alpha  lambda Accuracy  Kappa AccuracySD KappaSD
1     1 0.03087   0.7679 0.5358    0.03430 0.06845
2     1 0.05150   0.7689 0.5378    0.02807 0.05596

The interaction between the glmnet and caret packages is sometimes frustrating, but for obtaining results for particular values of $\lambda$, we see it can be easily used.

Feature selection example

Here we analyze data from a microarray study of Singh et al. (2002). This data set contains measurements of the gene expression of 6033 genes for 102 observations: 52 prostate cancer patients and 50 healty men. The goal is to see which genes were most associated with the cancer diagnosis.

library(sda)
data("singh2002")
dim(singh2002$y) # number of observations
NULL
dim(singh2002$x) 
[1]  102 6033

fit <- glmnet(singh2002$x, singh2002$y, family = "binomial",pmax=20,alpha=1)
Warning: from glmnet Fortran code (error code -10015); Number of nonzero
coefficients along the path exceeds pmax=20 at 15th lambda value; solutions for
larger lambdas returned
plot(fit,label=TRUE,xvar="lambda")

fit_lasso_cv = cv.glmnet(singh2002$x, singh2002$y, family = "binomial", nfolds=5,alpha = 0.99,type.measure='class')
coefs<-coef(fit_lasso_cv)
summary(coefs)
6034 x 1 sparse Matrix of class "dgCMatrix", with 44 entries 
      i j         x
1     1 1 -0.493161
2   299 1  0.027846
3   333 1 -0.307869
4   365 1  0.040395
5   378 1 -0.053895
6   453 1 -0.110068
7   580 1 -0.168770
8   611 1 -0.373868
9   703 1  0.009255
10  722 1 -0.015974
11  740 1  0.001608
12  806 1  0.023213
13  915 1 -0.207116
14  922 1  0.154351
15 1069 1 -0.250180
16 1078 1 -0.140912
17 1090 1 -0.207007
18 1114 1 -0.064859
19 1558 1 -0.030528
20 1590 1  0.092322
21 1721 1 -0.414631
22 1967 1 -0.032548
23 2392 1  0.003964
24 2853 1 -0.038024
25 2869 1 -0.060466
26 3018 1  0.101644
27 3188 1  0.001800
28 3283 1 -0.015787
29 3293 1  0.020324
30 3376 1 -0.120177
31 3648 1 -0.121972
32 3697 1  0.003471
33 3880 1 -0.021968
34 3941 1  0.096150
35 4074 1  0.056335
36 4089 1  0.037156
37 4105 1  0.025749
38 4155 1  0.048707
39 4317 1  0.090442
40 4332 1  0.158554
41 4397 1  0.088418
42 4519 1 -0.235190
43 4547 1  0.064363
44 4982 1 -0.089703

Lecture 3

More complete analysis of Hitters data

Let’s look more carefully at the results of the LASSO regression for the Hitters data. The plotres function gives some useful graphs.

library(plotmo)
X = model.matrix(Salary ~ ., Hitters)[, -1]
y = Hitters$Salary
fit_lasso_cv = cv.glmnet(X, y, alpha = 1)
plot(fit_lasso_cv)
plotres(fit_lasso_cv)

I see some definite heteroskedasticity, which is not surprising for a variable like Salary; income data is often very right-skewed and often has non-constant variance. Let’s check to be sure:

hist(Hitters$Salary)

Yup! Let’s try a log transformation to make the response more normal, and see if this fixes the heteroscedasticity.

hist(log(Hitters$Salary))

Looks better already. Let’s do the fit…

logHitters<-mutate(Hitters,logSalary=log(Salary))
#Xlog = model.matrix(logSalary ~ . - Salary, logHitters)[, -1]
ylog = logHitters$logSalary

log_fit_lasso_cv = cv.glmnet(X, ylog, alpha = 1)
plotres(log_fit_lasso_cv)

Did using the log transformation change the predictors found by the LASSO fit? Let’s see…

coef(fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 193.74264
AtBat         .      
Hits          1.21471
HmRun         .      
Runs          .      
RBI           .      
Walks         1.28958
Years         .      
CAtBat        .      
CHits         .      
CHmRun        .      
CRuns         0.12924
CRBI          0.31516
CWalks        .      
LeagueN       .      
DivisionW     .      
PutOuts       0.02533
Assists       .      
Errors        .      
NewLeagueN    .      

coef(log_fit_lasso_cv, s = "lambda.1se")
20 x 1 sparse Matrix of class "dgCMatrix"
                    1
(Intercept) 4.9970914
AtBat       .        
Hits        0.0033360
HmRun       .        
Runs        .        
RBI         0.0003002
Walks       0.0029740
Years       0.0080495
CAtBat      .        
CHits       0.0003775
CHmRun      .        
CRuns       0.0002244
CRBI        0.0000610
CWalks      .        
LeagueN     .        
DivisionW   .        
PutOuts     .        
Assists     .        
Errors      .        
NewLeagueN  .

Yes, the chosen coefficients aren’t exactly the same.

Now I’m concerned about outliers:

plotres(log_fit_lasso_cv,which=3)

What is going on with Mike Schmidt? He is observation #173. Let’s look at a few observations around him:

Hitters[173:175,c(1:8,19)] # Just show a few predictors for space reasons
                  AtBat Hits HmRun Runs RBI Walks Years CAtBat Salary
-Mike Schmidt        20    1     0    0   0     0     2     41   2127
-Mike Scioscia      374   94     5   36  26    62     7   1968    875
-Mickey Tettleton   211   43    10   26  35    39     3    498    120

He’s had only 1 hit, 0 home runs, 0 runs, and has only been in the league 2 years, yet his salary seems very high. Let’s confirm that:

hist(Hitters$Salary)

He is almost the highest payed player, according to this data. What is going on? Let’s Google Mike Schmidt.

$Mike Schmidt$

Mike Schmidt

Mike Schmidt played with the Phillies starting in 1972, so he’d been playing for 14 years by 1986. According to Wikipedia:

“Schmidt was a twelve-time All-Star and a three-time winner of the National League (NL) Most Valuable Player award (MVP), and he was known for his combination of power hitting and strong defense: as a hitter, he compiled 548 home runs and 1,595 runs batted in (RBIs), and led the NL in home runs eight times and in RBIs four times.”"

This all explains why Schmidt was payed so much, but not why he’s listed as having 0 runs and 0 home runs. I’m going to conclude that there’s something wrong with this data, and run the fit without Schmidt.

#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans = logHitters[-173,]$logSalary

log_fit_lasso_cv_sans = cv.glmnet(X[-173,], ysans, alpha = 1)
plotres(log_fit_lasso_cv_sans)

Terry Kennedy seems to be a similar situation.

#Xsans = model.matrix(logSalary ~ . - Salary, logHitters[-173,])[, -1]
ysans2 = logHitters[-c(173,241),]$logSalary

log_fit_lasso_cv_sans2 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 1)
plotres(log_fit_lasso_cv_sans2)

Steve Sax is a bit of an outlier here, but I don’t see anything wrong with his data. Also, I doubt he is very influential, since he’s near the middle of the fitted value chart.

Let’s look at the final coefficients:

coef(log_fit_lasso_cv_sans2)
20 x 1 sparse Matrix of class "dgCMatrix"
                      1
(Intercept)  4.60250178
AtBat        .         
Hits         0.00509315
HmRun        .         
Runs         .         
RBI          0.00055799
Walks        0.00463658
Years        0.03839075
CAtBat       .         
CHits        0.00031499
CHmRun       .         
CRuns        0.00009165
CRBI         .         
CWalks       .         
LeagueN      .         
DivisionW   -0.00915693
PutOuts      .         
Assists      .         
Errors       .         
NewLeagueN   .

Would we get similar values if we did a non-regularized fit for the same predictors? (Plus an advantage of the lm – we get nicer diagnostic plots by default:)


linfit2<-lm(log(Salary)~Hits+RBI+Walks+Years+CHits+CRuns+Division,data=Hitters[-c(173,241),])
summary(linfit2)

Call:
lm(formula = log(Salary) ~ Hits + RBI + Walks + Years + CHits + 
    CRuns + Division, data = Hitters[-c(173, 241), ])

Residuals:
   Min     1Q Median     3Q    Max 
-2.064 -0.446  0.068  0.416  1.335 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.276127   0.142734   29.96  < 2e-16 ***
Hits         0.006435   0.001464    4.40 0.000016 ***
RBI          0.001019   0.002317    0.44  0.66058    
Walks        0.007025   0.002321    3.03  0.00273 ** 
Years        0.067816   0.019197    3.53  0.00049 ***
CHits        0.000399   0.000386    1.03  0.30206    
CRuns       -0.000288   0.000696   -0.41  0.67941    
DivisionW   -0.169181   0.072177   -2.34  0.01985 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.574 on 253 degrees of freedom
Multiple R-squared:  0.589, Adjusted R-squared:  0.578 
F-statistic: 51.9 on 7 and 253 DF,  p-value: <2e-16
plot(linfit2)

Are the coefficients with the linear fit versus the LASSO found broadly similar?

coef(linfit2)
(Intercept)        Hits         RBI       Walks       Years       CHits 
  4.2761274   0.0064352   0.0010187   0.0070254   0.0678155   0.0003987 
      CRuns   DivisionW 
 -0.0002878  -0.1691809 
lassocoefs<-as.matrix(coef(log_fit_lasso_cv_sans2))
lassocoefs[lassocoefs!=0]
[1]  4.60250178  0.00509315  0.00055799  0.00463658  0.03839075  0.00031499
[7]  0.00009165 -0.00915693
plot(coef(linfit2)[2:9]~lassocoefs[lassocoefs!=0][2:9])

Lastly, let’s try an elastic net fit to see if it will bring in both the career averages and the 1986 numbers for RBI and Walks.

log_fit_lasso_cv_sans3 = cv.glmnet(X[-c(173,241),], ysans2, alpha = 0.5)
plotres(log_fit_lasso_cv_sans3)

coef(log_fit_lasso_cv_sans3)
20 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 4.78008050
AtBat       .         
Hits        0.00337053
HmRun       .         
Runs        0.00160678
RBI         0.00104338
Walks       0.00338272
Years       0.02248827
CAtBat      0.00003187
CHits       0.00015698
CHmRun      .         
CRuns       0.00022782
CRBI        0.00012046
CWalks      .         
LeagueN     .         
DivisionW   .         
PutOuts     .         
Assists     .         
Errors      .         
NewLeagueN  .

External Links

glmnet Web Vingette - Details from the package developers.

STAT 224 Regularization