Regularization

The problem of overfitting

N<-10
X<-runif(N)*10
err<-rnorm(N)*20
Y<-2+3*X-4*X^2+err
out<-data.frame(X,Y)

gf_point(Y~X,data=out)

First we can try a simple linear fit, using the command fit0<-lm(Y~X)

That works OK, but we can do a better job if we include the additional predictor \(X^2\):

fit1<-lm(Y~X+I(X^2))

But what if we could get a perfect fit? We can! We only need to use even more predictors, like this:

fit2<-lm(Y~X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2))

gf_point(Y~X,data=out) %>%
gf_line(predict.fit2.~X,data=out)

Which of these three is the best fit? Let’s look at them individually:

summary(fit0)

## 
## Call:
## lm(formula = Y ~ X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.045 -14.538  -9.751  12.408  48.171 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    75.29      18.27   4.122  0.00333 ** 
## X             -38.54       2.91 -13.242 1.01e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.56 on 8 degrees of freedom
## Multiple R-squared:  0.9564, Adjusted R-squared:  0.9509 
## F-statistic: 175.3 on 1 and 8 DF,  p-value: 1.009e-06

This one looks OK.

summary(fit1)

## 
## Call:
## lm(formula = Y ~ X + I(X^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.060  -9.457  -2.537  14.995  26.721 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  29.1147    21.7957   1.336    0.223  
## X           -12.8141     9.7215  -1.318    0.229  
## I(X^2)       -2.4300     0.8951  -2.715    0.030 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.56 on 7 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9727 
## F-statistic: 161.2 on 2 and 7 DF,  p-value: 1.4e-06

Here the coefficients are very different from the true values, and one of them isn’t even significant. Evidence of multicollinearity! (why?) Let’s check:

vif(fit1)

##       X  I(X^2) 
## 20.0416 20.0416

Now let’s look at the last fit:

summary(fit2)

## 
## Call:
## lm(formula = Y ~ X + I(X^2) + I(X^3) + I(X^4) + I(X^5) + I(X^6) + 
##     I(X^7) + I(X^8) + I(X^-2))
## 
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.501e+05         NA      NA       NA
## X            2.368e+05         NA      NA       NA
## I(X^2)      -1.636e+05         NA      NA       NA
## I(X^3)       6.275e+04         NA      NA       NA
## I(X^4)      -1.448e+04         NA      NA       NA
## I(X^5)       2.055e+03         NA      NA       NA
## I(X^6)      -1.755e+02         NA      NA       NA
## I(X^7)       8.268e+00         NA      NA       NA
## I(X^8)      -1.649e-01         NA      NA       NA
## I(X^-2)      2.040e+04         NA      NA       NA
## 
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 9 and 0 DF,  p-value: NA

vif(fit2)

## Warning in cov2cor(v): diag(.) had 0 or NA entries; non-finite result is
## doubtful

##       X  I(X^2)  I(X^3)  I(X^4)  I(X^5)  I(X^6)  I(X^7)  I(X^8) I(X^-2) 
##     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN

We can use regularization instead. First we do ridge regression, which uses a \(\beta^2\) penalty:

library(glmnet)
mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]
fit_ridge_cv = cv.glmnet(mm, Y, alpha = 0)

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold

coef(fit_ridge_cv, s="lambda.1se")

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -9.049835e+01
## X           -2.252878e+00
## I(X^2)      -2.035648e-01
## I(X^3)      -1.996791e-02
## I(X^4)      -2.005036e-03
## I(X^5)      -2.046299e-04
## I(X^6)      -2.116564e-05
## I(X^7)      -2.212112e-06
## I(X^8)      -2.329337e-07
## I(X^-2)      4.919327e+00

Now we do LASSO regression, which uses a \(|\beta|\) penalty:

mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]
fit_lasso_cv = cv.glmnet(mm, Y, alpha = 1)

## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold

coef(fit_lasso_cv,s="lambda.1se")

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)  10.254622
## X           -10.924940
## I(X^2)       -2.215631
## I(X^3)        .       
## I(X^4)        .       
## I(X^5)        .       
## I(X^6)        .       
## I(X^7)        .       
## I(X^8)        .       
## I(X^-2)       .

N<-1000
X<-runif(N)*10
err<-rnorm(N)*2
Y<-2+3*X-4*X^2+err
out<-data.frame(X,Y)
mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]

fit_ridge_cv = cv.glmnet(mm, Y, alpha = 0)
coef(fit_ridge_cv, s = "lambda.1se")

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  9.810829e+00
## X           -1.085430e+01
## I(X^2)      -9.560176e-01
## I(X^3)      -7.731467e-02
## I(X^4)      -5.726577e-03
## I(X^5)      -3.793373e-04
## I(X^6)      -2.028736e-05
## I(X^7)      -4.589448e-07
## I(X^8)       9.198727e-08
## I(X^-2)     -2.313281e-03

fit_lasso_cv = cv.glmnet(mm, Y, alpha = 1)
coef(fit_lasso_cv, s = "lambda.1se")

## 10 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  2.28388581
## X            .         
## I(X^2)      -3.30996587
## I(X^3)      -0.03305665
## I(X^4)       .         
## I(X^5)       .         
## I(X^6)       .         
## I(X^7)       .         
## I(X^8)       .         
## I(X^-2)      .