N<-10
X<-runif(N)*10
err<-rnorm(N)*20
Y<-2+3*X-4*X^2+err
out<-data.frame(X,Y)
gf_point(Y~X,data=out)
First we can try a simple linear fit, using the command fit0<-lm(Y~X)
That works OK, but we can do a better job if we include the additional predictor \(X^2\):
fit1<-lm(Y~X+I(X^2))
But what if we could get a perfect fit? We can! We only need to use even more predictors, like this:
fit2<-lm(Y~X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2))
gf_point(Y~X,data=out) %>%
gf_line(predict.fit2.~X,data=out)
Which of these three is the best fit? Let’s look at them individually:
summary(fit0)
##
## Call:
## lm(formula = Y ~ X)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.045 -14.538 -9.751 12.408 48.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.29 18.27 4.122 0.00333 **
## X -38.54 2.91 -13.242 1.01e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.56 on 8 degrees of freedom
## Multiple R-squared: 0.9564, Adjusted R-squared: 0.9509
## F-statistic: 175.3 on 1 and 8 DF, p-value: 1.009e-06
This one looks OK.
summary(fit1)
##
## Call:
## lm(formula = Y ~ X + I(X^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.060 -9.457 -2.537 14.995 26.721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.1147 21.7957 1.336 0.223
## X -12.8141 9.7215 -1.318 0.229
## I(X^2) -2.4300 0.8951 -2.715 0.030 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.56 on 7 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9727
## F-statistic: 161.2 on 2 and 7 DF, p-value: 1.4e-06
Here the coefficients are very different from the true values, and one of them isn’t even significant. Evidence of multicollinearity! (why?) Let’s check:
vif(fit1)
## X I(X^2)
## 20.0416 20.0416
Now let’s look at the last fit:
summary(fit2)
##
## Call:
## lm(formula = Y ~ X + I(X^2) + I(X^3) + I(X^4) + I(X^5) + I(X^6) +
## I(X^7) + I(X^8) + I(X^-2))
##
## Residuals:
## ALL 10 residuals are 0: no residual degrees of freedom!
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.501e+05 NA NA NA
## X 2.368e+05 NA NA NA
## I(X^2) -1.636e+05 NA NA NA
## I(X^3) 6.275e+04 NA NA NA
## I(X^4) -1.448e+04 NA NA NA
## I(X^5) 2.055e+03 NA NA NA
## I(X^6) -1.755e+02 NA NA NA
## I(X^7) 8.268e+00 NA NA NA
## I(X^8) -1.649e-01 NA NA NA
## I(X^-2) 2.040e+04 NA NA NA
##
## Residual standard error: NaN on 0 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: NaN
## F-statistic: NaN on 9 and 0 DF, p-value: NA
vif(fit2)
## Warning in cov2cor(v): diag(.) had 0 or NA entries; non-finite result is
## doubtful
## X I(X^2) I(X^3) I(X^4) I(X^5) I(X^6) I(X^7) I(X^8) I(X^-2)
## NaN NaN NaN NaN NaN NaN NaN NaN NaN
We can use regularization instead. First we do ridge regression, which uses a \(\beta^2\) penalty:
library(glmnet)
mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]
fit_ridge_cv = cv.glmnet(mm, Y, alpha = 0)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
coef(fit_ridge_cv, s="lambda.1se")
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -9.049835e+01
## X -2.252878e+00
## I(X^2) -2.035648e-01
## I(X^3) -1.996791e-02
## I(X^4) -2.005036e-03
## I(X^5) -2.046299e-04
## I(X^6) -2.116564e-05
## I(X^7) -2.212112e-06
## I(X^8) -2.329337e-07
## I(X^-2) 4.919327e+00
Now we do LASSO regression, which uses a \(|\beta|\) penalty:
mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]
fit_lasso_cv = cv.glmnet(mm, Y, alpha = 1)
## Warning: Option grouped=FALSE enforced in cv.glmnet, since < 3 observations
## per fold
coef(fit_lasso_cv,s="lambda.1se")
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 10.254622
## X -10.924940
## I(X^2) -2.215631
## I(X^3) .
## I(X^4) .
## I(X^5) .
## I(X^6) .
## I(X^7) .
## I(X^8) .
## I(X^-2) .
N<-1000
X<-runif(N)*10
err<-rnorm(N)*2
Y<-2+3*X-4*X^2+err
out<-data.frame(X,Y)
mm = model.matrix(Y ~ X+I(X^2)+I(X^3)+I(X^4)+I(X^5)+I(X^6)+I(X^7)+I(X^8)+I(X^-2), out)[, -1]
fit_ridge_cv = cv.glmnet(mm, Y, alpha = 0)
coef(fit_ridge_cv, s = "lambda.1se")
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 9.810829e+00
## X -1.085430e+01
## I(X^2) -9.560176e-01
## I(X^3) -7.731467e-02
## I(X^4) -5.726577e-03
## I(X^5) -3.793373e-04
## I(X^6) -2.028736e-05
## I(X^7) -4.589448e-07
## I(X^8) 9.198727e-08
## I(X^-2) -2.313281e-03
fit_lasso_cv = cv.glmnet(mm, Y, alpha = 1)
coef(fit_lasso_cv, s = "lambda.1se")
## 10 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 2.28388581
## X .
## I(X^2) -3.30996587
## I(X^3) -0.03305665
## I(X^4) .
## I(X^5) .
## I(X^6) .
## I(X^7) .
## I(X^8) .
## I(X^-2) .