Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other
Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.
Separate (and allegedly equal) education still leads to differences in educational achievement.
Question: How is student achievement related to school resources?
A sample of 70 schools and their political/social context resulted in data on many variables.
Data were combined into indices/measures of achievement and school characteristics
For achievement, high values are good.
The predictors are expected to be positively correlated with achievement.
\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)
achvData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P236.txt")
glimpse(achvData)
Observations: 70
Variables: 4
$ ACHV <dbl> -0.4315, 0.7997, -0.9247, -2.1908, -2.8482, -0.6623, 2.6367, 2…
$ FAM <dbl> 0.6081, 0.7937, -0.8263, -1.2531, 0.1740, 0.2025, 0.2418, 0.59…
$ PEER <dbl> 0.03509, 0.47924, -0.61951, -1.21675, -0.18517, 0.12764, -0.09…
$ SCHOOL <dbl> 0.16607, 0.53356, -0.78635, -1.04076, 0.14229, 0.27311, 0.0496…
head(achvData)
ACHV | FAM | PEER | SCHOOL |
---|---|---|---|
-0.4315 | 0.6081 | 0.0351 | 0.1661 |
0.7997 | 0.7937 | 0.4792 | 0.5336 |
-0.9247 | -0.8263 | -0.6195 | -0.7864 |
-2.1908 | -1.2531 | -1.2168 | -1.0408 |
-2.8482 | 0.1740 | -0.1852 | 0.1423 |
-0.6623 | 0.2025 | 0.1276 | 0.2731 |
lmfit.full <- lm(ACHV ~ FAM + PEER + SCHOOL, data = achvData)
tidy(lmfit.full)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.070 | 0.2506 | -0.2791 | 0.7810 |
FAM | 1.101 | 1.4106 | 0.7807 | 0.4378 |
PEER | 2.322 | 1.4813 | 1.5676 | 0.1218 |
SCHOOL | -2.281 | 2.2204 | -1.0273 | 0.3080 |
glance(summary(lmfit.full))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.2063 | 0.1702 | 2.07 | 5.717 | 0.0015 | 4 |
SEE ANYTHING ODD IN THE P-VALUES?
The overall F-test has a p-value of 0.0015.
\(H_0: \beta_1 = \beta_2 = \beta_3 = 0\)
vs.
\(H_1:\) at least one \(\beta_j \ne 0\)
We reject \(H_0\) and conclude that at least on predictor is significant.
But, none of the individual t-tests shows a significant predictor!!!
SEE ANYTHING ODD ABOUT THE SIGNS OF THE REGRESSION COEFFICIENTS?
tidy(lmfit.full)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.070 | 0.2506 | -0.2791 | 0.7810 |
FAM | 1.101 | 1.4106 | 0.7807 | 0.4378 |
PEER | 2.322 | 1.4813 | 1.5676 | 0.1218 |
SCHOOL | -2.281 | 2.2204 | -1.0273 | 0.3080 |
The coefficient for SCHOOL is negative!
Hopefully, having more school resources is associated with higher achievement!
Usually, these types of oddities are the result of multicollinearity:
Linear relationships among the predictor variables.
Are the predictors correlated?
ggpairs(achvData)
The predictors are strongly correlated with eachother.
We lose the simple interpretation of a regression coefficient
“while all other predictors are held constant”
In fact, it cannot be detected by examination of residuals.
Most any observed variables (data) have a little correlation with each other.
In principle, multicollinearity can be completely avoided in carefully designed experiments
The factors we expect to effect the response can be collected at values
that make the predictor variables statistically independent (“orthogonal”).
Check out STAT 22200 Linear Models and Experimental Design (offered each Spring)
In practice, we select several predictors believed to be related to the outcome,
and thus… likely to be associated with each other as well.
Question: How is student achievement related to school resources?
A sample of 70 schools and their political/social context resulted in data on many variables.
Data were combined into indices/measures of achievement and school characteristics
For achievement, high values are good.
The predictors are expected to be positively correlated with achievement.
\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)
tidy(lmfit.full)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.070 | 0.2506 | -0.2791 | 0.7810 |
FAM | 1.101 | 1.4106 | 0.7807 | 0.4378 |
PEER | 2.322 | 1.4813 | 1.5676 | 0.1218 |
SCHOOL | -2.281 | 2.2204 | -1.0273 | 0.3080 |
REMINDERS:
If we proceed in the presence of strong multicollinearity
Start with models with just one predictor.
lmfit.only.SCHOOL <- lm(ACHV ~ SCHOOL, data = achvData)
lmfit.only.FAM <- lm(ACHV ~ FAM, data = achvData)
lmfit.only.PEER <- lm(ACHV ~ PEER, data = achvData)
# lmfit.not.SCHOOL <- lm(ACHV ~ FAM + PEER, data = achvData)
# lmfit.not.PEER <- lm(ACHV ~ FAM + SCHOOL, data = achvData)
Adding PEER to a model with only SCHOOL:
lmfit.not.FAM <- lm(ACHV ~ PEER + SCHOOL, data = achvData)
tidy(lmfit.only.SCHOOL)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0104 | 0.2487 | -0.0419 | 0.9667 |
SCHOOL | 0.9283 | 0.2446 | 3.7954 | 0.0003 |
tidy(lmfit.not.FAM)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0470 | 0.2482 | -0.1892 | 0.8505 |
PEER | 2.0296 | 1.4289 | 1.4203 | 0.1601 |
SCHOOL | -0.8727 | 1.2911 | -0.6760 | 0.5014 |
tidy(lmfit.only.PEER)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -0.0309 | 0.2460 | -0.1255 | 0.9005 |
PEER | 1.0809 | 0.2676 | 4.0387 | 0.0001 |
Adding PEER to the model changes the sign of SCHOOL coefficient.
The size of the SCHOOL coefficient is about the same,
but the SE is 5 times larger!
At the same time, the PEER t-statistic is more than cut in half.
The overall fit of the models \(\left(R^2\right)\) is essentially unchanged.
\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)
glance(summary(lmfit.full))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.2063 | 0.1702 | 2.07 | 5.717 | 0.0015 | 4 |
\(ACHV = \beta_0 + \beta_3 SCHOOL + \epsilon\)
glance(summary(lmfit.only.SCHOOL))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.1748 | 0.1627 | 2.08 | 14.41 | 0.0003 | 2 |
\(ACHV = \beta_0 + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)
glance(summary(lmfit.not.FAM))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.1989 | 0.175 | 2.064 | 8.319 | 0.0006 | 3 |
\(ACHV = \beta_0 + \beta_2 PEER + \epsilon\)
glance(summary(lmfit.only.PEER))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.1935 | 0.1816 | 2.056 | 16.31 | 0.0001 | 2 |
If we were to keep just one variable, which would it be?
kable(cor(achvData))
ACHV | FAM | PEER | SCHOOL | |
---|---|---|---|---|
ACHV | 1.0000 | 0.4195 | 0.4398 | 0.4181 |
FAM | 0.4195 | 1.0000 | 0.9601 | 0.9857 |
PEER | 0.4398 | 0.9601 | 1.0000 | 0.9822 |
SCHOOL | 0.4181 | 0.9857 | 0.9822 | 1.0000 |
Technically, the PEER predictor is best (highest correlation with response),
but for the purposes of the study,
I suspect researchers would prefer to keep either FAM or SCHOOL.
Look for overall (model) F-statistic significant (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)),
but ALL individual \(t\)-tests nonsignificant.
However, a significant F-statistic (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)) and SOME nonsignificant \(t\)-tests for individual \(\beta\)s
does NOT necessarily imply multicollinearity.
There simply could be non-important variables in the model.
Look for a \(\beta\) coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)
ggpairs(achvData)
Indeed, each predictor is positively correlated with the response, as expected.
But, each predictor also strongly correlated with the other predictors.
We can’t even answer questions like:
If faculty credentials are high, but school resources are low, what is effect of peer influence on achievement?
There is no such data. Could we obtain it?
Probably not, because it appears that in this sample of public schools,
no such situations exist.
These variables, perhaps by their design and practicality, are inextricably linked.
Schools with more money have more money for resources
and more money to hire higer-credentialed teachers.
Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.
Other measures include the partial \(R^2\) and its companion:
the variance inflation factor (VIF)
\(R_j^2\) is the multiple correlation coefficient that results from predicting predictor \(x_j\) from a linear model with the other predictor variables.
For example, suppose there are 3 predictors (as in the achievement data)
If \(j=2\), we let \(x_j = x_2\) temporarily be a response variable and consider the model
\(X_2 = \beta_0 + \beta_1 x_1 + \beta_2 x_3 + \epsilon\)
Find the \(R^2\) from this model.
Call it \(R_j^2 = R_2^2\): the partial regression coefficient for \(x_2\).
lmfit.PEER <- lm(PEER ~ FAM + SCHOOL, data = achvData)
tidy(lmfit.PEER)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0219 | 0.0205 | 1.067 | 0.2897 |
FAM | -0.2408 | 0.1126 | -2.140 | 0.0360 |
SCHOOL | 1.1386 | 0.1191 | 9.559 | 0.0000 |
glance(summary(lmfit.PEER))
r.squared | adj.r.squared | sigma | statistic | p.value | df |
---|---|---|---|---|---|
0.9669 | 0.9659 | 0.1707 | 978.6 | 0 | 3 |
R2.PEER <- summary(lmfit.PEER)$r.squared
b0 <- tidy(lmfit.PEER)$estimate[1]
b1 <- tidy(lmfit.PEER)$estimate[2]
b2 <- tidy(lmfit.PEER)$estimate[3]
The partial correlation coefficient is \(R_2^2 = 0.9669\)
\(96.69\)% of the variability in PEER is explained by regression on FAM and SCHOOL:
\(PEER \approx 0.02188 + -0.2408 FAM + 1.139 SCHOOL\)
So, the data doesn’t contain much information about how PEER varies
when FAM and SCHOOL are not changed.
This complicates interpretation of the regression coefficient for PEER in the proposed model
\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)
REMINDER: If we proceed in the presence of multicollinearity…
We lose the simple interpretation of a regression coefficient
“while all other predictors are held constant”
Let \(\displaystyle{\mathrm{VIF}_j = \frac{1}{1 - R_j^2}}\).
We call \(\mathrm{VIF}_j\) the variance inflaction factor due to \(x_j\).
The variance inflation factor for PEER: \(\;\displaystyle{\mathrm{VIF}_j = \frac{1}{1 - R_j^2} \;=\; \frac{1}{1 - 0.9669} \;=\; 30.21}\)
R will calculate the VIF for you. I used the vif
function in the car
package.
library(car)
lmfit <- lm(ACHV ~ FAM + PEER + SCHOOL, data = achvData)
vif(lmfit)
kable(vif(lmfit.full))
x | |
---|---|
FAM | 37.58 |
PEER | 30.21 |
SCHOOL | 83.16 |
When the partial correlation coefficient \(R_j^2\) is large, VIF is large.
…and if \(R_j^2 = 0\) (orthogonal predictors), VIF = 1.
Rough guide: VIF \(> 10\) indicates moderate multicollinearity is present.
REMINDER: If we proceed in the presence of even moderate multicollinearity…
+ **Variance estimates are typically overstated**
+ So, standard errors typically overstated
+ Confidence intervals wider than needed
+ t-statistics smaller than they should be
+ These are missed opportunities to notice important predictors
The VIF is attempting to put a measure on whether overstated errors is a problem.
More specifically, the VIFs can be combined to measure the expected squared distance (squared error) of OLS estimators from their true values
\(\displaystyle D^2 = \sigma^2 \sum_{j=1}^p \mathrm{VIF}_j\)
That’s nice, but we can’t calculate this from the data
since multicollinearity causes \(\widehat{\sigma}^2\) to overstate the variance.
It’s a circular reasoning issue.
We can’t estimate how much the variance is inflated by using an inflated variance estimate.
We can measure what we would expect if the predictors were orthogonal (a null hypothesis)
and see how far away we are from that hypothesis using the VIFs.
Suppose the predictor variables are all orthogonal to each other.
Then, \(\mathrm{VIF}_j = 1\) for \(j = 1, 2, \ldots, p\)
and \(D^2 = \sigma^2 \sum_{j=1}^p \mathrm{VIF}_j = \sigma^2 \sum_{j=1}^p 1 = \sigma^2 p\)
Then, the ratio \(\displaystyle \frac{D^2}{D^2 \textrm{ if predictors orthogonal}} \;=\; \frac{\sigma^2 \sum_{j=1}^p \mathrm{VIF}_j}{\sigma^2 p} \;=\; \frac{1}{p} \sum_{j=1}^p \mathrm{VIF}_j \;=\; \overline{\mathrm{VIF}}\)
So, the average of the VIFs measures the squared error of the OLS estimators from their true values
relative to the size of that error if the predictors were orthogonal.
For example, when the predictors are orthogonal and the regression model valid,
then software can calculate \(\mathrm{se}(\widehat{\beta}_j)\) and we would consider it to be accurate.
If multicollinearity is present,
we estimate the true \(\mathrm{se}(\widehat{\beta}_j)\) is overestimated by a factor of \(\sqrt{\textrm{average } \mathrm{VIF}}\).
VIFs <- vif(lmfit.full)
p <- length(VIFs)
p
[1] 3
avgVIF = sum(VIFs) / p
For the achievement data, \(\overline{\mathrm{VIF}} = 50.32\).
So, the squared error in the OLS estimates is \(50\) times larger
than it would be if the predictor variables were orthogonal.
…and the reported standard errors are \(\sqrt{50} = 7.1\) times larger than they should be.
Interpretation is qualitative.
We don’t know the distribution of \(\overline{\mathrm{VIF}}\) (even if the predictors are orthogonal).
It’s an open problem in statistics, but not one we are clamoring to solve.
The analyst’s judgement should suffice.
The bad news: We cannot “fix” or eliminate multicollinearity for a given model.
For example, no transformation will help.
Some ideas…
Get more and/or better data with predictors that lie in the areas about which the current data are not informative.
This is not always feasible.
Stratified sampling to assure representativeness may aid in maximizing information in the data.
We can’t even answer questions like:
If faculty credentials are high, but school resources are low, what is effect of peer influence on achievement?
There is no such data. Could we obtain it?
Probably not, because it appears that in this sample of public schools,
no such situations exist.
Highly collinear predictors do not lend additional information
while using degrees of freedom and creating other problems.
There may be extra-statistical reasons (such as cost, reliability, etc) to just omit some predictors.
Consequences can be evaluated.
Some degree of collinearity can be introduced when quadratic or interaction terms are added to a model.
These can be often reduced by centering.
Center predictors before constructing powers (square, etc) and interaction terms involving them.
If predictors are functionally related, we can construct new variables that are
functions of the predictors and include the new variables in the model instead.
This is equivalent to defining constraints on the coefficients.
Omitting \(x_j\) completely is a special case equivalent to fitting the model with the constraint that \(\beta_j=0\).
Principal Components is an analysis technique performed on the \(X\) matrix to discern possible structural effects.
The method involves finding functions of \(x\)’s that are orthogonal and “best” capture the variability in \(Y\).
(Somewhat outside of scope of this course, see Chapter 10).
Explanation/interpretation of coefficients is pretty much impossible in the presence of strong multicollinearity, but predictions (within the scope of the data) are probably just fine.
Of course, the standard errors of those predictions will be inflated.