All lectures

1 Chapter 9: Working with Collinear Data

1.0.1 Example

1.0.1.1 Another Problem to Solve: (Multi)Collinearity

Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other

Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.

1.0.1.2 Example: Student achievement and school facilities

1.0.1.2.1 The scenario: Civil Rights Act of 1964

Separate (and allegedly equal) education still leads to differences in educational achievement.

Question: How is student achievement related to school resources?

A sample of 70 schools and their political/social context resulted in data on many variables.
Data were combined into indices/measures of achievement and school characteristics

  • ACHV: An index of student achievement (the response variable)
  • SCHOOL: A measure of school resources
  • FAM: A measure of faculty credentials in each school
  • PEER: A measure of the influence of peer groups in each school

For achievement, high values are good.

The predictors are expected to be positively correlated with achievement.

1.0.1.2.2 The proposed model and estimate: A cautionary tale

\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)

achvData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P236.txt")
glimpse(achvData)
Observations: 70
Variables: 4
$ ACHV   <dbl> -0.4315, 0.7997, -0.9247, -2.1908, -2.8482, -0.6623, 2.6367, 2…
$ FAM    <dbl> 0.6081, 0.7937, -0.8263, -1.2531, 0.1740, 0.2025, 0.2418, 0.59…
$ PEER   <dbl> 0.03509, 0.47924, -0.61951, -1.21675, -0.18517, 0.12764, -0.09…
$ SCHOOL <dbl> 0.16607, 0.53356, -0.78635, -1.04076, 0.14229, 0.27311, 0.0496…
head(achvData)
ACHV FAM PEER SCHOOL
-0.4315 0.6081 0.0351 0.1661
0.7997 0.7937 0.4792 0.5336
-0.9247 -0.8263 -0.6195 -0.7864
-2.1908 -1.2531 -1.2168 -1.0408
-2.8482 0.1740 -0.1852 0.1423
-0.6623 0.2025 0.1276 0.2731
lmfit.full <- lm(ACHV ~ FAM + PEER + SCHOOL, data = achvData)
tidy(lmfit.full)
term estimate std.error statistic p.value
(Intercept) -0.070 0.2506 -0.2791 0.7810
FAM 1.101 1.4106 0.7807 0.4378
PEER 2.322 1.4813 1.5676 0.1218
SCHOOL -2.281 2.2204 -1.0273 0.3080
glance(summary(lmfit.full))
r.squared adj.r.squared sigma statistic p.value df
0.2063 0.1702 2.07 5.717 0.0015 4

SEE ANYTHING ODD IN THE P-VALUES?







The overall F-test has a p-value of 0.0015.

\(H_0: \beta_1 = \beta_2 = \beta_3 = 0\)
vs.
\(H_1:\) at least one \(\beta_j \ne 0\)

We reject \(H_0\) and conclude that at least on predictor is significant.

But, none of the individual t-tests shows a significant predictor!!!

SEE ANYTHING ODD ABOUT THE SIGNS OF THE REGRESSION COEFFICIENTS?

tidy(lmfit.full)
term estimate std.error statistic p.value
(Intercept) -0.070 0.2506 -0.2791 0.7810
FAM 1.101 1.4106 0.7807 0.4378
PEER 2.322 1.4813 1.5676 0.1218
SCHOOL -2.281 2.2204 -1.0273 0.3080







The coefficient for SCHOOL is negative!

Hopefully, having more school resources is associated with higher achievement!

1.0.1.3 What’s the problem then?

Usually, these types of oddities are the result of multicollinearity:
Linear relationships among the predictor variables.

Are the predictors correlated?

ggpairs(achvData)

The predictors are strongly correlated with eachother.

1.0.2 Issues

1.0.2.1 What’s the big deal about multicollinearity?

  1. Since the predictors provide “overlapping” information, results may be ambiguous
    • A portion of the information from each predictor is redundant, but not all
    • Predictor variables might each serve as a proxy for the others in the
      regression equation without affecting the total explanatory power.
    • Do we need all the predictors in the model?
    • If not, which ones should we drop?
    • How does this choice effect inference, prediction, or forecasting?
      (for prediction/forecasting, we are less concerned)
  2. We lose the simple interpretation of a regression coefficient
    “while all other predictors are held constant”

  3. If we proceed in the presence of moderate multicollinearity
    • Variance estimates are typically overstated
      • So, standard errors typically overstated
      • Confidence intervals wider than needed
      • t-statistics smaller than they should be
    • These are missed opportunities to notice important predictors
  4. If we proceed in the presence of strong multicollinearity
    • The estimates of \(\beta\)s can be very unstable
    • \(\widehat\beta\)s may change substantially when other predictors are added/removed
      or observations are added/deleted (even if not outliers)
1.0.2.1.1 Multicollinearity is not a modeling error

In fact, it cannot be detected by examination of residuals.

  • It is a condition of deficient data.
  • It arises from a lack of independent information (about the response) coming from predictors

Most any observed variables (data) have a little correlation with each other.

  • How much multicollinearity can we permit and still have a useful, valid model?
  • What do we do if multicollinearity is so serious that we cannot ignore it?

In principle, multicollinearity can be completely avoided in carefully designed experiments

  • The factors we expect to effect the response can be collected at values
    that make the predictor variables statistically independent (“orthogonal”).

  • Check out STAT 22200 Linear Models and Experimental Design (offered each Spring)

In practice, we select several predictors believed to be related to the outcome,
and thus… likely to be associated with each other as well.

1.0.3 Example

1.0.3.1 Example Again: Student achievement and school facilities

1.0.3.1.1 The scenario: Civil Rights Act of 1964

Question: How is student achievement related to school resources?

A sample of 70 schools and their political/social context resulted in data on many variables.
Data were combined into indices/measures of achievement and school characteristics

  • ACHV: An index of student achievement (the response variable)
  • SCHOOL: A measure of school resources
  • FAM: A measure of faculty credentials in each school
  • PEER: A measure of the influence of peer groups in each school

For achievement, high values are good.

The predictors are expected to be positively correlated with achievement.

1.0.3.1.2 The proposed model and estimate

\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)

tidy(lmfit.full)
term estimate std.error statistic p.value
(Intercept) -0.070 0.2506 -0.2791 0.7810
FAM 1.101 1.4106 0.7807 0.4378
PEER 2.322 1.4813 1.5676 0.1218
SCHOOL -2.281 2.2204 -1.0273 0.3080

REMINDERS:

  • If the predictors provide “overlapping” information, results may be ambiguous
    A portion of the information from each predictor may be redundant, but not all
  • Predictor variables might each serve as a proxy for the others in the
    regression equation without affecting the total explanatory power.

If we proceed in the presence of strong multicollinearity

  • The estimates of \(\beta\)’s can be very unstable
  • They may change substantially when other predictors are added/removed
1.0.3.1.3 \(\beta\)s change when predictors added/removed

Start with models with just one predictor.

lmfit.only.SCHOOL <- lm(ACHV ~ SCHOOL, data = achvData)
lmfit.only.FAM <- lm(ACHV ~ FAM, data = achvData)
lmfit.only.PEER <- lm(ACHV ~ PEER, data = achvData)
# lmfit.not.SCHOOL <- lm(ACHV ~ FAM + PEER, data = achvData)
# lmfit.not.PEER <- lm(ACHV ~ FAM + SCHOOL, data = achvData)

Adding PEER to a model with only SCHOOL:

lmfit.not.FAM <- lm(ACHV ~ PEER + SCHOOL, data = achvData)
tidy(lmfit.only.SCHOOL)
term estimate std.error statistic p.value
(Intercept) -0.0104 0.2487 -0.0419 0.9667
SCHOOL 0.9283 0.2446 3.7954 0.0003
tidy(lmfit.not.FAM)
term estimate std.error statistic p.value
(Intercept) -0.0470 0.2482 -0.1892 0.8505
PEER 2.0296 1.4289 1.4203 0.1601
SCHOOL -0.8727 1.2911 -0.6760 0.5014
tidy(lmfit.only.PEER)
term estimate std.error statistic p.value
(Intercept) -0.0309 0.2460 -0.1255 0.9005
PEER 1.0809 0.2676 4.0387 0.0001

Adding PEER to the model changes the sign of SCHOOL coefficient.
The size of the SCHOOL coefficient is about the same,
but the SE is 5 times larger!
At the same time, the PEER t-statistic is more than cut in half.

The overall fit of the models \(\left(R^2\right)\) is essentially unchanged.

\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)

glance(summary(lmfit.full))
r.squared adj.r.squared sigma statistic p.value df
0.2063 0.1702 2.07 5.717 0.0015 4

\(ACHV = \beta_0 + \beta_3 SCHOOL + \epsilon\)

glance(summary(lmfit.only.SCHOOL))
r.squared adj.r.squared sigma statistic p.value df
0.1748 0.1627 2.08 14.41 0.0003 2

\(ACHV = \beta_0 + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)

glance(summary(lmfit.not.FAM))
r.squared adj.r.squared sigma statistic p.value df
0.1989 0.175 2.064 8.319 0.0006 3

\(ACHV = \beta_0 + \beta_2 PEER + \epsilon\)

glance(summary(lmfit.only.PEER))
r.squared adj.r.squared sigma statistic p.value df
0.1935 0.1816 2.056 16.31 0.0001 2


If we were to keep just one variable, which would it be?

kable(cor(achvData))
ACHV FAM PEER SCHOOL
ACHV 1.0000 0.4195 0.4398 0.4181
FAM 0.4195 1.0000 0.9601 0.9857
PEER 0.4398 0.9601 1.0000 0.9822
SCHOOL 0.4181 0.9857 0.9822 1.0000

Technically, the PEER predictor is best (highest correlation with response),
but for the purposes of the study,
I suspect researchers would prefer to keep either FAM or SCHOOL.

1.0.4 Detection

1.0.4.1 Detecting multicollinearity

1.0.4.1.1 Detection: Oddities in model estimates/tests

Look for overall (model) F-statistic significant (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)),
but ALL individual \(t\)-tests nonsignificant.

However, a significant F-statistic (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)) and SOME nonsignificant \(t\)-tests for individual \(\beta\)s
does NOT necessarily imply multicollinearity.

There simply could be non-important variables in the model.

Look for a \(\beta\) coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)

1.0.4.1.2 Detection: pairwise linear relationships (correlation, scatterplots)
ggpairs(achvData)

Indeed, each predictor is positively correlated with the response, as expected.


But, each predictor also strongly correlated with the other predictors.

We can’t even answer questions like:
If faculty credentials are high, but school resources are low, what is effect of peer influence on achievement?

There is no such data. Could we obtain it?
Probably not, because it appears that in this sample of public schools,
no such situations exist.

These variables, perhaps by their design and practicality, are inextricably linked.
Schools with more money have more money for resources
and more money to hire higer-credentialed teachers.

1.0.4.1.3 Detection more complex linear relationships (partial \(R^2\))

Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.

Other measures include the partial \(R^2\) and its companion:
the variance inflation factor (VIF)

1.0.4.1.3.1 Calculation and Interpretation:

\(R_j^2\) is the multiple correlation coefficient that results from predicting predictor \(x_j\) from a linear model with the other predictor variables.

For example, suppose there are 3 predictors (as in the achievement data)
If \(j=2\), we let \(x_j = x_2\) temporarily be a response variable and consider the model

\(X_2 = \beta_0 + \beta_1 x_1 + \beta_2 x_3 + \epsilon\)

Find the \(R^2\) from this model.
Call it \(R_j^2 = R_2^2\): the partial regression coefficient for \(x_2\).

lmfit.PEER <- lm(PEER ~ FAM + SCHOOL, data = achvData)
tidy(lmfit.PEER)
term estimate std.error statistic p.value
(Intercept) 0.0219 0.0205 1.067 0.2897
FAM -0.2408 0.1126 -2.140 0.0360
SCHOOL 1.1386 0.1191 9.559 0.0000
glance(summary(lmfit.PEER))
r.squared adj.r.squared sigma statistic p.value df
0.9669 0.9659 0.1707 978.6 0 3

R2.PEER <- summary(lmfit.PEER)$r.squared
b0 <- tidy(lmfit.PEER)$estimate[1]
b1 <- tidy(lmfit.PEER)$estimate[2]
b2 <- tidy(lmfit.PEER)$estimate[3]

The partial correlation coefficient is \(R_2^2 = 0.9669\)
\(96.69\)% of the variability in PEER is explained by regression on FAM and SCHOOL:

\(PEER \approx 0.02188 + -0.2408 FAM + 1.139 SCHOOL\)

So, the data doesn’t contain much information about how PEER varies
when FAM and SCHOOL are not changed.

This complicates interpretation of the regression coefficient for PEER in the proposed model

\(ACHV = \beta_0 + \beta_1 FAM + \beta_2 PEER + \beta_3 SCHOOL + \epsilon\)

REMINDER: If we proceed in the presence of multicollinearity…
We lose the simple interpretation of a regression coefficient
“while all other predictors are held constant”

1.0.4.1.4 Detection: Variance Inflation Factor (VIF)

Let \(\displaystyle{\mathrm{VIF}_j = \frac{1}{1 - R_j^2}}\).

We call \(\mathrm{VIF}_j\) the variance inflaction factor due to \(x_j\).

The variance inflation factor for PEER: \(\;\displaystyle{\mathrm{VIF}_j = \frac{1}{1 - R_j^2} \;=\; \frac{1}{1 - 0.9669} \;=\; 30.21}\)

R will calculate the VIF for you. I used the vif function in the car package.

library(car)
lmfit <- lm(ACHV ~ FAM + PEER + SCHOOL, data = achvData)
vif(lmfit)

kable(vif(lmfit.full))
x
FAM 37.58
PEER 30.21
SCHOOL 83.16


When the partial correlation coefficient \(R_j^2\) is large, VIF is large.

…and if \(R_j^2 = 0\) (orthogonal predictors), VIF = 1.

Rough guide: VIF \(> 10\) indicates moderate multicollinearity is present.


REMINDER: If we proceed in the presence of even moderate multicollinearity…

 + **Variance estimates are typically overstated**
   + So, standard errors typically overstated
   + Confidence intervals wider than needed
   + t-statistics smaller than they should be
 + These are missed opportunities to notice important predictors


The VIF is attempting to put a measure on whether overstated errors is a problem.

More specifically, the VIFs can be combined to measure the expected squared distance (squared error) of OLS estimators from their true values

\(\displaystyle D^2 = \sigma^2 \sum_{j=1}^p \mathrm{VIF}_j\)

That’s nice, but we can’t calculate this from the data
since multicollinearity causes \(\widehat{\sigma}^2\) to overstate the variance.
It’s a circular reasoning issue.
We can’t estimate how much the variance is inflated by using an inflated variance estimate.

We can measure what we would expect if the predictors were orthogonal (a null hypothesis)
and see how far away we are from that hypothesis using the VIFs.


Suppose the predictor variables are all orthogonal to each other.

Then, \(\mathrm{VIF}_j = 1\) for \(j = 1, 2, \ldots, p\)

and \(D^2 = \sigma^2 \sum_{j=1}^p \mathrm{VIF}_j = \sigma^2 \sum_{j=1}^p 1 = \sigma^2 p\)

Then, the ratio \(\displaystyle \frac{D^2}{D^2 \textrm{ if predictors orthogonal}} \;=\; \frac{\sigma^2 \sum_{j=1}^p \mathrm{VIF}_j}{\sigma^2 p} \;=\; \frac{1}{p} \sum_{j=1}^p \mathrm{VIF}_j \;=\; \overline{\mathrm{VIF}}\)

So, the average of the VIFs measures the squared error of the OLS estimators from their true values
relative to the size of that error if the predictors were orthogonal.

For example, when the predictors are orthogonal and the regression model valid,
then software can calculate \(\mathrm{se}(\widehat{\beta}_j)\) and we would consider it to be accurate.

If multicollinearity is present,
we estimate the true \(\mathrm{se}(\widehat{\beta}_j)\) is overestimated by a factor of \(\sqrt{\textrm{average } \mathrm{VIF}}\).

VIFs <- vif(lmfit.full)
p <- length(VIFs)
p
[1] 3
avgVIF = sum(VIFs) / p

For the achievement data, \(\overline{\mathrm{VIF}} = 50.32\).

So, the squared error in the OLS estimates is \(50\) times larger
than it would be if the predictor variables were orthogonal.

…and the reported standard errors are \(\sqrt{50} = 7.1\) times larger than they should be.

Interpretation is qualitative.

We don’t know the distribution of \(\overline{\mathrm{VIF}}\) (even if the predictors are orthogonal).
It’s an open problem in statistics, but not one we are clamoring to solve.
The analyst’s judgement should suffice.

1.0.5 Remedies

1.0.5.1 What to do about multicollinearity?

The bad news: We cannot “fix” or eliminate multicollinearity for a given model.

For example, no transformation will help.

Some ideas…

  1. Get more and better data
  2. Omit redundant variables
  3. Centering
  4. Try “constrained regression” (Section 10.4)
    This is the only section in Chapter 10 covered in this course.
  5. Principal components (not covered in this course, see Sections 10.1-10.3)
1.0.5.1.1 Get more and better data

Get more and/or better data with predictors that lie in the areas about which the current data are not informative.

This is not always feasible.

Stratified sampling to assure representativeness may aid in maximizing information in the data.

1.0.5.1.1.1 For the achievement data

We can’t even answer questions like:
If faculty credentials are high, but school resources are low, what is effect of peer influence on achievement?

There is no such data. Could we obtain it?
Probably not, because it appears that in this sample of public schools,
no such situations exist.

1.0.5.1.2 Omit redundant variables

Highly collinear predictors do not lend additional information
while using degrees of freedom and creating other problems.

There may be extra-statistical reasons (such as cost, reliability, etc) to just omit some predictors.

Consequences can be evaluated.

1.0.5.1.3 Centering (Chapter 3)

Some degree of collinearity can be introduced when quadratic or interaction terms are added to a model.

These can be often reduced by centering.
Center predictors before constructing powers (square, etc) and interaction terms involving them.

1.0.5.1.4 Try “constrained regression” (Section 10.4)

If predictors are functionally related, we can construct new variables that are
functions of the predictors and include the new variables in the model instead.

This is equivalent to defining constraints on the coefficients.

Omitting \(x_j\) completely is a special case equivalent to fitting the model with the constraint that \(\beta_j=0\).

1.0.5.1.5 Principle components (not in this course)

Principal Components is an analysis technique performed on the \(X\) matrix to discern possible structural effects.

The method involves finding functions of \(x\)’s that are orthogonal and “best” capture the variability in \(Y\).
(Somewhat outside of scope of this course, see Chapter 10).

1.0.5.2 Prediction vs. explanation when multicollinearity

Explanation/interpretation of coefficients is pretty much impossible in the presence of strong multicollinearity, but predictions (within the scope of the data) are probably just fine.

Of course, the standard errors of those predictions will be inflated.

1.0.5.3 Other examples

  • Textbook example on French import and domestic consumption.
  • Other examples in Prof. Chen’s lecture notes