Loading [MathJax]/jax/output/HTML-CSS/jax.js

All lectures

Transformations

We use transformations to achieve linearity and constant variance.

There are a few that we use very frequently:

Which transformation? Exponential growth/decay

Consider the context of the data
Is exponential growth/decay expected?

  • Y=β0eβ1xeϵ

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: log(Y)

    • Take the logarithm of both sides of the equation to linearize
  • log(Y)=β0+β1x+ϵ

Which transformation? Multiplicative effect of x (predictor)

Consider the context of the data
Is multiplicative growth/shrinkage expected?

  • Y=β0xβ1eϵ

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: log(Y) and log(x)

    • Take the logarithm of both sides of the equation to linearize
  • log(Y)=β0+β1log(x)+ϵ

Which transformation? Responses are proportions.

Consider the context of the data
Is the response a proportion (binomial model)?

  • E(Y|x)=π(x) = some function of x with range between 0 and 1

  • var(Y|x)=π(x)[1π(x)]n …which depends on x and the mean E(Y|x)

  • restricted, non-normal response, non-constant spread (varies with the mean)

  • Suggested transformation: log(Y1Y)

Or, use logistic regression

Which transformation? Responses are counts of a rare event

Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?

  • E(Y|x)=λ(x) = some function of x (λ(x)>0)

  • var(Y|x)=λ(x) …which depends on x and the mean E(Y|x)

  • right-skewed response and non-constant spread (increases as the mean increases)

  • Suggested model: Y=β0+β1x+ϵ

Or, use Poisson regression

Which transformation? Try “ladder of transformation”

2 1 1/2 log 1/2 1 2
1Y2 1Y 1Y log(Y) Y Y Y2
1x2 1x 1x log(x) x x x2

Which transformation? Let “boxcox” suggest

An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of Y of the form Yλ

This method does not suggest transformations for the predictors, only the response.

If the method suggests λ=1.78, either use Y1.78
or a close, but easier-to-interpret transformation, like Y2=1Y2

If λ=0 is suggested by the boxcox method,
then use the log(Y) transformation.

boxcox(lm(y ~ x))

boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))

Transformations for non-linearity

  • If you know (or can guess) the form of the true relationship between response and predictor,
    • let that guide your transformation.
  • Write down the equation for the what you believe may be the true relationship,
    • then do algebra (on both sides of the equation) until you reach a linear form.
    • Then define new variables Y, X as needed.
  • or can try different transformations of the predictor and/or response variables and see what works.

Transformations to achieve constant variance

When the variance is proportional to x2

Y=YXX=1Xβ0=β1β1=β0ϵ=ϵX

yi=β0+β1xi+ϵi

Weighted least squares

wi(yiˆyi)2=wie2i,where wi1var(ϵi)

Autocorrelation

So, what’s the problem with autocorrelation?

Autocorrelation can affect OLS analysis in the several ways:

  • OLS estimates are still unbiased but are not efficient (no longer have minimum variance)
  • σ2 and se(β)s may be seriously under- (or over-) estimated.
  • positive correlation, often under-estimation of σ2 and se(β)s
  • negative correlation (less common), often over-estimation

Underestimate of variability which leads to increased false positive decisions:
rejecting the null hypothesis β=0 when one should not

Whether positive or negative correlation,
confidence intervals for βs may no longer be valid.

Tests for autocorrelation

Runs test

runs.test( as.factor( rstandard(lmfit) > 0) )

Durbin-Watson approach

If the errors are correlated, perhaps each residual (ϵt):

  • depends linearly on the size and sign of the residual before it in time (ϵt1)m,
  • has correlation ρ=cor(ϵt,ϵt1), and
  • is not correlated with any earlier residuals (ϵt2,ϵt3,).

If this model is reasonable, then ϵt=γ0+γ1ϵt1+ωt=ρϵt1+ωt,|ρ|<1, where ρ is the correlation coefficient between successive errors
and ωt are independent, normally distributed, with constant variance across the errors (ϵs).

The Durbin-Watson statistic

The DW statistic (often labeled d) is defined as: d=nt=2(etet1)2nt=1e2t=nt=2(etet1)2SSE, where et is the tth observed residual with the data arranged in time order.

When ρ=0 (no autocorrelation), d2.
When ρ=1 (perfect positive correlation), d0.
When ρ=1 (perfect negative correlation), d4.

The range of d is approximately 0d4.

So, when d is close to 2, there is not much autocorrelation.
When d is small, it suggests evidence of positive autocorrelation (ρ>0).
When (4d) is small, it suggests evidence of negative autocorrelation (ρ<0).

Cochrane-Orcutt transformation

The Cochrane-Orcutt transformation (textbook Section 8.4) is one method.

To see how it works, can write mathematically the errors for two adjacent observations:

ϵt=ytβ0β1xtϵt1=yt1β0β1xt1

(ytρyt1)=β0(1ρ)+β1(xtρxt1)+ωt.

This is equivalent to the linear model yt=β0+β1xt+ωt

with
yt=(ytρyt1),
xt=(xtρxt1),
β0=β0(1ρ), and
β1=β1

This linear model is then fit to the data.

Omitted Variables

Correlated Errors due to Omitted Variables

Correlated errors can occur when a variable not in the model is related
to the response and related to the time or order in which the data were collected

Another Problem to Solve: (Multi)Collinearity

Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other

Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.

Issues

What’s the big deal about multicollinearity?

  1. Since the predictors provide “overlapping” information, results may be ambiguous
    • A portion of the information from each predictor is redundant, but not all
    • Predictor variables might each serve as a proxy for the others in the
      regression equation without affecting the total explanatory power.
    • Do we need all the predictors in the model?
    • If not, which ones should we drop?
    • How does this choice effect inference, prediction, or forecasting?
      (for prediction/forecasting, we are less concerned)
  2. We lose the simple interpretation of a regression coefficient
    “while all other predictors are held constant”

  3. If we proceed in the presence of moderate multicollinearity
    • Variance estimates are typically overstated
    • So, standard errors typically overstated
    • Confidence intervals wider than needed
    • t-statistics smaller than they should be
    • These are missed opportunities to notice important predictors
  4. If we proceed in the presence of strong multicollinearity
    • The estimates of βs can be very unstable
    • ˆβs may change substantially when other predictors are added/removed
      or observations are added/deleted (even if not outliers)

Detecting multicollinearity

Detection: Oddities in model estimates/tests

Look for overall (model) F-statistic significant (H0:β1=β2==βp=0),
but ALL individual t-tests nonsignificant.

However, a significant F-statistic (H0:β1=β2==βp=0) and SOME nonsignificant t-tests for individual βs
does NOT necessarily imply multicollinearity.

There simply could be non-important variables in the model.

Look for a β coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)

Detection:

pairwise linear relationships (correlation, scatterplots)

Detection more complex linear relationships (partial R2)

Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.

Other measures include the partial R2 and its companion:
the variance inflation factor (VIF)

Detection: Variance Inflation Factor (VIF)

Let VIFj=11R2j.

We call VIFj the variance inflaction factor due to xj.

The variance inflation factor: $;\displaystyle{_j = $

When the partial correlation coefficient R2j is large, VIF is large.

…and if R2j=0 (orthogonal predictors), VIF = 1.

Rough guide: VIF >10 indicates moderate multicollinearity is present.

What to do about multicollinearity?

The bad news: We cannot “fix” or eliminate multicollinearity for a given model.

For example, no transformation will help.

Some ideas…

  1. Get more and better data
  2. Omit redundant variables
  3. Centering
  4. Try “constrained regression” (Section 10.4)
    This is the only section in Chapter 10 covered in this course.
  5. Principal components (not covered in this course, see Sections 10.1-10.3)

Variable (model) selection

Variable selection is the process of choosing a “best” subset of all available predictors.

Well, there is no single “best” subset.

We do want a model we can interpret or justify with respect to the questions of interest.

Model comparison statistics

Nested models

  • F-test

Any two models with the same response

  • MSE (Mean Squared Error)

  • AIC (Akaike Information Criterion)

AIC=nloge(SSEp/n)+2p

BIC=nloge(SSEp/n)+ploge(n)

  • Smaller is better for both criterion

  • Models with AIC difference 2 should be treated as equally adequate

  • Similarly, models with BIC difference 2 should be treated as equally adequate

  • BIC penalty for larger models is more severe

    • ploge(n)>2p (whenever n>8)

    • Controls tendency of “overfitting” from AIC

  • BIC (Bayesian Information Criterion)

  • Mallow’s Cp (has fallen out of favor)

Any two models with the same response (but possibly differently transformed)

  • adjusted R2

Automation

Automated variable selection strategies

Thinking through the problem carefully is always a strategy that should be employed!!!

But, sometimes may want to use one of these objective approaches:

  • Forward Selection

  • Backward Elimination

  • Stepwise Selection (a little of both)

Forward Selection
  1. Begin with an “empty” model (no predictors)

  2. Add the predictor with highest correlation with response
    • or largest t-statistic (smallest p-value)
    • H0:correlationj=0 is equivalent to H0:βj=0
    • decide here on a significance level required for entry at each step
  3. Continue on to add next predictor with highest partial correlation with response
    • that is, after adjusting for the predictor added in step (2)
    • or largest t-statistic meeting the significance level criterion
    • H0:partial correlationj=0 is equivalent to H0:βj=0
  4. With new model from (3), go back to step (3) again
    • add another predictor if one meets the criterion (a p-value threshold)
    • repeat (3) until no other predictors meet the criterion
Backward Elimination
  1. Begin with the “full” model (all predictors)

  2. Remove the predictor with the smallest t-statistic (largest p-value)

  3. Continue on to remove the next predictor with smallest t-statistic
    • that is, after adjusting for the remaining predictors after step (2)
  4. With the new model from (3), go back to step (3) again
    • remove another predictor if it meets the criterion (a p-value threshold)
    • repeat (3) until no other predictors meet the criterion
Stepwise Selection

It combines forward and backward steps,
usually beginning from empty model (forward stepwise).

Logistic regression

A logistic regression model describes the log odds of a probability (that a binary variable Y takes on the value 1) as a linear combination of some other variables (X’s –covariates or predictors)

Because error structure is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used - maximum likelihood estimation

Here is the logistic regression model for the probability of an event as a function of a binary predictor variable x: log{π1π}=β0+β1x

When x=0, log(π1π)=β0+β1(0)=β0=log(π01π0)

When x=1, log(π1π)=β0+β1(1)=β0+β1=log(π11π1)

Then, log(OR)=log(π1/(1π1)π0(1π0))=log(π11π1)log(π01π0)=(β0+β1)β0=β1

Of course, we can have continuous predictor variables, or multiple predictors, as well.

Assumptions and Model Diagnostics

The assumptions:

  • No systematic bias in measurement assumption: still required
  • Uncorrelated observations assumption still required
  • No strong multicollinearity still required (and you can still use VIF to check for violation of assumption).
  • Linearity (w.r.t link function) still required
  • No influential observations still required.
  • Normality and constant variance assumptions are no longer required The error term is Bernoulli distributed.

Regularization: beyond OLS

To prevent overfitting, to find simple models, to deal with multicollinearity, we have two additional techniques: LASSO and ridge regression

Ridge regression

Minimize i(yiˆyi)2+λpj=1ˆβ2j

Optimize using Maximum Liklihood Estimation.

Find λ through a cross-validation procedure.

  • Can find good models even in the presence of multicollinearity
  • Keeps regression coefficients from getting too big
  • Doesn’t tend to make any coefficients exactly 0

LASSO

Minimize i(yiˆyi)2+λpj=1|ˆβj|

Optimize using Maximum Liklihood Estimation.

  • Optimization harder (slower) than Ridge because not differentiable
  • Tends to make some coefficients identically zero
  • Good for finding simple models!
  • An alternative to stepwise models for model selection