All lectures

Transformations

We use transformations to achieve linearity and constant variance.

There are a few that we use very frequently:

Which transformation? Exponential growth/decay

Consider the context of the data
Is exponential growth/decay expected?

  • \(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: \(\log(Y)\)

    • Take the logarithm of both sides of the equation to linearize
  • \(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)

Which transformation? Multiplicative effect of x (predictor)

Consider the context of the data
Is multiplicative growth/shrinkage expected?

  • \(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)

  • Non-linear model and non-constant spread (multiplicative errors)

  • Suggested transformation: \(\log(Y)\) and \(\log(x)\)

    • Take the logarithm of both sides of the equation to linearize
  • \(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)

Which transformation? Responses are proportions.

Consider the context of the data
Is the response a proportion (binomial model)?

  • \(E(Y\, |\, x) = \pi(x)\) = some function of \(x\) with range between 0 and 1

  • \({\mathrm{var}}(Y\, |\, x) = \displaystyle{\frac{\pi(x)[1-\pi(x)]}{n}}\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)

  • restricted, non-normal response, non-constant spread (varies with the mean)

  • Suggested transformation: \(\;\displaystyle \log\left(\frac{Y}{1-Y}\right)\)

Or, use logistic regression

Which transformation? Responses are counts of a rare event

Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?

  • \(E(Y\, |\, x) = \lambda(x)\) = some function of \(x\) (\(\lambda(x) > 0\))

  • \({\mathrm{var}}(Y\, |\, x) = \lambda(x)\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)

  • right-skewed response and non-constant spread (increases as the mean increases)

  • Suggested model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)

Or, use Poisson regression

Which transformation? Try “ladder of transformation”

\(-2\) \(-1\) \(1/2\) \(\log\) \(1/2\) \(1\) \(2\)
\(\frac{1}{Y^2}\) \(\frac{1}{Y}\) \(\frac{1}{\sqrt{Y}}\) \(\log(Y)\) \(\sqrt{Y}\) \(Y\) \(Y^2\)
\(\frac{1}{x^2}\) \(\frac{1}{x}\) \(\frac{1}{\sqrt{x}}\) \(\log(x)\) \(\sqrt{x}\) \(x\) \(x^2\)

Which transformation? Let “boxcox” suggest

An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of \(Y\) of the form \(Y^{\lambda}\)

This method does not suggest transformations for the predictors, only the response.

If the method suggests \(\lambda = -1.78\), either use \(Y^{-1.78}\)
or a close, but easier-to-interpret transformation, like \(Y^{-2} = \displaystyle \frac{1}{Y^2}\)

If \(\lambda=0\) is suggested by the boxcox method,
then use the \(\log(Y)\) transformation.

boxcox(lm(y ~ x))

boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))

Transformations for non-linearity

  • If you know (or can guess) the form of the true relationship between response and predictor,
    • let that guide your transformation.
  • Write down the equation for the what you believe may be the true relationship,
    • then do algebra (on both sides of the equation) until you reach a linear form.
    • Then define new variables \(Y^\prime\), \(X^\prime\) as needed.
  • or can try different transformations of the predictor and/or response variables and see what works.

Transformations to achieve constant variance

When the variance is proportional to \(x^2\)

\[Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}\]

\[y^\prime_i = {\beta_{0}}^\prime + {\beta_{1}}^\prime x_i + \epsilon^\prime_i\]

Weighted least squares

\[\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}\]

Autocorrelation

So, what’s the problem with autocorrelation?

Autocorrelation can affect OLS analysis in the several ways:

  • OLS estimates are still unbiased but are not efficient (no longer have minimum variance)
  • \(\sigma^2\) and \({\mathrm{se}}(\beta)\)s may be seriously under- (or over-) estimated.
  • positive correlation, often under-estimation of \(\sigma^2\) and \({\mathrm{se}}(\beta)\)s
  • negative correlation (less common), often over-estimation

Underestimate of variability which leads to increased false positive decisions:
rejecting the null hypothesis \(\beta = 0\) when one should not

Whether positive or negative correlation,
confidence intervals for \(\beta\)s may no longer be valid.

Tests for autocorrelation

Runs test

runs.test( as.factor( rstandard(lmfit) > 0) )

Durbin-Watson approach

If the errors are correlated, perhaps each residual (\(\epsilon_t\)):

  • depends linearly on the size and sign of the residual before it in time (\(\epsilon_{t-1}\))m,
  • has correlation \(\rho = {\mathrm{cor}}(\epsilon_t, \epsilon_{t-1})\), and
  • is not correlated with any earlier residuals (\(\epsilon_{t-2}, \epsilon_{t-3}, \ldots\)).

If this model is reasonable, then \[ \epsilon_t=\gamma_0 + \gamma_1\epsilon_{t-1}+\omega_t = \rho\epsilon_{t-1}+\omega_t, \qquad |\rho|<1, \] where \(\rho\) is the correlation coefficient between successive errors
and \(\omega_t\) are independent, normally distributed, with constant variance across the errors (\(\epsilon\)s).

The Durbin-Watson statistic

The DW statistic (often labeled \(d\)) is defined as: \[ d\;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{\sum_{t=1}^n e_t^2} \;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{{\mathrm{SSE}}}, \] where \(e_t\) is the \(t^{th}\) observed residual with the data arranged in time order.

When \(\rho = 0\) (no autocorrelation), \(d \approx 2\).
When \(\rho = 1\) (perfect positive correlation), \(d \approx 0\).
When \(\rho = -1\) (perfect negative correlation), \(d \approx 4\).

The range of \(d\) is approximately \(0 \le d \le 4\).

So, when \(d\) is close to 2, there is not much autocorrelation.
When \(d\) is small, it suggests evidence of positive autocorrelation (\(\rho > 0\)).
When \((4-d)\) is small, it suggests evidence of negative autocorrelation (\(\rho < 0\)).

Cochrane-Orcutt transformation

The Cochrane-Orcutt transformation (textbook Section 8.4) is one method.

To see how it works, can write mathematically the errors for two adjacent observations:

\[\begin{align} \epsilon_t & \;=\; y_t-\beta_0-\beta_1 x_t \\ \epsilon_{t-1} & \;=\; y_{t-1}-\beta_0-\beta_1 x_{t-1} \end{align}\]

\[ (y_t-\rho y_{t-1}) \;=\; \beta_0(1-\rho) + \beta_1( x_t-\rho x_{t-1}) + \omega_t. \]

This is equivalent to the linear model \[ y_t^\star=\beta_0^\star+\beta_1^\star x_t^\star +\omega_t \]

with
\(y_t^\star \;=\; (y_t-\rho y_{t-1})\),
\(x_t^\star \;=\; (x_t-\rho x_{t-1})\),
\(\beta_0^\star \;=\; \beta_0(1-\rho)\), and
\(\beta_1^\star \;=\; \beta_1\)

This linear model is then fit to the data.

Omitted Variables

Correlated Errors due to Omitted Variables

Correlated errors can occur when a variable not in the model is related
to the response and related to the time or order in which the data were collected

Another Problem to Solve: (Multi)Collinearity

Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other

Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.

Issues

What’s the big deal about multicollinearity?

  1. Since the predictors provide “overlapping” information, results may be ambiguous
    • A portion of the information from each predictor is redundant, but not all
    • Predictor variables might each serve as a proxy for the others in the
      regression equation without affecting the total explanatory power.
    • Do we need all the predictors in the model?
    • If not, which ones should we drop?
    • How does this choice effect inference, prediction, or forecasting?
      (for prediction/forecasting, we are less concerned)
  2. We lose the simple interpretation of a regression coefficient
    “while all other predictors are held constant”

  3. If we proceed in the presence of moderate multicollinearity
    • Variance estimates are typically overstated
    • So, standard errors typically overstated
    • Confidence intervals wider than needed
    • t-statistics smaller than they should be
    • These are missed opportunities to notice important predictors
  4. If we proceed in the presence of strong multicollinearity
    • The estimates of \(\beta\)s can be very unstable
    • \(\widehat\beta\)s may change substantially when other predictors are added/removed
      or observations are added/deleted (even if not outliers)

Detecting multicollinearity

Detection: Oddities in model estimates/tests

Look for overall (model) F-statistic significant (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)),
but ALL individual \(t\)-tests nonsignificant.

However, a significant F-statistic (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)) and SOME nonsignificant \(t\)-tests for individual \(\beta\)s
does NOT necessarily imply multicollinearity.

There simply could be non-important variables in the model.

Look for a \(\beta\) coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)

Detection:

pairwise linear relationships (correlation, scatterplots)

Detection more complex linear relationships (partial \(R^2\))

Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.

Other measures include the partial \(R^2\) and its companion:
the variance inflation factor (VIF)

Detection: Variance Inflation Factor (VIF)

Let \(\displaystyle{{\mathrm{VIF}}_j = \frac{1}{1 - R_j^2}}\).

We call \({\mathrm{VIF}}_j\) the variance inflaction factor due to \(x_j\).

The variance inflation factor: $;\displaystyle{_j = $

When the partial correlation coefficient \(R_j^2\) is large, VIF is large.

…and if \(R_j^2 = 0\) (orthogonal predictors), VIF = 1.

Rough guide: VIF \(> 10\) indicates moderate multicollinearity is present.

What to do about multicollinearity?

The bad news: We cannot “fix” or eliminate multicollinearity for a given model.

For example, no transformation will help.

Some ideas…

  1. Get more and better data
  2. Omit redundant variables
  3. Centering
  4. Try “constrained regression” (Section 10.4)
    This is the only section in Chapter 10 covered in this course.
  5. Principal components (not covered in this course, see Sections 10.1-10.3)

Variable (model) selection

Variable selection is the process of choosing a “best” subset of all available predictors.

Well, there is no single “best” subset.

We do want a model we can interpret or justify with respect to the questions of interest.

Model comparison statistics

Nested models

  • F-test

Any two models with the same response

  • MSE (Mean Squared Error)

  • AIC (Akaike Information Criterion)

\[AIC = n \log_e(SSE_p / n) + 2p\]

\[BIC = n \log_e(SSE_p / n) + p\log_e(n)\]

  • Smaller is better for both criterion

  • Models with AIC difference \(\le 2\) should be treated as equally adequate

  • Similarly, models with BIC difference \(\le 2\) should be treated as equally adequate

  • BIC penalty for larger models is more severe

    • \(p\log_e(n) > 2p\) (whenever \(n > 8\))

    • Controls tendency of “overfitting” from AIC

  • BIC (Bayesian Information Criterion)

  • Mallow’s \(C_p\) (has fallen out of favor)

Any two models with the same response (but possibly differently transformed)

  • adjusted \(R^2\)

Automation

Automated variable selection strategies

Thinking through the problem carefully is always a strategy that should be employed!!!

But, sometimes may want to use one of these objective approaches:

  • Forward Selection

  • Backward Elimination

  • Stepwise Selection (a little of both)

Forward Selection
  1. Begin with an “empty” model (no predictors)

  2. Add the predictor with highest correlation with response
    • or largest t-statistic (smallest p-value)
    • \(H_0: \text{correlation}_j = 0\) is equivalent to \(H_0: \beta_j=0\)
    • decide here on a significance level required for entry at each step
  3. Continue on to add next predictor with highest partial correlation with response
    • that is, after adjusting for the predictor added in step (2)
    • or largest t-statistic meeting the significance level criterion
    • \(H_0: \text{partial correlation}_j = 0\) is equivalent to \(H_0: \beta_j=0\)
  4. With new model from (3), go back to step (3) again
    • add another predictor if one meets the criterion (a p-value threshold)
    • repeat (3) until no other predictors meet the criterion
Backward Elimination
  1. Begin with the “full” model (all predictors)

  2. Remove the predictor with the smallest t-statistic (largest p-value)

  3. Continue on to remove the next predictor with smallest t-statistic
    • that is, after adjusting for the remaining predictors after step (2)
  4. With the new model from (3), go back to step (3) again
    • remove another predictor if it meets the criterion (a p-value threshold)
    • repeat (3) until no other predictors meet the criterion
Stepwise Selection

It combines forward and backward steps,
usually beginning from empty model (forward stepwise).

Logistic regression

A logistic regression model describes the log odds of a probability (that a binary variable \(Y\) takes on the value \(1\)) as a linear combination of some other variables (\(X\)’s –covariates or predictors)

Because error structure is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used - maximum likelihood estimation

Here is the logistic regression model for the probability of an event as a function of a binary predictor variable x: \[ \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \]

When \(x=0\), \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (0) = \beta_0 \;=\; \log \left( \frac{\pi_0}{1-\pi_0} \right) \]

When \(x=1\), \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (1) = \beta_0 + \beta_1 \;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) \]

Then, \[ \begin{align} \log(OR) &\;=\; \log \left( \frac{\pi_1/(1-\pi_1)}{\pi_0(1-\pi_0)} \right)\\ &\;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) - \log \left( \frac{\pi_0}{1-\pi_0} \right) \\ &\;=\; (\beta_0 + \beta_1) - \beta_0\\ &\;=\; \beta_1 \end{align} \]

Of course, we can have continuous predictor variables, or multiple predictors, as well.

Assumptions and Model Diagnostics

The assumptions:

  • No systematic bias in measurement assumption: still required
  • Uncorrelated observations assumption still required
  • No strong multicollinearity still required (and you can still use VIF to check for violation of assumption).
  • Linearity (w.r.t link function) still required
  • No influential observations still required.
  • Normality and constant variance assumptions are no longer required The error term is Bernoulli distributed.

Regularization: beyond OLS

To prevent overfitting, to find simple models, to deal with multicollinearity, we have two additional techniques: LASSO and ridge regression

Ridge regression

Minimize \[\sum_i (y_i-\hat y_i)^2 + \lambda \sum_{j=1}^p \hat \beta_j^2\]

Optimize using Maximum Liklihood Estimation.

Find \(\lambda\) through a cross-validation procedure.

  • Can find good models even in the presence of multicollinearity
  • Keeps regression coefficients from getting too big
  • Doesn’t tend to make any coefficients exactly 0

LASSO

Minimize \[\sum_i (y_i-\hat y_i)^2 + \lambda \sum_{j=1}^p |\hat\beta_j|\]

Optimize using Maximum Liklihood Estimation.

  • Optimization harder (slower) than Ridge because not differentiable
  • Tends to make some coefficients identically zero
  • Good for finding simple models!
  • An alternative to stepwise models for model selection