We use transformations to achieve linearity and constant variance.
There are a few that we use very frequently:
Consider the context of the data
Is exponential growth/decay expected?
\(\displaystyle Y = \beta_0\; e^{\beta_1 x}\; e^{\epsilon}\)
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: \(\log(Y)\)
\(\log(Y) = \beta_0 + \beta_1 x + \epsilon\)
Consider the context of the data
Is multiplicative growth/shrinkage expected?
\(\displaystyle Y = \beta_0 x^{\beta_1}e^{\epsilon}\)
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: \(\log(Y)\) and \(\log(x)\)
\(\log(Y) = \beta_0 + \beta_1\log(x) + \epsilon\)
Consider the context of the data
Is the response a proportion (binomial model)?
\(E(Y\, |\, x) = \pi(x)\) = some function of \(x\) with range between 0 and 1
\({\mathrm{var}}(Y\, |\, x) = \displaystyle{\frac{\pi(x)[1-\pi(x)]}{n}}\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)
restricted, non-normal response, non-constant spread (varies with the mean)
Suggested transformation: \(\;\displaystyle \log\left(\frac{Y}{1-Y}\right)\)
Or, use logistic regression
Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?
\(E(Y\, |\, x) = \lambda(x)\) = some function of \(x\) (\(\lambda(x) > 0\))
\({\mathrm{var}}(Y\, |\, x) = \lambda(x)\) …which depends on \(x\) and the mean \(E(Y\, |\, x)\)
right-skewed response and non-constant spread (increases as the mean increases)
Suggested model: \(\sqrt{Y} = \beta_0 + \beta_1 x + \epsilon\)
Or, use Poisson regression
| \(-2\) | \(-1\) | \(1/2\) | \(\log\) | \(1/2\) | \(1\) | \(2\) |
|---|---|---|---|---|---|---|
| \(\frac{1}{Y^2}\) | \(\frac{1}{Y}\) | \(\frac{1}{\sqrt{Y}}\) | \(\log(Y)\) | \(\sqrt{Y}\) | \(Y\) | \(Y^2\) |
| \(\frac{1}{x^2}\) | \(\frac{1}{x}\) | \(\frac{1}{\sqrt{x}}\) | \(\log(x)\) | \(\sqrt{x}\) | \(x\) | \(x^2\) |
An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of \(Y\) of the form \(Y^{\lambda}\)
This method does not suggest transformations for the predictors, only the response.
If the method suggests \(\lambda = -1.78\), either use \(Y^{-1.78}\)
or a close, but easier-to-interpret transformation, like \(Y^{-2} = \displaystyle \frac{1}{Y^2}\)
If \(\lambda=0\) is suggested by the boxcox method,
then use the \(\log(Y)\) transformation.
boxcox(lm(y ~ x))
boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))
\[Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}\]
\[y^\prime_i = {\beta_{0}}^\prime + {\beta_{1}}^\prime x_i + \epsilon^\prime_i\]
\[\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}\]
Regression assumption:
error terms \(\epsilon_i\) and \(\epsilon_j\) for observations \(i\) and \(j\) are uncorrelated.
Autocorrelation can affect OLS analysis in the several ways:
Underestimate of variability which leads to increased false positive decisions:
rejecting the null hypothesis \(\beta = 0\) when one should not
Whether positive or negative correlation,
confidence intervals for \(\beta\)s may no longer be valid.
runs.test( as.factor( rstandard(lmfit) > 0) )
If the errors are correlated, perhaps each residual (\(\epsilon_t\)):
If this model is reasonable, then \[
\epsilon_t=\gamma_0 + \gamma_1\epsilon_{t-1}+\omega_t = \rho\epsilon_{t-1}+\omega_t, \qquad |\rho|<1,
\] where \(\rho\) is the correlation coefficient between successive errors
and \(\omega_t\) are independent, normally distributed, with constant variance across the errors (\(\epsilon\)s).
The DW statistic (often labeled \(d\)) is defined as: \[ d\;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{\sum_{t=1}^n e_t^2} \;=\; \frac{\sum_{t=2}^n (e_t-e_{t-1})^2}{{\mathrm{SSE}}}, \] where \(e_t\) is the \(t^{th}\) observed residual with the data arranged in time order.
When \(\rho = 0\) (no autocorrelation), \(d \approx 2\).
When \(\rho = 1\) (perfect positive correlation), \(d \approx 0\).
When \(\rho = -1\) (perfect negative correlation), \(d \approx 4\).
The range of \(d\) is approximately \(0 \le d \le 4\).
So, when \(d\) is close to 2, there is not much autocorrelation.
When \(d\) is small, it suggests evidence of positive autocorrelation (\(\rho > 0\)).
When \((4-d)\) is small, it suggests evidence of negative autocorrelation (\(\rho < 0\)).
The Cochrane-Orcutt transformation (textbook Section 8.4) is one method.
To see how it works, can write mathematically the errors for two adjacent observations:
\[\begin{align} \epsilon_t & \;=\; y_t-\beta_0-\beta_1 x_t \\ \epsilon_{t-1} & \;=\; y_{t-1}-\beta_0-\beta_1 x_{t-1} \end{align}\]
\[ (y_t-\rho y_{t-1}) \;=\; \beta_0(1-\rho) + \beta_1( x_t-\rho x_{t-1}) + \omega_t. \]
This is equivalent to the linear model \[ y_t^\star=\beta_0^\star+\beta_1^\star x_t^\star +\omega_t \]
with
\(y_t^\star \;=\; (y_t-\rho y_{t-1})\),
\(x_t^\star \;=\; (x_t-\rho x_{t-1})\),
\(\beta_0^\star \;=\; \beta_0(1-\rho)\), and
\(\beta_1^\star \;=\; \beta_1\)
This linear model is then fit to the data.
Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other
Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.
We lose the simple interpretation of a regression coefficient
“while all other predictors are held constant”
Look for overall (model) F-statistic significant (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)),
but ALL individual \(t\)-tests nonsignificant.
However, a significant F-statistic (\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)) and SOME nonsignificant \(t\)-tests for individual \(\beta\)s
does NOT necessarily imply multicollinearity.
There simply could be non-important variables in the model.
Look for a \(\beta\) coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)
pairwise linear relationships (correlation, scatterplots)
Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.
Other measures include the partial \(R^2\) and its companion:
the variance inflation factor (VIF)
Let \(\displaystyle{{\mathrm{VIF}}_j = \frac{1}{1 - R_j^2}}\).
We call \({\mathrm{VIF}}_j\) the variance inflaction factor due to \(x_j\).
The variance inflation factor: $;\displaystyle{_j = $
When the partial correlation coefficient \(R_j^2\) is large, VIF is large.
…and if \(R_j^2 = 0\) (orthogonal predictors), VIF = 1.
Rough guide: VIF \(> 10\) indicates moderate multicollinearity is present.
The bad news: We cannot “fix” or eliminate multicollinearity for a given model.
For example, no transformation will help.
Some ideas…
Variable selection is the process of choosing a “best” subset of all available predictors.
Well, there is no single “best” subset.
We do want a model we can interpret or justify with respect to the questions of interest.
Nested models
Any two models with the same response
MSE (Mean Squared Error)
AIC (Akaike Information Criterion)
\[AIC = n \log_e(SSE_p / n) + 2p\]
\[BIC = n \log_e(SSE_p / n) + p\log_e(n)\]
Smaller is better for both criterion
Models with AIC difference \(\le 2\) should be treated as equally adequate
Similarly, models with BIC difference \(\le 2\) should be treated as equally adequate
BIC penalty for larger models is more severe
\(p\log_e(n) > 2p\) (whenever \(n > 8\))
Controls tendency of “overfitting” from AIC
BIC (Bayesian Information Criterion)
Mallow’s \(C_p\) (has fallen out of favor)
Any two models with the same response (but possibly differently transformed)
Thinking through the problem carefully is always a strategy that should be employed!!!
But, sometimes may want to use one of these objective approaches:
Forward Selection
Backward Elimination
Stepwise Selection (a little of both)
Begin with an “empty” model (no predictors)
Begin with the “full” model (all predictors)
Remove the predictor with the smallest t-statistic (largest p-value)
It combines forward and backward steps,
usually beginning from empty model (forward stepwise).
A logistic regression model describes the log odds of a probability (that a binary variable \(Y\) takes on the value \(1\)) as a linear combination of some other variables (\(X\)’s –covariates or predictors)
Because error structure is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used - maximum likelihood estimation
Here is the logistic regression model for the probability of an event as a function of a binary predictor variable x: \[ \log \left\{ \frac{\pi}{1-\pi} \right\} = \beta_0 + \beta_1 x \]
When \(x=0\), \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (0) = \beta_0 \;=\; \log \left( \frac{\pi_0}{1-\pi_0} \right) \]
When \(x=1\), \[ \log \left( \frac{\pi}{1-\pi} \right) \;=\; \beta_0 + \beta_1 (1) = \beta_0 + \beta_1 \;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) \]
Then, \[ \begin{align} \log(OR) &\;=\; \log \left( \frac{\pi_1/(1-\pi_1)}{\pi_0(1-\pi_0)} \right)\\ &\;=\; \log \left( \frac{\pi_1}{1-\pi_1} \right) - \log \left( \frac{\pi_0}{1-\pi_0} \right) \\ &\;=\; (\beta_0 + \beta_1) - \beta_0\\ &\;=\; \beta_1 \end{align} \]
Of course, we can have continuous predictor variables, or multiple predictors, as well.
To prevent overfitting, to find simple models, to deal with multicollinearity, we have two additional techniques: LASSO and ridge regression
Minimize \[\sum_i (y_i-\hat y_i)^2 + \lambda \sum_{j=1}^p \hat \beta_j^2\]
Optimize using Maximum Liklihood Estimation.
Find \(\lambda\) through a cross-validation procedure.
Minimize \[\sum_i (y_i-\hat y_i)^2 + \lambda \sum_{j=1}^p |\hat\beta_j|\]
Optimize using Maximum Liklihood Estimation.