We use transformations to achieve linearity and constant variance.
There are a few that we use very frequently:
Consider the context of the data
Is exponential growth/decay expected?
Y=β0eβ1xeϵ
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: log(Y)
log(Y)=β0+β1x+ϵ
Consider the context of the data
Is multiplicative growth/shrinkage expected?
Y=β0xβ1eϵ
Non-linear model and non-constant spread (multiplicative errors)
Suggested transformation: log(Y) and log(x)
log(Y)=β0+β1log(x)+ϵ
Consider the context of the data
Is the response a proportion (binomial model)?
E(Y|x)=π(x) = some function of x with range between 0 and 1
var(Y|x)=π(x)[1−π(x)]n …which depends on x and the mean E(Y|x)
restricted, non-normal response, non-constant spread (varies with the mean)
Suggested transformation: log(Y1−Y)
Or, use logistic regression
Consider the context of the data
Is the response a count from a fairly rare event (Poisson model)?
E(Y|x)=λ(x) = some function of x (λ(x)>0)
var(Y|x)=λ(x) …which depends on x and the mean E(Y|x)
right-skewed response and non-constant spread (increases as the mean increases)
Suggested model: √Y=β0+β1x+ϵ
Or, use Poisson regression
−2 | −1 | 1/2 | log | 1/2 | 1 | 2 |
---|---|---|---|---|---|---|
1Y2 | 1Y | 1√Y | log(Y) | √Y | Y | Y2 |
1x2 | 1x | 1√x | log(x) | √x | x | x2 |
An algorithm that suggests the “boxcox transformation”
tries to find an “optimal” transformation of Y of the form Yλ
This method does not suggest transformations for the predictors, only the response.
If the method suggests λ=−1.78, either use Y−1.78
or a close, but easier-to-interpret transformation, like Y−2=1Y2
If λ=0 is suggested by the boxcox method,
then use the log(Y) transformation.
boxcox(lm(y ~ x))
boxcox(lm(y ~ x), lambda = seq(-0, 1, 0.1))
Y′=YXX′=1Xβ′0=β1β′1=β0ϵ′=ϵX
y′i=β0′+β1′xi+ϵ′i
∑wi(yi−ˆyi)2=∑wie2i,where wi∝1var(ϵi)
Regression assumption:
error terms ϵi and ϵj for observations i and j are uncorrelated.
Autocorrelation can affect OLS analysis in the several ways:
Underestimate of variability which leads to increased false positive decisions:
rejecting the null hypothesis β=0 when one should not
Whether positive or negative correlation,
confidence intervals for βs may no longer be valid.
runs.test( as.factor( rstandard(lmfit) > 0) )
If the errors are correlated, perhaps each residual (ϵt):
If this model is reasonable, then ϵt=γ0+γ1ϵt−1+ωt=ρϵt−1+ωt,|ρ|<1, where ρ is the correlation coefficient between successive errors
and ωt are independent, normally distributed, with constant variance across the errors (ϵs).
The DW statistic (often labeled d) is defined as: d=∑nt=2(et−et−1)2∑nt=1e2t=∑nt=2(et−et−1)2SSE, where et is the tth observed residual with the data arranged in time order.
When ρ=0 (no autocorrelation), d≈2.
When ρ=1 (perfect positive correlation), d≈0.
When ρ=−1 (perfect negative correlation), d≈4.
The range of d is approximately 0≤d≤4.
So, when d is close to 2, there is not much autocorrelation.
When d is small, it suggests evidence of positive autocorrelation (ρ>0).
When (4−d) is small, it suggests evidence of negative autocorrelation (ρ<0).
The Cochrane-Orcutt transformation (textbook Section 8.4) is one method.
To see how it works, can write mathematically the errors for two adjacent observations:
ϵt=yt−β0−β1xtϵt−1=yt−1−β0−β1xt−1
(yt−ρyt−1)=β0(1−ρ)+β1(xt−ρxt−1)+ωt.
This is equivalent to the linear model y⋆t=β⋆0+β⋆1x⋆t+ωt
with
y⋆t=(yt−ρyt−1),
x⋆t=(xt−ρxt−1),
β⋆0=β0(1−ρ), and
β⋆1=β1
This linear model is then fit to the data.
Collinearity (also called multicollinearity) refers to the situation where
some or all predictors in the model are substantially correlated with each other
Linear relationships among predictors might not be only pairwise.
Perhaps one of the predictor is linearly related to some subset of other predictors.
We lose the simple interpretation of a regression coefficient
“while all other predictors are held constant”
Look for overall (model) F-statistic significant (H0:β1=β2=⋯=βp=0),
but ALL individual t-tests nonsignificant.
However, a significant F-statistic (H0:β1=β2=⋯=βp=0) and SOME nonsignificant t-tests for individual βs
does NOT necessarily imply multicollinearity.
There simply could be non-important variables in the model.
Look for a β coefficient for a variable have an opposite sign
from what you (and everyone else) would expect.
(This may also be a result of omitting confounders: November 29 lecture.)
pairwise linear relationships (correlation, scatterplots)
Not every multicollinearity situation can be detected by pairwise relationships:
correlation, scatterplots.
Other measures include the partial R2 and its companion:
the variance inflation factor (VIF)
Let VIFj=11−R2j.
We call VIFj the variance inflaction factor due to xj.
The variance inflation factor: $;\displaystyle{_j = $
When the partial correlation coefficient R2j is large, VIF is large.
…and if R2j=0 (orthogonal predictors), VIF = 1.
Rough guide: VIF >10 indicates moderate multicollinearity is present.
The bad news: We cannot “fix” or eliminate multicollinearity for a given model.
For example, no transformation will help.
Some ideas…
Variable selection is the process of choosing a “best” subset of all available predictors.
Well, there is no single “best” subset.
We do want a model we can interpret or justify with respect to the questions of interest.
Nested models
Any two models with the same response
MSE (Mean Squared Error)
AIC (Akaike Information Criterion)
AIC=nloge(SSEp/n)+2p
BIC=nloge(SSEp/n)+ploge(n)
Smaller is better for both criterion
Models with AIC difference ≤2 should be treated as equally adequate
Similarly, models with BIC difference ≤2 should be treated as equally adequate
BIC penalty for larger models is more severe
ploge(n)>2p (whenever n>8)
Controls tendency of “overfitting” from AIC
BIC (Bayesian Information Criterion)
Mallow’s Cp (has fallen out of favor)
Any two models with the same response (but possibly differently transformed)
Thinking through the problem carefully is always a strategy that should be employed!!!
But, sometimes may want to use one of these objective approaches:
Forward Selection
Backward Elimination
Stepwise Selection (a little of both)
Begin with an “empty” model (no predictors)
Begin with the “full” model (all predictors)
Remove the predictor with the smallest t-statistic (largest p-value)
It combines forward and backward steps,
usually beginning from empty model (forward stepwise).
A logistic regression model describes the log odds of a probability (that a binary variable Y takes on the value 1) as a linear combination of some other variables (X’s –covariates or predictors)
Because error structure is different (Bernoulli vs. assumed normal), a different model estimation method from least squares is used - maximum likelihood estimation
Here is the logistic regression model for the probability of an event as a function of a binary predictor variable x: log{π1−π}=β0+β1x
When x=0, log(π1−π)=β0+β1(0)=β0=log(π01−π0)
When x=1, log(π1−π)=β0+β1(1)=β0+β1=log(π11−π1)
Then, log(OR)=log(π1/(1−π1)π0(1−π0))=log(π11−π1)−log(π01−π0)=(β0+β1)−β0=β1
Of course, we can have continuous predictor variables, or multiple predictors, as well.
To prevent overfitting, to find simple models, to deal with multicollinearity, we have two additional techniques: LASSO and ridge regression
Minimize ∑i(yi−ˆyi)2+λp∑j=1ˆβ2j
Optimize using Maximum Liklihood Estimation.
Find λ through a cross-validation procedure.
Minimize ∑i(yi−ˆyi)2+λp∑j=1|ˆβj|
Optimize using Maximum Liklihood Estimation.