So far, the basics of linear regression
Now, we explore how to critique our previous work
Examine if the data are consistent with linear regression model specification
Use mostly graphical methods.
If the regression model specifications (assumptions) not true…
All inference based on the fitted regression model
For example, if errors (residuals) not at all reasonably from a normal distribution
the LS estimators do not have the expected sampling distribution.
So, t-tests, F-tests, confidence intervals may be incorrect
…unless sample size large (for t-tests and F-tests).
Some small deviations from the model specifications are not serious,
but large deviations could invalidate our results.
We should always plot the data before we get started.
For example, a scatterplot (or scatterplot matrix).
Let’s ignore that advice for the moment and see what kind of trouble that can lead to.
The regression results are identical for all four datasets
But the relationships of the responses to the same predictors are very different!
Can only discover this by graphing the data.
p1Anscombe <- gf_point(Y1 ~ X1, data=AnscombeData) %>%
gf_coefline(model = lmfit1)
p2Anscombe <- gf_point(Y2 ~ X2, data=AnscombeData) %>%
gf_coefline(model = lmfit2)
p3Anscombe <- gf_point(Y3 ~ X3, data=AnscombeData) %>%
gf_coefline(model = lmfit3)
p4Anscombe <- gf_point(Y4 ~ X4, data=AnscombeData) %>%
gf_coefline(model = lmfit4)
grid.arrange(p1Anscombe, p2Anscombe, p3Anscombe, p4Anscombe, ncol=2)
Linearity assumption… \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_p X_{ip} \epsilon_i\)
Check using “residual plots”
Fact: No matter what pattern you might see, it is algebraically true that the observed residuals are uncorrelated with the fitted \(\widehat{y}\) values.
That is \(\mathrm{corr}(\widehat{y}, e) = 0\).
Look at distribution of standardized residuals to see if consistent with data from a standard normal distribution.
Check using:
No need to check this assumption.
In a previous lecture, we showed that for data, \(\sum_i e_i \;=\; \sum_i (y_i - \widehat{y}_i) \;=\; 0\)
That is, the observed residuals always have average zero.
If violated, have “heteroscedascity” problem
If violated, and if related to the order data were collected,
then called an “autocorrelation” problem
To check:
What if you don’t know the order in which the data were collected?
Then consider how the data were collected and whether that might lead to dependence among observations.
Rarely true, but rarely have information needed to check.
Probably not true, but doubtful ever have enough information to assess
If violated, we have collinearity (or multi-collinearity)
To check: scatter plot of predictors against one another.
More on the collinearity problem later on in the course.
All have roughly equal role in determining results and influencing conclusions.
To check:
AnscombeDataResids <- AnscombeData %>%
mutate(e1 = residuals(lmfit1), yhat1 = fitted(lmfit1),
e2 = residuals(lmfit2), yhat2 = fitted(lmfit2),
e3 = residuals(lmfit3), yhat3 = fitted(lmfit3),
e4 = residuals(lmfit4), yhat4 = fitted(lmfit4)
)
p5Anscombe <- gf_point(e1 ~ yhat1, data=AnscombeDataResids) %>%
gf_labs(x = "yhat (fitted values)", y = "residuals (Y1 on X1)") %>%
gf_hline(yintercept = 0, linetype=2)
p6Anscombe <- gf_point(e2 ~ yhat2, data=AnscombeDataResids) %>%
gf_labs(x = "yhat (fitted values)", y = "residuals (Y2 on X2)") %>%
gf_hline(yintercept = 0, linetype=2)
p7Anscombe <- gf_point(e3 ~ yhat3, data=AnscombeDataResids) %>%
gf_labs(x = "yhat (fitted values)", y = "residuals (Y3 on X3)") %>%
gf_hline(yintercept = 0, linetype=2)
p8Anscombe <- gf_point(e4 ~ yhat4, data=AnscombeDataResids) %>%
gf_labs(x = "yhat (fitted values)", y = "residuals (Y4 on X4)") %>%
gf_hline(yintercept = 0, linetype=2)
grid.arrange(p1Anscombe, p5Anscombe, p2Anscombe, p6Anscombe, ncol=2)
grid.arrange(p3Anscombe, p7Anscombe, p4Anscombe, p8Anscombe, ncol=2)
variable | description |
---|---|
state | State (excluding AK and HI) |
pop | Population in each state |
fuelc | fuel consumption, millions of gallons |
fuel | Motor fuel consumption, gal/person |
nlic | 1971 thousands of drivers |
dlic | Percent of population with driver’s license |
tax | 1972 motor fuel tax rate, cents/gal |
inc | 1972 per capita income, thousands of dollars |
road | 1971 federal-aid highways, thousands of miles |
fuel = f(dlic, tax, inc, road)
We are hoping that residuals have similar spread across the fitted values
fuelData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/fuel.txt")
fuelData <- fuelData %>%
dplyr::select(state, fuel, dlic, tax, inc, road) %>%
arrange(state)
n <- dim(fuelData)[1]
p <- 4
First, fit the LS regression model and save it:
lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
Then, add the residuals and fitted values to the dataset:
fuelData <- mutate(fuelData, e = residuals(lmfit), yhat = fitted(lmfit))
We can now make plots of predictor variables and/or fitted values vs the residuals:
lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
fuelData <- fuelData %>%
mutate(e = residuals(lmfit), yhat = fitted (lmfit))
gf_point(e ~ yhat, data=fuelData) %>%
gf_labs(x = "yhat (fitted values)", y = "residuals") %>%
gf_hline(yintercept = 0, linetype = "dashed")
gf_point(e ~ dlic, data=fuelData) %>%
gf_labs(x = "dlic", y = "residuals") %>%
gf_hline(yintercept = 0, linetype = "dashed")
A faster, but slightly ugly, way to make the residuals vs fitted values plot (and several other of the diagnostic plots that we use here):
plot(lmfit)
CAUTION: Least squares (observed) residuals do NOT all have same variance
Model errors: \(\mathrm{var}(\epsilon_i) = \sigma^2\)
Observed residuals: \(\mathrm{var}(e_i) = \sigma^2 (1 - h_{ii})\)
\((1/n) \le h_{ii} < 1\)
Large \(h_{ii}\) indicates case \(i\) is far from the center of x-space.
So, \(e_i\) vary less far from center of x-space.
Applet: Distribution of slope and intercept
So, we will standardize \(e_i\)’s to solve this problem.
\[z_i = \frac{e_i - \mathrm{E}(e_i)}{\mathrm{sd}(e_i)} \;=\; \frac{e_i - 0}{\sigma\sqrt{1 - h_{ii}}} \;=\; \frac{e_i}{\sigma\sqrt{1 - h_{ii}}}\]
Properties: \(\mathrm{E}(z_i)=0\) and \(\mathrm{sd}(z_i)=1\), but for observed data, \(\sum z_i\) is not necessarily zero.
Well, you can’t, using any software, because we don’t know \(\sigma\).
However, we can estimate \(\sigma:\;\; \widehat{\sigma} = \displaystyle{ \sqrt{\frac{SSE}{n - p -1}}}\)
\[r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}\]
The observable standardized residuals are technically
called “internally studentized” residuals.
The textbook and R/Stata refer to the “internally studentized residuals” as “standardized” residuals. We will do the same.
Let’s add the standardized residuals to the fuel consumption dataset:
fuelData <- mutate(fuelData, r = rstandard(lmfit))
fuelData <- mutate(fuelData, r = rstandard(lmfit))
Standardized residual for Wyoming:
# Value of variables for Wyoming
fuelData %>%
filter(state == "WY")
state | fuel | dlic | tax | inc | road | e | yhat | r |
---|---|---|---|---|---|---|---|---|
WY | 968 | 67.2 | 7 | 4.345 | 3.905 | 234.9 | 733.1 | 3.735 |
Here is a summary residuals for the other states:
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
variable | min | Q1 | median | Q3 | max | mean | sd | n |
---|---|---|---|---|---|---|---|---|
fuel | 344.000 | 509.000 | 566.0000 | 631.5000 | 865.000 | 568.4468 | 96.9143 | 47 |
dlic | 45.100 | 52.950 | 56.3000 | 58.9500 | 72.400 | 56.8170 | 5.3985 | 47 |
tax | 5.000 | 7.000 | 7.5000 | 8.2500 | 10.000 | 7.6826 | 0.9559 | 47 |
inc | 3.063 | 3.733 | 4.2960 | 4.5835 | 5.342 | 4.2396 | 0.5796 | 47 |
road | 0.431 | 2.946 | 4.7460 | 7.3820 | 17.782 | 5.6007 | 3.5206 | 47 |
e | -122.029 | -45.709 | -11.6488 | 29.7206 | 117.597 | -4.9989 | 53.7052 | 47 |
yhat | 318.733 | 513.353 | 569.4385 | 639.9999 | 772.968 | 573.4457 | 90.2138 | 47 |
r | -1.932 | -0.723 | -0.1818 | 0.4864 | 1.843 | -0.0743 | 0.8559 | 47 |
The Wyoming standardized residual is almost twice the size of the largest magnitude standardized residual from among the other states.
standardized residuals have mean \(\mathrm{E}(r_i) = 0\) and \(\mathrm{var}(r_i) = 1\),
but for observed data, \(\sum r_i\) is not necessarily zero.
standardized residuals “are not strictly independent, but with a large
number of observations, the lack of independence may be ignored.”
So, if the errors \(\epsilon\) are \(N(0,\sigma^2)\),
then the \(r_i\) are approximately \(N(0,1)\) when sample size (\(n\)) is large.
Thus, we take a careful look at standardized residuals \(|\; r_i\; | \ge 2\) (outliers).
For simple linear regression (one predictor), \[h_{ii} = \frac{1}{n} + \frac{(x_i -\overline{x})^2}{\sum(x_i - \overline{x})^2}\]
\((1/n) \le h_{ii} < 1\)
Recall that for linear regression \(\widehat{Y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} X_{1i} + \widehat{\beta}_{2} X_{2i} + \cdots + \widehat{\beta}_pX_{pi}\),
and we obtain the \(\widehat{\beta}\)’s by least squares estimation:
\[\begin{align} \min \mathrm{SSE}&\;=\; \min \sum_{i=1}^n e_i^2 \;=\; \min \sum(y_i-\widehat{y}_i)^2 \\ \\ &\;=\; \min \sum(y_i- \widehat{\beta}_{0} - \widehat{\beta}_{1} X_{1i} - \widehat{\beta}_{2} X_{2i} - \cdots - \widehat{\beta}_pX_{pi})^2 \end{align}\]
Using matrix algebra notation, this can be re-written as \[\min \mathrm{SSE}(\widehat{\boldsymbol{\beta}}) \;=\; \min (\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}})^T(\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}}). \] By taking the first derivative of the SSE w.r.t. \(\widehat{\boldsymbol{\beta}}\), setting equal to zero and solving,
we can obtain \[\widehat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\]
This is a compact way of writing (and programming) least squares solutions for multiple regression with any number of predictors.
This form of writing the solution illustrates that \(\widehat{\boldsymbol{\beta}}\) is in fact a linear function of \(Y\)s. Thus the fitted value
\[ \widehat{\mathbf{Y}} \;=\; \mathbf{X}\widehat{\boldsymbol{\beta}} \;=\; \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}= \mathbf{H}\mathbf{Y},\]
where \(\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\).
\(\mathbf{H}\) is sometimes called the “hat matrix”
Note that \(\mathbf{H}\) only depends on the predictors \(\mathbf{X}\), not the responses \(\mathbf{Y}\).
The entries in the hat matrix are labeled \(h_{ij}\),
\(i = 1, 2, \ldots, n\), \(\quad j = 1, 2, \ldots, n\).
\(\mathbf{H}\) transforms \(y\) values into their corresponding \(\widehat{y}\) values.
\[\widehat{y}_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n, \]
NOTE: The textbook refers to \(\mathbf{H}\) as a projection matrix (\(\mathbf{P}\))
and labels the matrix entries as \(p_{ij}\) instead of \(h_{ij}\).
END OF MATRIX ALGEBRA DETOUR
We can write \(\widehat{y}_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n,\)
\(h_{ij}\) is the “contribution” of \(y_j\) to \(\widehat{y}_i\)
That is, \(h_{ij}\) is the “weight” given to \(y_j\) in predicting \(\widehat{y}_i\).
For simple linear regression (one predictor): \[h_{ij} = \frac{1}{n} + \frac{(x_i -\overline{x})(x_j-\overline{x})}{\sum(x_i - \overline{x})^2}\]
So, if \(x_j\) is far from \(\overline{x}\) in the same direction as \(x_i\),
then \(y_j\) contributes a lot to the estimate \(\widehat{y}_i\).
We call \(h_{ii}\) the leverage of the ith observation
since \(h_{ii}\) is the weight given to \(y_i\) in predicting \(\widehat{y}_i\).
For simple linear regression (one predictor), \[h_{ii} = \frac{1}{n} + \frac{(x_i -\overline{x})^2}{\sum(x_i - \overline{x})^2}\]
So, if \(x_i\) is far from \(\overline{x}\), then \(y_i\) contributes a lot to the estimate \(\widehat{y}_i\).
The last item follows from the fact that \(\displaystyle{h_{ii} + \frac{e_i^2}{\mathrm{SSE}} \le 1}\)
If \(h_{ii}\) is large (close to 1), then \(y_i\) has a lot of “leverage” on \(\widehat{y}_i\),
the residual \(e_i\) is small,
and the prediction \(\widehat{y}_i\) depends relatively less on the other observations (\(y_j\)’s)
Graphical methods of leverage detection are preferred, but…
However, from item 2 above, the average of \(h_{ii}\)’s is \((p+1) / n\)
A rough screening for “high leverage” \(y\)-values is \(h_{ii} \;>\; 2(p+1) / n \;\;=\;\; 2\;\times\) (average of \(h_{ii}\)’s)
Add the “hat values” (\(h_{ii}\)) to the fuel consumption dataset:
fuelData <- mutate(fuelData, h = hatvalues(lmfit))
fuelData <- mutate(fuelData, h = hatvalues(lmfit))
Hat value for Wyoming:
# Value of variables for Wyoming
fuelData %>%
filter(state == "WY")
state | fuel | dlic | tax | inc | road | e | yhat | r | h |
---|---|---|---|---|---|---|---|---|---|
WY | 968 | 67.2 | 7 | 4.345 | 3.905 | 234.9 | 733.1 | 3.735 | 0.0997 |
Here is a summary of the other states:
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
variable | min | Q1 | median | Q3 | max | mean | sd | n |
---|---|---|---|---|---|---|---|---|
fuel | 344.0000 | 509.0000 | 566.0000 | 631.5000 | 865.0000 | 568.4468 | 96.9143 | 47 |
dlic | 45.1000 | 52.9500 | 56.3000 | 58.9500 | 72.4000 | 56.8170 | 5.3985 | 47 |
tax | 5.0000 | 7.0000 | 7.5000 | 8.2500 | 10.0000 | 7.6826 | 0.9559 | 47 |
inc | 3.0630 | 3.7330 | 4.2960 | 4.5835 | 5.3420 | 4.2396 | 0.5796 | 47 |
road | 0.4310 | 2.9465 | 4.7460 | 7.3820 | 17.7820 | 5.6007 | 3.5206 | 47 |
e | -122.0289 | -45.7090 | -11.6488 | 29.7206 | 117.5965 | -4.9989 | 53.7052 | 47 |
yhat | 318.7326 | 513.3532 | 569.4385 | 639.9999 | 772.9678 | 573.4457 | 90.2138 | 47 |
r | -1.9317 | -0.7230 | -0.1818 | 0.4864 | 1.8425 | -0.0743 | 0.8559 | 47 |
h | 0.0265 | 0.0589 | 0.0810 | 0.1117 | 0.3151 | 0.1043 | 0.0680 | 47 |
The average of the hat values should be about
\[\frac{(p+1)}{n} \;=\; \frac{ (4 + 1)}{48} \;=\; 0.1042\]
Indeed, we observed that the average of the hat values (with Wyoming in the data) was \(0.1042\)
The threshold for “high leverage” for the fuel data full model is
\[\frac{2 (p+1)}{n} \;=\; \frac{2 (4 + 1)}{48} \;=\; 0.2083\]
Which observations have “high leverage”?
Do they have large standardized residuals?
filter(fuelData, h > 2*(p+1)/n )
state | fuel | dlic | tax | inc | road | e | yhat | r | h |
---|---|---|---|---|---|---|---|---|---|
CN | 457 | 57.1 | 10.0 | 5.342 | 1.333 | 23.449 | 433.6 | 0.4027 | 0.2288 |
IL | 471 | 52.5 | 7.5 | 5.126 | 14.186 | 28.747 | 442.3 | 0.4942 | 0.2305 |
NV | 782 | 67.2 | 6.0 | 5.215 | 2.302 | 68.201 | 713.8 | 1.2005 | 0.2660 |
NY | 344 | 45.1 | 8.0 | 5.319 | 11.868 | 25.267 | 318.7 | 0.4501 | 0.2832 |
TX | 640 | 56.6 | 5.0 | 4.045 | 17.782 | -7.282 | 647.3 | -0.1327 | 0.3151 |
Leverage plots:
boxplot(fuelData$h, horizontal = TRUE, pch=20)
abline(v = 2 * (p+1) / n, lty=2)
mtext("hat values (leverage)", side=1, line=2)
gf_histogram(~ h, data=fuelData, color="white") %>%
gf_vline(xintercept = 2 * (p+1) / n, linetype=2) %>%
gf_labs(x = "hat values (leverage)")
# gf_text(h ~ 1, data=fuelData, label = ~ state) %>%
# gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
Observations with high leverage do not necessarily have large residuals (TX, NY, NV, CO, IL)
Observations with large residuals do not necessarily have high leverage (WY)
gf_text(h ~ r, data=fuelData, label = ~ state) %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
gf_vline(xintercept = c(-2,2), linetype=2) %>%
gf_labs(x = "standardized residuals", y = "hat values (leverage)")
An observation with high leverage can have significant
“influence” on the regression coefficients.
Anscombe data: Extreme leverage, residual = 0
Cases with high leverage do not necessarily have large residuals,
… but can still be influential values.
AnscombeDataResids <- AnscombeDataResids %>%
mutate(h4 = hatvalues(lmfit4), r4 = rstandard(lmfit4))
AnscombeDataResids
Y1 | X1 | Y2 | X2 | Y3 | X3 | Y4 | X4 | e1 | yhat1 | e2 | yhat2 | e3 | yhat3 | e4 | yhat4 | h4 | r4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8.04 | 10 | 9.14 | 10 | 7.46 | 10 | 6.58 | 8 | 0.0390 | 8.001 | 1.1391 | 8.001 | -0.5397 | 8.000 | -0.421 | 7.001 | 0.1 | -0.3591 |
6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 | 8 | -0.0508 | 7.001 | 1.1391 | 7.001 | -0.2303 | 7.000 | -1.241 | 7.001 | 0.1 | -1.0586 |
7.58 | 13 | 8.74 | 13 | 12.74 | 13 | 7.71 | 8 | -1.9213 | 9.501 | -0.7609 | 9.501 | 3.2411 | 9.499 | 0.709 | 7.001 | 0.1 | 0.6048 |
8.81 | 9 | 8.77 | 9 | 7.11 | 9 | 8.84 | 8 | 1.3091 | 7.501 | 1.2691 | 7.501 | -0.3900 | 7.500 | 1.839 | 7.001 | 0.1 | 1.5687 |
8.33 | 11 | 9.26 | 11 | 7.81 | 11 | 8.47 | 8 | -0.1711 | 8.501 | 0.7591 | 8.501 | -0.6895 | 8.499 | 1.469 | 7.001 | 0.1 | 1.2531 |
9.96 | 14 | 8.10 | 14 | 8.84 | 14 | 7.04 | 8 | -0.0414 | 10.001 | -1.9009 | 10.001 | -1.1586 | 9.999 | 0.039 | 7.001 | 0.1 | 0.0333 |
7.24 | 6 | 6.13 | 6 | 6.08 | 6 | 5.25 | 8 | 1.2394 | 6.001 | 0.1291 | 6.001 | 0.0792 | 6.001 | -1.751 | 7.001 | 0.1 | -1.4937 |
4.26 | 4 | 3.10 | 4 | 5.39 | 4 | 12.50 | 19 | -0.7405 | 5.000 | -1.9009 | 5.001 | 0.3886 | 5.001 | 0.000 | 12.500 | 1.0 | NaN |
10.84 | 12 | 9.13 | 12 | 8.15 | 12 | 5.56 | 8 | 1.8388 | 9.001 | 0.1291 | 9.001 | -0.8492 | 8.999 | -1.441 | 7.001 | 0.1 | -1.2292 |
4.82 | 7 | 7.26 | 7 | 6.42 | 7 | 7.91 | 8 | -1.6807 | 6.501 | 0.7591 | 6.501 | -0.0805 | 6.500 | 0.909 | 7.001 | 0.1 | 0.7754 |
5.68 | 5 | 4.74 | 5 | 5.73 | 5 | 6.89 | 8 | 0.1795 | 5.500 | -0.7609 | 5.501 | 0.2289 | 5.501 | -0.111 | 7.001 | 0.1 | -0.0947 |
# Note that the standardized residual is not defined for
# the outlier because the hat value is exactly = 1!!!
p9Anscombe <- gf_point(h4 ~ c(1:11), data=AnscombeDataResids) %>%
gf_labs(x = "observation number", y = "hat values (leverage)") %>%
gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
# Plotted raw residuals here since a missing value in standardized residuals
p10Anscombe <- gf_point(h4 ~ e4, data=AnscombeDataResids) %>%
gf_labs(x = "raw residuals", y = "hat values (leverage)") %>%
gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
grid.arrange(p4Anscombe, p8Anscombe, p9Anscombe, p10Anscombe, ncol=2)
Anscombe data: Example of influence without leverage
Cases with large residuals do not necessarily have high leverage
…but can still be influential values.
AnscombeDataResids <- AnscombeDataResids %>%
mutate(h3 = hatvalues(lmfit3), r3 = rstandard(lmfit3))
p7bAnscombe <- gf_point(r3 ~ yhat3, data=AnscombeDataResids) %>%
gf_labs(x = "yhat (fitted values)", y = "std residuals (Y3 on X3)") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p11Anscombe <- gf_point(h3 ~ c(1:11), data=AnscombeDataResids) %>%
gf_labs(x = "observation number", y = "hat values") %>%
gf_lims(y = c(0, 0.40)) %>%
gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
p12Anscombe <- gf_point(h3 ~ r3, data=AnscombeDataResids) %>%
gf_labs(x = "standardized residuals", y = "hat values") %>%
gf_lims(y = c(0, 0.40)) %>%
gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
grid.arrange(p3Anscombe, p7bAnscombe, p11Anscombe, p12Anscombe, ncol=2)
Regress Y3 on X3 using all data values
tidy(lmfit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 3.0025 | 1.1245 | 2.670 | 0.0256 |
X3 | 0.4997 | 0.1179 | 4.239 | 0.0022 |
Regress Y3 on X3 after removing the outlier
tidy(lm(Y3 ~ X3, data=AnscombeData[-3, ]))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.0056 | 0.0029 | 1370 | 0 |
X3 | 0.3454 | 0.0003 | 1077 | 0 |
Regress Y3 on X3 using all data values
glance(lmfit3)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6663 | 0.6292 | 1.236 | 17.97 | 0.0022 | 2 | 9 |
Regress Y3 on X3 after removing the outlier
glance(lm(Y3 ~ X3, data=AnscombeData[-3, ]))[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
1 | 1 | 0.0031 | 1160688 | 0 | 2 | 8 |
The outlier influences the results even though leverage is low.
For the fuel data with large residual for Wyoming:
How does the regression fit change if obs with large outlier removed?
What if this observation were removed?
Does the regression fit change significantly?
fuelDataWY <- fuelData[fuelData$state != "WY", ]
lmfitWY <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataWY)
Regression coefficients using all observations:
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6787 | 0.6488 | 66.31 | 22.71 | 0 | 5 | 43 |
Regression coefficients using all observations …except WY:
tidy(lmfitWY)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 430.97 | 154.781 | 2.7844 | 0.0080 |
dlic | 11.74 | 1.640 | 7.1572 | 0.0000 |
tax | -31.36 | 10.815 | -2.8998 | 0.0059 |
inc | -66.26 | 14.324 | -4.6257 | 0.0000 |
road | -1.35 | 2.829 | -0.4773 | 0.6356 |
glance(lmfitWY)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.7044 | 0.6762 | 55.15 | 25.02 | 0 | 5 | 42 |
Coefficients for significant predictors (dlic
, tax
, inc
) change a little,
but the p-values are similar either way (except for tax
).
The variance estimate (\(\widehat{\sigma}\)) gets much smaller,
as expected after removing an observation with a large residual.
So, the predictions may not change much (need to check),
but \(\widehat{\sigma}\) is sensitive to outliers.
An observation with high leverage and a large residual
can have significant “influence” on the regression coefficients.
For the fuel consumption data, NV has a high leverage value
…and a moderate standardized residual.
# Value of variables for Nevada
fuelData %>%
filter(state == "NV")
state | fuel | dlic | tax | inc | road | e | yhat | r | h |
---|---|---|---|---|---|---|---|---|---|
NV | 782 | 67.2 | 6 | 5.215 | 2.302 | 68.2 | 713.8 | 1.2 | 0.266 |
gf_text(h ~ r, data=fuelData, label = ~ state) %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
gf_vline(xintercept = c(-2,2), linetype=2)
What if this observation were removed?
Does the regression fit change significantly?
fuelDataNV <- fuelData[fuelData$state != "NV", ]
lmfitNV <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataNV)
Regression coefficients using all observations:
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6787 | 0.6488 | 66.31 | 22.71 | 0 | 5 | 43 |
Regression coefficients using all observations …except NV:
tidy(lmfitNV)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 357.637 | 185.281 | 1.9302 | 0.0603 |
dlic | 13.170 | 1.920 | 6.8605 | 0.0000 |
tax | -28.684 | 13.858 | -2.0698 | 0.0447 |
inc | -72.704 | 17.865 | -4.0697 | 0.0002 |
road | -0.998 | 3.573 | -0.2793 | 0.7814 |
glance(lmfitNV)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.665 | 0.6331 | 65.96 | 20.84 | 0 | 5 | 42 |
Coefficients for significant predictors (dlic
, tax
, inc
) change a little
and the p-value for tax
changes a lot.
The variance estimate (\(\widehat{\sigma}\)) is essentially the same
with or without NV in the data.
More importantly, we would want to check on the influence of NV on any predictions we make.
For observation \(i\), Cook’s distance measures the difference in the fit
when all observations are used (\(\widehat{y}_j\)) vs. all observations except \(i\) (\(\widehat{y}_{j(i)}\))
\[C_i = \frac{\sum_{j=1}^n (\widehat{y}_j - \widehat{y}_{j(i)})^2}{\widehat{\sigma}^2 (p+1)}\] This is a measure of the influence of observation \(i\) on predictions.
A rough guide: Pay attention to observations with \(C_i \ge 1\).
A much better guide: A dotplot or index plot of the \(C_i\)’s.
Look for unusual observations.
Add Cook’s distance to the dataset:
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))
pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
gf_labs(x = "observation number", y = "Cook's distance") %>%
gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
gf_labs(x = "observation number", y = "Cook's distance")
Why does Wyoming have a high Cook’s distance when we know leverage is small?
Cook’s distance can be rewritten as \(\quad \displaystyle{C_i = \left(\frac{r_i^2}{p+1}\right) \left(\frac{h_{ii}}{1 - h_{ii}}\right)}\)
Cook’s distance combines the size of the residual with the amount of leverage.
\(\displaystyle \frac{h_{ii}}{1 - h_{ii}}\;\) is called the “potential function”
So, the high residual for Wyoming once again alerts us
that this observation may have influence on the regression fit.
But, we already ascertained that leaving out Wyoming may not change things much
except for inflating the SSE and \(\widehat{\sigma}\).
The textbook mentions another type of residual sometimes used.
The idea is this: If \(e_i\) is an outlier (large unstandardized residual), then
Thus, the standardized residual \(r_i\) will be smaller (since \(\widehat{\sigma}\) in denominator).
\[r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}\]
So… we might not notice some standardized residuals that should have been flagged as outliers.
A solution: For each observation \(i\)
The result is an “externally studentized” or, more simply “studentized” residual.
\[r_i^* = \frac{e_i}{\widehat{\sigma}_{(i)} \sqrt{1 - h_{ii}}}\] When the raw residual \(e_i\) is an outlier, we can expect that \(|\; r_i^*\; | > |\; r_i\; |\)
Studentized residuals follow a t-distribution with \((n-p-2)\) df
… if the model assumptions are satisfied.
This distribution is a bit wider than a \(N(0,1)\) distribution since the tails of the distribution of the \(r_i^*\) are longer.
If \(|\; r_i\; | > 1\), then \(|\; r_i^*\; | > |\; r_i\; |\)
Otherwise, \(r_i^*\) is just a tad smaller than \(r_i\).
So, as expected, for large standardized residuals,
the studentized residual is even larger.
The goal is to reduce false negatives
where we fail to notice outliers after standardization.
The textbook argues that there isn’t that much difference
between standardized and studentized residuals.
They choose to use standardized residuals in their graphs.
Suggestion:
Take a careful look at standardized residuals \(|\; r_i\; | \ge 2\) (outliers).
Add the studentized residuals (\(r_i^*\)) to the fuel consumption dataset:
fuelData <- mutate(fuelData, rstar = rstudent(lmfit))
fuelData <- mutate(fuelData, rstar = rstudent(lmfit))
Studentized residual for Wyoming:
# Value of variables for Wyoming
fuelData %>%
filter(state == "WY")
state | fuel | dlic | tax | inc | road | e | yhat | r | h | C | rstar |
---|---|---|---|---|---|---|---|---|---|---|---|
WY | 968 | 67.2 | 7 | 4.345 | 3.905 | 234.9 | 733.1 | 3.735 | 0.0997 | 0.309 | 4.49 |
Here is a summary of the other states:
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
variable | min | Q1 | median | Q3 | max | mean | sd | n |
---|---|---|---|---|---|---|---|---|
fuel | 344.0000 | 509.0000 | 566.0000 | 631.5000 | 865.0000 | 568.4468 | 96.9143 | 47 |
dlic | 45.1000 | 52.9500 | 56.3000 | 58.9500 | 72.4000 | 56.8170 | 5.3985 | 47 |
tax | 5.0000 | 7.0000 | 7.5000 | 8.2500 | 10.0000 | 7.6826 | 0.9559 | 47 |
inc | 3.0630 | 3.7330 | 4.2960 | 4.5835 | 5.3420 | 4.2396 | 0.5796 | 47 |
road | 0.4310 | 2.9465 | 4.7460 | 7.3820 | 17.7820 | 5.6007 | 3.5206 | 47 |
e | -122.0289 | -45.7090 | -11.6488 | 29.7206 | 117.5965 | -4.9989 | 53.7052 | 47 |
yhat | 318.7326 | 513.3532 | 569.4385 | 639.9999 | 772.9678 | 573.4457 | 90.2138 | 47 |
r | -1.9317 | -0.7230 | -0.1818 | 0.4864 | 1.8425 | -0.0743 | 0.8559 | 47 |
h | 0.0265 | 0.0589 | 0.0810 | 0.1117 | 0.3151 | 0.1043 | 0.0680 | 47 |
C | 0.0001 | 0.0021 | 0.0075 | 0.0166 | 0.1119 | 0.0174 | 0.0256 | 47 |
rstar | -1.9978 | -0.7189 | -0.1797 | 0.4820 | 1.8973 | -0.0737 | 0.8635 | 47 |
The standardized residual for Wyoming is well over twice the size of the second largest largest magnitude standardized residual.
pfuelStud <- gf_point(r ~ yhat, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_lims(y = c(-2.5, 5)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
pfuelExtStud <- gf_point(rstar ~ yhat, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_lims(y = c(-2.5, 5)) %>%
gf_labs(x = "fitted values (yhat)", y = "studentized residuals")
grid.arrange(pfuelStud, pfuelExtStud, ncol=2)
ggpairs
.What plot should look like: Points spread out in a rough football shape for all plots.
Problems you are looking for:
(You will look more closely for all of these problems using residuals after a fit, but it’s a good idea to do a quick check first so that you know what you are working with.)
Make an added-variable plot for each \(x_j\) you expect to use in the model
“The stronger the linear relationship in the added-variable plot is, the more important the additional contribution of \(x_j\) to the regression equation already containing the other predictors.”
"If the scatterplot shows [a flat] slope, the variable is unlikely to be useful in the model.
Use this as a rough guide for decisions about which variables to include in the model, …or not.
Perhaps examine this more later with F-tests.
Make a residual plus component plot for each x_j$ you expect to use in the model.
This plot indicates if any nonlinearity is present in the relationship between \(Y\) and \(x_j\) after adjusting for the other predictors in the model.
What plot should look like:
Symmetric and mound-shaped (like a Normal distribution).
What plot should look like:
Points on plot should be lined up mostly along the diagonal reference line, with a few off-diagonal at either end.
Problems you are looking for:
Non-normality of residuals, seen by Normal probability plot plot that doesn’t follow the diagonal.
Is the distribution of residuals skewed?
Bimodal distribution?
Are there large outliers?
What plot should look like:
Points should be spread in a horizontal band of roughly even width, without noticable outliers. There should be no apparent pattern to the points.
Problems you are looking for:
# R `plot(fit)` creates this plot (among several others) with the square root of abs(standardized residuals), rather than the standardized residuals themselves. This is to make it easier to visualize non-constant variance.
Analysis is the same as for (6) above. In some cases, problems can appear here which weren’t visible with the residuals vs. fitted values plot (or vice versa).
Problems you are looking for:
This would be evidence of non-independence of errors.
Can make this plot even for data that is not obviously time-series data, in case something systmatically changes with observations later in the dataset.
Problems you are looking for:
Problems you are looking for:
1a: Linearity
Scatterplot: standardized residuals vs. fitted values
…and more scatterplots: standardized residuals vs. each predictor
Look for random scatter across all values of \(\widehat{y}\) and each \(x_j\)
2a: Errors are normally distributed
Histogram, boxplot, dotplot, normal quantile plot of standardized residuals
Look for symmetry, mound-shape, not too skewed.
2c: Errors have constant variance \(\sigma^2\) over all values of \(x\)’s
Scatterplots:
standardized residuals vs. fitted values
standardized residuals vs. each predictor variable
Look for similar spread in residuals across all values of \(\widehat{y}\) or \(x_j\).
2d and 4a: Independence of errors
Mostly, this depends on how the data were collected.
But, can plot residuals vs. observation number
looking for a pattern over time (autocorrelation, Chapter 8)
4b: Equally reliable and informative observations
This depends on how the data were collected.
But, we will look at residuals for outliers and “influential points”
that may influence the results much more than other observations
Look for random scatter across all values of \(\widehat{y}\) or \(x_j\)
gf_point(r ~ yhat, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
Now, plot standardized residuals vs. each predictor in the model
pfuel.dlic <- gf_point(r ~ dlic, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "dlic", y = "standardized residuals")
pfuel.tax <- gf_point(r ~ tax, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "tax", y = "standardized residuals")
pfuel.inc <- gf_point(r ~ inc, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "inc", y = "standardized residuals")
pfuel.road <- gf_point(r ~ road, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "road", y = "standardized residuals")
grid.arrange(pfuel.dlic, pfuel.tax, pfuel.inc, pfuel.road, ncol=2)
Look for symmetry, mound-shape, not too skewed.
boxplot(fuelData$r, horizontal = TRUE, pch=20)
abline(v = 2, lty=2)
mtext("standardized residuals", side=1, line=2)
p1 <- gf_histogram(~ r, data=fuelData, color="white", bins=15) %>%
gf_labs(x = "standardized residuals")
p2 <- gf_dotplot(~ r, data=fuelData) %>%
gf_labs(x = "standardized residuals")
grid.arrange(p1, p2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
gf_qq(~ r, data=fuelData, distribution=qnorm) %>%
gf_qqline(distribution=qnorm) %>%
gf_labs(x = "Standard Normal Quantiles", y = "Standardized Residuals")
Look for similar spread in residuals across all values of \(\widehat{y}\).
gf_point(r ~ yhat, data=fuelData) %>%
gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
First, find out what you can about how the data were collected.
But, can plot residuals vs. observation number
looking for a pattern over time (autocorrelation, Chapter 8)
This plot is not likely relevant for the fuel consumption data, which are in alphabetical order by state.
n <- dim(fuelData)[1]
gf_point(r ~ c(1:n), data=fuelData) %>%
gf_labs(x = "observation number", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2,0,2), linetype=2)
4b: Equally reliable and informative observations
Again, first find out what you can about how the data were collected.
But, we should look at residuals for outliers and “influential points”
that may influence the results much more than other observations
Leverage plots:
n <- dim(fuelData)[1]
p <- 4
pfuelHat1 <- gf_point(h ~ c(1:n), data=fuelData) %>%
gf_labs(x = "observation number", y = "hat values") %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat2 <- gf_point(h ~ r, data=fuelData) %>%
gf_labs(x = "standardized residuals", y = "hat values") %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat1, pfuelHat2, ncol=2)
pfuelHat3 <- gf_text(h ~ c(1:n), data=fuelData, label = ~state) %>%
gf_labs(x = "observation number", y = "hat values") %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat4 <- gf_text(h ~ r, data=fuelData, label = ~state) %>%
gf_labs(x = "standardized residuals", y = "hat values") %>%
gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat3, pfuelHat4, ncol=2)
Cook’s Distance plots:
A dotplot or index plot of the \(C_i\)’s.
Look for unusual observations.
Add Cook’s distance to the dataset:
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))
pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
gf_labs(x = "observation number", y = "Cook's distance") %>%
gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
gf_labs(x = "observation number", y = "Cook's distance")
Keep
If we have a “story” that explains why the outlier exists
and is not a data collection/recording error.
Odd points may strongly influence model fit
Identifying outliers might be a goal in its own right.
Outliers may be left in but may require explanation in the analysis.
Drop
Usually we choose to keep the outlier
unless there are very sound reasons for dropping observations
Adjust
If outliers are also influential and you want to keep them,
then you might want to adjust for them.
Sometimes outliers may occur because important information
is contained in another variable not currently in the model.
Besides t-tests and such, an “added variable plot”
(“partial regression plot”) can illuminate whether another predictor will help.
Suppose we left the important predictor dlic
out of the fuel consumption model.
lmfit.no.dlic <- lm(fuel ~ tax + inc + road, data=fuelData)
p1.no.dlic <- gf_point(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2,0,2), linetype=2)
p2.no.dlic <- gf_text(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic), data=fuelData, label = ~state) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2,0,2), linetype=2)
Without dlic
in the model, we have two outliers instead of one.
The added variable plot is a plot of the residuals from a model without
the predictor of interest vs. the predictor of interest.
In this case, we plot the residuals from fuel = f(tax, inc, road)
against the predictor dlic
gf_point(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData) %>%
gf_labs(y = "standardized residuals") %>%
gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
gf_coefline(model = lm(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData))
The residuals have a (linear) pattern.
This indicates that dlic
should be added to the linear model.
We know that with dlic
in the model, we only have one outlier remaining.
Create an indicator variable (0’s and 1’s)
fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )
fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )
tail(fuelData)
state | fuel | dlic | tax | inc | road | e | yhat | r | h | C | rstar | WY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
43 | VA | 547 | 51.7 | 9.0 | 4.258 | 4.686 | 86.778 | 460.2 | 1.3598 | 0.0737 | 0.0294 | 1.3738 | 0 |
44 | VT | 561 | 58.0 | 9.0 | 3.865 | 1.586 | -17.107 | 578.1 | -0.2697 | 0.0847 | 0.0013 | -0.2668 | 0 |
45 | WI | 508 | 54.5 | 7.0 | 4.207 | 6.580 | -58.024 | 566.0 | -0.8947 | 0.0433 | 0.0072 | -0.8926 | 0 |
46 | WN | 510 | 57.1 | 9.0 | 4.476 | 3.942 | -9.678 | 519.7 | -0.1515 | 0.0723 | 0.0004 | -0.1498 | 0 |
47 | WV | 460 | 55.1 | 8.5 | 4.574 | 2.619 | -47.028 | 507.0 | -0.7284 | 0.0518 | 0.0058 | -0.7243 | 0 |
48 | WY | 968 | 67.2 | 7.0 | 4.345 | 3.905 | 234.947 | 733.1 | 3.7345 | 0.0997 | 0.3090 | 4.4901 | 1 |
Then, fit the model with this variable added as a predictor.
lmfitWYindicator <- lm(fuel ~ dlic + tax + inc + road + WY, data=fuelData)
The original model (without the indicator variable for WY)
tidy(lmfit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
glance(lmfit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6787 | 0.6488 | 66.31 | 22.71 | 0 | 5 | 43 |
The new model (includes the indicator variable for WY)
tidy(lmfitWYindicator)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 430.97 | 154.781 | 2.7844 | 0.0080 |
dlic | 11.74 | 1.640 | 7.1572 | 0.0000 |
tax | -31.36 | 10.815 | -2.8998 | 0.0059 |
inc | -66.26 | 14.324 | -4.6257 | 0.0000 |
road | -1.35 | 2.829 | -0.4773 | 0.6356 |
WY | 260.97 | 58.122 | 4.4901 | 0.0001 |
glance(lmfitWYindicator)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.7829 | 0.7571 | 55.15 | 30.29 | 0 | 6 | 42 |
The new model incorporates a separate intercept for WY
This assures that SSE and \(\widehat{\sigma}\) will not be overstated.
Do we still have outliers?
gf_point(rstandard(lmfitWYindicator) ~ fitted(lmfitWYindicator)) %>%
gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
gf_lims(x = c(325, 750))
Warning: Removed 1 rows containing missing values (geom_point).
You can see that we now have 3 outliers!
This is pretty common with outliers.
You drop or adjust and new outliers crop up.
So, use this adjustment method with caution.