1 Chapter 4

1.1 Lecture 7

1.1.1 So far… estimation and inference

So far, the basics of linear regression

Fit simple and multiple linear regression models
interpret the coefficients
test hypotheses about the models and coefficients
produce confidence and prediction intervals

1.1.2 What’s new… model diagnostics

Now, we explore how to critique our previous work

model checking
model criticism
model diagnostics

Examine if the data are consistent with linear regression model specification

Use mostly graphical methods.

1.1.2.1 Why diagnostics?

If the regression model specifications (assumptions) not true…

All inference based on the fitted regression model

less reliable
suspicious
… even invalid

For example, if errors (residuals) not at all reasonably from a normal distribution
the LS estimators do not have the expected sampling distribution.

So, t-tests, F-tests, confidence intervals may be incorrect
…unless sample size large (for t-tests and F-tests).

Some small deviations from the model specifications are not serious,
but large deviations could invalidate our results.

1.1.3 Anscombe datasets: Why graphs are important

We should always plot the data before we get started.

For example, a scatterplot (or scatterplot matrix).

Let’s ignore that advice for the moment and see what kind of trouble that can lead to.

1.1.3.1 Anscombe datasets: LS regression fit

The regression results are identical for all four datasets

1.1.3.2 Anscombe datasets: scatterplots

But the relationships of the responses to the same predictors are very different!

Can only discover this by graphing the data.

p1Anscombe <- gf_point(Y1 ~ X1, data=AnscombeData) %>%
  gf_coefline(model = lmfit1)
p2Anscombe <- gf_point(Y2 ~ X2, data=AnscombeData) %>%
  gf_coefline(model = lmfit2)
p3Anscombe <- gf_point(Y3 ~ X3, data=AnscombeData) %>%
  gf_coefline(model = lmfit3)
p4Anscombe <- gf_point(Y4 ~ X4, data=AnscombeData) %>%
  gf_coefline(model = lmfit4)
grid.arrange(p1Anscombe, p2Anscombe, p3Anscombe, p4Anscombe, ncol=2)

1.1.4 Revisiting the model assumptions

Assumption about the form of the model
1. Linearity: The mean of $Y$ is a linear function of the $x$’s
Assumptions about the errors ($\epsilon$’s)
1. Normally distributed (and thus so are the $Y$’s)
2. Mean zero: No systematic mis-prediction
3. Constant variance $\sigma^2$ over all values $X$’s
4. Independent
5. Uncorrelated with predictors ($x$’s)
Assumptions about the predictors ($x$’s)
1. Non-random, “fixed”
  - True for many designed experiments
  - For observational studies, inferences are conditional on $x$’s
2. Measured without error
  - Probably not true, but doubtful ever have enough information to assess
3. Linearly independent
  - No predictor can be expressed as linear combination of others
  - Not collinear (predictor variables not interrelated)
  - Rarely true, but minor collinearity OK
4. Uncorrelated with errors ($\epsilon$’s)
Assumptions about the observations: $y_i, x_{1i}, x_{2i}, \ldots, x_{pi}$
1. Independent (values of observation $i$ not dependent on values of observation $j$)
2. Equally reliable and informative

1.1.5 Checking regression assumptions

1.1.5.1 Assumptions about the form of the model

Linearity assumption… $y_i = \beta_0 + \beta_1 x_{i1} + \beta_p X_{ip} \epsilon_i$

Check using “residual plots”

Scatter plots of standardized residuals against each of the predictor variables.
- Should see: no pattern.
- Could see:
  - clear curve (nonlinearity)
  - outlying points (could influence results, could lead to interesting discovery)
  - spread changing with $x_j$ (a non-constant variance problem, not a linearity problem)
Scatter plot of standardized residuals against fitted $\widehat{y}$ values.
- Sometimes called a “scale-location” plot
- Also used to check constant variance assumption
- Should see: no pattern.
- Could see:
  - clear curve (nonlinearity)
  - outlying points (could influence results, could lead to interesting discovery)
  - spread changing with $\widehat{y}$ (a non-constant variance problem, not a linearity problem)

Fact: No matter what pattern you might see, it is algebraically true that the observed residuals are uncorrelated with the fitted $\widehat{y}$ values.

That is $\mathrm{corr}(\widehat{y}, e) = 0$.

1.1.5.2 Assumptions about errors (epsilon)

1.1.5.2.1 The errors have a normal distribution (“normality assumption”)

Look at distribution of standardized residuals to see if consistent with data from a standard normal distribution.

Check using:

Normal probability (quantile) plot of standardized residuals.
- Should see: most points fall along the reference line
- Could see:
  - a lot of points tailing off the line
  - indicating that the residuals are quite non-normally distributed in the tails.
  - maybe because the errors themselves aren’t normally distributed
  - …or maybe because of other model problems like non-linearity.

1.1.5.2.2 The errors $\epsilon$ have mean zero

No need to check this assumption.

In a previous lecture, we showed that for data, $\sum_i e_i \;=\; \sum_i (y_i - \widehat{y}_i) \;=\; 0$

That is, the observed residuals always have average zero.

1.1.5.2.3 The errors have the same variance $\sigma^2$ (“constant variance assumption”)

If violated, have “heteroscedascity” problem

Scatter plot of standardized residuals against fitted $\widehat{y}$ values.
- Sometimes called a “scale-location” plot
- Also used to check linearity assumption
- Should see: no pattern.
- Could see:
  - clear curve (nonlinearity)
  - outlying points (could influence results, could lead to interesting discovery)
  - spread changing with $\widehat{y}$ (a non-constant variance problem, not a linearity problem)
  - Common patterns: points spreading out in a fan or butterfly shape.

1.1.5.2.4 Errors are independent of each other “independent-errors assumption”

If violated, and if related to the order data were collected,
then called an “autocorrelation” problem

To check:

Plot standardized residuals vs. the observation number
- Observation number must = the index/order in which data were collected
- Called an “index plot”
- Should see: no clear pattern.
- Could see:
  - Data collected later have higher or lower residuals.
  - A zig zag pattern

What if you don’t know the order in which the data were collected?

Then consider how the data were collected and whether that might lead to dependence among observations.

1.1.5.2.5 Errors are uncorrelated with predictors

No need to check this assumption.

In a previous lecture, we showed that for data, $cor(e, x) = 0$

That is, the observed residuals are always uncorrelated with the predictor variables.

1.1.5.3 Assumptions about the predictors

1.1.5.3.1 Predictors are nonrandom

True for many designed experiments
For observational studies, inferences are conditional on $x$’s

Rarely true, but rarely have information needed to check.

1.1.5.3.2 Predictors are measured without error

Probably not true, but doubtful ever have enough information to assess

1.1.5.3.3 Predictor variables are linearly independent

If violated, we have collinearity (or multi-collinearity)

What we want:
- No predictor can be expressed as linear combination of others
- Not collinear (predictor variables not interrelated)
But, in practice…
- Rarely true, but minor collinearity OK

To check: scatter plot of predictors against one another.

More on the collinearity problem later on in the course.

1.1.5.4 Assumptions about the observations

1.1.5.4.1 All observations are equally reliable and informative

All have roughly equal role in determining results and influencing conclusions.

To check:

Look for outliers and “influential points” in residual plots
- scatterplot of standardized residuals vs fitted $\widehat{y}$ values
Residuals vs leverage plot
- Examine points with high leverage (to be discussed in this chapter)
Measure “Cook’s Distance” and “DFITS” (to be discussed in this chapter)

1.1.6 The basic residual plot (vs. fitted yhat values)

1.1.6.1 Anscombe data: raw residuals vs. fitted values

AnscombeDataResids <- AnscombeData %>%
  mutate(e1 = residuals(lmfit1), yhat1 = fitted(lmfit1),
         e2 = residuals(lmfit2), yhat2 = fitted(lmfit2),
         e3 = residuals(lmfit3), yhat3 = fitted(lmfit3),
         e4 = residuals(lmfit4), yhat4 = fitted(lmfit4)
         )
p5Anscombe <- gf_point(e1 ~ yhat1, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y1 on X1)") %>%
  gf_hline(yintercept = 0, linetype=2)
p6Anscombe <- gf_point(e2 ~ yhat2, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y2 on X2)") %>%
  gf_hline(yintercept = 0, linetype=2)
p7Anscombe <- gf_point(e3 ~ yhat3, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y3 on X3)") %>%
  gf_hline(yintercept = 0, linetype=2)
p8Anscombe <- gf_point(e4 ~ yhat4, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y4 on X4)") %>%
  gf_hline(yintercept = 0, linetype=2)
grid.arrange(p1Anscombe, p5Anscombe, p2Anscombe, p6Anscombe, ncol=2)

grid.arrange(p3Anscombe, p7Anscombe, p4Anscombe, p8Anscombe, ncol=2)

1.1.6.2 Fuel data: plot raw residuals vs. fitted values

variable	description
state	State (excluding AK and HI)
pop	Population in each state
fuelc	fuel consumption, millions of gallons
fuel	Motor fuel consumption, gal/person
nlic	1971 thousands of drivers
dlic	Percent of population with driver’s license
tax	1972 motor fuel tax rate, cents/gal
inc	1972 per capita income, thousands of dollars
road	1971 federal-aid highways, thousands of miles

fuel = f(dlic, tax, inc, road)

We are hoping that residuals have similar spread across the fitted values

fuelData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/fuel.txt")
fuelData <- fuelData %>%
  dplyr::select(state, fuel, dlic, tax, inc, road) %>%
  arrange(state)
n <- dim(fuelData)[1]
p <- 4

1.1.6.3 Residuals and fitted values in R

First, fit the LS regression model and save it:

lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)

Then, add the residuals and fitted values to the dataset:

fuelData <- mutate(fuelData, e = residuals(lmfit), yhat = fitted(lmfit))

We can now make plots of predictor variables and/or fitted values vs the residuals:

lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
fuelData <- fuelData %>%
  mutate(e = residuals(lmfit), yhat = fitted (lmfit))

gf_point(e ~ yhat, data=fuelData) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals") %>%
  gf_hline(yintercept = 0, linetype = "dashed")


gf_point(e ~ dlic, data=fuelData) %>%
  gf_labs(x = "dlic", y = "residuals") %>%
  gf_hline(yintercept = 0, linetype = "dashed")

A faster, but slightly ugly, way to make the residuals vs fitted values plot (and several other of the diagnostic plots that we use here):

plot(lmfit)

1.1.6.4 What we are checking with the residuals vs fitted values plot

Average of residuals is zero (a fact from least squares).
Is the pattern linear?
- Model assumption: Y is a linear function of predictors
Is the spread around the line similar across values of $\widehat{y}$?
- Model assumption: Errors have constant variance for all values of predictors

1.2 Lecture 8

1.2.1 Standardizing residuals

CAUTION: Least squares (observed) residuals do NOT all have same variance

Model errors: $\mathrm{var}(\epsilon_i) = \sigma^2$

Observed residuals: $\mathrm{var}(e_i) = \sigma^2 (1 - h_{ii})$

$(1/n) \le h_{ii} < 1$

Large $h_{ii}$ indicates case $i$ is far from the center of x-space.

So, $e_i$ vary less far from center of x-space.

Applet: Distribution of slope and intercept

So, we will standardize $e_i$’s to solve this problem.

\[z_i = \frac{e_i - \mathrm{E}(e_i)}{\mathrm{sd}(e_i)} \;=\; \frac{e_i - 0}{\sigma\sqrt{1 - h_{ii}}} \;=\; \frac{e_i}{\sigma\sqrt{1 - h_{ii}}}\]

Properties: $\mathrm{E}(z_i)=0$ and $\mathrm{sd}(z_i)=1$, but for observed data, $\sum z_i$ is not necessarily zero.

1.2.1.1 Calculating standardized residuals

Well, you can’t, using any software, because we don’t know $\sigma$.

However, we can estimate $\sigma:\;\; \widehat{\sigma} = \displaystyle{ \sqrt{\frac{SSE}{n - p -1}}}$

\[r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}\]

1.2.1.2 A note about terminology" “internally studentized residuals”?

The observable standardized residuals are technically
called “internally studentized” residuals.

The textbook and R/Stata refer to the “internally studentized residuals” as “standardized” residuals. We will do the same.

1.2.1.3 How to get standardized residuals in R

Let’s add the standardized residuals to the fuel consumption dataset:

fuelData <- mutate(fuelData, r = rstandard(lmfit))

fuelData <- mutate(fuelData, r = rstandard(lmfit))

Standardized residual for Wyoming:

# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")

state	fuel	dlic	tax	inc	road	e	yhat	r
WY	968	67.2	7	4.345	3.905	234.9	733.1	3.735

Here is a summary residuals for the other states:

# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])

variable	min	Q1	median	Q3	max	mean	sd	n
fuel	344.000	509.000	566.0000	631.5000	865.000	568.4468	96.9143	47
dlic	45.100	52.950	56.3000	58.9500	72.400	56.8170	5.3985	47
tax	5.000	7.000	7.5000	8.2500	10.000	7.6826	0.9559	47
inc	3.063	3.733	4.2960	4.5835	5.342	4.2396	0.5796	47
road	0.431	2.946	4.7460	7.3820	17.782	5.6007	3.5206	47
e	-122.029	-45.709	-11.6488	29.7206	117.597	-4.9989	53.7052	47
yhat	318.733	513.353	569.4385	639.9999	772.968	573.4457	90.2138	47
r	-1.932	-0.723	-0.1818	0.4864	1.843	-0.0743	0.8559	47

The Wyoming standardized residual is almost twice the size of the largest magnitude standardized residual from among the other states.

1.2.1.4 Properties of standardized residuals

standardized residuals have mean $\mathrm{E}(r_i) = 0$ and $\mathrm{var}(r_i) = 1$,
but for observed data, $\sum r_i$ is not necessarily zero.

standardized residuals “are not strictly independent, but with a large
number of observations, the lack of independence may be ignored.”

So, if the errors $\epsilon$ are $N(0,\sigma^2)$,
then the $r_i$ are approximately $N(0,1)$ when sample size ($n$) is large.

Thus, we take a careful look at standardized residuals $|\; r_i\; | \ge 2$ (outliers).

1.2.1.5 What is h_ii?

$h_{ii}$ is a function of the predictors only
Large $h_{ii}$ indicates case $i$ is far from the center of the $x$ “space”

For simple linear regression (one predictor), \[h_{ii} = \frac{1}{n} + \frac{(x_i -\overline{x})^2}{\sum(x_i - \overline{x})^2}\]

$(1/n) \le h_{ii} < 1$

1.2.1.6 Detour: The “hat matrix” (not part of STAT 224)

Recall that for linear regression $\widehat{Y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} X_{1i} + \widehat{\beta}_{2} X_{2i} + \cdots + \widehat{\beta}_pX_{pi}$,
and we obtain the $\widehat{\beta}$’s by least squares estimation:

\[\begin{align} \min \mathrm{SSE}&\;=\; \min \sum_{i=1}^n e_i^2 \;=\; \min \sum(y_i-\widehat{y}_i)^2 \\ \\ &\;=\; \min \sum(y_i- \widehat{\beta}_{0} - \widehat{\beta}_{1} X_{1i} - \widehat{\beta}_{2} X_{2i} - \cdots - \widehat{\beta}_pX_{pi})^2 \end{align}\]

Using matrix algebra notation, this can be re-written as \[\min \mathrm{SSE}(\widehat{\boldsymbol{\beta}}) \;=\; \min (\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}})^T(\mathbf{Y}-\mathbf{X}\widehat{\boldsymbol{\beta}}). \] By taking the first derivative of the SSE w.r.t. $\widehat{\boldsymbol{\beta}}$, setting equal to zero and solving,
we can obtain \[\widehat{\boldsymbol{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}\]

This is a compact way of writing (and programming) least squares solutions for multiple regression with any number of predictors.

This form of writing the solution illustrates that $\widehat{\boldsymbol{\beta}}$ is in fact a linear function of $Y$s. Thus the fitted value

\[ \widehat{\mathbf{Y}} \;=\; \mathbf{X}\widehat{\boldsymbol{\beta}} \;=\; \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}= \mathbf{H}\mathbf{Y},\]

where $\mathbf{H}=\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T$.

$\mathbf{H}$ is sometimes called the “hat matrix”
Note that $\mathbf{H}$ only depends on the predictors $\mathbf{X}$, not the responses $\mathbf{Y}$.

The entries in the hat matrix are labeled $h_{ij}$,
$i = 1, 2, \ldots, n$, $\quad j = 1, 2, \ldots, n$.

1.2.1.7 Fitted value is a linear combination of all the y-values

$\mathbf{H}$ transforms $y$ values into their corresponding $\widehat{y}$ values.

\[\widehat{y}_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n, \]

NOTE: The textbook refers to $\mathbf{H}$ as a projection matrix ($\mathbf{P}$)
and labels the matrix entries as $p_{ij}$ instead of $h_{ij}$.

END OF MATRIX ALGEBRA DETOUR

1.2.2 Leverage

1.2.2.1 Fitted value can be re-written as a linear combination of all the y-values

We can write $\widehat{y}_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n,$

$h_{ij}$ is the “contribution” of $y_j$ to $\widehat{y}_i$

That is, $h_{ij}$ is the “weight” given to $y_j$ in predicting $\widehat{y}_i$.

For simple linear regression (one predictor): \[h_{ij} = \frac{1}{n} + \frac{(x_i -\overline{x})(x_j-\overline{x})}{\sum(x_i - \overline{x})^2}\]

So, if $x_j$ is far from $\overline{x}$ in the same direction as $x_i$,
then $y_j$ contributes a lot to the estimate $\widehat{y}_i$.

1.2.2.2 Leverage (h_ii)

We call $h_{ii}$ the leverage of the ith observation
since $h_{ii}$ is the weight given to $y_i$ in predicting $\widehat{y}_i$.

For simple linear regression (one predictor), \[h_{ii} = \frac{1}{n} + \frac{(x_i -\overline{x})^2}{\sum(x_i - \overline{x})^2}\]

So, if $x_i$ is far from $\overline{x}$, then $y_i$ contributes a lot to the estimate $\widehat{y}_i$.

1.2.2.3 Properties of the leverage: h_ii

$(1/n) \le h_{ii} < 1$
$\sum h_{ii} = (p + 1)$
Large $h_{ii}$ indicates case $i$ is far from center of the $x$ space
Large $h_{ii}$ indicates $y_i$ plays an important role in prediction: $\widehat{y}_i$
Observations with very high leverage often have small residuals

The last item follows from the fact that $\displaystyle{h_{ii} + \frac{e_i^2}{\mathrm{SSE}} \le 1}$

If $h_{ii}$ is large (close to 1), then $y_i$ has a lot of “leverage” on $\widehat{y}_i$,
the residual $e_i$ is small,
and the prediction $\widehat{y}_i$ depends relatively less on the other observations ($y_j$’s)

1.2.2.4 Detecting “high leverage” values numerically

Graphical methods of leverage detection are preferred, but…

However, from item 2 above, the average of $h_{ii}$’s is $(p+1) / n$

A rough screening for “high leverage” $y$-values is $h_{ii} \;>\; 2(p+1) / n \;\;=\;\; 2\;\times$ (average of $h_{ii}$’s)

1.2.2.5 Getting leverage (h_ii) values in R

Add the “hat values” ($h_{ii}$) to the fuel consumption dataset:

fuelData <- mutate(fuelData, h = hatvalues(lmfit))

fuelData <- mutate(fuelData, h = hatvalues(lmfit))

Hat value for Wyoming:

# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")

state	fuel	dlic	tax	inc	road	e	yhat	r	h
WY	968	67.2	7	4.345	3.905	234.9	733.1	3.735	0.0997

Here is a summary of the other states:

# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])

variable	min	Q1	median	Q3	max	mean	sd	n
fuel	344.0000	509.0000	566.0000	631.5000	865.0000	568.4468	96.9143	47
dlic	45.1000	52.9500	56.3000	58.9500	72.4000	56.8170	5.3985	47
tax	5.0000	7.0000	7.5000	8.2500	10.0000	7.6826	0.9559	47
inc	3.0630	3.7330	4.2960	4.5835	5.3420	4.2396	0.5796	47
road	0.4310	2.9465	4.7460	7.3820	17.7820	5.6007	3.5206	47
e	-122.0289	-45.7090	-11.6488	29.7206	117.5965	-4.9989	53.7052	47
yhat	318.7326	513.3532	569.4385	639.9999	772.9678	573.4457	90.2138	47
r	-1.9317	-0.7230	-0.1818	0.4864	1.8425	-0.0743	0.8559	47
h	0.0265	0.0589	0.0810	0.1117	0.3151	0.1043	0.0680	47

The average of the hat values should be about

\[\frac{(p+1)}{n} \;=\; \frac{ (4 + 1)}{48} \;=\; 0.1042\]

Indeed, we observed that the average of the hat values (with Wyoming in the data) was $0.1042$

The threshold for “high leverage” for the fuel data full model is

\[\frac{2 (p+1)}{n} \;=\; \frac{2 (4 + 1)}{48} \;=\; 0.2083\]

Which observations have “high leverage”?
Do they have large standardized residuals?

filter(fuelData, h > 2*(p+1)/n )

state	fuel	dlic	tax	inc	road	e	yhat	r	h
CN	457	57.1	10.0	5.342	1.333	23.449	433.6	0.4027	0.2288
IL	471	52.5	7.5	5.126	14.186	28.747	442.3	0.4942	0.2305
NV	782	67.2	6.0	5.215	2.302	68.201	713.8	1.2005	0.2660
NY	344	45.1	8.0	5.319	11.868	25.267	318.7	0.4501	0.2832
TX	640	56.6	5.0	4.045	17.782	-7.282	647.3	-0.1327	0.3151

1.2.2.6 Detecting “high leverage” values graphically

Leverage plots:

boxplot of $h_{ii}$
histogram of $h_{ii}$
$h_{ii}$ vs. index (the order in which the data were collected)
$h_{ii}$ vs. standardized residuals
$h_{ii}$ vs. externally studentized residuals (to be introduced momentarily)

boxplot(fuelData$h, horizontal = TRUE, pch=20)
abline(v = 2 * (p+1) / n, lty=2)
mtext("hat values (leverage)", side=1, line=2)

  
gf_histogram(~ h, data=fuelData, color="white") %>%
  gf_vline(xintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_labs(x = "hat values (leverage)")

# gf_text(h ~ 1, data=fuelData, label = ~ state) %>%
#   gf_hline(yintercept = 2 * (p+1) / n, linetype=2)

Observations with high leverage do not necessarily have large residuals (TX, NY, NV, CO, IL)

Observations with large residuals do not necessarily have high leverage (WY)

gf_text(h ~ r, data=fuelData, label = ~ state) %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_vline(xintercept = c(-2,2), linetype=2) %>%
  gf_labs(x = "standardized residuals", y = "hat values (leverage)")

1.2.2.7 Why worry about leverage?

An observation with high leverage can have significant
“influence” on the regression coefficients.

1.2.2.8 Why worry about high leverage if residual not large?

Anscombe data: Extreme leverage, residual = 0

Cases with high leverage do not necessarily have large residuals,
… but can still be influential values.

AnscombeDataResids <- AnscombeDataResids %>%
  mutate(h4 = hatvalues(lmfit4), r4 = rstandard(lmfit4))
AnscombeDataResids

Y1	X1	Y2	X2	Y3	X3	Y4	X4	e1	yhat1	e2	yhat2	e3	yhat3	e4	yhat4	h4	r4
8.04	10	9.14	10	7.46	10	6.58	8	0.0390	8.001	1.1391	8.001	-0.5397	8.000	-0.421	7.001	0.1	-0.3591
6.95	8	8.14	8	6.77	8	5.76	8	-0.0508	7.001	1.1391	7.001	-0.2303	7.000	-1.241	7.001	0.1	-1.0586
7.58	13	8.74	13	12.74	13	7.71	8	-1.9213	9.501	-0.7609	9.501	3.2411	9.499	0.709	7.001	0.1	0.6048
8.81	9	8.77	9	7.11	9	8.84	8	1.3091	7.501	1.2691	7.501	-0.3900	7.500	1.839	7.001	0.1	1.5687
8.33	11	9.26	11	7.81	11	8.47	8	-0.1711	8.501	0.7591	8.501	-0.6895	8.499	1.469	7.001	0.1	1.2531
9.96	14	8.10	14	8.84	14	7.04	8	-0.0414	10.001	-1.9009	10.001	-1.1586	9.999	0.039	7.001	0.1	0.0333
7.24	6	6.13	6	6.08	6	5.25	8	1.2394	6.001	0.1291	6.001	0.0792	6.001	-1.751	7.001	0.1	-1.4937
4.26	4	3.10	4	5.39	4	12.50	19	-0.7405	5.000	-1.9009	5.001	0.3886	5.001	0.000	12.500	1.0	NaN
10.84	12	9.13	12	8.15	12	5.56	8	1.8388	9.001	0.1291	9.001	-0.8492	8.999	-1.441	7.001	0.1	-1.2292
4.82	7	7.26	7	6.42	7	7.91	8	-1.6807	6.501	0.7591	6.501	-0.0805	6.500	0.909	7.001	0.1	0.7754
5.68	5	4.74	5	5.73	5	6.89	8	0.1795	5.500	-0.7609	5.501	0.2289	5.501	-0.111	7.001	0.1	-0.0947

# Note that the standardized residual is not defined for
# the outlier because the hat value is exactly = 1!!!
p9Anscombe <- gf_point(h4 ~ c(1:11), data=AnscombeDataResids) %>%
  gf_labs(x = "observation number", y = "hat values (leverage)") %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
# Plotted raw residuals here since a missing value in standardized residuals
p10Anscombe <- gf_point(h4 ~ e4, data=AnscombeDataResids) %>%
  gf_labs(x = "raw residuals", y = "hat values (leverage)") %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2) 
grid.arrange(p4Anscombe, p8Anscombe, p9Anscombe, p10Anscombe, ncol=2)

1.2.2.9 Why worry about large residuals with low leverage?

Anscombe data: Example of influence without leverage

Cases with large residuals do not necessarily have high leverage
…but can still be influential values.

AnscombeDataResids <- AnscombeDataResids %>%
  mutate(h3 = hatvalues(lmfit3), r3 = rstandard(lmfit3))
p7bAnscombe <- gf_point(r3 ~ yhat3, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "std residuals (Y3 on X3)") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p11Anscombe <- gf_point(h3 ~ c(1:11), data=AnscombeDataResids) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_lims(y = c(0, 0.40)) %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
p12Anscombe <- gf_point(h3 ~ r3, data=AnscombeDataResids) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_lims(y = c(0, 0.40)) %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
grid.arrange(p3Anscombe, p7bAnscombe, p11Anscombe, p12Anscombe, ncol=2)

Regress Y3 on X3 using all data values

tidy(lmfit3)

term	estimate	std.error	statistic	p.value
(Intercept)	3.0025	1.1245	2.670	0.0256
X3	0.4997	0.1179	4.239	0.0022

Regress Y3 on X3 after removing the outlier

tidy(lm(Y3 ~ X3, data=AnscombeData[-3, ]))

term	estimate	std.error	statistic	p.value
(Intercept)	4.0056	0.0029	1370	0
X3	0.3454	0.0003	1077	0

Regress Y3 on X3 using all data values

glance(lmfit3)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6663	0.6292	1.236	17.97	0.0022	2	9

Regress Y3 on X3 after removing the outlier

glance(lm(Y3 ~ X3, data=AnscombeData[-3, ]))[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
1	1	0.0031	1160688	0	2	8

The outlier influences the results even though leverage is low.

1.2.2.10 Influence of outlier with low leverage in the fuel data

For the fuel data with large residual for Wyoming:
How does the regression fit change if obs with large outlier removed?

What if this observation were removed?
Does the regression fit change significantly?

fuelDataWY <- fuelData[fuelData$state != "WY", ]
lmfitWY <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataWY)

Regression coefficients using all observations:

tidy(lmfit)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

glance(lmfit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6787	0.6488	66.31	22.71	0	5	43

Regression coefficients using all observations …except WY:

tidy(lmfitWY)

term	estimate	std.error	statistic	p.value
(Intercept)	430.97	154.781	2.7844	0.0080
dlic	11.74	1.640	7.1572	0.0000
tax	-31.36	10.815	-2.8998	0.0059
inc	-66.26	14.324	-4.6257	0.0000
road	-1.35	2.829	-0.4773	0.6356

glance(lmfitWY)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.7044	0.6762	55.15	25.02	0	5	42

Coefficients for significant predictors (dlic, tax, inc) change a little,
but the p-values are similar either way (except for tax).

The variance estimate ($\widehat{\sigma}$) gets much smaller,
as expected after removing an observation with a large residual.

So, the predictions may not change much (need to check),
but $\widehat{\sigma}$ is sensitive to outliers.

1.2.2.11 What if high leverage and residual large?

An observation with high leverage and a large residual
can have significant “influence” on the regression coefficients.

For the fuel consumption data, NV has a high leverage value
…and a moderate standardized residual.

# Value of variables for Nevada
fuelData %>%
  filter(state == "NV")

state	fuel	dlic	tax	inc	road	e	yhat	r	h
NV	782	67.2	6	5.215	2.302	68.2	713.8	1.2	0.266

gf_text(h ~ r, data=fuelData, label = ~ state) %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_vline(xintercept = c(-2,2), linetype=2)

What if this observation were removed?
Does the regression fit change significantly?

fuelDataNV <- fuelData[fuelData$state != "NV", ]
lmfitNV <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataNV)

Regression coefficients using all observations:

tidy(lmfit)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

glance(lmfit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6787	0.6488	66.31	22.71	0	5	43

Regression coefficients using all observations …except NV:

tidy(lmfitNV)

term	estimate	std.error	statistic	p.value
(Intercept)	357.637	185.281	1.9302	0.0603
dlic	13.170	1.920	6.8605	0.0000
tax	-28.684	13.858	-2.0698	0.0447
inc	-72.704	17.865	-4.0697	0.0002
road	-0.998	3.573	-0.2793	0.7814

glance(lmfitNV)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.665	0.6331	65.96	20.84	0	5	42

Coefficients for significant predictors (dlic, tax, inc) change a little
and the p-value for tax changes a lot.

The variance estimate ($\widehat{\sigma}$) is essentially the same
with or without NV in the data.

More importantly, we would want to check on the influence of NV on any predictions we make.

1.2.2.12 Cook’s distance

For observation $i$, Cook’s distance measures the difference in the fit
when all observations are used ($\widehat{y}_j$) vs. all observations except $i$ ($\widehat{y}_{j(i)}$)

\[C_i = \frac{\sum_{j=1}^n (\widehat{y}_j - \widehat{y}_{j(i)})^2}{\widehat{\sigma}^2 (p+1)}\] This is a measure of the influence of observation $i$ on predictions.

A rough guide: Pay attention to observations with $C_i \ge 1$.

A much better guide: A dotplot or index plot of the $C_i$’s.
Look for unusual observations.

Add Cook’s distance to the dataset:

fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "Cook's distance") %>%
  gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
  gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "Cook's distance")

Why does Wyoming have a high Cook’s distance when we know leverage is small?

Cook’s distance can be rewritten as $\quad \displaystyle{C_i = \left(\frac{r_i^2}{p+1}\right) \left(\frac{h_{ii}}{1 - h_{ii}}\right)}$

Cook’s distance combines the size of the residual with the amount of leverage.

$\displaystyle \frac{h_{ii}}{1 - h_{ii}}\;$ is called the “potential function”

So, the high residual for Wyoming once again alerts us
that this observation may have influence on the regression fit.

But, we already ascertained that leaving out Wyoming may not change things much
except for inflating the SSE and $\widehat{\sigma}$.

1.2.3 Externally studentized residuals

The textbook mentions another type of residual sometimes used.

The idea is this: If $e_i$ is an outlier (large unstandardized residual), then

It contributes a large value to the SSE
It will increase the SSE
It may cause an overestimate of $\sigma$

Thus, the standardized residual $r_i$ will be smaller (since $\widehat{\sigma}$ in denominator).

\[r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}\]

So… we might not notice some standardized residuals that should have been flagged as outliers.

A solution: For each observation $i$

Remove that observation from the data
Fit the linear model again to the $n-1$ observations
Calculate an estimate of $\sigma$
Call this estimate $\widehat{\sigma}_{(i)}$
Then, standardize (studentize) with this new estimate

The result is an “externally studentized” or, more simply “studentized” residual.

\[r_i^* = \frac{e_i}{\widehat{\sigma}_{(i)} \sqrt{1 - h_{ii}}}\] When the raw residual $e_i$ is an outlier, we can expect that $|\; r_i^*\; | > |\; r_i\; |$

Studentized residuals follow a t-distribution with $(n-p-2)$ df
… if the model assumptions are satisfied.

This distribution is a bit wider than a $N(0,1)$ distribution since the tails of the distribution of the $r_i^*$ are longer.

1.2.3.1 Standardized vs. studentized residuals

If $|\; r_i\; | > 1$, then $|\; r_i^*\; | > |\; r_i\; |$

Otherwise, $r_i^*$ is just a tad smaller than $r_i$.

So, as expected, for large standardized residuals,
the studentized residual is even larger.

The goal is to reduce false negatives
where we fail to notice outliers after standardization.

The textbook argues that there isn’t that much difference
between standardized and studentized residuals.
They choose to use standardized residuals in their graphs.

Suggestion:
Take a careful look at standardized residuals $|\; r_i\; | \ge 2$ (outliers).

1.2.3.2 How to get studentized residuals in R

Add the studentized residuals ($r_i^*$) to the fuel consumption dataset:

fuelData <- mutate(fuelData, rstar = rstudent(lmfit))

fuelData <- mutate(fuelData, rstar = rstudent(lmfit))

Studentized residual for Wyoming:

# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")

state	fuel	dlic	tax	inc	road	e	yhat	r	h	C	rstar
WY	968	67.2	7	4.345	3.905	234.9	733.1	3.735	0.0997	0.309	4.49

Here is a summary of the other states:

# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])

variable	min	Q1	median	Q3	max	mean	sd	n
fuel	344.0000	509.0000	566.0000	631.5000	865.0000	568.4468	96.9143	47
dlic	45.1000	52.9500	56.3000	58.9500	72.4000	56.8170	5.3985	47
tax	5.0000	7.0000	7.5000	8.2500	10.0000	7.6826	0.9559	47
inc	3.0630	3.7330	4.2960	4.5835	5.3420	4.2396	0.5796	47
road	0.4310	2.9465	4.7460	7.3820	17.7820	5.6007	3.5206	47
e	-122.0289	-45.7090	-11.6488	29.7206	117.5965	-4.9989	53.7052	47
yhat	318.7326	513.3532	569.4385	639.9999	772.9678	573.4457	90.2138	47
r	-1.9317	-0.7230	-0.1818	0.4864	1.8425	-0.0743	0.8559	47
h	0.0265	0.0589	0.0810	0.1117	0.3151	0.1043	0.0680	47
C	0.0001	0.0021	0.0075	0.0166	0.1119	0.0174	0.0256	47
rstar	-1.9978	-0.7189	-0.1797	0.4820	1.8973	-0.0737	0.8635	47

The standardized residual for Wyoming is well over twice the size of the second largest largest magnitude standardized residual.

pfuelStud <- gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_lims(y = c(-2.5, 5)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
pfuelExtStud <- gf_point(rstar ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_lims(y = c(-2.5, 5)) %>%
  gf_labs(x = "fitted values (yhat)", y = "studentized residuals")
grid.arrange(pfuelStud, pfuelExtStud, ncol=2)

1.2.4 Plots to check linear regression model assumptions, step-by-step

1.2.4.1 Before estimating the model parameters:

Scatterplot of the predictor variables versus each other and versus the response variable using a command like ggpairs.

What plot should look like: Points spread out in a rough football shape for all plots.

Problems you are looking for:

nonlinearities between the predictors and the response variable
obvious outliers
non-constant variance in the response or predictors
extreme corrlation between predictors

(You will look more closely for all of these problems using residuals after a fit, but it’s a good idea to do a quick check first so that you know what you are working with.)

Added-Variable plots

Make an added-variable plot for each $x_j$ you expect to use in the model

“The stronger the linear relationship in the added-variable plot is, the more important the additional contribution of $x_j$ to the regression equation already containing the other predictors.”

"If the scatterplot shows [a flat] slope, the variable is unlikely to be useful in the model.

Use this as a rough guide for decisions about which variables to include in the model, …or not.
Perhaps examine this more later with F-tests.

Residual Plus Component plot

Make a residual plus component plot for each x_j$ you expect to use in the model.

This plot indicates if any nonlinearity is present in the relationship between $Y$ and $x_j$ after adjusting for the other predictors in the model.

1.2.4.2 After estimating the model parameters:

Histogram of the standardized residuals

What plot should look like:
Symmetric and mound-shaped (like a Normal distribution).

Normal quantile plot of the standardized residuals, plus histogram and/or box or dot plots if something is off.

What plot should look like:
Points on plot should be lined up mostly along the diagonal reference line, with a few off-diagonal at either end.

Problems you are looking for:

Non-normality of residuals, seen by Normal probability plot plot that doesn’t follow the diagonal.

Is the distribution of residuals skewed?
Bimodal distribution?
Are there large outliers?

Scatterplot of residuals vs. fitted values.

What plot should look like:
Points should be spread in a horizontal band of roughly even width, without noticable outliers. There should be no apparent pattern to the points.

Problems you are looking for:

non-linearity, as seen if the band of points bends.
non-constant variance (heteroscedasticity), if the band widens to one side or the other (or has a bow-tie shape)
other visible patterns
large outliers

# R `plot(fit)` creates this plot (among several others) with the square root of abs(standardized residuals), rather than the standardized residuals themselves. This is to make it easier to visualize non-constant variance.

Scatterplots of standardized residuals vs. each predictor variable

Analysis is the same as for (6) above. In some cases, problems can appear here which weren’t visible with the residuals vs. fitted values plot (or vice versa).

For time-series data: scatterplot of residuals vs time (“index”)

Problems you are looking for:

zig-zags
cyclical patterns

This would be evidence of non-independence of errors.

Can make this plot even for data that is not obviously time-series data, in case something systmatically changes with observations later in the dataset.

Scatterplot of standardized residuals vs. leverage values

Problems you are looking for:

leverage outliers
leverage values $> 2(p+1)/n$

Plot of Cook’s distance vs. observation number

Problems you are looking for:

Cook’s distance outliers
Cook’s distance values $> 1$

1.2.4.3 Some of the information another way: plots to check each assumption

1a: Linearity

Scatterplot: standardized residuals vs. fitted values
…and more scatterplots: standardized residuals vs. each predictor

Look for random scatter across all values of $\widehat{y}$ and each $x_j$

2a: Errors are normally distributed

Histogram, boxplot, dotplot, normal quantile plot of standardized residuals

Look for symmetry, mound-shape, not too skewed.

2c: Errors have constant variance $\sigma^2$ over all values of $x$’s

Scatterplots:
standardized residuals vs. fitted values
standardized residuals vs. each predictor variable

Look for similar spread in residuals across all values of $\widehat{y}$ or $x_j$.

2d and 4a: Independence of errors

Mostly, this depends on how the data were collected.

But, can plot residuals vs. observation number
looking for a pattern over time (autocorrelation, Chapter 8)

4b: Equally reliable and informative observations

This depends on how the data were collected.

But, we will look at residuals for outliers and “influential points”
that may influence the results much more than other observations

1.2.4.4 Checking linearity using standardized residuals

Look for random scatter across all values of $\widehat{y}$ or $x_j$

gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")

Now, plot standardized residuals vs. each predictor in the model

pfuel.dlic <- gf_point(r ~ dlic, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "dlic", y = "standardized residuals")
pfuel.tax <- gf_point(r ~ tax, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "tax", y = "standardized residuals")
pfuel.inc <- gf_point(r ~ inc, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "inc", y = "standardized residuals")
pfuel.road <- gf_point(r ~ road, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "road", y = "standardized residuals")
grid.arrange(pfuel.dlic, pfuel.tax, pfuel.inc, pfuel.road, ncol=2)

1.2.4.5 Checking normality of errors using standardized residuals

Look for symmetry, mound-shape, not too skewed.

boxplot(fuelData$r, horizontal = TRUE, pch=20)
abline(v = 2, lty=2)
mtext("standardized residuals", side=1, line=2)


  
p1 <- gf_histogram(~ r, data=fuelData, color="white", bins=15) %>%
  gf_labs(x = "standardized residuals")
p2 <- gf_dotplot(~ r, data=fuelData) %>%
  gf_labs(x = "standardized residuals")
grid.arrange(p1, p2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

gf_qq(~ r, data=fuelData, distribution=qnorm) %>%
  gf_qqline(distribution=qnorm) %>%
  gf_labs(x = "Standard Normal Quantiles", y = "Standardized Residuals")

1.2.4.6 Checking constant variance of errors using standardized residuals

Look for similar spread in residuals across all values of $\widehat{y}$.

gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")

1.2.4.7 Checking for independence of errors

First, find out what you can about how the data were collected.

But, can plot residuals vs. observation number
looking for a pattern over time (autocorrelation, Chapter 8)

This plot is not likely relevant for the fuel consumption data, which are in alphabetical order by state.

n <- dim(fuelData)[1]

gf_point(r ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)

4b: Equally reliable and informative observations

Again, first find out what you can about how the data were collected.

But, we should look at residuals for outliers and “influential points”
that may influence the results much more than other observations

Leverage plots:

leverage vs. observation number
leverage vs. standardized residuals
…and others mentioned, but not demonstrated in the book

n <- dim(fuelData)[1]
p <- 4

pfuelHat1 <- gf_point(h ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat2 <- gf_point(h ~ r, data=fuelData) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat1, pfuelHat2, ncol=2)


pfuelHat3 <- gf_text(h ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat4 <- gf_text(h ~ r, data=fuelData, label = ~state) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat3, pfuelHat4, ncol=2)

Cook’s Distance plots:

A dotplot or index plot of the $C_i$’s.

Look for unusual observations.

Add Cook’s distance to the dataset:

fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "Cook's distance") %>%
  gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
  gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "Cook's distance")

1.2.5 What to do with outliers?

Keep
Drop
Adjust

Keep

If we have a “story” that explains why the outlier exists
and is not a data collection/recording error.

Odd points may strongly influence model fit

regression surface is “tilted” to accommodate outlier(s)
$\widehat{\sigma}$ is inflated
Extra noisiness in the data can obscure important predictors
Large outlier can “mask” other outliers from being detected

Identifying outliers might be a goal in its own right.

Outliers may be left in but may require explanation in the analysis.

Drop

Results then relevant to smaller data range
But, may lead to bias in conclusions
Outliers may represent data errors

Usually we choose to keep the outlier
unless there are very sound reasons for dropping observations

Adjust

If outliers are also influential and you want to keep them,
then you might want to adjust for them.

Create an “indicator variable” for the outlier (see below)
or transform the data (Chapter 6)
or look for additional variables not already included in the model
- use added variable plots (see below)

1.2.6 Added variable plot

Sometimes outliers may occur because important information
is contained in another variable not currently in the model.

Besides t-tests and such, an “added variable plot”
(“partial regression plot”) can illuminate whether another predictor will help.

Suppose we left the important predictor dlic out of the fuel consumption model.

lmfit.no.dlic <- lm(fuel ~ tax + inc + road, data=fuelData)
p1.no.dlic <- gf_point(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)
p2.no.dlic <- gf_text(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic), data=fuelData, label = ~state) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)

Without dlic in the model, we have two outliers instead of one.

The added variable plot is a plot of the residuals from a model without
the predictor of interest vs. the predictor of interest.

In this case, we plot the residuals from fuel = f(tax, inc, road)
against the predictor dlic

gf_point(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData) %>%
  gf_labs(y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
  gf_coefline(model = lm(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData))

The residuals have a (linear) pattern.
This indicates that dlic should be added to the linear model.

We know that with dlic in the model, we only have one outlier remaining.

1.2.7 Adjusting for an outlier

Create an indicator variable (0’s and 1’s)

value = 1 (TRUE) if the observation is Wyoming
value = 0 (FALSE) for all other observations

fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )

fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )
tail(fuelData)

	state	fuel	dlic	tax	inc	road	e	yhat	r	h	C	rstar	WY
43	VA	547	51.7	9.0	4.258	4.686	86.778	460.2	1.3598	0.0737	0.0294	1.3738	0
44	VT	561	58.0	9.0	3.865	1.586	-17.107	578.1	-0.2697	0.0847	0.0013	-0.2668	0
45	WI	508	54.5	7.0	4.207	6.580	-58.024	566.0	-0.8947	0.0433	0.0072	-0.8926	0
46	WN	510	57.1	9.0	4.476	3.942	-9.678	519.7	-0.1515	0.0723	0.0004	-0.1498	0
47	WV	460	55.1	8.5	4.574	2.619	-47.028	507.0	-0.7284	0.0518	0.0058	-0.7243	0
48	WY	968	67.2	7.0	4.345	3.905	234.947	733.1	3.7345	0.0997	0.3090	4.4901	1

Then, fit the model with this variable added as a predictor.

lmfitWYindicator <- lm(fuel ~ dlic + tax + inc + road + WY, data=fuelData)

The original model (without the indicator variable for WY)

tidy(lmfit)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

glance(lmfit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6787	0.6488	66.31	22.71	0	5	43

The new model (includes the indicator variable for WY)

tidy(lmfitWYindicator)

term	estimate	std.error	statistic	p.value
(Intercept)	430.97	154.781	2.7844	0.0080
dlic	11.74	1.640	7.1572	0.0000
tax	-31.36	10.815	-2.8998	0.0059
inc	-66.26	14.324	-4.6257	0.0000
road	-1.35	2.829	-0.4773	0.6356
WY	260.97	58.122	4.4901	0.0001

glance(lmfitWYindicator)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.7829	0.7571	55.15	30.29	0	6	42

The new model incorporates a separate intercept for WY

This assures that SSE and $\widehat{\sigma}$ will not be overstated.

Do we still have outliers?

gf_point(rstandard(lmfitWYindicator) ~ fitted(lmfitWYindicator)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
  gf_lims(x = c(325, 750))
Warning: Removed 1 rows containing missing values (geom_point).

You can see that we now have 3 outliers!

This is pretty common with outliers.
You drop or adjust and new outliers crop up.

So, use this adjustment method with caution.

STAT 224 Chapter 4

1 Chapter 4

1.1 Lecture 7

1.1.1 So far… estimation and inference

1.1.2 What’s new… model diagnostics

1.1.2.1 Why diagnostics?

1.1.3 Anscombe datasets: Why graphs are important

1.1.3.1 Anscombe datasets: LS regression fit

1.1.3.2 Anscombe datasets: scatterplots

1.1.4 Revisiting the model assumptions

1.1.5 Checking regression assumptions

1.1.5.1 Assumptions about the form of the model

1.1.5.2 Assumptions about errors (epsilon)

1.1.5.2.1 The errors have a normal distribution (“normality assumption”)

1.1.5.2.2 The errors \(\epsilon\) have mean zero

1.1.5.2.3 The errors have the same variance \(\sigma^2\) (“constant variance assumption”)

1.1.5.2.4 Errors are independent of each other “independent-errors assumption”

1.1.5.2.5 Errors are uncorrelated with predictors

1.1.5.3 Assumptions about the predictors

1.1.5.3.1 Predictors are nonrandom

1.1.5.3.2 Predictors are measured without error

1.1.5.3.3 Predictor variables are linearly independent

1.1.5.4 Assumptions about the observations

1.1.5.4.1 All observations are equally reliable and informative

1.1.6 The basic residual plot (vs. fitted yhat values)

1.1.6.1 Anscombe data: raw residuals vs. fitted values

1.1.6.2 Fuel data: plot raw residuals vs. fitted values

1.1.6.3 Residuals and fitted values in R

1.1.6.4 What we are checking with the residuals vs fitted values plot

1.2 Lecture 8

1.2.1 Standardizing residuals

1.2.1.1 Calculating standardized residuals

1.2.1.2 A note about terminology" “internally studentized residuals”?

1.2.1.3 How to get standardized residuals in R

1.2.1.4 Properties of standardized residuals

1.2.1.5 What is h_ii?

1.2.1.6 Detour: The “hat matrix” (not part of STAT 224)

1.2.1.7 Fitted value is a linear combination of all the y-values

1.2.2 Leverage

1.2.2.1 Fitted value can be re-written as a linear combination of all the y-values

1.2.2.2 Leverage (h_ii)

1.2.2.3 Properties of the leverage: h_ii

1.2.2.4 Detecting “high leverage” values numerically

1.2.2.5 Getting leverage (h_ii) values in R

1.2.2.6 Detecting “high leverage” values graphically

1.2.2.7 Why worry about leverage?

1.2.2.8 Why worry about high leverage if residual not large?

1.2.2.9 Why worry about large residuals with low leverage?

1.2.2.10 Influence of outlier with low leverage in the fuel data

1.2.2.11 What if high leverage and residual large?

1.2.2.12 Cook’s distance

1.2.3 Externally studentized residuals

1.2.3.1 Standardized vs. studentized residuals

1.2.3.2 How to get studentized residuals in R

1.2.4 Plots to check linear regression model assumptions, step-by-step

1.2.4.1 Before estimating the model parameters:

1.2.4.2 After estimating the model parameters:

1.2.4.3 Some of the information another way: plots to check each assumption

1.2.4.4 Checking linearity using standardized residuals

1.2.4.5 Checking normality of errors using standardized residuals

1.2.4.6 Checking constant variance of errors using standardized residuals

1.2.4.7 Checking for independence of errors

1.2.5 What to do with outliers?

1.2.6 Added variable plot

1.2.7 Adjusting for an outlier