---
title: "STAT 224 Chapter 4"
params:
  dotabs: no
output:
  html_document:
    code_folding: hide
    css: styles.css
    df_print: kable
    number_sections: yes
    self_contained: yes
    theme: cerulean
---

```{r test-main, child = 'preamble.Rmd'}
```


# Chapter 4 `r if(params$dotabs) paste('{.tabset}')`

## Lecture 7

### So far... estimation and inference

So far, the basics of linear regression

  + Fit simple and multiple linear regression models
  + interpret the coefficients
  + test hypotheses about the models and coefficients
  + produce confidence and prediction intervals
  
### What's new...  model diagnostics

Now, we explore how to critique our previous work

  + model checking
  + model criticism
  + model diagnostics

Examine if the data are consistent with linear regression model specification

Use mostly graphical methods.

#### Why diagnostics?

If the regression model specifications (assumptions) not true...

All inference based on the fitted regression model

  + less reliable
  + suspicious
  + ... even invalid

For example, if errors (residuals) not at all reasonably from a normal distribution    
the LS estimators do not have the expected sampling distribution.

So, t-tests, F-tests, confidence intervals may be incorrect   
...unless sample size large (for t-tests and F-tests).

Some small deviations from the model specifications are not serious,   
but large deviations could invalidate our results.

### Anscombe datasets: Why graphs are important

We should always plot the data before we get started.

For example, a scatterplot (or scatterplot matrix).

Let's ignore that advice for the moment and see what kind of trouble that can lead to.

#### Anscombe datasets: LS regression fit

The regression results are identical for all four datasets

```{r,include=FALSE}
AnscombeData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P029b.txt")
lmfit1 <- lm(Y1 ~ X1, data=AnscombeData)
lmfit2 <- lm(Y2 ~ X2, data=AnscombeData)
lmfit3 <- lm(Y3 ~ X3, data=AnscombeData)
lmfit4 <- lm(Y4 ~ X4, data=AnscombeData)
```

<!-- Other statistics are the same for each dataset too... -->
<!-- ```{r} -->
<!-- glance(lmfit1)[-c(7:10)] -->
<!-- glance(lmfit2)[-c(7:10)] -->
<!-- glance(lmfit3)[-c(7:10)] -->
<!-- glance(lmfit4)[-c(7:10)] -->
<!-- ``` -->


#### Anscombe datasets: scatterplots

But the relationships of the responses to the same predictors are very different!

Can only discover this by graphing the data.
```{r}
p1Anscombe <- gf_point(Y1 ~ X1, data=AnscombeData) %>%
  gf_coefline(model = lmfit1)
p2Anscombe <- gf_point(Y2 ~ X2, data=AnscombeData) %>%
  gf_coefline(model = lmfit2)
p3Anscombe <- gf_point(Y3 ~ X3, data=AnscombeData) %>%
  gf_coefline(model = lmfit3)
p4Anscombe <- gf_point(Y4 ~ X4, data=AnscombeData) %>%
  gf_coefline(model = lmfit4)
grid.arrange(p1Anscombe, p2Anscombe, p3Anscombe, p4Anscombe, ncol=2)
```


### Revisiting the model assumptions

  1. Assumption about the form of the model
     a. Linearity: The mean of $Y$ is a linear function of the $x$'s
  2. Assumptions about the errors ($\epsilon$'s)
     a. Normally distributed (and thus so are the $Y$'s)
     b. Mean zero: No systematic mis-prediction
     c. Constant variance $\sigma^2$ over all values $X$'s
     d. Independent 
     e. Uncorrelated with predictors ($x$'s)
  3. Assumptions about the predictors ($x$'s)
     a. Non-random, "fixed"
        + True for many designed experiments
        + For observational studies, inferences are conditional on $x$'s
     b. Measured without error
        + Probably not true, but doubtful ever have enough information to assess
     c. Linearly independent
        + No predictor can be expressed as linear combination of others
        + Not collinear (predictor variables not interrelated)
        + Rarely true, but minor collinearity OK
     d. Uncorrelated with errors ($\epsilon$'s)
  4. Assumptions about the observations: $y_i, x_{1i}, x_{2i}, \ldots, x_{pi}$
     a. Independent (values of observation $i$ not dependent on values of observation $j$)
     b. Equally reliable and informative
    
    
### Checking regression assumptions


####  Assumptions about the form of the model

Linearity assumption... 
	 $y_i = \beta_0 + \beta_1 x_{i1}  +  \beta_p X_{ip} \epsilon_i$
	 
Check using "residual plots" 
	 
  + Scatter plots of standardized residuals against each of the predictor variables. 
    + Should see: no pattern.
    + Could see:
      + clear curve (nonlinearity)
      + outlying points (could influence results, could lead to interesting discovery)
      + spread changing with $x_j$ (a non-constant variance problem, not a linearity problem)
  + Scatter plot of standardized residuals against fitted $\yhat$ values. 
    + Sometimes called a "scale-location" plot
    + Also used to check constant variance assumption
    + Should see: no pattern.
    + Could see:
      + clear curve (nonlinearity)
      + outlying points (could influence results, could lead to interesting discovery)
      + spread changing with $\yhat$ (a non-constant variance problem, not a linearity problem)
	      
Fact: No matter what pattern you might see, it is algebraically true
that the observed residuals are uncorrelated with the fitted $\yhat$ values.

That is $\corr(\yhat, e) = 0$.
	 
![](https://data.library.virginia.edu/files/diagnostics1.jpeg)
	
####	 Assumptions about errors (epsilon)

##### The errors have a normal distribution ("normality assumption")

Look at distribution of standardized residuals to see if consistent with data from 
a standard normal distribution.
  
Check using:
  
  + Normal probability (quantile) plot of standardized residuals.
    + Should see: most points fall along the reference line
    + Could see: 
      + a lot of points tailing off the line
      + indicating that the residuals are quite non-normally distributed in the tails. 
      + maybe because the errors themselves aren't normally distributed
      + ...or maybe because of other model problems like non-linearity.
      
![](https://data.library.virginia.edu/files/diagnostics2.jpeg)      

      
##### The errors $\epsilon$ have mean zero

No need to check this assumption.
  
In a previous lecture, we showed that for data,
$\sum_i e_i \;=\; \sum_i (y_i - \yhat_i) \;=\; 0$

That is, the observed residuals always have average zero.
  
##### The errors have the same variance $\sigma^2$ ("constant variance assumption")

If violated, have "**heteroscedascity**" problem

  + Scatter plot of standardized residuals against fitted $\yhat$ values. 
    + Sometimes called a "scale-location" plot
    + Also used to check linearity assumption
    + Should see: no pattern.
    + Could see:
      + clear curve (nonlinearity)
      + outlying points (could influence results, could lead to interesting discovery)
      + spread changing with $\yhat$ (a non-constant variance problem, not a linearity problem)
      + Common patterns: points spreading out in a fan or butterfly shape.

![](https://data.library.virginia.edu/files/diagnostics3.jpeg)
   
      
##### Errors are independent of each other "independent-errors assumption" 

If violated, and if related to the order data were collected,   
then called an "**autocorrelation**" problem

To check: 

  + Plot standardized residuals vs. the observation number 
    + Observation number must = the index/order in which data were collected
    + Called an "index plot"
    + Should see: no clear pattern.
    + Could see: 
      + Data collected later have higher or lower residuals. 
      + A zig zag pattern
      
What if you don't know the order in which the data were collected?

Then consider how the data were collected and whether that might lead to dependence
among observations.

##### Errors are uncorrelated with predictors

No need to check this assumption.
  
In a previous lecture, we showed that for data,
$cor(e, x) = 0$

That is, the observed residuals are always uncorrelated with the predictor variables.

#### Assumptions about the predictors

##### Predictors are nonrandom

  + True for many designed experiments
  + For observational studies, inferences are conditional on $x$'s
  
Rarely true, but rarely have information needed to check.

##### Predictors are measured without error

Probably not true, but doubtful ever have enough information to assess

##### Predictor variables are linearly independent 

If violated, we have collinearity (or multi-collinearity)

  + What we want:
    + No predictor can be expressed as linear combination of others
    + Not collinear (predictor variables not interrelated)
  + But, in practice...
    + Rarely true, but minor collinearity OK
    
To check: scatter plot of predictors against one another. 

More on the collinearity problem later on in the course.

#### Assumptions about the observations

##### All observations are equally reliable and informative

All have roughly equal role in determining results and influencing conclusions.

To check:
  
  + Look for outliers and "influential points" in residual plots
    + scatterplot of standardized residuals vs fitted $\yhat$ values
  + Residuals vs leverage plot
    + Examine points with high leverage (to be discussed in this chapter)
  + Measure "Cook's Distance" and "DFITS" (to be discussed in this chapter)

![](https://data.library.virginia.edu/files/diagnostics5.jpeg)
    

### The basic residual plot (vs. fitted yhat values)

#### Anscombe data: raw residuals vs. fitted values

```{r}
AnscombeDataResids <- AnscombeData %>%
  mutate(e1 = residuals(lmfit1), yhat1 = fitted(lmfit1),
         e2 = residuals(lmfit2), yhat2 = fitted(lmfit2),
         e3 = residuals(lmfit3), yhat3 = fitted(lmfit3),
         e4 = residuals(lmfit4), yhat4 = fitted(lmfit4)
         )
p5Anscombe <- gf_point(e1 ~ yhat1, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y1 on X1)") %>%
  gf_hline(yintercept = 0, linetype=2)
p6Anscombe <- gf_point(e2 ~ yhat2, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y2 on X2)") %>%
  gf_hline(yintercept = 0, linetype=2)
p7Anscombe <- gf_point(e3 ~ yhat3, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y3 on X3)") %>%
  gf_hline(yintercept = 0, linetype=2)
p8Anscombe <- gf_point(e4 ~ yhat4, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals (Y4 on X4)") %>%
  gf_hline(yintercept = 0, linetype=2)
grid.arrange(p1Anscombe, p5Anscombe, p2Anscombe, p6Anscombe, ncol=2)
grid.arrange(p3Anscombe, p7Anscombe, p4Anscombe, p8Anscombe, ncol=2)
```

#### Fuel data: plot raw residuals vs. fitted values

variable | description
|:-------|:-------------|
state    | State (excluding AK and HI)
pop      | Population in each state
fuelc    | fuel consumption, millions of gallons
fuel     | Motor fuel consumption, gal/person
nlic     | 1971 thousands of drivers
dlic     | Percent of population with driver's license
tax      | 1972 motor fuel tax rate, cents/gal
inc      | 1972 per capita income, thousands of dollars
road     | 1971 federal-aid highways, thousands of miles


fuel = f(dlic, tax, inc, road)

We are hoping that residuals have similar spread across the fitted values
```{r}
fuelData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/fuel.txt")
fuelData <- fuelData %>%
  dplyr::select(state, fuel, dlic, tax, inc, road) %>%
  arrange(state)
n <- dim(fuelData)[1]
p <- 4
```


#### Residuals and fitted values in R

First, fit the LS regression model and save it:

`lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)`

Then, add the residuals and fitted values to the dataset:

`fuelData <- mutate(fuelData, e = residuals(lmfit), yhat = fitted(lmfit))`

We can now make plots of predictor variables and/or fitted values vs the residuals:

```{r}
lmfit <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
fuelData <- fuelData %>%
  mutate(e = residuals(lmfit), yhat = fitted (lmfit))

gf_point(e ~ yhat, data=fuelData) %>%
  gf_labs(x = "yhat (fitted values)", y = "residuals") %>%
  gf_hline(yintercept = 0, linetype = "dashed")


```

```{r}

gf_point(e ~ dlic, data=fuelData) %>%
  gf_labs(x = "dlic", y = "residuals") %>%
  gf_hline(yintercept = 0, linetype = "dashed")


```


A faster, but slightly ugly, way to make the residuals vs fitted values plot (and several other of the diagnostic plots that we use here):
  
```{r}
plot(lmfit)
```
  
  
#### What we are checking with the residuals vs fitted values plot
  
  + Average of residuals is zero (a fact from least squares).
  + Is the pattern linear?
    + Model assumption: Y is a linear function of predictors
  + Is the spread around the line similar across values of $\yhat$?
    + Model assumption: Errors have constant variance for all values of predictors


## Lecture 8

### Standardizing residuals

**CAUTION:** Least squares (observed) residuals do NOT all have same variance

Model errors: $\var(\epsilon_i) = \sigma^2$

Observed residuals: $\var(e_i) = \sigma^2 (1 - h_{ii})$

$(1/n) \le h_{ii} < 1$

Large $h_{ii}$ indicates case $i$ is far from the center of x-space.   

So, $e_i$ vary less far from center of x-space.   


<br>
[Applet: Distribution of slope and intercept](http://www.rossmanchance.com/applets/RegSim/RegCoeff.html)
<br>

So, we will standardize $e_i$'s to solve this problem.  

$$z_i = \frac{e_i - \E(e_i)}{\sd(e_i)}
\;=\; \frac{e_i - 0}{\sigma\sqrt{1 - h_{ii}}}
\;=\; \frac{e_i}{\sigma\sqrt{1 - h_{ii}}}$$

Properties: $\E(z_i)=0$ and $\sd(z_i)=1$,
but for observed data, $\sum z_i$ is not necessarily zero.


#### Calculating standardized residuals

Well, you can't, using any software, 
because we don't know $\sigma$.

However, we can estimate $\sigma:\;\; \widehat{\sigma} = \displaystyle{ \sqrt{\frac{SSE}{n - p -1}}}$


$$r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}$$

#### A note about terminology" "internally studentized residuals"?

The observable standardized residuals are technically    
called "internally studentized" residuals.   

The textbook and R/Stata refer to the "internally studentized residuals"
as "standardized" residuals.  We will do the same.

#### How to get standardized residuals in R

Let's add the standardized residuals to the fuel consumption dataset:

`fuelData <- mutate(fuelData, r = rstandard(lmfit))`

```{r}
fuelData <- mutate(fuelData, r = rstandard(lmfit))
```

Standardized residual for Wyoming:
```{r}
# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")
```

Here is a summary residuals for the other states:
```{r}
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
```

The Wyoming standardized residual is almost twice the size
of the largest magnitude standardized residual from among the other states.

#### Properties of standardized residuals

standardized residuals have mean $\E(r_i) = 0$ and $\var(r_i) = 1$,  
but for observed data, $\sum r_i$ is not necessarily zero.

standardized residuals "are not strictly independent, but with a large   
number of observations, the lack of independence may be ignored."

So, if the errors $\epsilon$ are $N(0,\sigma^2)$,   
then the $r_i$ are approximately $N(0,1)$ when sample size ($n$) is large.

Thus, we take a careful look at standardized residuals $|\; r_i\; | \ge 2$ (outliers).

#### What is h_ii?

  + $h_{ii}$ is a function of the predictors only
  + Large $h_{ii}$ indicates case $i$ is far from the center of the $x$ "space"
  
For simple linear regression (one predictor), 
$$h_{ii} = \frac{1}{n} + \frac{(x_i -\xbar)^2}{\sum(x_i - \xbar)^2}$$
  
$(1/n) \le h_{ii} < 1$

#### Detour: The "hat matrix" (not part of STAT 224)

Recall that for linear regression
$\Yhat_i \;=\; \b0 + \b1 X_{1i} + \b2 X_{2i} + \cdots + \bp X_{pi}$,   
and we obtain the $\widehat{\beta}$'s by least squares estimation:

$$\begin{align}
\min \SSE &\;=\; \min \sum_{i=1}^n e_i^2 \;=\; \min \sum(y_i-\yhat_i)^2  \\ \\
&\;=\; \min \sum(y_i- \b0 - \b1 X_{1i} - \b2 X_{2i} - \cdots - \bp X_{pi})^2
\end{align}$$

Using matrix algebra notation, this can be re-written as
\[\min \SSE(\widehat{\bbeta}) \;=\; \min (\bY-\bX\widehat{\bbeta})^T(\bY-\bX\widehat{\bbeta}). \] 
By taking the first derivative of the SSE w.r.t. $\widehat{\bbeta}$, setting equal to zero and solving,    
we can obtain $$\widehat{\bbeta}=(\bX^T\bX)^{-1}\bX^T\bY$$

This is a compact way of writing (and programming) least squares solutions for multiple regression with any number of predictors.

This form of writing the solution illustrates that  $\widehat{\bbeta}$ is in fact a linear function of $Y$s. Thus the fitted value 

\[ \widehat{\bY} \;=\; \bX \widehat{\bbeta} \;=\; \bX(\bX^T\bX)^{-1}\bX^T\bY = \bH \bY ,\] 

where $\bH=\bX(\bX^T\bX)^{-1}\bX^T$.

$\bH$ is sometimes called the "hat matrix"   
Note that $\bH$ only depends on the predictors $\bX$, not the responses $\bY$.

The entries in the hat matrix are labeled $h_{ij}$,   
$i = 1, 2, \ldots, n$, $\quad j = 1, 2, \ldots, n$.

#### Fitted value is a linear combination of all the y-values

$\bH$ transforms $y$ values into their corresponding $\yhat$ values. 

\[\yhat_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n, \] 

**NOTE:** The textbook refers to $\bH$ as a projection matrix ($\mathbf{P}$)   
and labels the matrix entries as $p_{ij}$ instead of $h_{ij}$.

END OF MATRIX ALGEBRA DETOUR

### Leverage

#### Fitted value can be re-written as a linear combination of all the y-values

We can write 
$\yhat_i \;=\; h_{i1}y_1 + h_{i2}y_2 + \ldots + h_{in}y_n,$

$h_{ij}$ is the "contribution" of $y_j$ to $\yhat_i$

That is, $h_{ij}$ is the "weight" given to $y_j$ in predicting $\yhat_i$.

For simple linear regression (one predictor): 
$$h_{ij} = \frac{1}{n} + \frac{(x_i -\xbar)(x_j-\xbar)}{\sum(x_i - \xbar)^2}$$

So, if $x_j$ is far from $\xbar$ in the same direction as $x_i$,   
then $y_j$ contributes a lot to the estimate $\yhat_i$.

#### Leverage (h_ii)

We call $h_{ii}$ the **leverage** of the ith observation    
since $h_{ii}$ is the weight given to $y_i$ in predicting $\yhat_i$.

For simple linear regression (one predictor), 
$$h_{ii} = \frac{1}{n} + \frac{(x_i -\xbar)^2}{\sum(x_i - \xbar)^2}$$

So, if $x_i$ is far from $\xbar$,
then $y_i$ contributes a lot to the estimate $\yhat_i$.

#### Properties of the leverage: h_ii

  1. $(1/n) \le h_{ii} < 1$
  2. $\sum h_{ii} = (p + 1)$
  3. Large $h_{ii}$ indicates case $i$ is far from center of the $x$ space
  4. Large $h_{ii}$ indicates $y_i$ plays an important role in prediction: $\yhat_i$   
  5. Observations with very high leverage often have small residuals
  
The last item follows from the fact that 
$\displaystyle{h_{ii} + \frac{e_i^2}{\SSE} \le 1}$

If $h_{ii}$ is large (close to 1), then $y_i$ has a lot of "leverage" on $\yhat_i$,   
the residual $e_i$ is small,    
and the prediction $\yhat_i$ depends relatively less on the other observations ($y_j$'s)

#### Detecting "high leverage" values numerically

Graphical methods of leverage detection are preferred, but...

However, from item 2 above, the average of $h_{ii}$'s is $(p+1) / n$

A rough screening for "high leverage" $y$-values
is $h_{ii} \;>\; 2(p+1) / n \;\;=\;\; 2\;\times$ (average of $h_{ii}$'s)

#### Getting leverage (h_ii) values in R

Add the "hat values" ($h_{ii}$) to the fuel consumption dataset:

`fuelData <- mutate(fuelData, h = hatvalues(lmfit))`

```{r}
fuelData <- mutate(fuelData, h = hatvalues(lmfit))
```

Hat value for Wyoming:
```{r}
# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")
```

Here is a summary of the other states:
```{r}
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
```

The average of the hat values should be about 

$$\frac{(p+1)}{n} \;=\; \frac{ (`r p` + 1)}{`r n`} \;=\; `r (p+1)/n`$$

Indeed, we observed that the average of the hat values (with Wyoming in the data) was 
$`r mean(~ h, data = fuelData)`$

The threshold for "high leverage" for the fuel data full model is

$$\frac{2 (p+1)}{n} \;=\; \frac{2 (`r p` + 1)}{`r n`} \;=\; `r 2*(p+1)/n`$$

Which observations have "high leverage"?    
Do they have large standardized residuals?

```{r}
filter(fuelData, h > 2*(p+1)/n )
```

#### Detecting "high leverage" values graphically

Leverage plots:

  + boxplot of $h_{ii}$ 
  + histogram of $h_{ii}$ 
  + $h_{ii}$ vs. index (the order in which the data were collected)
  + $h_{ii}$ vs. standardized residuals
  + $h_{ii}$ vs. externally studentized residuals (to be introduced momentarily)
  
```{r fig.height=3}
boxplot(fuelData$h, horizontal = TRUE, pch=20)
abline(v = 2 * (p+1) / n, lty=2)
mtext("hat values (leverage)", side=1, line=2)
  
gf_histogram(~ h, data=fuelData, color="white") %>%
  gf_vline(xintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_labs(x = "hat values (leverage)")
```

```{r}
# gf_text(h ~ 1, data=fuelData, label = ~ state) %>%
#   gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
```

Observations with high leverage do not necessarily have large residuals
(TX, NY, NV, CO, IL)

Observations with large residuals do not necessarily have high leverage (WY)
```{r}
gf_text(h ~ r, data=fuelData, label = ~ state) %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_vline(xintercept = c(-2,2), linetype=2) %>%
  gf_labs(x = "standardized residuals", y = "hat values (leverage)")
```

#### Why worry about leverage?

An observation with high leverage can have significant    
"influence" on the regression coefficients.

#### Why worry about high leverage if residual not large?

Anscombe data: Extreme leverage, residual = 0

Cases with high leverage do not necessarily have large residuals,   
... but can still be influential values.

```{r}
AnscombeDataResids <- AnscombeDataResids %>%
  mutate(h4 = hatvalues(lmfit4), r4 = rstandard(lmfit4))
AnscombeDataResids
# Note that the standardized residual is not defined for
# the outlier because the hat value is exactly = 1!!!
p9Anscombe <- gf_point(h4 ~ c(1:11), data=AnscombeDataResids) %>%
  gf_labs(x = "observation number", y = "hat values (leverage)") %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
# Plotted raw residuals here since a missing value in standardized residuals
p10Anscombe <- gf_point(h4 ~ e4, data=AnscombeDataResids) %>%
  gf_labs(x = "raw residuals", y = "hat values (leverage)") %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2) 
grid.arrange(p4Anscombe, p8Anscombe, p9Anscombe, p10Anscombe, ncol=2)
```

#### Why worry about large residuals with low leverage?

Anscombe data: Example of influence without leverage

Cases with large residuals do not necessarily have high leverage   
...but can still be influential values.

```{r}
AnscombeDataResids <- AnscombeDataResids %>%
  mutate(h3 = hatvalues(lmfit3), r3 = rstandard(lmfit3))
p7bAnscombe <- gf_point(r3 ~ yhat3, data=AnscombeDataResids) %>%
  gf_labs(x = "yhat (fitted values)", y = "std residuals (Y3 on X3)") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype=2)
p11Anscombe <- gf_point(h3 ~ c(1:11), data=AnscombeDataResids) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_lims(y = c(0, 0.40)) %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
p12Anscombe <- gf_point(h3 ~ r3, data=AnscombeDataResids) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_lims(y = c(0, 0.40)) %>%
  gf_hline(yintercept = 2 * (1+1) / 11, linetype=2)
grid.arrange(p3Anscombe, p7bAnscombe, p11Anscombe, p12Anscombe, ncol=2)
```

Regress Y3 on X3 using all data values
```{r}
tidy(lmfit3)
```

Regress Y3 on X3 after removing the outlier
```{r}
tidy(lm(Y3 ~ X3, data=AnscombeData[-3, ]))
```

Regress Y3 on X3 using all data values
```{r}
glance(lmfit3)[-c(7:10)]
```

Regress Y3 on X3 after removing the outlier
```{r}
glance(lm(Y3 ~ X3, data=AnscombeData[-3, ]))[-c(7:10)]
```

The outlier influences the results even though leverage is low.

#### Influence of outlier with low leverage in the fuel data

For the fuel data with large residual for Wyoming:   
How does the regression fit change if obs with large outlier removed?

What if this observation were removed?   
Does the regression fit change significantly?

```{r} 
fuelDataWY <- fuelData[fuelData$state != "WY", ]
lmfitWY <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataWY) 
```
Regression coefficients using all observations:
```{r}
tidy(lmfit)
glance(lmfit)[-c(7:10)]
```

Regression coefficients using all observations ...except WY:
```{r}
tidy(lmfitWY)
glance(lmfitWY)[-c(7:10)]
```

Coefficients for significant predictors (`dlic`, `tax`, `inc`) change a little,   
but the p-values are similar either way (except for `tax`).

The variance estimate ($\widehat{\sigma}$) gets much smaller,   
as expected after removing an observation with a large residual.

So, the predictions may not change much (need to check),    
but $\widehat{\sigma}$ is sensitive to outliers.

#### What if high leverage and residual large?

An observation with high leverage and a large residual   
can have significant "influence" on the regression coefficients.

For the fuel consumption data, NV has a high leverage value   
...and a moderate standardized residual.

```{r}
# Value of variables for Nevada
fuelData %>%
  filter(state == "NV")
```

```{r}
gf_text(h ~ r, data=fuelData, label = ~ state) %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2) %>%
  gf_vline(xintercept = c(-2,2), linetype=2)
```


What if this observation were removed?   
Does the regression fit change significantly?

```{r} 
fuelDataNV <- fuelData[fuelData$state != "NV", ]
lmfitNV <- lm(fuel ~ dlic + tax + inc + road, data=fuelDataNV) 
```
Regression coefficients using all observations:
```{r}
tidy(lmfit)
glance(lmfit)[-c(7:10)]
```

Regression coefficients using all observations ...except NV:
```{r}
tidy(lmfitNV)
glance(lmfitNV)[-c(7:10)]
```

Coefficients for significant predictors (`dlic`, `tax`, `inc`) change a little    
and the p-value for `tax` changes a lot.

The variance estimate ($\widehat{\sigma}$) is essentially the same   
with or without NV in the data.

More importantly, we would want to check on the influence of NV on
any predictions we make.

#### Cook's distance

For observation $i$, Cook's distance measures the difference in the fit   
when all observations are used ($\yhat_j$) vs. all observations except $i$ ($\yhat_{j(i)}$)

$$C_i = \frac{\sum_{j=1}^n (\yhat_j - \yhat_{j(i)})^2}{\widehat{\sigma}^2 (p+1)}$$
This is a measure of the influence of observation $i$ on predictions.

A rough guide: Pay attention to observations with $C_i \ge 1$.

A much better guide: A dotplot or index plot of the $C_i$'s.   
Look for unusual observations.

Add Cook's distance to the dataset:

`fuelData <- mutate(fuelData, C = cooks.distance(lmfit))`

```{r fig.height=3}
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "Cook's distance") %>%
  gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
  gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
```

```{r}
gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "Cook's distance")
```

Why does Wyoming have a high Cook's distance when we know leverage is small?

Cook's distance can be rewritten as
$\quad \displaystyle{C_i = \left(\frac{r_i^2}{p+1}\right)  \left(\frac{h_{ii}}{1 - h_{ii}}\right)}$

Cook's distance combines the size of the residual with the amount of leverage.

$\displaystyle \frac{h_{ii}}{1 - h_{ii}}\;$ is called the "potential function"

So, the high residual for Wyoming once again alerts us    
that this observation may have influence on the regression fit.

But, we already ascertained that leaving out Wyoming may not change things much   
except for inflating the SSE and $\widehat{\sigma}$.


### Externally studentized residuals

The textbook mentions another type of residual sometimes used.

The idea is this:
If $e_i$ is an outlier (large unstandardized residual), then

  + It contributes a large value to the SSE
  + It will increase the SSE 
  + It may cause an overestimate of $\sigma$
    
Thus, the standardized residual $r_i$ will be smaller (since $\widehat{\sigma}$ in denominator).

$$r_i \;=\; \frac{e_i}{\widehat{\sigma}\sqrt{1 - h_{ii}}}$$

So... we might not notice some standardized residuals
that should have been flagged as outliers.

A solution: For each observation $i$

  + Remove that observation from the data
  + Fit the linear model again to the $n-1$ observations
  + Calculate an estimate of $\sigma$
  + Call this estimate $\widehat{\sigma}_{(i)}$
  + Then, standardize (studentize) with this new estimate
  
The result is an "externally studentized" or, more simply "studentized" residual.

$$r_i^* = \frac{e_i}{\widehat{\sigma}_{(i)} \sqrt{1 - h_{ii}}}$$
When the raw residual $e_i$ is an outlier, 
we can expect that $|\; r_i^*\; | > |\; r_i\; |$

Studentized residuals follow a t-distribution with $(n-p-2)$ df    
... if the model assumptions are satisfied.

This distribution is a bit wider than a $N(0,1)$ distribution
since the tails of the distribution of the $r_i^*$ are longer.

#### Standardized vs. studentized residuals

If $|\; r_i\; | > 1$, then $|\; r_i^*\; | > |\; r_i\; |$

Otherwise, $r_i^*$ is just a tad smaller than $r_i$.

So, as expected, for large standardized residuals,   
the studentized residual is even larger.

The goal is to reduce false negatives   
where we fail to notice outliers after standardization.


The textbook argues that there isn't that much difference    
between standardized and studentized residuals.   
They choose to use standardized residuals in their graphs.

**Suggestion:**     
Take a careful look at standardized residuals $|\; r_i\; | \ge 2$ (outliers).

#### How to get studentized residuals in R

Add the studentized residuals ($r_i^*$) to the fuel consumption dataset:

`fuelData <- mutate(fuelData, rstar = rstudent(lmfit))`

```{r}
fuelData <- mutate(fuelData, rstar = rstudent(lmfit))
```

Studentized residual for Wyoming:
```{r}
# Value of variables for Wyoming
fuelData %>%
  filter(state == "WY")
```

Here is a summary of the other states:
```{r}
# Summary of variables for the other states
df <- plyr::ldply(apply(fuelData[fuelData$state != "WY",-1], FUN=favstats, MARGIN=2))
df <- dplyr::rename(df, variable = .id)
kable(df[, -10])
```

The standardized residual for Wyoming is well over twice
the size of the second largest largest magnitude standardized residual.

```{r fig.height=3}
pfuelStud <- gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_lims(y = c(-2.5, 5)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
pfuelExtStud <- gf_point(rstar ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_lims(y = c(-2.5, 5)) %>%
  gf_labs(x = "fitted values (yhat)", y = "studentized residuals")
grid.arrange(pfuelStud, pfuelExtStud, ncol=2)
```

### Plots to check linear regression model assumptions, step-by-step

#### Before  estimating the model parameters:

1.  Scatterplot of the predictor variables versus each other and versus the response variable using a command like `ggpairs`. 

What plot should look like:
Points spread out in a rough football shape for all plots.

Problems you are looking for: 

  + nonlinearities between the predictors and the response variable    
  + obvious outliers  
  + non-constant variance in the response or predictors
  + extreme corrlation between predictors

(You will look more closely for all of these problems using residuals after a fit, but it's a good idea to do a quick check first so that you know what you are working with.)

2. Added-Variable plots

Make an added-variable plot for each $x_j$ you expect to use in the model

"The stronger the linear relationship in the added-variable plot is,
the more important the additional contribution of $x_j$ to the regression
equation already containing the other predictors."

"If the scatterplot shows [a flat] slope,
the variable is unlikely to be useful in the model.

Use this as a rough guide for decisions 
about which variables to include in the model, ...or not.   
Perhaps examine this more later with F-tests. 

3. Residual Plus Component plot

Make a residual plus component plot for each 
x_j$ you expect to use in the model.

This plot indicates if any nonlinearity is present in the 
relationship between $Y$ and $x_j$ after adjusting for the
other predictors in the model.

#### After estimating the model parameters:

4. Histogram of the standardized residuals

What plot should look like:   
Symmetric and mound-shaped (like a Normal distribution).

5.  Normal quantile plot of the standardized residuals, plus histogram and/or box or dot plots if something is off.

What plot should look like:   
Points on plot should be lined up mostly along the diagonal reference
line, with a few off-diagonal at either end. 

Problems you are looking for:

Non-normality of residuals, seen by Normal probability plot plot that doesn't follow the diagonal. 
  
 Is the distribution of residuals skewed?   
 Bimodal distribution?    
 Are there large outliers?


6. Scatterplot of residuals vs. fitted values.

What plot should look like:    
Points should be spread in a horizontal band of roughly even width, without noticable outliers. There should be no apparent pattern to the points.

Problems you are looking for: 

  + non-linearity, as seen if the band of points bends.
  + non-constant variance (heteroscedasticity), if the band widens to one side or the other (or has a bow-tie shape)
  + other visible patterns
  + large outliers 
  
  
```{r}
# R `plot(fit)` creates this plot (among several others) with the square root of abs(standardized residuals), rather than the standardized residuals themselves. This is to make it easier to visualize non-constant variance.
```

7. Scatterplots of standardized residuals vs. each predictor variable

Analysis is the same as for (6) above. In some cases, problems can appear here which weren't visible with the residuals vs. fitted values plot (or vice versa).

8. For time-series data: scatterplot of residuals vs time ("index")

Problems you are looking for:

  + zig-zags
  + cyclical patterns
  
This would be evidence of non-independence of errors.

Can make this plot even for data that is not obviously time-series data, in case something systmatically changes with observations later in the dataset.

9. Scatterplot of standardized residuals vs.\ leverage values 

Problems you are looking for: 

  + leverage outliers 
  + leverage values $> 2(p+1)/n$
  
10. Plot of Cook's distance vs.\ observation number

Problems you are looking for: 

  + Cook's distance outliers 
  + Cook's distance values $> 1$
  

<br><br><br><br><br><br><br><br><br><br><br><br><br><br>
 
 <br><br><br><br><br><br><br><br><br><br><br><br><br><br>

 
#### Some of the information another way: plots to check each assumption

**1a:** Linearity

Scatterplot: standardized residuals vs. fitted values   
...and more scatterplots: standardized residuals vs. each predictor

Look for random scatter across all values of $\yhat$ and each $x_j$

**2a:** Errors are normally distributed

Histogram, boxplot, dotplot, normal quantile plot of standardized residuals

Look for symmetry, mound-shape, not too skewed.

**2c:** Errors have constant variance $\sigma^2$ over all values of $x$'s

Scatterplots:    
standardized residuals vs. fitted values    
standardized residuals vs. each predictor variable

Look for similar spread in residuals across all values of $\yhat$ or $x_j$.
    
**2d and 4a:** Independence of errors

Mostly, this depends on how the data were collected.

But, can plot residuals vs. observation number  
looking for a pattern over time (autocorrelation, Chapter 8)

**4b:** Equally reliable and informative observations

This depends on how the data were collected.

But, we will look at residuals for outliers and "influential points"   
that may influence the results much more than other observations

#### Checking linearity using standardized residuals

Look for random scatter across all values of $\yhat$ or $x_j$

```{r}
gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
```

Now, plot standardized residuals vs. each predictor in the model
```{r}
pfuel.dlic <- gf_point(r ~ dlic, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "dlic", y = "standardized residuals")
pfuel.tax <- gf_point(r ~ tax, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "tax", y = "standardized residuals")
pfuel.inc <- gf_point(r ~ inc, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "inc", y = "standardized residuals")
pfuel.road <- gf_point(r ~ road, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "road", y = "standardized residuals")
grid.arrange(pfuel.dlic, pfuel.tax, pfuel.inc, pfuel.road, ncol=2)
```
  
#### Checking normality of errors using standardized residuals

Look for symmetry, mound-shape, not too skewed.

```{r, fig.height=3}
boxplot(fuelData$r, horizontal = TRUE, pch=20)
abline(v = 2, lty=2)
mtext("standardized residuals", side=1, line=2)

  
p1 <- gf_histogram(~ r, data=fuelData, color="white", bins=15) %>%
  gf_labs(x = "standardized residuals")
p2 <- gf_dotplot(~ r, data=fuelData) %>%
  gf_labs(x = "standardized residuals")
grid.arrange(p1, p2, ncol=2)
```

```{r, fig.height=4, fig.width=4}
gf_qq(~ r, data=fuelData, distribution=qnorm) %>%
  gf_qqline(distribution=qnorm) %>%
  gf_labs(x = "Standard Normal Quantiles", y = "Standardized Residuals")

```
  
#### Checking constant variance of errors using standardized residuals

Look for similar spread in residuals across all values of $\yhat$.

```{r}
gf_point(r ~ yhat, data=fuelData) %>%
  gf_hline(yintercept = c(-2,0,2), linetype = 2) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals")
```

#### Checking for independence of errors

First, find out what you can about how the data were collected.

But, can plot residuals vs. observation number  
looking for a pattern over time (autocorrelation, Chapter 8)

This plot is not likely relevant for the fuel consumption data,
which are in alphabetical order by state.

```{r}
n <- dim(fuelData)[1]

gf_point(r ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)
```

**4b:** Equally reliable and informative observations

Again, first find out what you can about how the data were collected.

But, we should look at residuals for outliers and "influential points"   
that may influence the results much more than other observations

Leverage plots:

  + leverage vs. observation number
  + leverage vs. standardized residuals
  + ...and others mentioned, but not demonstrated in the book
  
```{r fig.height=3}
n <- dim(fuelData)[1]
p <- 4

pfuelHat1 <- gf_point(h ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat2 <- gf_point(h ~ r, data=fuelData) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat1, pfuelHat2, ncol=2)

pfuelHat3 <- gf_text(h ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
pfuelHat4 <- gf_text(h ~ r, data=fuelData, label = ~state) %>%
  gf_labs(x = "standardized residuals", y = "hat values") %>%
  gf_hline(yintercept = 2 * (p+1) / n, linetype=2)
grid.arrange(pfuelHat3, pfuelHat4, ncol=2)
```

Cook's Distance plots:

A dotplot or index plot of the $C_i$'s.   

Look for unusual observations.

Add Cook's distance to the dataset:

`fuelData <- mutate(fuelData, C = cooks.distance(lmfit))`

```{r fig.height=3}
fuelData <- mutate(fuelData, C = cooks.distance(lmfit))

pfuelPotential1 <- gf_point(C ~ c(1:n), data=fuelData) %>%
  gf_labs(x = "observation number", y = "Cook's distance") %>%
  gf_hline(yintercept = 1, linetype=2)
pfuelPotential2 <- gf_dotplot(~ C, data=fuelData) %>%
  gf_labs(x = "Cook's distance")
grid.arrange(pfuelPotential1, pfuelPotential2, ncol=2)
```

```{r}
gf_text(C ~ c(1:n), data=fuelData, label = ~state) %>%
  gf_labs(x = "observation number", y = "Cook's distance")
```

### What to do with outliers?

  + Keep
  + Drop
  + Adjust

**Keep**

If we have a "story" that explains why the outlier exists   
and is not a data collection/recording error.

Odd points may strongly influence model fit

  +  regression surface is "tilted" to accommodate outlier(s)
  +  $\widehat{\sigma}$ is inflated
  +  Extra noisiness in the data can obscure important predictors
  +  Large outlier can "mask" other outliers from being detected
    
Identifying outliers might be a goal in its own right.

Outliers may be left in but may require explanation in the analysis.

**Drop**

  + Results then relevant to smaller data range
  + But, may lead to bias in conclusions
  + Outliers may represent data errors
  
Usually we choose to keep the outlier    
unless there are very sound reasons for dropping observations

**Adjust**

If outliers are also influential and you want to keep them,   
then you might want to adjust for them.

  + Create an "indicator variable" for the outlier (see below)
  + or transform the data (Chapter 6)
  + or look for additional variables not already included in the model
    + use added variable plots (see below)

### Added variable plot

Sometimes outliers may occur because important information   
is contained in another variable not currently in the model.

Besides t-tests and such, an "added variable plot"   
("partial regression plot") can illuminate whether another predictor will help.

Suppose we left the important predictor `dlic` out of the fuel consumption model.

```{r}
lmfit.no.dlic <- lm(fuel ~ tax + inc + road, data=fuelData)
p1.no.dlic <- gf_point(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)
p2.no.dlic <- gf_text(rstandard(lmfit.no.dlic) ~ fitted(lmfit.no.dlic), data=fuelData, label = ~state) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2)
```

Without `dlic` in the model, we have two outliers instead of one.

The added variable plot is a plot of the residuals from a model **without**  
the predictor of interest vs. the predictor of interest.

In this case, we plot the residuals from fuel = f(tax, inc, road)   
against the predictor `dlic`

```{r}
gf_point(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData) %>%
  gf_labs(y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
  gf_coefline(model = lm(rstandard(lmfit.no.dlic) ~ dlic, data=fuelData))
```

The residuals have a (linear) pattern.   
This indicates that `dlic` should be added to the linear model.

We know that with `dlic` in the model, we only have one outlier remaining.

### Adjusting for an outlier

Create an indicator variable (0's and 1's) 

  + value = 1 (TRUE) if the observation is Wyoming
  + value = 0 (FALSE) for all other observations
  
`fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )`

```{r}
fuelData <- mutate(fuelData, WY = 1 * (state == "WY") )
tail(fuelData)
```


Then, fit the model with this variable added as a predictor.

```{r}
lmfitWYindicator <- lm(fuel ~ dlic + tax + inc + road + WY, data=fuelData)
```

The original model (without the indicator variable for WY)
```{r}
tidy(lmfit)
glance(lmfit)[-c(7:10)]
```

The new model (includes the indicator variable for WY)
```{r}
tidy(lmfitWYindicator)
glance(lmfitWYindicator)[-c(7:10)]
```

The new model incorporates a separate intercept for WY

This assures that SSE and $\widehat{\sigma}$ will not be overstated.

Do we still have outliers?

```{r}
gf_point(rstandard(lmfitWYindicator) ~ fitted(lmfitWYindicator)) %>%
  gf_labs(x = "fitted values (yhat)", y = "standardized residuals") %>%
  gf_hline(yintercept = c(-2,0,2), linetype=2) %>%
  gf_lims(x = c(325, 750))
```

You can see that we now have 3 outliers!

This is pretty common with outliers.   
You drop or adjust and new outliers crop up.

So, use this adjustment method with caution.