---
title: "STAT 224 Chapter 3"
params:
  dotabs: no
output:
  html_document:
    code_folding: hide
    css: styles.css
    df_print: kable
    number_sections: yes
    self_contained: yes
    theme: cerulean
---

```{r test-main, child = 'preamble.Rmd'}
```

# Chapter 3 `r if(params$dotabs) paste('{.tabset}')`

## Lecture #5

### Model

#### Multiple linear regression model

If there are other variables that may help us predict $Y$,   
we could extend our simple model to include them.

$$Y \;=\; \beta_0 + \beta_1x_1 +  \beta_2x_2 + \ldots \beta_px_p$$
...and then, allow for error (statistical model)
$$Y \;=\; \beta_0 + \beta_1x_1 +  \beta_2x_2 + \ldots \beta_px_p + \epsilon$$

The following are also multiple linear regression models

$$
\begin{aligned}
Y &\;=\; \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \\ \\
Y &\;=\; \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + \epsilon
\end{aligned}
$$


$Y$ is approximately a linear function of $x$'s   
and $\epsilon$ measures discrepency of the approximation

As before, $\epsilon$ contains no systematic information about $Y$   
that is not already captured by the $x$'s

The $\beta$'s are called partial regression coefficients,      
or just regression coefficients

#### Reminder: Model specifications

  1. The errors are **uncorrelated**
  2. The errors have no systematic info about $Y$ not already captured by the $x$'s
     a. $E(\epsilon_i)=0$ for all units $i$ (for all values of all predictors $x_j$)
     b. The errors are not correlated with the predictor values
     c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all units $i$ (for all predictors $x_j$)
  3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
     a. $E(Y\; |\; X_1=x_1, X_2=x_2, \ldots, X_p=x_p) \;=\; \beta_0 + \beta_1x_1 +  \beta_2x_2 + \ldots \beta_px_p$
  4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$
  

### Data

#### Data format

We will endevour to provide you with **tidy** data:

  + Each row corresponds to one observational unit (case)
  + Each variable is one column
  
| observation | response $y$ | predictor $x_1$ | $x_2$ | ... | $x_p$ |
|:-----------:|:------------:|:---------------:|:-----:|:--------:|:-----:|
|  1  | $y_1$ | $x_{11}$ | $x_{12}$ | ... | $x_{1p}$ |
|  2  | $y_2$ | $x_{21}$ | $x_{22}$ | ... | $x_{2p}$ |
| ... |  ...  |    ...   |   ...    | ... |   ...    |
| ... |  ...  |    ...   |   ...    | ... |   ...    |
| $n$ | $y_n$ | $x_{n1}$ | $x_{n2}$ | ... | $x_{np}$ |

$x_{ij}$ = observation for unit $i$ on variable $x_j$

### Example: Heights of parents and their sons

First, I cleaned up the data.
Click the `Code` button to see my R code.
```{r data-reduction-one-son-per-family, results='hide'}
### For these data, I added some categorical variables (for plotting)
### and reduced the data to one (random) son per family.
### It is not expected that you would need to do code like in this "chunk"

### Add categorical variables for heights of moms and dads
cuts <- c(61, 66, 68, 70, 72, 74, 80)
midpoints <- c(65, 67, 69, 71, 73, 75)

heightDataAll <- Galton %>%
  mutate(father2grp = factor(father > median(father), 
    labels = c("Father Short", "Father Tall"))) %>%
  mutate(mother2grp = factor(mother > median(mother), 
    labels = c("Mother Short", "Mother Tall"))) %>%
  mutate(father4grp = factor(cut(father, 4), 
    labels = c("Father Short", "Father Average", "Father Tall", "Father Very Tall"))) %>%
  mutate(mother4grp = factor(cut(mother, 4), 
    labels = c("Mother Short", "Mother Average", "Mother Tall", "Mother Very Tall"))) %>%
  mutate(fathergrp = factor(cut(father, cuts),  labels = midpoints))

### Reduce the data to one (random) son per family

### If you set the "seed" for the random number generator first,
### then you get the same "random" results each time you compile the doc.
### This keeps you sane.  
### The identical dataset is generated every time I compile this document.
set.seed(12345)

heightDataSonsGroups <- heightDataAll %>%
  filter(sex == "M") %>%
  group_by(family) %>%
  sample_n(1) %>%
  ungroup() %>%
  dplyr::select(-family, -sex)
heightDataSons <- heightDataSonsGroups %>%
  dplyr::select(height, father, mother, nkids)
### Now each family (row) is one "case"
### and each case has 3 observed variables: heights of the father, mother, and one son, plus number of children in the family (nkids)
### ...plus some categorical values I created for heights of fathers and mothers

n <- dim(heightDataSons)[1]
```

#### How are sons' heights related to their father's height?
The $`r n`$ points are shrunk and "jittered" so you can see more of them.
```{r, fig.height=3}
p1 <- gf_point(height ~ father, data=heightDataSons) %>%
  gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") 

### If you set the "seed" for the random number generator first,
### then you get the same "random" results each time you compile the doc.
### This keeps you sane when you want to later comment on a plot.
### The plot doesn't change in appearance each time you compile!
set.seed(12345)

p2 <- gf_jitter(height ~ father, data=heightDataSons, width=0.75, height=0.75, cex=0.75) %>%
  gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") 
grid.arrange(p1, p2, ncol=2)

```

#### Linear relationship between average son's height and father's height?
```{r}
### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataSonsGroups, aes(x=fathergrp, y=height)) +
  geom_boxplot() +
  labs(x="Father's Height (inches)", y="Son's Height (inches)") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")
```

```{r}
### Fit a LS regression line
heightFit <- lm(height ~ father, data=heightDataSons)
### Add the fitted values and residuals to the data
heightDataFitSons <- mutate(heightDataSons, 
                            yhat = fitted(heightFit),  
                            resid = residuals(heightFit) 
                            )
```

A look at the residuals:
```{r}
### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataFitSons, aes(x = father, y = height)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  ### Plot regression slope
  geom_segment(aes(xend = father, yend = yhat), alpha = .2) +  ### alpha to fade lines
  geom_point() +
  geom_point(aes(y = yhat), shape = 1) +
  theme_bw()  ### Add theme for cleaner look
```


... and a residual plot:

```{r}
gf_point(resid ~ father, data=heightDataFitSons) %>%
  gf_hline(yintercept = 0)
```

#### Reading the output

Output from the `tidy` function in the `broom` package.
```{r}
tidy(heightFit)
```
What is the value of the least squares regression intercept?

What is the meaning of the intercept for these data?

What is the value of the least squares regression slope?

What is the meaning of the slope for these data?

Output from the `glance` function in the `broom` package.
```{r}
glance(heightFit)
```

In this course, we will not likely discuss four of the statistics that
`glance` provides: `logLik`, `AIC`, `BIC`, anad `deviance`.  

So, I will restrict the output for lectures:
```{r}
glance(heightFit)[-c(7:10)]
```

Click the `Code` button to see a summary of the regression output using base R.
```{r}
heightFit <- lm(height ~ father, data=heightDataSons)
### Here is the base R summary of a linear model result
summary(heightFit)
```

What is the value of $R^2$ and what does it mean?

What hypothesis is tested using this p-value?

Hint: Compare the t-statistics above to the F-statistic.

```{r eval=FALSE}
### Plot not displayed.
### Just playing around with the full dataset (out of curiousity)
gf_point(height ~ father, color = ~ sex, size = ~ nkids, data = heightDataAll)
```

```{r practice-boxplots, eval=FALSE}
### Plots not displayed.
### I was just playing around with various ggplot2 graphics functions.
### Mostly drawing means (and lines between means) with boxplots
### Who knows, someday you might want to show something similar at your summer intership!
x <- factor(rep(1:10, 100))
y <- rnorm(1000)
df <- data.frame(x=x, y=y)

ggplot(df, aes(x=x, y=y)) + 
geom_boxplot() + 
stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
stat_summary(fun.y=mean, geom="point")

qplot(father2grp, height, data=heightDataSons, geom="boxplot",
  xlab="Father's Height", ylab="Son's Height") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3)

gf_boxplot(height ~ father2grp, data=heightDataSons) +
  labs(x="Father's Height", y="Son's Height") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3)

ggplot(heightDataSons, aes(x=father2grp, y=height)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")

ggplot(heightDataSons, aes(x=father4grp, y=height)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")
  
gf_histogram(~ height | father2grp, bins=12, data=heightDataSons, color="white") +
           labs(x="Son's Height", y="Number of Sons")

ggplot(heightDataSons, aes(x = height)) +
  geom_histogram(bins=12, color = "white") +
  facet_wrap(~ father2grp, ncol=1)

gf_boxplot(height ~ father4grp, data=heightDataSons) +
           labs(x="Father's Height", y="Son's Height")
gf_violin(height ~ father4grp, draw_quantiles=c(0.50), data=heightDataSons) +
           labs(x="Father's Height", y="Son's Height")
```

#### What about the other predictors in the data?

Could there be other information in the data    
that can help predict a son's height from family characteristics?
```{r}
### Look at the first 10 observations in the data
head(heightDataSons, 10)
```

Start with a scatterplot matrix.

Using base R:
```{r}
### Scatterplot matrix in base R
pairs(heightDataSons, pch=20)
### Table of correlations between all variables in the data
cor(heightDataSons)
```

Using the `GGally` package:
```{r}
### Scatterplot matrix using GGally package
ggpairs(heightDataSons)
```

...and again getting fancier with the `GGally` package:

This adds simple linear regression lines and confidence bands to each plot.   
```{r}
### I stumbled across this while Googling around about ggpairs function.  Nifty.
ggpairs(heightDataSons, lower = list(continuous = wrap("smooth", method = "lm")))
```

Father's height is more strongly correlated with    
Son's height than mother's height.

Number of children in family doesn't seem to be related to anyone's height   
... not surprising?

Fathers and mothers have fairly low height correlation   
...a bit surprising?

Then, perhaps the mothers bring "significant" 
new information about the Son's height  
that is not already captured by father's height.

#### Fit the multiple linear regression model

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$

where   
$x_1$ = father's height    
$x_2$ = mother's height   
$x_3$ = numer of children in the family

```{r}
### Fit a multiple linear regression model with least squares
heightFit.all <- lm(height ~ father + mother + nkids, data=heightDataSons)
### Look at the results using functions in the broom package
tidy(heightFit.all)
glance(heightFit.all)[-c(7:10)]
```

Surprisingly, the number of children in the family  
is a moderately significant predictor.    

For now, I will leave it out of the model.

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$

where   
$x_1$ = father's height    
$x_2$ = mother's height   

```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
```

### Estimation

#### Least squares estimates of coefficients

As we did with one predictor,  
we estimate the regression coefficients by "least squares"

Minimize the sum of squared residuals.

The estimated error (estimate of $\epsilon_i$) is called a **residual** ($e_i$)
 
$$\begin{aligned}
{\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\
&\;=\; {\rm observed - expected} \\
&\;=\; {\rm observed - predicted} \\ \\
&\;=\; y_i - \yhat_i\\ \\
&\;=\; y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2} + \cdots + \bp x_{ip})
\end{aligned}$$

$$SSE \;=\; \sum_i e_i^2 
\;=\; \sum_i (y_i - \widehat{\beta_0} - \widehat{\beta_1} x_{i1} - \widehat{\beta_2} x_{i2} - \cdots + \bp x_{ip})^2$$
We can minimize this function with respect to $\widehat{\beta_0}, \widehat{\beta_1}, \widehat{\beta_2}, \ldots, \bp$  
to find the least squares estimates of the model coefficients.

The formulas get a bit crazy for $p > 1$ predictors.    
After taking partial derivatives wrt each $\widehat\beta_j$  
we need to solve $(p+1)$ linear equations for $(p+1)$ unknowns.

This is all best represented in matrix form (...linear algebra).

At least the intercept has a simple formula.

With one predictor: $Y \;=\; \beta_0 + \beta_1 x + \epsilon$,  
recall that $\widehat{\beta_0} \;=\; \ybar - \widehat{\beta_1} \xbar$.

For multiple linear regression, 

$$\widehat{\beta_0} \;=\; \ybar - \widehat{\beta_1} \xbar_1 - \widehat{\beta_2} \xbar_2 - \cdots - \bp \xbar_p
\;=\; \ybar - \sum_{j=1}^p \widehat\beta_j \xbar_j$$


#### Estimate of the standard deviation

As with all least squares methods, we estimate $\var(\epsilon)$ as 

$$\widehat\sigma^2 \;=\; \frac{\SSE}{\df} \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{\df}$$

degrees of freedom (df)   

= sample size $-$ ### parameters estimated for the mean 
($\beta_0, \beta_1, \beta_2, \ldots, \beta_p$)   

= $n-(p+1) \;=\; (n - p - 1)$

$$\widehat\sigma^2 \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{n - p - 1}$$

#### Finding the least squares estimates using R

`lm` = "linear model" function

`lm(y ~ x1 + x2 + x3 + x4, data = mydata)`


### Interpreting (partial) regression coefficients

```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
### Grab the coefficients (column 2) for 1st two predictors (rows 2 & 3)
### (The intercept information is in row 1)
b.both.fathers <- heightFit.both.coef$estimate[2]
b.both.mothers <- heightFit.both.coef$estimate[3]
```
**One interpretation:**

Among mothers of a particular height, if they each had a child with a man 1 inch taller instead,    
we expect their sons would have been, on average, `r b.both.fathers` inches taller.

Among fathers of a particular height, if they each had a child with a woman 1 inch taller instead,    
we expect their sons would have been, on average, `r b.both.mothers` inches taller.

<br><br><br>
**Another interpretation:**

Fathers 1 inch taller than others tend to have sons about `r b.both.fathers` inches taller,  
on average (after adjusting for the height of the mother).

Mothers 1 inch taller than others tend to have sons about `r b.both.mothers` inches taller,  
on average (after adjusting for the height of the father).

<br><br><br>
**Yet another interpretation (much like the first):**

Holding the height of the mother constant, fathers 1 inch taller than others tend to have sons about `r b.both.fathers` inches taller,  
on average.

Holding the height of the father constant, mothers 1 inch taller than others tend to have sons about `r b.both.mothers` inches taller,  
on average.

### Simple vs. multiple linear regression coefficients

What if we only use father's height as a predictor?

(Simple linear regression)

```{r}
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
heightFit.fathers.coef
```

```{r}
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
### Grab the coefficient (column 2) for the predictor (row 2)
### (The intercept information is in row 1)
b.fathers <- heightFit.fathers.coef$estimate[2]
```

Why did the coefficient for father's height change from
`r b.both.fathers` to `r b.fathers`?

For the simple linear regression, the interpretation is different.

Men 1 inch taller than others tend to have sons about `r b.fathers` inches taller,   
on average  (without taking into account the height of the mother).

#### A closer look at $\beta_1$ (the slope)

$\beta_1$ is the slope of the simple linear regression function. 

Slope is the change in average $Y$ associated with    
a one-unit increase of $x$.

\[
\E(Y^{old})={\beta_0}+\beta_1x^{old} \quad \textrm{ and}
\]
\[
\E(Y^{new})=\beta_0+\beta_1(x^{old}+1),
\]
then
\[
\E(Y^{new})-\E(Y^{old}) \;=\; \beta_0+\beta_1(x^{old}+1)-(\beta_0+\beta_1x^{old}) \;=\; \beta_1
\]

#### A closer look at $\beta_1$ (the partial coeff for $x_1$)

Consider a multiple regression with two predictors:

\[
\E(Y^{old}) \;=\; {\beta_0}+\beta_1x_1^{old}+\beta_2x_2 \quad \textrm{ and}
\]
\[
\E(Y^{new}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2,
\]
then
\[
\E(Y^{new})-\E(Y^{old}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2-(\beta_0+\beta_1x_1^{old}+\beta_2x_2) \;=\; \beta_1.
\]

...and this math only works if $x_2$ does not change when $x_1$ increases by one unit.

**Interpretation:** 

$\beta_1$ is the expected change in $Y$ when $x_1$ increase by 1-unit,   
with all other predictors ($x_2$ in this example) *unchanged* (but present in the model). 

**Another interpretation:** 

The expected change in $Y$ when $x_1$ increases by 1 unit,   
*adjusting for* all other predictors.

### Mathematical understanding of partial coefficients

**NOTE:**  The following discussion is just for a pedagogical purpose -- another way to view things.   
This is **NOT** how we calculate coefficients for multiple regression.


For a model with two predictors

$$\yhat_i = \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}$$
$\beta_1$ represents the effect of $x_1$ on the mean of $Y$  
after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$

#### A 3-step process for finding $\widehat{\beta_1}$.

  1. Regress $Y$ on $x_2$
  2. Regress $x_1$ on $x_2$
  3. Regress the residuals from 1 onto the residuals from 2
  
  The estimated slope for Model #3 = $\widehat{\beta_1}$ for the multiple regression.
  
What?  This is **NOT obvious.**  Let's investigate.

##### Estimate Model #1: 

Regress $Y$ on $x_2$

$$y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$$
This model estimate "grabs" all the (linear) information about $Y$ that $x_2$ has to offer.

The residuals leftover ($e_{i; y, x_2}$) are not correlated with $x_2$.

##### Estimate Model #2: 

Regress $x_1$ on $x_2$

$$x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$$
This model estimate grabs all the (linear) information about $x_1$ that $x_2$ has to offer.

The residuals leftover ($e_{i; x_1, x_2}$) are not correlated with $x_2$.

##### Estimate Model #3: 

Regress residuals: Model #1 on Model #2

$$e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$$
How do we know the intercept is exactly zero?  Think on it.

This model grabs all the linear information about $y$ that $x_1$ has to offer   
**after** we already removed all linear information $x_2$ contained about either $y$ or $x_1$.

Assuming this works, then $\widehat{\alpha_1}$ in the estimate of Model #3   
should be exactly $\widehat{\beta_1}$ in the full, two-predictor model estimate:

$$y_i \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}$$

So, $\beta_1$ represents the effect of $x_1$ on the mean of $Y$  
after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$

#### Confirming the 3-step process for finding $\widehat{\beta_1}$

Recall the Model #2 estimate above: $x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$

Substituting the Model #2 estimate into the full model estimate, we have 

$$
\begin{aligned}
y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\\ \\
y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} (\widehat{\gamma_0} + \widehat{\gamma_1} x_{i2} + e_{i; x_1, x_2}) + \widehat{\beta_2} x_{i2}\\ \\
&= (\widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0}) + (\widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2}) x_{i2} + \widehat{\beta_1} e_{i; x_1, x_2}
\end{aligned}
$$

Then, by comparing the above with the Model #1 estimate: $y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$ we see that

$$
\begin{aligned}
\widehat{\delta_0} & \;=\; \widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0} \\
\widehat{\delta_1} & \;=\; \widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2} \\
e_{i; y, x_2} & \;=\; \widehat{\beta_1} e_{i; x_1, x_2}
\end{aligned}
$$
The last equation shows $\,\widehat{\beta_1}$ is exactly the coefficient in the Model #3 estimate: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$


#### Example: Seeing the 3-step process for finding $\widehat{\beta_1}$

For the height data    
$x_1$ = father's height   
$x_2$ = mother's height   
$y$ = son's height

##### Model #1 estimate is 

$$y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$$

Regress sons' heights ($Y$) on mothers' heights ($x_2$):
```{r}
### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.mothers.coef <- tidy(heightFit.mothers)
heightFit.mothers.coef
### Save the residuals
y.on.x2.resids <- residuals(heightFit.mothers)
```

##### Model #2 estimate is

$$x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$$

Regress fathers' heights ($x_1$) on mothers' heights ($x_2$):

```{r}
### Estimate a simple linear regression model
heightFit.fathersOnMothers <- lm(father ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathersOnMothers.coef <- tidy(heightFit.fathersOnMothers)
heightFit.fathersOnMothers.coef
### Save the residuals
x1.on.x2.resids <- residuals(heightFit.fathersOnMothers)
```

Some of the residuals:
```{r}
head(mutate(heightDataSons, y.on.x2.resids, x1.on.x2.resids), 10)
```

##### Model #3 estimate is

The regression of the residuals to find 

$$\widehat{\alpha_1}$: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$$

```{r}
### Estimate a simple linear regression model
heightFit.residsModel <- lm(y.on.x2.resids ~ x1.on.x2.resids)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.residsModel)
```

...and finding $\widehat{\beta_1}$ directly from the full model   
(both father and mother heights as predictors)
```{r}
### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
### for the multiple linear regression on both x1 and x2
tidy(lm(height ~ father + mother, data=heightDataSons))
```

**It works!**

#### Reminder: The 3-step process is just for pedagogical illustration

We don't actually use the 3-step process to find $\widehat{\beta_1}$ in practice.

We just compute the two-predictor model directly in R.


### Geometric interpretation of regression coefficients

#### One predictor (regression line)

  + The fitted values ($\Yhat$'s) will all lie on the regression line  
  + $\beta_0$ is the intercept of that line
  + $\beta_1$ is the slope of that line
  + $Y$ values may be above, below, or on the line within the ($x,Y$) plane of the data
  

#### Two predictors (regression plane)

  + The fitted values ($\Yhat$'s) will all lie on the regression plane  
  + $\beta_0$ is the $Y$ value with the $x$ variables both = 0
  + $\beta_1$ is the slope of the plane along the $x_1$ direction
  + $\beta_2$ is the slope of the same plane along the $x_2$ direction
  + $Y$ values may be above, below, or on the plane within the full space of the data

Recall: $\beta_1$ is the effect of $x_1$ on expected $Y$   
but only after we have "adjusted for" $x_2$ in the model.

A graph may give a sense of what is meant by "adjusted for" 

[http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf](http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf)

The plane is not parallel to either $x$ axis,   
but changes in both directions   
(if we have two meaningful predictors of $Y$). 

Given any $x_2$, the incremental change due to $x_1$ is the same,   
but the resultant predicted $Y$ is not the same   
...due to the contribution of $x_2$.

#### Multiple predictors (regression surface)

The fitted values ($\Yhat$'s) will all lie on the same regression surface

The $Y$ values may be above, below, or on the surface

## Lecture #6

### R^2 (From Chapter 2 but I skipped past this part when giving the lecture)


#### Coefficient of determination: $R^2$ ("R squared")

One measure of the "goodness of fit" of the regression model to the data:

$R^2$ = proportion of variability in the response ($Y$)  
explained by regression on the predictor ($x$)

$$R^2 \;=\; \frac{\SSR}{\SST} \;=\; 1 - \frac{\SSE}{\SST}
\;=\; 1 - \frac{\sum (y_i - \yhat_i)^2}{\sum (y_i - \ybar)^2}$$

If the predictors don't include much information about the response,   
then $\sum (y_i - \yhat_i)^2$ is close to $\sum (y_i - \ybar)^2$    
since $\ybar$ is the least squares estimate for the mean without predictors.

In that case, $R^2$ is near $1 - 1 = 0$.

On the other hand, if the predictors have so much information that   
residuals are very, very small, then $R^2$ is near $1 - 0 = 1$.


```{r}
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
summary(repairFit)
```

#### Why do we call this number $R^2$?

When there is only one predictor ($x$), $R^2 = \corr(y,x)^2 = r^2$

```{r,include=FALSE}
covyx = cov(Minutes ~ Units, data=repairData)
covyx
corryx = cor(Minutes ~ Units, data=repairData)
corryx
```

```{r}
cor(Minutes ~ Units, data=repairData)
cor(Minutes ~ Units, data=repairData)^2
```
$r = `r corryx`$  
$r^2 = `r corryx^2`$

```{r}
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
```

So, 98.7\% of the variability in repair times can be attributed
to the regression on the number of units.

The rest of the variability (residuals) is not explained by the number of units repaired.

#### Another way to view $R^2$

Claim: $\corr(y, \yhat) = |\, \corr(y, x)\, |$

Recall: 

  + $\cov(a+by, c+dx) = bd\cov(y,x)$
  + $\sd(a+bY) = |b|\sd(y)$ and $\sd(c + dx) = |d|\sd(x)$
  
$$\corr(a+bY, c+dx) 
= \frac{bd\cov(y,x)}{|b|\sd(Y)\, |d|\sd(x)}
= {\rm sign}(bd)\corr(y,x)$$

$\begin{align}
\corr(y, \yhat) = \corr(y, \b0 + \b1 x) 
= \textrm{sign}(\b1) \corr(y,x)
\end{align}$

$\b1$ and $\corr(y,x)$ always have the same sign: $\b1 = \corr(y,x) (s_y/s_x)$

Thus, $0 \le \corr(y, \yhat) \le 1$
and $\corr(y, \yhat) = |\, \corr(y, x)\, |$

So, $R^2 = \left[\corr(y, x)\right]^2 = \left[\corr(y, \yhat)\right]^2$

This will be handy when we have models with more than one predictor.

But we will have to be careful: when we add more predictors to a model, $R^2$ will *always* increase or stay constant. Therefore we won't be able to reliably use $R^2$ as a way of comparing models to see which is better.

#### Adjusted R^2

```{r}
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
```

Adjusted $R^2$ includes the degrees of freedom:

$$\text{adjusted} R^2 = 1 - \frac{\SSE / \SSE(\df)}{\SST / \SST(\df)}
= \frac{\sum (y_i - \yhat_i)^2/(n-p)}{\sum (y_i - \ybar)^2 / (n-1)}$$

Adjusted $R^2$ does *not* always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another.


### R$^2$s are not additive

**Terminology:** For multiple linear regression, $R^2$ is sometimes called   
the **multiple correlation coefficient.**

Model with both father's ($x_1$) and mother's height ($x_2$) included:
```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.both)[-c(7:10)]
### Save the R-sqr value as a number (not a data frame)
heightFit.both.Rsqr <- as.numeric(glance(heightFit.both)[1])
```
Regression on both the fathers' and mothers' heights explains
`r round(100*heightFit.both.Rsqr, 1)`% of the variability
in sons' heights.

Model with only father's height ($x_1$) included:
```{r}
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.fathers)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.fathers)[-c(7:10)]
### Save the R-sqr value as a number (not a data frame)
heightFit.fathers.Rsqr <- as.numeric(glance(heightFit.fathers)[1])
```
Regression on the fathers' heights explains
`r round(100*heightFit.fathers.Rsqr, 1)`% of the variability
in sons' heights.

Model with only mother's height ($x_2$) included:
```{r}
### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.mothers)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.mothers)[-c(7:10)]
### Save the R-sqr value as a number (not a data frame)
heightFit.mothers.Rsqr <- as.numeric(glance(heightFit.mothers)[1])
```
Regression on the mothers' heights explains
`r round(100*heightFit.mothers.Rsqr, 1)`% of the variability
in sons' heights.

We see that `r round(100*heightFit.both.Rsqr, 1)`%
$<$ `r round(100*heightFit.fathers.Rsqr, 1)`%
$+$ `r round(100*heightFit.mothers.Rsqr, 1)`%

because, for example, adding mothers' heights to the model only brings in their (marginal) information    
about sons' heights exclusive of information they also hold about fathers' heights.

Of course, $R^2$ for the combined, two-predictor model must be   
larger than either of the $R^2$ for one-predictor models.  Why?

**Caution:** This is not always true for the adjusted $R^2$.


### Example: Fuel consumption data

As an illustration of MLR modeling, we look at a dataset on U.S. fuel consumption by state (C. Bingham, American Almanac). 

We will investigate fuel consumption as a linear function of demographics, fuel tax policy, and other factors.

```{r}
### Read in the data on fuel taxes and demographics for 48 states
### (excludes Alaska and Hawaii)
fuelData <- read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fuel.txt")

### Rearrange the order of the variables (in the order selected)
### then sort by the state code (AL, AR, AZ, CA, ...)
fuelData <- fuelData %>%
  dplyr::select(state, pop, fuelc, fuel, nlic, dlic, tax, inc, road) %>%
  arrange(state)
### Display the data as a "tibble" (a "nice" format for a data frame)
head(as.tibble(fuelData))
```

variable | description
|:-------|:-------------|
state    | State (excluding AK and HI)
pop      | Population in each state
fuelc    | fuel consumption, millions of gallons
fuel     | Motor fuel consumption, gal/person
nlic     | 1971 thousands of drivers
dlic     | Percent of population with driver's license
tax      | 1972 motor fuel tax rate, cents/gal
inc      | 1972 per capita income, thousands of dollars
road     | 1971 federal-aid highways, thousands of miles

#### A first model: fuel = dlic + tax + inc + road

We will only use per capita versions of the data:

`dlic` vs. `nlic` (percent of licenced drivers vs. count)

`fuel` vs. `fuelc` (gallons/person rather than gallons)

`road` is a little problematic (more $ for states with large land area)

`pop` could be included, but the variables are per capita already.


#### Scatterplot matrix

```{r}
### Use select (from dplyr) to get the variables in order 
### (response variable first), so scatterplot matrix better arranged.
temp <- dplyr::select(fuelData, fuel, dlic, tax, inc, road)
### scatterplot matrix using ggpairs from GGally package
ggpairs(temp)
```

`tax` and `dlic` appear to have a linear relationship with `fuel`

...and maybe `income`?

`road` doesn't look to have much relationship to `fuel`

#### The full model: fuel = dlic + tax + inc + road

```{r}
### Estimate various regression models we will use in these notes
fuelFit0 <- lm(fuel ~ 1, data=fuelData)
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
fuelFit4 <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
tidy(fuelFit4)
```

### Comparing models: R^2

#### One predictor: $R^2$

Recall from simple linear regression (one predictor)  

$$R^2 = \frac{\SSR}{\SST} = 1 - \frac{\SSE}{\SST}$$

SST = $\sum (y_i - \ybar)^2$ = total sum of squares   

SSE = $\sum (y_i - \yhat_i)^2$ = error sum of squares   

SSR = $\sum (\yhat_i - \ybar)^2$ = regression sum of squares

$\SST = \SSR + \SSE$

Even the degrees of freedom for each sum of squares add up!

SST(df) = SSE(df) + SSR(df)

$(n-1) = (n-2) + 1$

Finally, $r^2 = [\corr(y, x)]^2 = R^2$

#### Multiple predictors: $R^2$

All is the same for multiple linear regression except...

$r_1^2 = [\corr(y, x_1)]^2 \ne R^2$

Instead, recall that for simple linear regression, we define

$$\corr(y, x) = |\; \corr(y, \yhat) \;|$$

For simple and multiple linear regression

$$ R^2 = [ \corr(y, \yhat) ]^2 = \frac{\SSR}{\SST} = 1 - \frac{\SSE}{\SST}$$

#### Fuel example: $R^2$

```{r}
### We estimated the full model "fuelFit4" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]
```

#### Every additional predicitor increases $R^2$

Since more information reduces error (SSE),

each additional predictor    

  + decreases SSE
  + increases $R^2$

(Technically, adding a predictor might have *no effect* on SSE or $R^2$, if the estimated regression coefficient for that predictor was *exactly* 0. But this will almost never happen in practice.)

Here is $R^2$ for the full model: `dlic`, `tax`, `inc`, and `road`  
versus a 3-predictor model (`road` removed)

fuel = dlic + tax + inc + road    
vs.  
fuel = dlic + tax + inc

```{r}
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]

```

```{r}
### We estimated the reduced model "fuelFit3" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
```

$R^2$ went down even though we know `road` is not significant predictor

What about the adjusted $R^2$?  $\quad$ It went up!   

...which implies `road` does not add enough to the model to offset the additional "complexity" of the model

### Adjusted $R^2$

$${\rm Adjusted}\; R^2 
\;=\; 1 - \frac{\SSE \;/\; \SSE(df)}{\SST \;/\; \SST(df)} 
\;=\; 1 - \frac{\SSE \;/\; (n-p-1)}{\SST \;/\; (n-1)}$$

Adjusted $R^2$ sets up a "penalty" for model complexity.

Can use adjusted $R^2$ to compare "nested" models.

Can even compare models that are not nested.   
For example, compare the following two models with adjusted $R^2$,   
but do **NOT** compare using $R^2$:

fuel = dlic + inc $\qquad$ vs. $\qquad$ fuel = dlic + tax

```{r}
### Estimate a multiple linear regression
fuelFit2a <- lm(fuel ~ dlic + inc, data=fuelData)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2a)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2a)[-c(7:10)]
### We estimated the fuelFit2 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2)[-c(7:10)]
```

#### Adjusted $R^2$ is smaller than $R^2$

$${\rm Adjusted}\; R^2 
\;=\; 1 - \frac{n-1}{n-p-1} (1 - R^2)
\;=\; \frac{(n-1)R^2 - p}{n-1-p}$$

Example:  $n=100$, $p=20$, $R^2=0.90$  
Then, adjusted $R^2 = 0.87$.

Adjusted $R^2$ is essentially "handicapped" or "penalized"  
by the number of parameters ($p+1$) relative to $n$

NOTE: Adjusted $R^2$ can be negative (not common in practice)

### Remove another predictor? fuel = tax + dlic

We already decided `road` did not add enough information to the model to justify keeping it in.

Should we also remove `inc`?

fuel = dlic + tax + inc   
vs.   
fuel = dlic + tax

```{r}
### We estimated the fuelFit2 and fuelFit3 models far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
tidy(fuelFit2)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
glance(fuelFit2)[-c(7:10)]
```

Note that the SEs got larger, the t-statistics got smaller.

So, it can happen that a significant predictor becomes  
not significant when a signifcant predictor is removed.

Also, important predictors can become insignificant if   
too many nonconsequential variables in the model.

Not all of modeling can be based on statistical tests!

#### What if the intercept is not significant?

The model we have settled upon is

fuel = dlic + tax + inc

but the intercept is (barely) not significant:
```{r}
### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
```

Even so, we keep the intercept in the model.

The intercept doesn't really add complexity.  
...and keeping it guarantees all the nice properties of least squares estimates.

#### Questions about including predictors ...or not

What are the consequences of...

including non-significant variables in the model?  
After all, $R^2$ always improves, so why omit variables?


What are the consequences of...

omitting important variables?   
After all, fewer parameters is better for  

  + more df for variance estimate
  + less model complexity
  
### Model validation: fuel = dlic + tax + inc

#### Reminder: Model specifications

  1. The errors are **uncorrelated**
  2. The errors have no systematic info about $Y$
     a. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$)
     b. The errors are not correlated with the predictor values
     c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$)
  3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
  4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$

#### Checking the residuals against the model specifications

```{r fig.height=4}
### We estimated the 5 models fuelFitX far above
### Here, we are adding the fitted values and residuals
### for each model to the dataset
fuelDataResids <- fuelData %>%
  mutate(resids0 = residuals(fuelFit0), fitted0 = predict(fuelFit0),
         resids1 = residuals(fuelFit1), fitted1 = predict(fuelFit1),
         resids2 = residuals(fuelFit2), fitted2 = predict(fuelFit2),
         resids3 = residuals(fuelFit3), fitted3 = predict(fuelFit3),
         resids4 = residuals(fuelFit4), fitted4 = predict(fuelFit4)
         )

### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = fuelDataResids) %>%
  gf_hline(yintercept = 0, linetype = "dashed") %>%
  gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")
  
```

```{r fig.height=4}
### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = fuelDataResids, color="white", bins=12) %>%
  gf_labs(x = "Residuals (fuel = dlic + tax + inc)")

### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = fuelDataResids) %>% gf_qqline()

grid.arrange(p1, p2, ncol=2)
```

#### Does outlier influence our decision about normality?

Here are the cases with the largest residuals:
```{r}
### select variables and arrange (sort) the data
temp <- fuelDataResids %>%
  dplyr::select(state, fuel, dlic, tax, inc, resids3) %>%
  arrange(resids3)
### Look at the last few observations (where sorted values are largest)
tail(temp, 8)
```

The largest residual happens to be for the state of Wyoming.

Here are the residual plots again, with the Wyoming residual removed.
```{r fig.height=4}
### Use the filter function from dplyr package to filter out Wyoming
temp <- filter(fuelDataResids, state != "WY")

### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = temp) %>%
  gf_hline(yintercept = 0, linetype = "dashed") %>%
  gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")
```

```{r fig.height=4}
### Remember, Wyoming is removed from data, so that residual is not displayed
### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = temp, color="white", bins=12) %>%
  gf_labs(x = "Residuals (fuel = dlic + tax + inc)")

### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = temp) %>% gf_qqline()

grid.arrange(p1, p2, ncol=2)
```

### Inference for multiple linear regression

#### Several types of tests

Can test for specific values of $\beta_0, \beta_1, \beta_2, \ldots, \beta_p$

  1. Test one single coefficient = 0 (or test another value)
  2. Test if all coefficients $\beta_1, \beta_2, \ldots, \beta_p$ simultaneously = 0
     + That is, predictors don't have much information about response
  3. Test whether some subset of coefficients simultaneously = 0
  4. Other constraints: Test whether...
     + some coefficients equal specific values
     + some coefficients are equal to the each other
     + some coefficients sum to a specific value
     + ...and so on
  
Tests OK if model specifications reasonable   
and residuals plausibly from a normal distribution.

#### 1. Test one single coefficient

$$\frac{\bj - \beta_j}{\se(\bj)} \;\sim\; t_{n-p-1}$$

```{r}
### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
```

####  2. Test if all coefficients simultaneously = 0

To test multiple coefficients, we use the F-test.

$H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$   
$H_1:$ At least one of the $\beta_j \ne 0$.

**NOTE:** Here, we are not testing $\beta_0=0$.

Here is the process.

Start with the **full model** (FM):
$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$     
(no restrictions on the coefficients)

Compare to the **reduced model** (RM): $Y = \beta_0 + \epsilon$   
(all other $\beta_j=0$ as stated in $H_0$)

If the model specifications are valid (including Normality),  
Then the following comparison statistic has an $F$ distribution.

$$F \;=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)}$$

Let's try it with the fuel data: 
fuel = dlic + tax + inc

I encourage you to look at the R code.
```{r,collapse=TRUE}
### We estimated the models "fuelFit0" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit0) and full (fuelFit3) models
anova(fuelFit0, fuelFit3)

### Now, I'll do things a little more explicitly (manually)
### Use tidy() from dplyr package to put ANOVA output into a table
anovaTable3 <- tidy(anova(fuelFit0, fuelFit3))

### Grab the various pieces of the F-statistic from the ANOVA table
SSE.RM <- anovaTable3$rss[1]
SSE.FM <- anovaTable3$rss[2]
df.RM <- anovaTable3$res.df[1]
df.FM <- anovaTable3$res.df[2]
### Calculate the F-statistic
F.stat <- ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / ( SSE.FM / df.FM)
### Calculate the p-value of the F-test
p.value <- 1 - pf(F.stat, df.RM - df.FM, df.FM)
```

$$F \;=\; \frac{(`r SSE.RM` - `r SSE.FM`) \; / \; (`r df.RM` - `r df.FM`)}
{`r SSE.FM` \;/\; `r df.FM`}
\;=\; `r F.stat`$$

p-value = `r p.value`

#### Standard F-test from software

$H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$   
$H_1:$ At least one of the $\beta_j \ne 0$.

is the test (and F-statistic) that R automatically calculates for regression.

```{r}
### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
```

#### The ANOVA table and the F test for all $\beta_j=0$ 
(except not testing $\beta_0=0$)

**ANOVA = ANalysis Of VAriance**

Source | SS (Sum Squares of) | df | MS (Mean Squares) 
|:-------------|:------------------------|:-------|:---------------------|
Model (Regression) | SSR = $\sum_i (\yhat_i-\ybar)^2$ |  $p$ |  MSR = SSR$\,/\,p$
Residual (Error) | SSE = $\sum_i (y_i-\yhat_i)^2$ |  $n-p-1$  | MSE = SSE$\,/\,(n-p-1)$ = $\widehat{\sigma}{^2}$
Total | SST = $\sum_i (y_i-\ybar)^2$ = SSR + SSE |  $n-1$   |  MST = SST$\,/\,(n-1) = \var(y)$ 

When $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$,

Reduced model (RM): $Y = \beta_0 + \epsilon$    
has SSE = $\sum_i (y_i-\ybar)^2$ on $n-1$ df

Full model(FM):  $Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$     
has SSE = $\sum_i (y_i-\yhat_i)^2$ on $n-p-1$ df

and the F-statistic is


$$\begin{aligned}
F &= \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;-\; \df(\FM)} \\ \\
  &=\; \frac{\left[\sum (y_i-\ybar)^2 - \sum(y_i - \yhat_i)^2 \right] \;/\; [(n-1) - (n-p-1)]}{ \sum(y_i - \yhat_i)^2 \;/\; (n-p-1) } \\ \\
&=\; \frac{(\SST - \SSE) \;/\; p}{\SSE \;/\; (n-p-1)}
\;=\; \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} 
\;=\; \frac{\MSR}{\MSE}
\end{aligned}$$

Since the textbook mentions it, here is another way to compute the F-statistic:


$$\begin{aligned}
F &=\; \frac{\MSR}{\MSE}
\;=\; \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} \\ \\
&=\; \frac{1 \;/\; \SST}{1 \;/\; \SST} \times \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} \\ \\
&=\; \frac{\SSR \;/\; \SST (p)}{\SSE \;/\; \SST (n-p-1)} 
\;=\; \frac{R^2_p \;/\; p} {(1 - R^2_p) \;/\; (n-p-1)}
\end{aligned}$$

...but there is nothing illuminating (to me) about this formula.


####  Test whether some set of coefficients simultaneously = 0

To test statements about multiple coefficients, we continue to use the F-test.

Let's use the fuel data: 
fuel = dlic + tax + inc

$H_0: \beta_2 = \beta_3 = 0$
(`tax` and `inc` not needed in the model)

Here is the process.

Start with the **full model** (FM):
$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$     
(no restrictions on the coefficients)

Compare to the **reduced model** (RM):
$Y = \beta_0 + \beta_1 x_1 + \epsilon$   
($\beta_2=0$ and $\beta_3=0$ as stated in $H_0$)

If the model specifications are valid (including Normality),  
Then the following comparison statistic has an $F$ distribution.

df for F-statistic: [df(RM) $-$ df(FM)] and df(FM)

$$\begin{aligned}
F &=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)} \\ \\
&=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; (p-q)}{\SSE(\FM) \;/\; (n-p-1)} \\ \\
&=\; \frac{(R^2_p - R^2_q) \;/\; (p-q)}{(1 - R^2_p) \;/\; (n-p-1)} 
\end{aligned}$$

where $p$ is the number of predictors in the full model (FM)   
and $q<p$ is the number of predictors in the reduced model (RM)

OK.  Let's test $H_0: \beta_2 = \beta_3 = 0$
(`fuel` and `inc` not needed in the model)

**SHOW CODE**

```{r}
### We estimated the models "fuelFit1" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit1) and full (fuelFit3) models
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit1, fuelFit3)
```

#### If test just one $\beta_j=0$: $F = t^2$

fuel = dlic + tax + inc

$H_0: \beta_3 = 0$
(fuel = dlic + tax)

Here is the process.

Start with the **full model** (FM):
$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$     
(no restrictions on the coefficients)

Compare to the **reduced model** (RM):
$Y = \beta_0 + \beta_1 x_1 + + \beta_2 x_2 + \epsilon$   
($\beta_3=0$ as stated in $H_0$)

Here is the F-statistic:
```{r}
### We estimated the models "fuelFit2" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit2) and full (fuelFit3) models
### ...and these models only differ by one coefficient (beta_3)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit2, fuelFit3)
F.stat <- anova(fuelFit2, fuelFit3)[2,5]
```
F-statistic = `r F.stat`

...and the t-statistic for the same test
```{r}
### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
### t-statistics are in the "statistic" variable of the tidy table
### There are p=3 predictors, so the (p+1) = 4th holds info for X_3 
t.stat <- tidy(fuelFit3)$statistic[4]
```
t-statistic = `r t.stat`

As we have seen before, $F_{1,df} = (t_{df})^2$.


####  Constraints: Some coefficients equal specific values

With `dlic`, `tax`, and `inc` in the fuel model could the   
marginal effect of `tax` be 24 fewer gallons/year per person   
while the marginal effect of `dlic` increases average fuel by 12 gallons/year?

$H_0: \beta_1=12,\; \textrm{and}\;\; \beta_2=-24$

Start with the **full model** (FM):
$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$     
(no restrictions on the coefficients)

Compare to the **reduced model** (RM):
$Y = \beta_0 + 12 x_1 -24 x_2 + \beta_3 x_3 + \epsilon$   
($\beta_3=0$ as stated in $H_0$)

Fit both the reduced and full models and compare using `anova` function

**SHOW CODE**

```{r}
### The offset() function lets you bring in specific values for coef
lmfit.RM <- lm(fuel ~ offset(12*dlic) + offset(-24*tax) + inc, data=fuelData)
### This model has no constraints (full model)
lmfit.FM <- lm(fuel ~ dlic + tax + inc, data=fuelData)

### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)
```

**SHOW CODE**

Or use the `linearHypothesis` function from the `car` package:
```{r}
### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(fuelFit3, c("dlic = 12", "tax = -24")))
```

####  Constraints: Some coefficients equal each other

In the height data, perhaps the coefficients for fathers and for mothers are the same.

```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
```

$H_0: \beta_1=\beta_2$

Start with the **full model** (FM):
$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$     
(no restrictions on the coefficients)

Compare to the **reduced model** (RM):
$Y = \beta_0 + \beta_1 x_1 + \beta_1 x_2 + \epsilon   
= \beta_0 + \beta_1 (x_1 + x_2) + \epsilon$     
($\beta_1=\beta_2$ as stated in $H_0$)

Fit both the reduced and full models and compare using `anova` function

**SHOW CODE**

```{r}
### The I() function lets you redefine a predictor 
### In this case, the new predictor is the sum of two other predictors
lmfit.RM <- lm(height ~ I(father + mother), data=heightDataSons)
### This model has no constraints (full model)
lmfit.FM <- lm(height ~ father + mother, data=heightDataSons)

### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)
```

**SHOW CODE**

Or use the `linearHypothesis` function from the `car` package:
```{r}
### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(lmfit.FM, c("father = mother")))
```

So, it is plausible that the coefficients are equal.


### Which measure of overall model fit can/should be used when?

| number of predictors | nested models | non-nested models |
| -------------:| ------------- | ------------- |
different number | F-test (see note 1) | |
| | adjusted $R^2$ | adjusted $R^2$ |
| | $R^2$ (see note 2) | |
| same number | not applicable | adjusted $R^2$ (see note 3) |
|             |                | $R^2$ |
| | | |

**Note 1:** 
The F-test has the advantage of providing a measure of uncertaintly (p-value)

**Note 2:**
$R^2$ _always_ increases as more predictors are added to the model, so adjusted $R^2$ is preferred for model comparison.
On the other hand, $R^2$ is an interpretable quantity:
the amount of variation in the response explained by linear regression
on the predictor(s).

**Note 3:**
When two models with the same number of predictors are compared,
then $C \times R^2 =$ adjusted $R^2$, ($C$ is a constant),
so comparing either measure provides equivalent information.


### Predictions 

For families with a father 70 inches tall and mother 66 inches tall,   
estimate the **average** of their sons' heights (and a **confidence** interval for the true average).

Using the full model: height = father + mother,
```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)

### Set up a little dataset to hold the predictor values for an estimate
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a confidence interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="confidence"))
```

For **one family** with a father 70 inches tall and mother 66 inches tall,   
estimate their son's height (and a **prediction** interval actual height).

Using the full model: height = father + mother,
```{r}
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)

### Set up a little dataset to hold the predictor values for a prediction
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a prediction interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="prediction"))
```

Why is the prediction interval wider than the confidence interval?


### Centering and Scaling 

How do the coefficients change if we change the units of observation?

#### Height data: Change sons' units to centimeters

```{r}
### Add sons' heights in cm to the dataset
cm <- 2.54
heightDataSons.cm <- heightDataSons %>%
  mutate(height.cm = height*cm)

### Fit the new linear model (response in cm instead of inches)
heightFit.both.cm <- lm(height.cm ~ father + mother, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.coef <- tidy(heightFit.both.cm)
```

Regression with original data:
```{r}
### This table of coefficients was saved earlier
heightFit.both.coef
```
Son's heights changed to centimeters:
```{r}
### This table of coefficients was saved earlier
heightFit.both.cm.coef
```

What is the ratio of the new coefficients and SE's compared to old?
```{r}
### These tables of coefficients were saved earlier
### Coefficients in column 2, SEs in column 3
### Intercept in row 1
### Predictors in rows 2 ... to p
heightFit.both.coef[1:3, 2:3]
heightFit.both.cm.coef[1:3, 2:3]
heightFit.both.cm.coef[1:3, 2:3] /  heightFit.both.coef[1:3, 2:3]
```

Rather confusingly, the mothers' and fathers' heights are still in inches.  

So, the coefficients are in units of cm/inch.

Of course, we can change all units to cm (and this better not affect our analysis at all!)

```{r}
### Add fathers' and mothers' heights (in cm) to the dataset
heightDataSons.cm <- heightDataSons.cm %>%
  mutate(father.cm = father*cm, mother.cm = mother*cm)

### Fit a new linear model with all variables in cm (not inches)
heightFit.both.cm.cm <- lm(height.cm ~ father.cm + mother.cm, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.cm.coef <- tidy(heightFit.both.cm.cm)
heightFit.both.cm.cm.coef
```

#### A general solution to scaling: Standardize the variables

$$\begin{aligned}
\widetilde{y} \;=\; \frac{y - \ybar}{s_y}
\quad {\rm and} \quad
\widetilde{x}_j \;=\; \frac{x_j - \xbar_j}{s_j} 
\quad {\rm for}\; j=1,2,\ldots, p
\end{aligned}$$

#### First, let's just center the data

$$Y - \ybar \;=\; \beta_0 
+ \beta_1 (x_1 - \xbar_1) 
+ \beta_2 (x_2 - \xbar_2)
+ \cdots
+ \beta_p (x_p - \xbar_p)
+ \epsilon
$$

We are still using the modified data:   
sons' heights in cm, fathers' and mothers' heights in inches

```{r}
### Add centered variables to the predictors (sons' in cm, others in inches)
heightDataSons.ctr <- heightDataSons.cm %>%
  mutate(height.ctr = height.cm - mean(height.cm), 
         father.ctr = father - mean(father),
         mother.ctr = mother - mean(mother) 
        )

### Estimate the multiple linear model with centered data
heightFit.both.ctr <- lm(height.ctr ~ father.ctr + mother.ctr, data=heightDataSons.ctr)
```

Regression with uncentered data:
```{r}
### This table of coefficients was saved earlier
tidy(heightFit.both.cm)
```
Regression after all variables centered:
```{r}
### This table of coefficients was saved earlier
tidy(heightFit.both.ctr)
```

The intercept for the regression with centered data is zero.  Why?

**Hint:** When we use least squares to estimate $\beta_0$ for a linear model like this

$$Y \;=\; \beta_0  + \beta_1 x_1   + \beta_2 x_2  + \cdots + \beta_p x_p  + \epsilon$$
$$ {\rm then,}\; \widehat{\beta_0} \;=\; \ybar - \sum_{i=1}^p \bj \xbar_j$$

The other regression coefficients (and their SEs) are the same as when not centered.

#### Now, let's standardize the data

```{r}
### Add standardized variables to the dataset
heightDataSons.cm.std <- heightDataSons.cm %>%
  mutate(height.std = (height.cm - mean(height.cm)) / sd(height.cm), 
         father.std = (father - mean(father)) / sd(father),
         mother.std = (mother - mean(mother)) / sd(mother) 
        )

### Estimate linear model with standardized response and predictor variables
heightFit.both.cm.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.cm.std)
```

Regression with unstandardized data:
```{r}
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)
```
Regression after all variables standardized:
```{r}
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm.std)
```

This last regression is the same whether we measured in cm or inches.  Why?

Regression with unstandardized original data (all in inches):
```{r}
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)
```
Regression after all variables standardized:
```{r}
### Add standardized variables to the dataset
heightDataSons.std <- heightDataSons %>%
  mutate(height.std = (height - mean(height)) / sd(height), 
         father.std = (father - mean(father)) / sd(father),
         mother.std = (mother - mean(mother)) / sd(mother) 
        )
### Estimate the linear model using standardized response and predictor vars
heightFit.both.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.std)
### View the coef, SEs, t-stats, p-values from the linear regression
tidy(heightFit.both.std)
```

We don't really gain anything from standardizing the original data since all variables   
are already in the same units (inches) for the same characteristic (height).

Plus, it seems that the standard deviations of the original heights  
were nearly the same already for each variable (before standardizing).

$s_y = `r sd(~ height, data=heightDataSons)`$   
$s_{x1} = `r sd(~ father, data=heightDataSons)`$   
$s_{x2} = `r sd(~ mother, data=heightDataSons)`$   

So, the coefficients changed a little, but their interpretation is very different.

Now, for fathers one std deviation taller, their sons are expected   
to be $`r round(tidy(heightFit.both.std)$estimate[2], 4)`$
standard deviations taller (adjusted for the mothers' heights).


### Properties of LS estimators

#### Reminder: Model specifications

  1. The errors are **uncorrelated**
  2. The errors have no systematic info about $Y$
    a. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$)
    b. The errors are not correlated with the predictor values
    c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$)
  3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
  4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$
  
#### Facts about LS estimators

  + $\bj$s unbiased: $\E(\bj) = \beta_j$ for $j=0,1,2,\ldots,p$
  + $\var(\bj) = c_j \sigma^2$
    + let R compute the $c_j$s (requires matrix algebra)
  + $\bj \sim$ Normal($\,\beta_j, \sqrt{c_j}\sigma$)
  + $W = \SSE/\sigma$ is chi-square ($\chi^2$) with $\df = n - p - 1$
  + $\bj$s are independent of $\widehat{\sigma}$ (spread of residuals does not effect slope)
  + BLUE: Best Linear Unbiased Estimators (minimum variance)
  
### Imputing a missing value

```{r}
i <- 5
n <- dim(heightDataSons)[1]
```
Suppose for the $`r 5`th$ family in the data, the mother's height was not recorded.

We could just throw out case $i = `r 5`$.

...but what if a mother's or father's height is missing from several families.   
Should we just throw out the information on the other parent so easily?

One solution: **imputation** of missing values.

This can be done many ways.   
We will only discuss two.

#### Option 1: Replace missing value with average

For example, replace the missing $`r 5`th$ mother's height
with the average height of the other $(n-1) = `r n-1`$ mothers' heights.

```{r message=FALSE}
kable(mean(~ mother[-5], data=heightDataSons))
```
  
#### Option 2: Replace missing value with regression prediction

  + Temporarily let mother's height be the response variable.   
  + Let son's and father's heights be the predictors.
  + Remove case $i = `r 5`$ from the data (missing the mother's height). 
  + Regress mother's height on both son's and father's height
  + Predict mother's height for case $i = `r 5`$ using the son's and father's heights from case $i = `r 5`$
  
```{r}
i
temp <- heightDataSons[-i, ]
lmfit <- lm(mother ~ father + height, data = temp)
newdata <- data.frame(father = heightDataSons$father[i], height = heightDataSons$height[i])
newdata <- mutate(newdata, mother = predict(lmfit, newdata))
kable(newdata)
```