1 Chapter 3

1.1 Lecture #5

1.1.1 Model

1.1.1.1 Multiple linear regression model

If there are other variables that may help us predict $Y$,
we could extend our simple model to include them.

\[Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p\] …and then, allow for error (statistical model) \[Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p + \epsilon\]

The following are also multiple linear regression models

\[ \begin{aligned} Y &\;=\; \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + \epsilon \end{aligned} \]

$Y$ is approximately a linear function of $x$’s
and $\epsilon$ measures discrepency of the approximation

As before, $\epsilon$ contains no systematic information about $Y$
that is not already captured by the $x$’s

The $\beta$’s are called partial regression coefficients,
or just regression coefficients

1.1.1.2 Reminder: Model specifications

The errors are uncorrelated
The errors have no systematic info about $Y$ not already captured by the $x$’s
1. $E(\epsilon_i)=0$ for all units $i$ (for all values of all predictors $x_j$)
2. The errors are not correlated with the predictor values
3. $\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2$ is same for all units $i$ (for all predictors $x_j$)
The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
1. $E(Y\; |\; X_1=x_1, X_2=x_2, \ldots, X_p=x_p) \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p$
The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$

1.1.2 Data

1.1.2.1 Data format

We will endevour to provide you with tidy data:

Each row corresponds to one observational unit (case)
Each variable is one column

observation	response $y$	predictor $x_1$	$x_2$	…	$x_p$
1	$y_1$	$x_{11}$	$x_{12}$	…	$x_{1p}$
2	$y_2$	$x_{21}$	$x_{22}$	…	$x_{2p}$
…	…	…	…	…	…
…	…	…	…	…	…
$n$	$y_n$	$x_{n1}$	$x_{n2}$	…	$x_{np}$

$x_{ij}$ = observation for unit $i$ on variable $x_j$

1.1.3 Example: Heights of parents and their sons

First, I cleaned up the data. Click the Code button to see my R code.

### For these data, I added some categorical variables (for plotting)
### and reduced the data to one (random) son per family.
### It is not expected that you would need to do code like in this "chunk"

### Add categorical variables for heights of moms and dads
cuts <- c(61, 66, 68, 70, 72, 74, 80)
midpoints <- c(65, 67, 69, 71, 73, 75)

heightDataAll <- Galton %>%
  mutate(father2grp = factor(father > median(father), 
    labels = c("Father Short", "Father Tall"))) %>%
  mutate(mother2grp = factor(mother > median(mother), 
    labels = c("Mother Short", "Mother Tall"))) %>%
  mutate(father4grp = factor(cut(father, 4), 
    labels = c("Father Short", "Father Average", "Father Tall", "Father Very Tall"))) %>%
  mutate(mother4grp = factor(cut(mother, 4), 
    labels = c("Mother Short", "Mother Average", "Mother Tall", "Mother Very Tall"))) %>%
  mutate(fathergrp = factor(cut(father, cuts),  labels = midpoints))

### Reduce the data to one (random) son per family

### If you set the "seed" for the random number generator first,
### then you get the same "random" results each time you compile the doc.
### This keeps you sane.  
### The identical dataset is generated every time I compile this document.
set.seed(12345)

heightDataSonsGroups <- heightDataAll %>%
  filter(sex == "M") %>%
  group_by(family) %>%
  sample_n(1) %>%
  ungroup() %>%
  dplyr::select(-family, -sex)
heightDataSons <- heightDataSonsGroups %>%
  dplyr::select(height, father, mother, nkids)
### Now each family (row) is one "case"
### and each case has 3 observed variables: heights of the father, mother, and one son, plus number of children in the family (nkids)
### ...plus some categorical values I created for heights of fathers and mothers

n <- dim(heightDataSons)[1]

1.1.3.1 How are sons’ heights related to their father’s height?

The $173$ points are shrunk and “jittered” so you can see more of them.

p1 <- gf_point(height ~ father, data=heightDataSons) %>%
  gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") 

### If you set the "seed" for the random number generator first,
### then you get the same "random" results each time you compile the doc.
### This keeps you sane when you want to later comment on a plot.
### The plot doesn't change in appearance each time you compile!
set.seed(12345)

p2 <- gf_jitter(height ~ father, data=heightDataSons, width=0.75, height=0.75, cex=0.75) %>%
  gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") 
grid.arrange(p1, p2, ncol=2)

1.1.3.2 Linear relationship between average son’s height and father’s height?

### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataSonsGroups, aes(x=fathergrp, y=height)) +
  geom_boxplot() +
  labs(x="Father's Height (inches)", y="Son's Height (inches)") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")

### Fit a LS regression line
heightFit <- lm(height ~ father, data=heightDataSons)
### Add the fitted values and residuals to the data
heightDataFitSons <- mutate(heightDataSons, 
                            yhat = fitted(heightFit),  
                            resid = residuals(heightFit) 
                            )

A look at the residuals:

### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataFitSons, aes(x = father, y = height)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  ### Plot regression slope
  geom_segment(aes(xend = father, yend = yhat), alpha = .2) +  ### alpha to fade lines
  geom_point() +
  geom_point(aes(y = yhat), shape = 1) +
  theme_bw()  ### Add theme for cleaner look

… and a residual plot:

gf_point(resid ~ father, data=heightDataFitSons) %>%
  gf_hline(yintercept = 0)

1.1.3.3 Reading the output

Output from the tidy function in the broom package.

tidy(heightFit)

term	estimate	std.error	statistic	p.value
(Intercept)	38.4008	4.6882	8.191	0
father	0.4432	0.0678	6.535	0

What is the value of the least squares regression intercept?

What is the meaning of the intercept for these data?

What is the value of the least squares regression slope?

What is the meaning of the slope for these data?

Output from the glance function in the broom package.

glance(heightFit)

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual
0.1998	0.1952	2.261	42.71	0	2	-385.6	777.3	786.7	874.5	171

In this course, we will not likely discuss four of the statistics that glance provides: logLik, AIC, BIC, anad deviance.

So, I will restrict the output for lectures:

glance(heightFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.1998	0.1952	2.261	42.71	0	2	171

Click the Code button to see a summary of the regression output using base R.

heightFit <- lm(height ~ father, data=heightDataSons)
### Here is the base R summary of a linear model result
summary(heightFit)

Call:
lm(formula = height ~ father, data = heightDataSons)

Residuals:
   Min     1Q Median     3Q    Max 
-5.978 -1.535  0.022  1.522  6.022 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  38.4008     4.6882    8.19  5.8e-14 ***
father        0.4432     0.0678    6.54  7.0e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.26 on 171 degrees of freedom
Multiple R-squared:   0.2,  Adjusted R-squared:  0.195 
F-statistic: 42.7 on 1 and 171 DF,  p-value: 0.000000000702

What is the value of $R^2$ and what does it mean?

What hypothesis is tested using this p-value?

Hint: Compare the t-statistics above to the F-statistic.

### Plot not displayed.
### Just playing around with the full dataset (out of curiousity)
gf_point(height ~ father, color = ~ sex, size = ~ nkids, data = heightDataAll)

### Plots not displayed.
### I was just playing around with various ggplot2 graphics functions.
### Mostly drawing means (and lines between means) with boxplots
### Who knows, someday you might want to show something similar at your summer intership!
x <- factor(rep(1:10, 100))
y <- rnorm(1000)
df <- data.frame(x=x, y=y)

ggplot(df, aes(x=x, y=y)) + 
geom_boxplot() + 
stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
stat_summary(fun.y=mean, geom="point")

qplot(father2grp, height, data=heightDataSons, geom="boxplot",
  xlab="Father's Height", ylab="Son's Height") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3)

gf_boxplot(height ~ father2grp, data=heightDataSons) +
  labs(x="Father's Height", y="Son's Height") +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3)

ggplot(heightDataSons, aes(x=father2grp, y=height)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")

ggplot(heightDataSons, aes(x=father4grp, y=height)) +
  geom_boxplot() +
  stat_summary(fun.y=mean, geom="line", aes(group=1))  + 
  stat_summary(fun.y=mean, geom="point", size=3, color="red")
  
gf_histogram(~ height | father2grp, bins=12, data=heightDataSons, color="white") +
           labs(x="Son's Height", y="Number of Sons")

ggplot(heightDataSons, aes(x = height)) +
  geom_histogram(bins=12, color = "white") +
  facet_wrap(~ father2grp, ncol=1)

gf_boxplot(height ~ father4grp, data=heightDataSons) +
           labs(x="Father's Height", y="Son's Height")
gf_violin(height ~ father4grp, draw_quantiles=c(0.50), data=heightDataSons) +
           labs(x="Father's Height", y="Son's Height")

1.1.3.4 What about the other predictors in the data?

Could there be other information in the data
that can help predict a son’s height from family characteristics?

### Look at the first 10 observations in the data
head(heightDataSons, 10)

height	father	mother	nkids
73.2	78.5	67.0	4
70.0	69.0	66.0	3
67.0	69.0	66.7	6
68.0	69.0	66.0	6
71.0	69.0	66.5	5
70.5	69.5	66.5	4
71.0	69.0	66.5	6
70.5	69.5	66.0	7
69.0	69.0	66.0	9
67.0	69.0	65.0	7

Start with a scatterplot matrix.

Using base R:

### Scatterplot matrix in base R
pairs(heightDataSons, pch=20)

### Table of correlations between all variables in the data
cor(heightDataSons)
         height  father mother    nkids
height  1.00000 0.44704 0.2830 -0.07661
father  0.44704 1.00000 0.1162  0.02625
mother  0.28299 0.11623 1.0000  0.09500
nkids  -0.07661 0.02625 0.0950  1.00000

Using the GGally package:

### Scatterplot matrix using GGally package
ggpairs(heightDataSons)

…and again getting fancier with the GGally package:

This adds simple linear regression lines and confidence bands to each plot.

### I stumbled across this while Googling around about ggpairs function.  Nifty.
ggpairs(heightDataSons, lower = list(continuous = wrap("smooth", method = "lm")))

Father’s height is more strongly correlated with
Son’s height than mother’s height.

Number of children in family doesn’t seem to be related to anyone’s height
… not surprising?

Fathers and mothers have fairly low height correlation
…a bit surprising?

Then, perhaps the mothers bring “significant” new information about the Son’s height
that is not already captured by father’s height.

1.1.3.5 Fit the multiple linear regression model

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$

where
$x_1$ = father’s height
$x_2$ = mother’s height
$x_3$ = numer of children in the family

### Fit a multiple linear regression model with least squares
heightFit.all <- lm(height ~ father + mother + nkids, data=heightDataSons)
### Look at the results using functions in the broom package
tidy(heightFit.all)

term	estimate	std.error	statistic	p.value
(Intercept)	24.1166	6.0128	4.011	0.0001
father	0.4179	0.0658	6.353	0.0000
mother	0.2589	0.0706	3.670	0.0003
nkids	-0.1077	0.0643	-1.675	0.0957

glance(heightFit.all)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.2661	0.2531	2.179	20.43	0	4	169

Surprisingly, the number of children in the family
is a moderately significant predictor.

For now, I will leave it out of the model.

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$

where
$x_1$ = father’s height
$x_2$ = mother’s height

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

1.1.4 Estimation

1.1.4.1 Least squares estimates of coefficients

As we did with one predictor,
we estimate the regression coefficients by “least squares”

Minimize the sum of squared residuals.

The estimated error (estimate of $\epsilon_i$) is called a residual ($e_i$)

\[\begin{aligned} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2} + \cdots + \widehat{\beta}_px_{ip}) \end{aligned}\]

\[SSE \;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{\beta_0} - \widehat{\beta_1} x_{i1} - \widehat{\beta_2} x_{i2} - \cdots + \widehat{\beta}_px_{ip})^2\] We can minimize this function with respect to $\widehat{\beta_0}, \widehat{\beta_1}, \widehat{\beta_2}, \ldots, \widehat{\beta}_p$
to find the least squares estimates of the model coefficients.

The formulas get a bit crazy for $p > 1$ predictors.
After taking partial derivatives wrt each $\widehat\beta_j$
we need to solve $(p+1)$ linear equations for $(p+1)$ unknowns.

This is all best represented in matrix form (…linear algebra).

At least the intercept has a simple formula.

With one predictor: $Y \;=\; \beta_0 + \beta_1 x + \epsilon$,
recall that $\widehat{\beta_0} \;=\; \overline{y}- \widehat{\beta_1} \overline{x}$.

For multiple linear regression,

\[\widehat{\beta_0} \;=\; \overline{y}- \widehat{\beta_1} \overline{x}_1 - \widehat{\beta_2} \overline{x}_2 - \cdots - \widehat{\beta}_p\overline{x}_p \;=\; \overline{y}- \sum_{j=1}^p \widehat\beta_j \overline{x}_j\]

1.1.4.2 Estimate of the standard deviation

As with all least squares methods, we estimate $\mathrm{var}(\epsilon)$ as

\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{\mathrm{df}}\]

degrees of freedom (df)

= sample size $-$ ### parameters estimated for the mean ($\beta_0, \beta_1, \beta_2, \ldots, \beta_p$)

= $n-(p+1) \;=\; (n - p - 1)$

\[\widehat\sigma^2 \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n - p - 1}\]

1.1.4.3 Finding the least squares estimates using R

lm = “linear model” function

lm(y ~ x1 + x2 + x3 + x4, data = mydata)

1.1.5 Interpreting (partial) regression coefficients

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

### Grab the coefficients (column 2) for 1st two predictors (rows 2 & 3)
### (The intercept information is in row 1)
b.both.fathers <- heightFit.both.coef$estimate[2]
b.both.mothers <- heightFit.both.coef$estimate[3]

One interpretation:

Among mothers of a particular height, if they each had a child with a man 1 inch taller instead,
we expect their sons would have been, on average, 0.4162 inches taller.

Among fathers of a particular height, if they each had a child with a woman 1 inch taller instead,
we expect their sons would have been, on average, 0.248 inches taller.

Another interpretation:

Fathers 1 inch taller than others tend to have sons about 0.4162 inches taller,
on average (after adjusting for the height of the mother).

Mothers 1 inch taller than others tend to have sons about 0.248 inches taller,
on average (after adjusting for the height of the father).

Yet another interpretation (much like the first):

Holding the height of the mother constant, fathers 1 inch taller than others tend to have sons about 0.4162 inches taller,
on average.

Holding the height of the father constant, mothers 1 inch taller than others tend to have sons about 0.248 inches taller,
on average.

1.1.6 Simple vs. multiple linear regression coefficients

What if we only use father’s height as a predictor?

(Simple linear regression)

### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
heightFit.fathers.coef

term	estimate	std.error	statistic	p.value
(Intercept)	38.4008	4.6882	8.191	0
father	0.4432	0.0678	6.535	0

### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
### Grab the coefficient (column 2) for the predictor (row 2)
### (The intercept information is in row 1)
b.fathers <- heightFit.fathers.coef$estimate[2]

Why did the coefficient for father’s height change from 0.4162 to 0.4432?

For the simple linear regression, the interpretation is different.

Men 1 inch taller than others tend to have sons about 0.4432 inches taller,
on average (without taking into account the height of the mother).

1.1.6.1 A closer look at $\beta_1$ (the slope)

$\beta_1$ is the slope of the simple linear regression function.

Slope is the change in average $Y$ associated with
a one-unit increase of $x$.

\[ \mathrm{E}(Y^{old})={\beta_0}+\beta_1x^{old} \quad \textrm{ and} \] \[ \mathrm{E}(Y^{new})=\beta_0+\beta_1(x^{old}+1), \] then \[ \mathrm{E}(Y^{new})-\mathrm{E}(Y^{old}) \;=\; \beta_0+\beta_1(x^{old}+1)-(\beta_0+\beta_1x^{old}) \;=\; \beta_1 \]

1.1.6.2 A closer look at $\beta_1$ (the partial coeff for $x_1$)

Consider a multiple regression with two predictors:

\[ \mathrm{E}(Y^{old}) \;=\; {\beta_0}+\beta_1x_1^{old}+\beta_2x_2 \quad \textrm{ and} \] \[ \mathrm{E}(Y^{new}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2, \] then \[ \mathrm{E}(Y^{new})-\mathrm{E}(Y^{old}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2-(\beta_0+\beta_1x_1^{old}+\beta_2x_2) \;=\; \beta_1. \]

…and this math only works if $x_2$ does not change when $x_1$ increases by one unit.

Interpretation:

$\beta_1$ is the expected change in $Y$ when $x_1$ increase by 1-unit,
with all other predictors ($x_2$ in this example) unchanged (but present in the model).

Another interpretation:

The expected change in $Y$ when $x_1$ increases by 1 unit,
adjusting for all other predictors.

1.1.7 Mathematical understanding of partial coefficients

NOTE: The following discussion is just for a pedagogical purpose – another way to view things.
This is NOT how we calculate coefficients for multiple regression.

For a model with two predictors

\[\widehat{y}_i = \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\] $\beta_1$ represents the effect of $x_1$ on the mean of $Y$
after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$

1.1.7.1 A 3-step process for finding $\widehat{\beta_1}$.

Regress $Y$ on $x_2$
Regress $x_1$ on $x_2$
Regress the residuals from 1 onto the residuals from 2

The estimated slope for Model #3 = $\widehat{\beta_1}$ for the multiple regression.

What? This is NOT obvious. Let’s investigate.

1.1.7.1.1 Estimate Model #1:

Regress $Y$ on $x_2$

\[y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}\] This model estimate “grabs” all the (linear) information about $Y$ that $x_2$ has to offer.

The residuals leftover ($e_{i; y, x_2}$) are not correlated with $x_2$.

1.1.7.1.2 Estimate Model #2:

Regress $x_1$ on $x_2$

\[x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}\] This model estimate grabs all the (linear) information about $x_1$ that $x_2$ has to offer.

The residuals leftover ($e_{i; x_1, x_2}$) are not correlated with $x_2$.

1.1.7.1.3 Estimate Model #3:

Regress residuals: Model #1 on Model #2

\[e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}\] How do we know the intercept is exactly zero? Think on it.

This model grabs all the linear information about $y$ that $x_1$ has to offer
after we already removed all linear information $x_2$ contained about either $y$ or $x_1$.

Assuming this works, then $\widehat{\alpha_1}$ in the estimate of Model #3
should be exactly $\widehat{\beta_1}$ in the full, two-predictor model estimate:

\[y_i \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\]

So, $\beta_1$ represents the effect of $x_1$ on the mean of $Y$
after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$

1.1.7.2 Confirming the 3-step process for finding $\widehat{\beta_1}$

Recall the Model #2 estimate above: $x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$

Substituting the Model #2 estimate into the full model estimate, we have

\[ \begin{aligned} y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\\ \\ y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} (\widehat{\gamma_0} + \widehat{\gamma_1} x_{i2} + e_{i; x_1, x_2}) + \widehat{\beta_2} x_{i2}\\ \\ &= (\widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0}) + (\widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2}) x_{i2} + \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} \]

Then, by comparing the above with the Model #1 estimate: $y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$ we see that

\[ \begin{aligned} \widehat{\delta_0} & \;=\; \widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0} \\ \widehat{\delta_1} & \;=\; \widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2} \\ e_{i; y, x_2} & \;=\; \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} \] The last equation shows $\,\widehat{\beta_1}$ is exactly the coefficient in the Model #3 estimate: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$

1.1.7.3 Example: Seeing the 3-step process for finding $\widehat{\beta_1}$

For the height data
$x_1$ = father’s height
$x_2$ = mother’s height
$y$ = son’s height

1.1.7.3.1 Model #1 estimate is

\[y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}\]

Regress sons’ heights ($Y$) on mothers’ heights ($x_2$):

### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.mothers.coef <- tidy(heightFit.mothers)
heightFit.mothers.coef

term	estimate	std.error	statistic	p.value
(Intercept)	49.8503	4.9715	10.027	0.0000
mother	0.2997	0.0777	3.858	0.0002

### Save the residuals
y.on.x2.resids <- residuals(heightFit.mothers)

1.1.7.3.2 Model #2 estimate is

\[x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}\]

Regress fathers’ heights ($x_1$) on mothers’ heights ($x_2$):

### Estimate a simple linear regression model
heightFit.fathersOnMothers <- lm(father ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathersOnMothers.coef <- tidy(heightFit.fathersOnMothers)
heightFit.fathersOnMothers.coef

term	estimate	std.error	statistic	p.value
(Intercept)	61.1486	5.1934	11.77	0.0000
mother	0.1242	0.0811	1.53	0.1278

### Save the residuals
x1.on.x2.resids <- residuals(heightFit.fathersOnMothers)

Some of the residuals:

head(mutate(heightDataSons, y.on.x2.resids, x1.on.x2.resids), 10)

height	father	mother	nkids	y.on.x2.resids	x1.on.x2.resids
73.2	78.5	67.0	4	3.2728	9.0328
70.0	69.0	66.0	3	0.3725	-0.3431
67.0	69.0	66.7	6	-2.8373	-0.4300
68.0	69.0	66.0	6	-1.6275	-0.3431
71.0	69.0	66.5	5	1.2227	-0.4052
70.5	69.5	66.5	4	0.7227	0.0948
71.0	69.0	66.5	6	1.2227	-0.4052
70.5	69.5	66.0	7	0.8725	0.1569
69.0	69.0	66.0	9	-0.6275	-0.3431
67.0	69.0	65.0	7	-2.3279	-0.2189

1.1.7.3.3 Model #3 estimate is

The regression of the residuals to find

\[\widehat{\alpha_1}$: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}\]

### Estimate a simple linear regression model
heightFit.residsModel <- lm(y.on.x2.resids ~ x1.on.x2.resids)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.residsModel)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0000	0.1660	0.000	1
x1.on.x2.resids	0.4162	0.0659	6.313	0

…and finding $\widehat{\beta_1}$ directly from the full model
(both father and mother heights as predictors)

### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
### for the multiple linear regression on both x1 and x2
tidy(lm(height ~ father + mother, data=heightDataSons))

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

It works!

1.1.7.4 Reminder: The 3-step process is just for pedagogical illustration

We don’t actually use the 3-step process to find $\widehat{\beta_1}$ in practice.

We just compute the two-predictor model directly in R.

1.1.8 Geometric interpretation of regression coefficients

1.1.8.1 One predictor (regression line)

The fitted values ($\widehat{Y}$’s) will all lie on the regression line
$\beta_0$ is the intercept of that line
$\beta_1$ is the slope of that line
$Y$ values may be above, below, or on the line within the ($x,Y$) plane of the data

1.1.8.2 Two predictors (regression plane)

The fitted values ($\widehat{Y}$’s) will all lie on the regression plane
$\beta_0$ is the $Y$ value with the $x$ variables both = 0
$\beta_1$ is the slope of the plane along the $x_1$ direction
$\beta_2$ is the slope of the same plane along the $x_2$ direction
$Y$ values may be above, below, or on the plane within the full space of the data

Recall: $\beta_1$ is the effect of $x_1$ on expected $Y$
but only after we have “adjusted for” $x_2$ in the model.

A graph may give a sense of what is meant by “adjusted for”

http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf

The plane is not parallel to either $x$ axis,
but changes in both directions
(if we have two meaningful predictors of $Y$).

Given any $x_2$, the incremental change due to $x_1$ is the same,
but the resultant predicted $Y$ is not the same
…due to the contribution of $x_2$.

1.1.8.3 Multiple predictors (regression surface)

The fitted values ($\widehat{Y}$’s) will all lie on the same regression surface

The $Y$ values may be above, below, or on the surface

1.2 Lecture #6

1.2.1 R^2 (From Chapter 2 but I skipped past this part when giving the lecture)

1.2.1.1 Coefficient of determination: $R^2$ (“R squared”)

One measure of the “goodness of fit” of the regression model to the data:

$R^2$ = proportion of variability in the response ($Y$)
explained by regression on the predictor ($x$)

\[R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;=\; 1 - \frac{\sum (y_i - \widehat{y}_i)^2}{\sum (y_i - \overline{y})^2}\]

If the predictors don’t include much information about the response,
then $\sum (y_i - \widehat{y}_i)^2$ is close to $\sum (y_i - \overline{y})^2$
since $\overline{y}$ is the least squares estimate for the mean without predictors.

In that case, $R^2$ is near $1 - 1 = 0$.

On the other hand, if the predictors have so much information that
residuals are very, very small, then $R^2$ is near $1 - 0 = 1$.

repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

summary(repairFit)

Call:
lm(formula = Minutes ~ Units, data = repairData)

Residuals:
   Min     1Q Median     3Q    Max 
-9.232 -3.341 -0.714  4.777  7.803 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.162      3.355    1.24     0.24    
Units         15.509      0.505   30.71  8.9e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared:  0.987, Adjusted R-squared:  0.986 
F-statistic:  943 on 1 and 12 DF,  p-value: 8.92e-13

1.2.1.2 Why do we call this number $R^2$?

When there is only one predictor ($x$), $R^2 = \mathrm{corr}(y,x)^2 = r^2$

cor(Minutes ~ Units, data=repairData)
[1] 0.9937
cor(Minutes ~ Units, data=repairData)^2
[1] 0.9874

$r = 0.9937$
$r^2 = 0.9874$

repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

So, 98.7% of the variability in repair times can be attributed to the regression on the number of units.

The rest of the variability (residuals) is not explained by the number of units repaired.

1.2.1.3 Another way to view $R^2$

Claim: $\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |$

Recall:

$\mathrm{cov}(a+by, c+dx) = bd\mathrm{cov}(y,x)$
$\mathrm{sd}(a+bY) = |b|\mathrm{sd}(y)$ and $\mathrm{sd}(c + dx) = |d|\mathrm{sd}(x)$

\[\mathrm{corr}(a+bY, c+dx) = \frac{bd\mathrm{cov}(y,x)}{|b|\mathrm{sd}(Y)\, |d|\mathrm{sd}(x)} = {\rm sign}(bd)\mathrm{corr}(y,x)\]

$\begin{align} \mathrm{corr}(y, \widehat{y}) = \mathrm{corr}(y, \widehat{\beta}_{0} + \widehat{\beta}_{1} x) = \textrm{sign}(\widehat{\beta}_{1}) \mathrm{corr}(y,x) \end{align}$

$\widehat{\beta}_{1}$ and $\mathrm{corr}(y,x)$ always have the same sign: $\widehat{\beta}_{1} = \mathrm{corr}(y,x) (s_y/s_x)$

Thus, $0 \le \mathrm{corr}(y, \widehat{y}) \le 1$ and $\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |$

So, $R^2 = \left[\mathrm{corr}(y, x)\right]^2 = \left[\mathrm{corr}(y, \widehat{y})\right]^2$

This will be handy when we have models with more than one predictor.

But we will have to be careful: when we add more predictors to a model, $R^2$ will always increase or stay constant. Therefore we won’t be able to reliably use $R^2$ as a way of comparing models to see which is better.

1.2.1.4 Adjusted R^2

repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

Adjusted $R^2$ includes the degrees of freedom:

\[\text{adjusted} R^2 = 1 - \frac{\mathrm{SSE}/ \mathrm{SSE}(\mathrm{df})}{\mathrm{SST}/ \mathrm{SST}(\mathrm{df})} = \frac{\sum (y_i - \widehat{y}_i)^2/(n-p)}{\sum (y_i - \overline{y})^2 / (n-1)}\]

Adjusted $R^2$ does not always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another.

1.2.2 R$^2$s are not additive

Terminology: For multiple linear regression, $R^2$ is sometimes called
the multiple correlation coefficient.

Model with both father’s ($x_1$) and mother’s height ($x_2$) included:

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.both)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.254	0.2452	2.19	28.93	0	3	170

### Save the R-sqr value as a number (not a data frame)
heightFit.both.Rsqr <- as.numeric(glance(heightFit.both)[1])

Regression on both the fathers’ and mothers’ heights explains 25.4% of the variability in sons’ heights.

Model with only father’s height ($x_1$) included:

### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.fathers)

term	estimate	std.error	statistic	p.value
(Intercept)	38.4008	4.6882	8.191	0
father	0.4432	0.0678	6.535	0

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.fathers)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.1998	0.1952	2.261	42.71	0	2	171

### Save the R-sqr value as a number (not a data frame)
heightFit.fathers.Rsqr <- as.numeric(glance(heightFit.fathers)[1])

Regression on the fathers’ heights explains 20% of the variability in sons’ heights.

Model with only mother’s height ($x_2$) included:

### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.mothers)

term	estimate	std.error	statistic	p.value
(Intercept)	49.8503	4.9715	10.027	0.0000
mother	0.2997	0.0777	3.858	0.0002

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.mothers)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.0801	0.0747	2.425	14.89	0.0002	2	171

### Save the R-sqr value as a number (not a data frame)
heightFit.mothers.Rsqr <- as.numeric(glance(heightFit.mothers)[1])

Regression on the mothers’ heights explains 8% of the variability in sons’ heights.

We see that 25.4% $<$ 20% $+$ 8%

because, for example, adding mothers’ heights to the model only brings in their (marginal) information
about sons’ heights exclusive of information they also hold about fathers’ heights.

Of course, $R^2$ for the combined, two-predictor model must be
larger than either of the $R^2$ for one-predictor models. Why?

Caution: This is not always true for the adjusted $R^2$.

1.2.3 Example: Fuel consumption data

As an illustration of MLR modeling, we look at a dataset on U.S. fuel consumption by state (C. Bingham, American Almanac).

We will investigate fuel consumption as a linear function of demographics, fuel tax policy, and other factors.

### Read in the data on fuel taxes and demographics for 48 states
### (excludes Alaska and Hawaii)
fuelData <- read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fuel.txt")

### Rearrange the order of the variables (in the order selected)
### then sort by the state code (AL, AR, AZ, CA, ...)
fuelData <- fuelData %>%
  dplyr::select(state, pop, fuelc, fuel, nlic, dlic, tax, inc, road) %>%
  arrange(state)
### Display the data as a "tibble" (a "nice" format for a data frame)
head(as.tibble(fuelData))

state	pop	fuelc	fuel	nlic	dlic	tax	inc	road
AL	3510	1946	554	1801	51.3	7.0	3.333	6.594
AR	1978	1242	628	1081	54.7	7.5	3.357	4.121
AZ	1945	1230	632	1173	60.3	7.0	4.300	3.635
CA	20468	10730	524	12130	59.3	7.0	5.002	9.794
CN	3082	1408	457	1760	57.1	10.0	5.342	1.333
CO	2357	1384	587	1475	62.6	7.0	4.449	4.639

variable	description
state	State (excluding AK and HI)
pop	Population in each state
fuelc	fuel consumption, millions of gallons
fuel	Motor fuel consumption, gal/person
nlic	1971 thousands of drivers
dlic	Percent of population with driver’s license
tax	1972 motor fuel tax rate, cents/gal
inc	1972 per capita income, thousands of dollars
road	1971 federal-aid highways, thousands of miles

1.2.3.1 A first model: fuel = dlic + tax + inc + road

We will only use per capita versions of the data:

dlic vs. nlic (percent of licenced drivers vs. count)

fuel vs. fuelc (gallons/person rather than gallons)

road is a little problematic (more $ for states with large land area)

pop could be included, but the variables are per capita already.

1.2.3.2 Scatterplot matrix

### Use select (from dplyr) to get the variables in order 
### (response variable first), so scatterplot matrix better arranged.
temp <- dplyr::select(fuelData, fuel, dlic, tax, inc, road)
### scatterplot matrix using ggpairs from GGally package
ggpairs(temp)

tax and dlic appear to have a linear relationship with fuel

…and maybe income?

road doesn’t look to have much relationship to fuel

1.2.3.3 The full model: fuel = dlic + tax + inc + road

### Estimate various regression models we will use in these notes
fuelFit0 <- lm(fuel ~ 1, data=fuelData)
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
fuelFit4 <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
tidy(fuelFit4)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

1.2.4 Comparing models: R^2

1.2.4.1 One predictor: $R^2$

Recall from simple linear regression (one predictor)

\[R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]

SST = $\sum (y_i - \overline{y})^2$ = total sum of squares

SSE = $\sum (y_i - \widehat{y}_i)^2$ = error sum of squares

SSR = $\sum (\widehat{y}_i - \overline{y})^2$ = regression sum of squares

$\mathrm{SST}= \mathrm{SSR}+ \mathrm{SSE}$

Even the degrees of freedom for each sum of squares add up!

SST(df) = SSE(df) + SSR(df)

$(n-1) = (n-2) + 1$

Finally, $r^2 = [\mathrm{corr}(y, x)]^2 = R^2$

1.2.4.2 Multiple predictors: $R^2$

All is the same for multiple linear regression except…

$r_1^2 = [\mathrm{corr}(y, x_1)]^2 \ne R^2$

Instead, recall that for simple linear regression, we define

\[\mathrm{corr}(y, x) = |\; \mathrm{corr}(y, \widehat{y}) \;|\]

For simple and multiple linear regression

\[ R^2 = [ \mathrm{corr}(y, \widehat{y}) ]^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]

1.2.4.3 Fuel example: $R^2$

### We estimated the full model "fuelFit4" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6787	0.6488	66.31	22.71	0	5	43

1.2.4.4 Every additional predicitor increases $R^2$

Since more information reduces error (SSE),

each additional predictor

decreases SSE
increases $R^2$

(Technically, adding a predictor might have no effect on SSE or $R^2$, if the estimated regression coefficient for that predictor was exactly 0. But this will almost never happen in practice.)

Here is $R^2$ for the full model: dlic, tax, inc, and road
versus a 3-predictor model (road removed)

fuel = dlic + tax + inc + road

vs. fuel = dlic + tax + inc

### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)

term	estimate	std.error	statistic	p.value
(Intercept)	377.291	185.541	2.0335	0.0482
dlic	13.364	1.923	6.9499	0.0000
tax	-34.790	12.970	-2.6823	0.0103
inc	-66.589	17.222	-3.8665	0.0004
road	-2.426	3.389	-0.7158	0.4780

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6787	0.6488	66.31	22.71	0	5	43

### We estimated the reduced model "fuelFit3" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6749	0.6527	65.94	30.44	0	4	44

$R^2$ went down even though we know road is not significant predictor

What about the adjusted $R^2$? $\quad$ It went up!

…which implies road does not add enough to the model to offset the additional “complexity” of the model

1.2.5 Adjusted $R^2$

\[{\rm Adjusted}\; R^2 \;=\; 1 - \frac{\mathrm{SSE}\;/\; \mathrm{SSE}(df)}{\mathrm{SST}\;/\; \mathrm{SST}(df)} \;=\; 1 - \frac{\mathrm{SSE}\;/\; (n-p-1)}{\mathrm{SST}\;/\; (n-1)}\]

Adjusted $R^2$ sets up a “penalty” for model complexity.

Can use adjusted $R^2$ to compare “nested” models.

Can even compare models that are not nested.
For example, compare the following two models with adjusted $R^2$,
but do NOT compare using $R^2$:

fuel = dlic + inc $\qquad$ vs. $\qquad$ fuel = dlic + tax

### Estimate a multiple linear regression
fuelFit2a <- lm(fuel ~ dlic + inc, data=fuelData)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2a)

term	estimate	std.error	statistic	p.value
(Intercept)	7.837	122.463	0.064	0.9493
dlic	15.250	1.883	8.099	0.0000
inc	-70.924	18.209	-3.895	0.0003

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2a)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6175	0.6005	70.72	36.33	0	3	45

### We estimated the fuelFit2 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2)

term	estimate	std.error	statistic	p.value
(Intercept)	108.97	171.786	0.6343	0.5291
dlic	12.51	2.091	5.9862	0.0000
tax	-32.08	12.197	-2.6297	0.0117

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.5567	0.537	76.13	28.25	0	3	45

1.2.5.1 Adjusted $R^2$ is smaller than $R^2$

\[{\rm Adjusted}\; R^2 \;=\; 1 - \frac{n-1}{n-p-1} (1 - R^2) \;=\; \frac{(n-1)R^2 - p}{n-1-p}\]

Example: $n=100$, $p=20$, $R^2=0.90$
Then, adjusted $R^2 = 0.87$.

Adjusted $R^2$ is essentially “handicapped” or “penalized”
by the number of parameters ($p+1$) relative to $n$

NOTE: Adjusted $R^2$ can be negative (not common in practice)

1.2.6 Remove another predictor? fuel = tax + dlic

We already decided road did not add enough information to the model to justify keeping it in.

Should we also remove inc?

fuel = dlic + tax + inc

vs. fuel = dlic + tax

### We estimated the fuelFit2 and fuelFit3 models far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

tidy(fuelFit2)

term	estimate	std.error	statistic	p.value
(Intercept)	108.97	171.786	0.6343	0.5291
dlic	12.51	2.091	5.9862	0.0000
tax	-32.08	12.197	-2.6297	0.0117

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6749	0.6527	65.94	30.44	0	4	44

glance(fuelFit2)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.5567	0.537	76.13	28.25	0	3	45

Note that the SEs got larger, the t-statistics got smaller.

So, it can happen that a significant predictor becomes
not significant when a signifcant predictor is removed.

Also, important predictors can become insignificant if
too many nonconsequential variables in the model.

Not all of modeling can be based on statistical tests!

1.2.6.1 What if the intercept is not significant?

The model we have settled upon is

fuel = dlic + tax + inc

but the intercept is (barely) not significant:

### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

Even so, we keep the intercept in the model.

The intercept doesn’t really add complexity.
…and keeping it guarantees all the nice properties of least squares estimates.

1.2.6.2 Questions about including predictors …or not

What are the consequences of…

including non-significant variables in the model?
After all, $R^2$ always improves, so why omit variables?

What are the consequences of…

omitting important variables?
After all, fewer parameters is better for

more df for variance estimate
less model complexity

1.2.7 Model validation: fuel = dlic + tax + inc

1.2.7.1 Reminder: Model specifications

The errors are uncorrelated
The errors have no systematic info about $Y$
1. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$)
2. The errors are not correlated with the predictor values
3. $\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$)
The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$

1.2.7.2 Checking the residuals against the model specifications

### We estimated the 5 models fuelFitX far above
### Here, we are adding the fitted values and residuals
### for each model to the dataset
fuelDataResids <- fuelData %>%
  mutate(resids0 = residuals(fuelFit0), fitted0 = predict(fuelFit0),
         resids1 = residuals(fuelFit1), fitted1 = predict(fuelFit1),
         resids2 = residuals(fuelFit2), fitted2 = predict(fuelFit2),
         resids3 = residuals(fuelFit3), fitted3 = predict(fuelFit3),
         resids4 = residuals(fuelFit4), fitted4 = predict(fuelFit4)
         )

### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = fuelDataResids) %>%
  gf_hline(yintercept = 0, linetype = "dashed") %>%
  gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")

### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = fuelDataResids, color="white", bins=12) %>%
  gf_labs(x = "Residuals (fuel = dlic + tax + inc)")

### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = fuelDataResids) %>% gf_qqline()

grid.arrange(p1, p2, ncol=2)

1.2.7.3 Does outlier influence our decision about normality?

Here are the cases with the largest residuals:

### select variables and arrange (sort) the data
temp <- fuelDataResids %>%
  dplyr::select(state, fuel, dlic, tax, inc, resids3) %>%
  arrange(resids3)
### Look at the last few observations (where sorted values are largest)
tail(temp, 8)

	state	fuel	dlic	tax	inc	resids3
41	MT	704	58.6	7	3.897	62.53
42	NM	699	56.3	7	3.656	72.76
43	IN	580	53.0	8	4.391	78.60
44	NV	782	67.2	6	5.215	82.47
45	VA	547	51.7	9	4.258	83.91
46	SD	865	72.4	7	4.716	89.52
47	ND	714	54.0	7	3.718	123.59
48	WY	968	67.2	7	4.345	238.77

The largest residual happens to be for the state of Wyoming.

Here are the residual plots again, with the Wyoming residual removed.

### Use the filter function from dplyr package to filter out Wyoming
temp <- filter(fuelDataResids, state != "WY")

### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = temp) %>%
  gf_hline(yintercept = 0, linetype = "dashed") %>%
  gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")

### Remember, Wyoming is removed from data, so that residual is not displayed
### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = temp, color="white", bins=12) %>%
  gf_labs(x = "Residuals (fuel = dlic + tax + inc)")

### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = temp) %>% gf_qqline()

grid.arrange(p1, p2, ncol=2)

1.2.8 Inference for multiple linear regression

1.2.8.1 Several types of tests

Can test for specific values of $\beta_0, \beta_1, \beta_2, \ldots, \beta_p$

Test one single coefficient = 0 (or test another value)
Test if all coefficients $\beta_1, \beta_2, \ldots, \beta_p$ simultaneously = 0
- That is, predictors don’t have much information about response
Test whether some subset of coefficients simultaneously = 0
Other constraints: Test whether…
- some coefficients equal specific values
- some coefficients are equal to the each other
- some coefficients sum to a specific value
- …and so on

Tests OK if model specifications reasonable
and residuals plausibly from a normal distribution.

1.2.8.2 1. Test one single coefficient

\[\frac{\widehat{\beta}_j- \beta_j}{\mathrm{se}(\widehat{\beta}_j)} \;\sim\; t_{n-p-1}\]

### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

1.2.8.3 2. Test if all coefficients simultaneously = 0

To test multiple coefficients, we use the F-test.

$H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$
$H_1:$ At least one of the $\beta_j \ne 0$.

NOTE: Here, we are not testing $\beta_0=0$.

Here is the process.

Start with the full model (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$
(no restrictions on the coefficients)

Compare to the reduced model (RM): $Y = \beta_0 + \epsilon$
(all other $\beta_j=0$ as stated in $H_0$)

If the model specifications are valid (including Normality),
Then the following comparison statistic has an $F$ distribution.

\[F \;=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})}\]

Let’s try it with the fuel data: fuel = dlic + tax + inc

I encourage you to look at the R code.

### We estimated the models "fuelFit0" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit0) and full (fuelFit3) models
anova(fuelFit0, fuelFit3)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
47	588366	NA	NA	NA	NA
44	191302	3	397064	30.44	0


### Now, I'll do things a little more explicitly (manually)
### Use tidy() from dplyr package to put ANOVA output into a table
anovaTable3 <- tidy(anova(fuelFit0, fuelFit3))
Warning: Unknown or uninitialised column: 'term'.

### Grab the various pieces of the F-statistic from the ANOVA table
SSE.RM <- anovaTable3$rss[1]
SSE.FM <- anovaTable3$rss[2]
df.RM <- anovaTable3$res.df[1]
df.FM <- anovaTable3$res.df[2]
### Calculate the F-statistic
F.stat <- ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / ( SSE.FM / df.FM)
### Calculate the p-value of the F-test
p.value <- 1 - pf(F.stat, df.RM - df.FM, df.FM)

\[F \;=\; \frac{(588,366 - 191,302) \; / \; (47 - 44)} {191,302 \;/\; 44} \;=\; 30.44\]

p-value = 0.00000000008235

1.2.8.4 Standard F-test from software

$H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$
$H_1:$ At least one of the $\beta_j \ne 0$.

is the test (and F-statistic) that R automatically calculates for regression.

### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.6749	0.6527	65.94	30.44	0	4	44

1.2.8.5 The ANOVA table and the F test for all $\beta_j=0$

(except not testing $\beta_0=0$)

ANOVA = ANalysis Of VAriance

Source	SS (Sum Squares of)	df	MS (Mean Squares)
Model (Regression)	SSR = $\sum_i (\widehat{y}_i-\overline{y})^2$	$p$	MSR = SSR$\,/\,p$
Residual (Error)	SSE = $\sum_i (y_i-\widehat{y}_i)^2$	$n-p-1$	MSE = SSE$\,/\,(n-p-1)$ = $\widehat{\sigma}{^2}$
Total	SST = $\sum_i (y_i-\overline{y})^2$ = SSR + SSE	$n-1$	MST = SST$\,/\,(n-1) = \mathrm{var}(y)$

When $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$,

Reduced model (RM): $Y = \beta_0 + \epsilon$
has SSE = $\sum_i (y_i-\overline{y})^2$ on $n-1$ df

Full model(FM): $Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$
has SSE = $\sum_i (y_i-\widehat{y}_i)^2$ on $n-p-1$ df

and the F-statistic is

\[\begin{aligned} F &= \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;-\; \mathrm{df}(\mathrm{FM})} \\ \\ &=\; \frac{\left[\sum (y_i-\overline{y})^2 - \sum(y_i - \widehat{y}_i)^2 \right] \;/\; [(n-1) - (n-p-1)]}{ \sum(y_i - \widehat{y}_i)^2 \;/\; (n-p-1) } \\ \\ &=\; \frac{(\mathrm{SST}- \mathrm{SSE}) \;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \;=\; \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \;=\; \frac{\mathrm{MSR}}{\mathrm{MSE}} \end{aligned}\]

Since the textbook mentions it, here is another way to compute the F-statistic:

\[\begin{aligned} F &=\; \frac{\mathrm{MSR}}{\mathrm{MSE}} \;=\; \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \\ \\ &=\; \frac{1 \;/\; \mathrm{SST}}{1 \;/\; \mathrm{SST}} \times \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \\ \\ &=\; \frac{\mathrm{SSR}\;/\; \mathrm{SST}(p)}{\mathrm{SSE}\;/\; \mathrm{SST}(n-p-1)} \;=\; \frac{R^2_p \;/\; p} {(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}\]

…but there is nothing illuminating (to me) about this formula.

1.2.8.6 Test whether some set of coefficients simultaneously = 0

To test statements about multiple coefficients, we continue to use the F-test.

Let’s use the fuel data: fuel = dlic + tax + inc

$H_0: \beta_2 = \beta_3 = 0$ (tax and inc not needed in the model)

Here is the process.

Start with the full model (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$
(no restrictions on the coefficients)

Compare to the reduced model (RM): $Y = \beta_0 + \beta_1 x_1 + \epsilon$
($\beta_2=0$ and $\beta_3=0$ as stated in $H_0$)

If the model specifications are valid (including Normality),
Then the following comparison statistic has an $F$ distribution.

df for F-statistic: [df(RM) $-$ df(FM)] and df(FM)

\[\begin{aligned} F &=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})} \\ \\ &=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; (p-q)}{\mathrm{SSE}(\mathrm{FM}) \;/\; (n-p-1)} \\ \\ &=\; \frac{(R^2_p - R^2_q) \;/\; (p-q)}{(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}\]

where $p$ is the number of predictors in the full model (FM)
and $q<p$ is the number of predictors in the reduced model (RM)

OK. Let’s test $H_0: \beta_2 = \beta_3 = 0$ (fuel and inc not needed in the model)

SHOW CODE

### We estimated the models "fuelFit1" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit1) and full (fuelFit3) models
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit1, fuelFit3)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
46	300919	NA	NA	NA	NA
44	191302	2	109616	12.61	0

1.2.8.7 If test just one $\beta_j=0$: $F = t^2$

fuel = dlic + tax + inc

$H_0: \beta_3 = 0$ (fuel = dlic + tax)

Here is the process.

Start with the full model (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$
(no restrictions on the coefficients)

Compare to the reduced model (RM): $Y = \beta_0 + \beta_1 x_1 + + \beta_2 x_2 + \epsilon$
($\beta_3=0$ as stated in $H_0$)

Here is the F-statistic:

### We estimated the models "fuelFit2" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (fuelFit2) and full (fuelFit3) models
### ...and these models only differ by one coefficient (beta_3)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit2, fuelFit3)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
45	260834	NA	NA	NA	NA
44	191302	1	69532	15.99	0.0002

F.stat <- anova(fuelFit2, fuelFit3)[2,5]

F-statistic = 15.99

…and the t-statistic for the same test

### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)

term	estimate	std.error	statistic	p.value
(Intercept)	307.33	156.831	1.960	0.0564
dlic	13.75	1.837	7.485	0.0000
tax	-29.48	10.584	-2.786	0.0078
inc	-68.02	17.010	-3.999	0.0002

### t-statistics are in the "statistic" variable of the tidy table
### There are p=3 predictors, so the (p+1) = 4th holds info for X_3 
t.stat <- tidy(fuelFit3)$statistic[4]

t-statistic = -3.999

As we have seen before, $F_{1,df} = (t_{df})^2$.

1.2.8.8 Constraints: Some coefficients equal specific values

With dlic, tax, and inc in the fuel model could the
marginal effect of tax be 24 fewer gallons/year per person
while the marginal effect of dlic increases average fuel by 12 gallons/year?

$H_0: \beta_1=12,\; \textrm{and}\;\; \beta_2=-24$

Start with the full model (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$
(no restrictions on the coefficients)

Compare to the reduced model (RM): $Y = \beta_0 + 12 x_1 -24 x_2 + \beta_3 x_3 + \epsilon$
($\beta_3=0$ as stated in $H_0$)

Fit both the reduced and full models and compare using anova function

SHOW CODE

### The offset() function lets you bring in specific values for coef
lmfit.RM <- lm(fuel ~ offset(12*dlic) + offset(-24*tax) + inc, data=fuelData)
### This model has no constraints (full model)
lmfit.FM <- lm(fuel ~ dlic + tax + inc, data=fuelData)

### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
46	198266	NA	NA	NA	NA
44	191302	2	6964	0.8008	0.4554

SHOW CODE

Or use the linearHypothesis function from the car package:

### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(fuelFit3, c("dlic = 12", "tax = -24")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
46	198266	NA	NA	NA	NA
44	191302	2	6964	0.8008	0.4554

1.2.8.9 Constraints: Some coefficients equal each other

In the height data, perhaps the coefficients for fathers and for mothers are the same.

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

$H_0: \beta_1=\beta_2$

Start with the full model (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$
(no restrictions on the coefficients)

Compare to the reduced model (RM): $Y = \beta_0 + \beta_1 x_1 + \beta_1 x_2 + \epsilon = \beta_0 + \beta_1 (x_1 + x_2) + \epsilon$
($\beta_1=\beta_2$ as stated in $H_0$)

Fit both the reduced and full models and compare using anova function

SHOW CODE

### The I() function lets you redefine a predictor 
### In this case, the new predictor is the sum of two other predictors
lmfit.RM <- lm(height ~ I(father + mother), data=heightDataSons)
### This model has no constraints (full model)
lmfit.FM <- lm(height ~ father + mother, data=heightDataSons)

### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you

### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
171	828.3	NA	NA	NA	NA
170	815.3	1	12.99	2.708	0.1017

SHOW CODE

Or use the linearHypothesis function from the car package:

### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(lmfit.FM, c("father = mother")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
171	828.3	NA	NA	NA	NA
170	815.3	1	12.99	2.708	0.1017

So, it is plausible that the coefficients are equal.

1.2.9 Which measure of overall model fit can/should be used when?

number of predictors	nested models	non-nested models
different number	F-test (see note 1)
	adjusted $R^2$	adjusted $R^2$
	$R^2$ (see note 2)
same number	not applicable	adjusted $R^2$ (see note 3)
		$R^2$

Note 1: The F-test has the advantage of providing a measure of uncertaintly (p-value)

Note 2: $R^2$ always increases as more predictors are added to the model, so adjusted $R^2$ is preferred for model comparison. On the other hand, $R^2$ is an interpretable quantity: the amount of variation in the response explained by linear regression on the predictor(s).

Note 3: When two models with the same number of predictors are compared, then $C \times R^2 =$ adjusted $R^2$, ($C$ is a constant), so comparing either measure provides equivalent information.

1.2.10 Predictions

For families with a father 70 inches tall and mother 66 inches tall,
estimate the average of their sons’ heights (and a confidence interval for the true average).

Using the full model: height = father + mother,

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006


### Set up a little dataset to hold the predictor values for an estimate
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a confidence interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="confidence"))

fit	lwr	upr
69.9	69.46	70.34

For one family with a father 70 inches tall and mother 66 inches tall,
estimate their son’s height (and a prediction interval actual height).

Using the full model: height = father + mother,

### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)

### Set up a little dataset to hold the predictor values for a prediction
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a prediction interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="prediction"))

fit	lwr	upr
69.9	65.56	74.25

Why is the prediction interval wider than the confidence interval?

1.2.11 Centering and Scaling

How do the coefficients change if we change the units of observation?

1.2.11.1 Height data: Change sons’ units to centimeters

### Add sons' heights in cm to the dataset
cm <- 2.54
heightDataSons.cm <- heightDataSons %>%
  mutate(height.cm = height*cm)

### Fit the new linear model (response in cm instead of inches)
heightFit.both.cm <- lm(height.cm ~ father + mother, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.coef <- tidy(heightFit.both.cm)

Regression with original data:

### This table of coefficients was saved earlier
heightFit.both.coef

term	estimate	std.error	statistic	p.value
(Intercept)	24.4022	6.0422	4.039	0.0001
father	0.4162	0.0661	6.294	0.0000
mother	0.2480	0.0706	3.511	0.0006

Son’s heights changed to centimeters:

### This table of coefficients was saved earlier
heightFit.both.cm.coef

term	estimate	std.error	statistic	p.value
(Intercept)	61.9816	15.3472	4.039	0.0001
father	1.0571	0.1679	6.294	0.0000
mother	0.6299	0.1794	3.511	0.0006

What is the ratio of the new coefficients and SE’s compared to old?

### These tables of coefficients were saved earlier
### Coefficients in column 2, SEs in column 3
### Intercept in row 1
### Predictors in rows 2 ... to p
heightFit.both.coef[1:3, 2:3]

estimate	std.error
24.4022	6.0422
0.4162	0.0661
0.2480	0.0706

heightFit.both.cm.coef[1:3, 2:3]

estimate	std.error
61.9816	15.3472
1.0571	0.1679
0.6299	0.1794

heightFit.both.cm.coef[1:3, 2:3] /  heightFit.both.coef[1:3, 2:3]

estimate	std.error
2.54	2.54
2.54	2.54
2.54	2.54

Rather confusingly, the mothers’ and fathers’ heights are still in inches.

So, the coefficients are in units of cm/inch.

Of course, we can change all units to cm (and this better not affect our analysis at all!)

### Add fathers' and mothers' heights (in cm) to the dataset
heightDataSons.cm <- heightDataSons.cm %>%
  mutate(father.cm = father*cm, mother.cm = mother*cm)

### Fit a new linear model with all variables in cm (not inches)
heightFit.both.cm.cm <- lm(height.cm ~ father.cm + mother.cm, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.cm.coef <- tidy(heightFit.both.cm.cm)
heightFit.both.cm.cm.coef

term	estimate	std.error	statistic	p.value
(Intercept)	61.9816	15.3472	4.039	0.0001
father.cm	0.4162	0.0661	6.294	0.0000
mother.cm	0.2480	0.0706	3.511	0.0006

1.2.11.2 A general solution to scaling: Standardize the variables

\[\begin{aligned} \widetilde{y} \;=\; \frac{y - \overline{y}}{s_y} \quad {\rm and} \quad \widetilde{x}_j \;=\; \frac{x_j - \overline{x}_j}{s_j} \quad {\rm for}\; j=1,2,\ldots, p \end{aligned}\]

1.2.11.3 First, let’s just center the data

\[Y - \overline{y}\;=\; \beta_0 + \beta_1 (x_1 - \overline{x}_1) + \beta_2 (x_2 - \overline{x}_2) + \cdots + \beta_p (x_p - \overline{x}_p) + \epsilon \]

We are still using the modified data:
sons’ heights in cm, fathers’ and mothers’ heights in inches

### Add centered variables to the predictors (sons' in cm, others in inches)
heightDataSons.ctr <- heightDataSons.cm %>%
  mutate(height.ctr = height.cm - mean(height.cm), 
         father.ctr = father - mean(father),
         mother.ctr = mother - mean(mother) 
        )

### Estimate the multiple linear model with centered data
heightFit.both.ctr <- lm(height.ctr ~ father.ctr + mother.ctr, data=heightDataSons.ctr)

Regression with uncentered data:

### This table of coefficients was saved earlier
tidy(heightFit.both.cm)

term	estimate	std.error	statistic	p.value
(Intercept)	61.9816	15.3472	4.039	0.0001
father	1.0571	0.1679	6.294	0.0000
mother	0.6299	0.1794	3.511	0.0006

Regression after all variables centered:

### This table of coefficients was saved earlier
tidy(heightFit.both.ctr)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0000	0.4229	0.000	1.0000
father.ctr	1.0571	0.1679	6.294	0.0000
mother.ctr	0.6299	0.1794	3.511	0.0006

The intercept for the regression with centered data is zero. Why?

Hint: When we use least squares to estimate $\beta_0$ for a linear model like this

\[Y \;=\; \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\] \[ {\rm then,}\; \widehat{\beta_0} \;=\; \overline{y}- \sum_{i=1}^p \widehat{\beta}_j\overline{x}_j\]

The other regression coefficients (and their SEs) are the same as when not centered.

1.2.11.4 Now, let’s standardize the data

### Add standardized variables to the dataset
heightDataSons.cm.std <- heightDataSons.cm %>%
  mutate(height.std = (height.cm - mean(height.cm)) / sd(height.cm), 
         father.std = (father - mean(father)) / sd(father),
         mother.std = (mother - mean(mother)) / sd(mother) 
        )

### Estimate linear model with standardized response and predictor variables
heightFit.both.cm.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.cm.std)

Regression with unstandardized data:

### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)

term	estimate	std.error	statistic	p.value
(Intercept)	61.9816	15.3472	4.039	0.0001
father	1.0571	0.1679	6.294	0.0000
mother	0.6299	0.1794	3.511	0.0006

Regression after all variables standardized:

### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm.std)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0000	0.0661	0.000	1.0000
father.std	0.4198	0.0667	6.294	0.0000
mother.std	0.2342	0.0667	3.511	0.0006

This last regression is the same whether we measured in cm or inches. Why?

Regression with unstandardized original data (all in inches):

### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)

term	estimate	std.error	statistic	p.value
(Intercept)	61.9816	15.3472	4.039	0.0001
father	1.0571	0.1679	6.294	0.0000
mother	0.6299	0.1794	3.511	0.0006

Regression after all variables standardized:

### Add standardized variables to the dataset
heightDataSons.std <- heightDataSons %>%
  mutate(height.std = (height - mean(height)) / sd(height), 
         father.std = (father - mean(father)) / sd(father),
         mother.std = (mother - mean(mother)) / sd(mother) 
        )
### Estimate the linear model using standardized response and predictor vars
heightFit.both.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.std)
### View the coef, SEs, t-stats, p-values from the linear regression
tidy(heightFit.both.std)

term	estimate	std.error	statistic	p.value
(Intercept)	0.0000	0.0661	0.000	1.0000
father.std	0.4198	0.0667	6.294	0.0000
mother.std	0.2342	0.0667	3.511	0.0006

We don’t really gain anything from standardizing the original data since all variables
are already in the same units (inches) for the same characteristic (height).

Plus, it seems that the standard deviations of the original heights
were nearly the same already for each variable (before standardizing).

$s_y = 2.521$
$s_{x1} = 2.543$
$s_{x2} = 2.381$

So, the coefficients changed a little, but their interpretation is very different.

Now, for fathers one std deviation taller, their sons are expected
to be $0.4198$ standard deviations taller (adjusted for the mothers’ heights).

1.2.12 Properties of LS estimators

1.2.12.1 Reminder: Model specifications

The errors are uncorrelated
The errors have no systematic info about $Y$ a. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$) b. The errors are not correlated with the predictor values c. $\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$)
The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$
The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$

1.2.12.2 Facts about LS estimators

$\widehat{\beta}_j$s unbiased: $\mathrm{E}(\widehat{\beta}_j) = \beta_j$ for $j=0,1,2,\ldots,p$
$\mathrm{var}(\widehat{\beta}_j) = c_j \sigma^2$
- let R compute the $c_j$s (requires matrix algebra)
$\widehat{\beta}_j\sim$ Normal($\,\beta_j, \sqrt{c_j}\sigma$)
$W = \mathrm{SSE}/\sigma$ is chi-square ($\chi^2$) with $\mathrm{df}= n - p - 1$
$\widehat{\beta}_j$s are independent of $\widehat{\sigma}$ (spread of residuals does not effect slope)
BLUE: Best Linear Unbiased Estimators (minimum variance)

1.2.13 Imputing a missing value

i <- 5
n <- dim(heightDataSons)[1]

Suppose for the $5th$ family in the data, the mother’s height was not recorded.

We could just throw out case $i = 5$.

…but what if a mother’s or father’s height is missing from several families.
Should we just throw out the information on the other parent so easily?

One solution: imputation of missing values.

This can be done many ways.
We will only discuss two.

1.2.13.1 Option 1: Replace missing value with average

For example, replace the missing $5th$ mother’s height with the average height of the other $(n-1) = 172$ mothers’ heights.

kable(mean(~ mother[-5], data=heightDataSons))

x
63.95

1.2.13.2 Option 2: Replace missing value with regression prediction

Temporarily let mother’s height be the response variable.
Let son’s and father’s heights be the predictors.
Remove case $i = 5$ from the data (missing the mother’s height).
Regress mother’s height on both son’s and father’s height
Predict mother’s height for case $i = 5$ using the son’s and father’s heights from case $i = 5$

i
[1] 5
temp <- heightDataSons[-i, ]
lmfit <- lm(mother ~ father + height, data = temp)
newdata <- data.frame(father = heightDataSons$father[i], height = heightDataSons$height[i])
newdata <- mutate(newdata, mother = predict(lmfit, newdata))
kable(newdata)

father	height	mother
69	71	64.49

Source	SS (Sum Squares of)	df	MS (Mean Squares)
Model (Regression)	SSR = \(\sum_i (\widehat{y}_i-\overline{y})^2\)	\(p\)	MSR = SSR\(\,/\,p\)
Residual (Error)	SSE = \(\sum_i (y_i-\widehat{y}_i)^2\)	\(n-p-1\)	MSE = SSE\(\,/\,(n-p-1)\) = \(\widehat{\sigma}{^2}\)
Total	SST = \(\sum_i (y_i-\overline{y})^2\) = SSR + SSE	\(n-1\)	MST = SST\(\,/\,(n-1) = \mathrm{var}(y)\)

observation	response \(y\)	predictor \(x_1\)	\(x_2\)	…	\(x_p\)
1	\(y_1\)	\(x_{11}\)	\(x_{12}\)	…	\(x_{1p}\)
2	\(y_2\)	\(x_{21}\)	\(x_{22}\)	…	\(x_{2p}\)
…	…	…	…	…	…
…	…	…	…	…	…
\(n\)	\(y_n\)	\(x_{n1}\)	\(x_{n2}\)	…	\(x_{np}\)

number of predictors	nested models	non-nested models
different number	F-test (see note 1)
	adjusted \(R^2\)	adjusted \(R^2\)
	\(R^2\) (see note 2)
same number	not applicable	adjusted \(R^2\) (see note 3)
		\(R^2\)

STAT 224 Chapter 3

1 Chapter 3

1.1 Lecture #5

1.1.1 Model

1.1.1.1 Multiple linear regression model

1.1.1.2 Reminder: Model specifications

1.1.2 Data

1.1.2.1 Data format

1.1.3 Example: Heights of parents and their sons

1.1.3.1 How are sons’ heights related to their father’s height?

1.1.3.2 Linear relationship between average son’s height and father’s height?

1.1.3.3 Reading the output

1.1.3.4 What about the other predictors in the data?

1.1.3.5 Fit the multiple linear regression model

1.1.4 Estimation

1.1.4.1 Least squares estimates of coefficients

1.1.4.2 Estimate of the standard deviation

1.1.4.3 Finding the least squares estimates using R

1.1.5 Interpreting (partial) regression coefficients

1.1.6 Simple vs. multiple linear regression coefficients

1.1.6.1 A closer look at \(\beta_1\) (the slope)

1.1.6.2 A closer look at \(\beta_1\) (the partial coeff for \(x_1\))

1.1.7 Mathematical understanding of partial coefficients

1.1.7.1 A 3-step process for finding \(\widehat{\beta_1}\).

1.1.7.1.1 Estimate Model #1:

1.1.7.1.2 Estimate Model #2:

1.1.7.1.3 Estimate Model #3:

1.1.7.2 Confirming the 3-step process for finding \(\widehat{\beta_1}\)

1.1.7.3 Example: Seeing the 3-step process for finding \(\widehat{\beta_1}\)

1.1.7.3.1 Model #1 estimate is

1.1.7.3.2 Model #2 estimate is

1.1.7.3.3 Model #3 estimate is

1.1.7.4 Reminder: The 3-step process is just for pedagogical illustration

1.1.8 Geometric interpretation of regression coefficients

1.1.8.1 One predictor (regression line)

1.1.8.2 Two predictors (regression plane)

1.1.8.3 Multiple predictors (regression surface)

1.2 Lecture #6

1.2.1 R^2 (From Chapter 2 but I skipped past this part when giving the lecture)

1.2.1.1 Coefficient of determination: \(R^2\) (“R squared”)

1.2.1.2 Why do we call this number \(R^2\)?

1.2.1.3 Another way to view \(R^2\)

1.2.1.4 Adjusted R^2

1.2.2 R\(^2\)s are not additive

1.2.3 Example: Fuel consumption data

1.2.3.1 A first model: fuel = dlic + tax + inc + road

1.2.3.2 Scatterplot matrix

1.2.3.3 The full model: fuel = dlic + tax + inc + road

1.2.4 Comparing models: R^2

1.2.4.1 One predictor: \(R^2\)

1.2.4.2 Multiple predictors: \(R^2\)

1.2.4.3 Fuel example: \(R^2\)

1.2.4.4 Every additional predicitor increases \(R^2\)

1.2.5 Adjusted \(R^2\)

1.2.5.1 Adjusted \(R^2\) is smaller than \(R^2\)

1.2.6 Remove another predictor? fuel = tax + dlic

1.2.6.1 What if the intercept is not significant?

1.2.6.2 Questions about including predictors …or not

1.2.7 Model validation: fuel = dlic + tax + inc

1.2.7.1 Reminder: Model specifications

1.2.7.2 Checking the residuals against the model specifications

1.2.7.3 Does outlier influence our decision about normality?

1.2.8 Inference for multiple linear regression

1.2.8.1 Several types of tests

1.2.8.2 1. Test one single coefficient

1.2.8.3 2. Test if all coefficients simultaneously = 0

1.2.8.4 Standard F-test from software

1.2.8.5 The ANOVA table and the F test for all \(\beta_j=0\)

1.2.8.6 Test whether some set of coefficients simultaneously = 0

1.2.8.7 If test just one \(\beta_j=0\): \(F = t^2\)

1.2.8.8 Constraints: Some coefficients equal specific values

1.2.8.9 Constraints: Some coefficients equal each other

1.2.9 Which measure of overall model fit can/should be used when?

1.2.10 Predictions

1.2.11 Centering and Scaling

1.2.11.1 Height data: Change sons’ units to centimeters

1.2.11.2 A general solution to scaling: Standardize the variables

1.2.11.3 First, let’s just center the data

1.2.11.4 Now, let’s standardize the data

1.2.12 Properties of LS estimators