If there are other variables that may help us predict \(Y\),
we could extend our simple model to include them.
\[Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p\] …and then, allow for error (statistical model) \[Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p + \epsilon\]
The following are also multiple linear regression models
\[ \begin{aligned} Y &\;=\; \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + \epsilon \end{aligned} \]
\(Y\) is approximately a linear function of \(x\)’s
and \(\epsilon\) measures discrepency of the approximation
As before, \(\epsilon\) contains no systematic information about \(Y\)
that is not already captured by the \(x\)’s
The \(\beta\)’s are called partial regression coefficients,
or just regression coefficients
We will endevour to provide you with tidy data:
observation | response \(y\) | predictor \(x_1\) | \(x_2\) | … | \(x_p\) |
---|---|---|---|---|---|
1 | \(y_1\) | \(x_{11}\) | \(x_{12}\) | … | \(x_{1p}\) |
2 | \(y_2\) | \(x_{21}\) | \(x_{22}\) | … | \(x_{2p}\) |
… | … | … | … | … | … |
… | … | … | … | … | … |
\(n\) | \(y_n\) | \(x_{n1}\) | \(x_{n2}\) | … | \(x_{np}\) |
\(x_{ij}\) = observation for unit \(i\) on variable \(x_j\)
First, I cleaned up the data. Click the Code
button to see my R code.
### For these data, I added some categorical variables (for plotting)
### and reduced the data to one (random) son per family.
### It is not expected that you would need to do code like in this "chunk"
### Add categorical variables for heights of moms and dads
cuts <- c(61, 66, 68, 70, 72, 74, 80)
midpoints <- c(65, 67, 69, 71, 73, 75)
heightDataAll <- Galton %>%
mutate(father2grp = factor(father > median(father),
labels = c("Father Short", "Father Tall"))) %>%
mutate(mother2grp = factor(mother > median(mother),
labels = c("Mother Short", "Mother Tall"))) %>%
mutate(father4grp = factor(cut(father, 4),
labels = c("Father Short", "Father Average", "Father Tall", "Father Very Tall"))) %>%
mutate(mother4grp = factor(cut(mother, 4),
labels = c("Mother Short", "Mother Average", "Mother Tall", "Mother Very Tall"))) %>%
mutate(fathergrp = factor(cut(father, cuts), labels = midpoints))
### Reduce the data to one (random) son per family
### If you set the "seed" for the random number generator first,
### then you get the same "random" results each time you compile the doc.
### This keeps you sane.
### The identical dataset is generated every time I compile this document.
set.seed(12345)
heightDataSonsGroups <- heightDataAll %>%
filter(sex == "M") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
dplyr::select(-family, -sex)
heightDataSons <- heightDataSonsGroups %>%
dplyr::select(height, father, mother, nkids)
### Now each family (row) is one "case"
### and each case has 3 observed variables: heights of the father, mother, and one son, plus number of children in the family (nkids)
### ...plus some categorical values I created for heights of fathers and mothers
n <- dim(heightDataSons)[1]
### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataSonsGroups, aes(x=fathergrp, y=height)) +
geom_boxplot() +
labs(x="Father's Height (inches)", y="Son's Height (inches)") +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point", size=3, color="red")
### Fit a LS regression line
heightFit <- lm(height ~ father, data=heightDataSons)
### Add the fitted values and residuals to the data
heightDataFitSons <- mutate(heightDataSons,
yhat = fitted(heightFit),
resid = residuals(heightFit)
)
A look at the residuals:
### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg
ggplot(heightDataFitSons, aes(x = father, y = height)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + ### Plot regression slope
geom_segment(aes(xend = father, yend = yhat), alpha = .2) + ### alpha to fade lines
geom_point() +
geom_point(aes(y = yhat), shape = 1) +
theme_bw() ### Add theme for cleaner look
… and a residual plot:
gf_point(resid ~ father, data=heightDataFitSons) %>%
gf_hline(yintercept = 0)
Output from the tidy
function in the broom
package.
tidy(heightFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 38.4008 | 4.6882 | 8.191 | 0 |
father | 0.4432 | 0.0678 | 6.535 | 0 |
What is the value of the least squares regression intercept?
What is the meaning of the intercept for these data?
What is the value of the least squares regression slope?
What is the meaning of the slope for these data?
Output from the glance
function in the broom
package.
glance(heightFit)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual |
---|---|---|---|---|---|---|---|---|---|---|
0.1998 | 0.1952 | 2.261 | 42.71 | 0 | 2 | -385.6 | 777.3 | 786.7 | 874.5 | 171 |
In this course, we will not likely discuss four of the statistics that glance
provides: logLik
, AIC
, BIC
, anad deviance
.
So, I will restrict the output for lectures:
glance(heightFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.1998 | 0.1952 | 2.261 | 42.71 | 0 | 2 | 171 |
Click the Code
button to see a summary of the regression output using base R.
heightFit <- lm(height ~ father, data=heightDataSons)
### Here is the base R summary of a linear model result
summary(heightFit)
Call:
lm(formula = height ~ father, data = heightDataSons)
Residuals:
Min 1Q Median 3Q Max
-5.978 -1.535 0.022 1.522 6.022
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.4008 4.6882 8.19 5.8e-14 ***
father 0.4432 0.0678 6.54 7.0e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.26 on 171 degrees of freedom
Multiple R-squared: 0.2, Adjusted R-squared: 0.195
F-statistic: 42.7 on 1 and 171 DF, p-value: 0.000000000702
What is the value of \(R^2\) and what does it mean?
What hypothesis is tested using this p-value?
Hint: Compare the t-statistics above to the F-statistic.
### Plot not displayed.
### Just playing around with the full dataset (out of curiousity)
gf_point(height ~ father, color = ~ sex, size = ~ nkids, data = heightDataAll)
### Plots not displayed.
### I was just playing around with various ggplot2 graphics functions.
### Mostly drawing means (and lines between means) with boxplots
### Who knows, someday you might want to show something similar at your summer intership!
x <- factor(rep(1:10, 100))
y <- rnorm(1000)
df <- data.frame(x=x, y=y)
ggplot(df, aes(x=x, y=y)) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point")
qplot(father2grp, height, data=heightDataSons, geom="boxplot",
xlab="Father's Height", ylab="Son's Height") +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point", size=3)
gf_boxplot(height ~ father2grp, data=heightDataSons) +
labs(x="Father's Height", y="Son's Height") +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point", size=3)
ggplot(heightDataSons, aes(x=father2grp, y=height)) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point", size=3, color="red")
ggplot(heightDataSons, aes(x=father4grp, y=height)) +
geom_boxplot() +
stat_summary(fun.y=mean, geom="line", aes(group=1)) +
stat_summary(fun.y=mean, geom="point", size=3, color="red")
gf_histogram(~ height | father2grp, bins=12, data=heightDataSons, color="white") +
labs(x="Son's Height", y="Number of Sons")
ggplot(heightDataSons, aes(x = height)) +
geom_histogram(bins=12, color = "white") +
facet_wrap(~ father2grp, ncol=1)
gf_boxplot(height ~ father4grp, data=heightDataSons) +
labs(x="Father's Height", y="Son's Height")
gf_violin(height ~ father4grp, draw_quantiles=c(0.50), data=heightDataSons) +
labs(x="Father's Height", y="Son's Height")
Could there be other information in the data
that can help predict a son’s height from family characteristics?
### Look at the first 10 observations in the data
head(heightDataSons, 10)
height | father | mother | nkids |
---|---|---|---|
73.2 | 78.5 | 67.0 | 4 |
70.0 | 69.0 | 66.0 | 3 |
67.0 | 69.0 | 66.7 | 6 |
68.0 | 69.0 | 66.0 | 6 |
71.0 | 69.0 | 66.5 | 5 |
70.5 | 69.5 | 66.5 | 4 |
71.0 | 69.0 | 66.5 | 6 |
70.5 | 69.5 | 66.0 | 7 |
69.0 | 69.0 | 66.0 | 9 |
67.0 | 69.0 | 65.0 | 7 |
Start with a scatterplot matrix.
Using base R:
### Scatterplot matrix in base R
pairs(heightDataSons, pch=20)
### Table of correlations between all variables in the data
cor(heightDataSons)
height father mother nkids
height 1.00000 0.44704 0.2830 -0.07661
father 0.44704 1.00000 0.1162 0.02625
mother 0.28299 0.11623 1.0000 0.09500
nkids -0.07661 0.02625 0.0950 1.00000
Using the GGally
package:
### Scatterplot matrix using GGally package
ggpairs(heightDataSons)
…and again getting fancier with the GGally
package:
This adds simple linear regression lines and confidence bands to each plot.
### I stumbled across this while Googling around about ggpairs function. Nifty.
ggpairs(heightDataSons, lower = list(continuous = wrap("smooth", method = "lm")))
Father’s height is more strongly correlated with
Son’s height than mother’s height.
Number of children in family doesn’t seem to be related to anyone’s height
… not surprising?
Fathers and mothers have fairly low height correlation
…a bit surprising?
Then, perhaps the mothers bring “significant” new information about the Son’s height
that is not already captured by father’s height.
\(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
where
\(x_1\) = father’s height
\(x_2\) = mother’s height
\(x_3\) = numer of children in the family
### Fit a multiple linear regression model with least squares
heightFit.all <- lm(height ~ father + mother + nkids, data=heightDataSons)
### Look at the results using functions in the broom package
tidy(heightFit.all)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.1166 | 6.0128 | 4.011 | 0.0001 |
father | 0.4179 | 0.0658 | 6.353 | 0.0000 |
mother | 0.2589 | 0.0706 | 3.670 | 0.0003 |
nkids | -0.1077 | 0.0643 | -1.675 | 0.0957 |
glance(heightFit.all)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.2661 | 0.2531 | 2.179 | 20.43 | 0 | 4 | 169 |
Surprisingly, the number of children in the family
is a moderately significant predictor.
For now, I will leave it out of the model.
\(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
where
\(x_1\) = father’s height
\(x_2\) = mother’s height
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
As we did with one predictor,
we estimate the regression coefficients by “least squares”
Minimize the sum of squared residuals.
The estimated error (estimate of \(\epsilon_i\)) is called a residual (\(e_i\))
\[\begin{aligned} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2} + \cdots + \widehat{\beta}_px_{ip}) \end{aligned}\]
\[SSE \;=\; \sum_i e_i^2
\;=\; \sum_i (y_i - \widehat{\beta_0} - \widehat{\beta_1} x_{i1} - \widehat{\beta_2} x_{i2} - \cdots + \widehat{\beta}_px_{ip})^2\] We can minimize this function with respect to \(\widehat{\beta_0}, \widehat{\beta_1}, \widehat{\beta_2}, \ldots, \widehat{\beta}_p\)
to find the least squares estimates of the model coefficients.
The formulas get a bit crazy for \(p > 1\) predictors.
After taking partial derivatives wrt each \(\widehat\beta_j\)
we need to solve \((p+1)\) linear equations for \((p+1)\) unknowns.
This is all best represented in matrix form (…linear algebra).
At least the intercept has a simple formula.
With one predictor: \(Y \;=\; \beta_0 + \beta_1 x + \epsilon\),
recall that \(\widehat{\beta_0} \;=\; \overline{y}- \widehat{\beta_1} \overline{x}\).
For multiple linear regression,
\[\widehat{\beta_0} \;=\; \overline{y}- \widehat{\beta_1} \overline{x}_1 - \widehat{\beta_2} \overline{x}_2 - \cdots - \widehat{\beta}_p\overline{x}_p \;=\; \overline{y}- \sum_{j=1}^p \widehat\beta_j \overline{x}_j\]
As with all least squares methods, we estimate \(\mathrm{var}(\epsilon)\) as
\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{\mathrm{df}}\]
degrees of freedom (df)
= sample size \(-\) ### parameters estimated for the mean (\(\beta_0, \beta_1, \beta_2, \ldots, \beta_p\))
= \(n-(p+1) \;=\; (n - p - 1)\)
\[\widehat\sigma^2 \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n - p - 1}\]
lm
= “linear model” function
lm(y ~ x1 + x2 + x3 + x4, data = mydata)
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
### Grab the coefficients (column 2) for 1st two predictors (rows 2 & 3)
### (The intercept information is in row 1)
b.both.fathers <- heightFit.both.coef$estimate[2]
b.both.mothers <- heightFit.both.coef$estimate[3]
One interpretation:
Among mothers of a particular height, if they each had a child with a man 1 inch taller instead,
we expect their sons would have been, on average, 0.4162 inches taller.
Among fathers of a particular height, if they each had a child with a woman 1 inch taller instead,
we expect their sons would have been, on average, 0.248 inches taller.
Another interpretation:
Fathers 1 inch taller than others tend to have sons about 0.4162 inches taller,
on average (after adjusting for the height of the mother).
Mothers 1 inch taller than others tend to have sons about 0.248 inches taller,
on average (after adjusting for the height of the father).
Yet another interpretation (much like the first):
Holding the height of the mother constant, fathers 1 inch taller than others tend to have sons about 0.4162 inches taller,
on average.
Holding the height of the father constant, mothers 1 inch taller than others tend to have sons about 0.248 inches taller,
on average.
What if we only use father’s height as a predictor?
(Simple linear regression)
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
heightFit.fathers.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 38.4008 | 4.6882 | 8.191 | 0 |
father | 0.4432 | 0.0678 | 6.535 | 0 |
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathers.coef <- tidy(heightFit.fathers)
### Grab the coefficient (column 2) for the predictor (row 2)
### (The intercept information is in row 1)
b.fathers <- heightFit.fathers.coef$estimate[2]
Why did the coefficient for father’s height change from 0.4162 to 0.4432?
For the simple linear regression, the interpretation is different.
Men 1 inch taller than others tend to have sons about 0.4432 inches taller,
on average (without taking into account the height of the mother).
\(\beta_1\) is the slope of the simple linear regression function.
Slope is the change in average \(Y\) associated with
a one-unit increase of \(x\).
\[ \mathrm{E}(Y^{old})={\beta_0}+\beta_1x^{old} \quad \textrm{ and} \] \[ \mathrm{E}(Y^{new})=\beta_0+\beta_1(x^{old}+1), \] then \[ \mathrm{E}(Y^{new})-\mathrm{E}(Y^{old}) \;=\; \beta_0+\beta_1(x^{old}+1)-(\beta_0+\beta_1x^{old}) \;=\; \beta_1 \]
Consider a multiple regression with two predictors:
\[ \mathrm{E}(Y^{old}) \;=\; {\beta_0}+\beta_1x_1^{old}+\beta_2x_2 \quad \textrm{ and} \] \[ \mathrm{E}(Y^{new}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2, \] then \[ \mathrm{E}(Y^{new})-\mathrm{E}(Y^{old}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2-(\beta_0+\beta_1x_1^{old}+\beta_2x_2) \;=\; \beta_1. \]
…and this math only works if \(x_2\) does not change when \(x_1\) increases by one unit.
Interpretation:
\(\beta_1\) is the expected change in \(Y\) when \(x_1\) increase by 1-unit,
with all other predictors (\(x_2\) in this example) unchanged (but present in the model).
Another interpretation:
The expected change in \(Y\) when \(x_1\) increases by 1 unit,
adjusting for all other predictors.
NOTE: The following discussion is just for a pedagogical purpose – another way to view things.
This is NOT how we calculate coefficients for multiple regression.
For a model with two predictors
\[\widehat{y}_i = \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\] \(\beta_1\) represents the effect of \(x_1\) on the mean of \(Y\)
after taking out the (linear) effects of \(x_2\) from both \(Y\) and \(x_1\)
The estimated slope for Model #3 = \(\widehat{\beta_1}\) for the multiple regression.
What? This is NOT obvious. Let’s investigate.
Regress \(Y\) on \(x_2\)
\[y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}\] This model estimate “grabs” all the (linear) information about \(Y\) that \(x_2\) has to offer.
The residuals leftover (\(e_{i; y, x_2}\)) are not correlated with \(x_2\).
Regress \(x_1\) on \(x_2\)
\[x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}\] This model estimate grabs all the (linear) information about \(x_1\) that \(x_2\) has to offer.
The residuals leftover (\(e_{i; x_1, x_2}\)) are not correlated with \(x_2\).
Regress residuals: Model #1 on Model #2
\[e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}\] How do we know the intercept is exactly zero? Think on it.
This model grabs all the linear information about \(y\) that \(x_1\) has to offer
after we already removed all linear information \(x_2\) contained about either \(y\) or \(x_1\).
Assuming this works, then \(\widehat{\alpha_1}\) in the estimate of Model #3
should be exactly \(\widehat{\beta_1}\) in the full, two-predictor model estimate:
\[y_i \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\]
So, \(\beta_1\) represents the effect of \(x_1\) on the mean of \(Y\)
after taking out the (linear) effects of \(x_2\) from both \(Y\) and \(x_1\)
Recall the Model #2 estimate above: \(x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}\)
Substituting the Model #2 estimate into the full model estimate, we have
\[ \begin{aligned} y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\\ \\ y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} (\widehat{\gamma_0} + \widehat{\gamma_1} x_{i2} + e_{i; x_1, x_2}) + \widehat{\beta_2} x_{i2}\\ \\ &= (\widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0}) + (\widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2}) x_{i2} + \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} \]
Then, by comparing the above with the Model #1 estimate: \(y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}\) we see that
\[ \begin{aligned} \widehat{\delta_0} & \;=\; \widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0} \\ \widehat{\delta_1} & \;=\; \widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2} \\ e_{i; y, x_2} & \;=\; \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} \] The last equation shows \(\,\widehat{\beta_1}\) is exactly the coefficient in the Model #3 estimate: \(e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}\)
For the height data
\(x_1\) = father’s height
\(x_2\) = mother’s height
\(y\) = son’s height
\[y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}\]
Regress sons’ heights (\(Y\)) on mothers’ heights (\(x_2\)):
### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.mothers.coef <- tidy(heightFit.mothers)
heightFit.mothers.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 49.8503 | 4.9715 | 10.027 | 0.0000 |
mother | 0.2997 | 0.0777 | 3.858 | 0.0002 |
### Save the residuals
y.on.x2.resids <- residuals(heightFit.mothers)
\[x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}\]
Regress fathers’ heights (\(x_1\)) on mothers’ heights (\(x_2\)):
### Estimate a simple linear regression model
heightFit.fathersOnMothers <- lm(father ~ mother, data=heightDataSons)
### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.fathersOnMothers.coef <- tidy(heightFit.fathersOnMothers)
heightFit.fathersOnMothers.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.1486 | 5.1934 | 11.77 | 0.0000 |
mother | 0.1242 | 0.0811 | 1.53 | 0.1278 |
### Save the residuals
x1.on.x2.resids <- residuals(heightFit.fathersOnMothers)
Some of the residuals:
head(mutate(heightDataSons, y.on.x2.resids, x1.on.x2.resids), 10)
height | father | mother | nkids | y.on.x2.resids | x1.on.x2.resids |
---|---|---|---|---|---|
73.2 | 78.5 | 67.0 | 4 | 3.2728 | 9.0328 |
70.0 | 69.0 | 66.0 | 3 | 0.3725 | -0.3431 |
67.0 | 69.0 | 66.7 | 6 | -2.8373 | -0.4300 |
68.0 | 69.0 | 66.0 | 6 | -1.6275 | -0.3431 |
71.0 | 69.0 | 66.5 | 5 | 1.2227 | -0.4052 |
70.5 | 69.5 | 66.5 | 4 | 0.7227 | 0.0948 |
71.0 | 69.0 | 66.5 | 6 | 1.2227 | -0.4052 |
70.5 | 69.5 | 66.0 | 7 | 0.8725 | 0.1569 |
69.0 | 69.0 | 66.0 | 9 | -0.6275 | -0.3431 |
67.0 | 69.0 | 65.0 | 7 | -2.3279 | -0.2189 |
The regression of the residuals to find
\[\widehat{\alpha_1}$: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}\]
### Estimate a simple linear regression model
heightFit.residsModel <- lm(y.on.x2.resids ~ x1.on.x2.resids)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.residsModel)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000 | 0.1660 | 0.000 | 1 |
x1.on.x2.resids | 0.4162 | 0.0659 | 6.313 | 0 |
…and finding \(\widehat{\beta_1}\) directly from the full model
(both father and mother heights as predictors)
### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
### for the multiple linear regression on both x1 and x2
tidy(lm(height ~ father + mother, data=heightDataSons))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
It works!
We don’t actually use the 3-step process to find \(\widehat{\beta_1}\) in practice.
We just compute the two-predictor model directly in R.
Recall: \(\beta_1\) is the effect of \(x_1\) on expected \(Y\)
but only after we have “adjusted for” \(x_2\) in the model.
A graph may give a sense of what is meant by “adjusted for”
http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf
The plane is not parallel to either \(x\) axis,
but changes in both directions
(if we have two meaningful predictors of \(Y\)).
Given any \(x_2\), the incremental change due to \(x_1\) is the same,
but the resultant predicted \(Y\) is not the same
…due to the contribution of \(x_2\).
The fitted values (\(\widehat{Y}\)’s) will all lie on the same regression surface
The \(Y\) values may be above, below, or on the surface
One measure of the “goodness of fit” of the regression model to the data:
\(R^2\) = proportion of variability in the response (\(Y\))
explained by regression on the predictor (\(x\))
\[R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;=\; 1 - \frac{\sum (y_i - \widehat{y}_i)^2}{\sum (y_i - \overline{y})^2}\]
If the predictors don’t include much information about the response,
then \(\sum (y_i - \widehat{y}_i)^2\) is close to \(\sum (y_i - \overline{y})^2\)
since \(\overline{y}\) is the least squares estimate for the mean without predictors.
In that case, \(R^2\) is near \(1 - 1 = 0\).
On the other hand, if the predictors have so much information that
residuals are very, very small, then \(R^2\) is near \(1 - 0 = 1\).
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
summary(repairFit)
Call:
lm(formula = Minutes ~ Units, data = repairData)
Residuals:
Min 1Q Median 3Q Max
-9.232 -3.341 -0.714 4.777 7.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.162 3.355 1.24 0.24
Units 15.509 0.505 30.71 8.9e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared: 0.987, Adjusted R-squared: 0.986
F-statistic: 943 on 1 and 12 DF, p-value: 8.92e-13
When there is only one predictor (\(x\)), \(R^2 = \mathrm{corr}(y,x)^2 = r^2\)
cor(Minutes ~ Units, data=repairData)
[1] 0.9937
cor(Minutes ~ Units, data=repairData)^2
[1] 0.9874
\(r = 0.9937\)
\(r^2 = 0.9874\)
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
So, 98.7% of the variability in repair times can be attributed to the regression on the number of units.
The rest of the variability (residuals) is not explained by the number of units repaired.
Claim: \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)
Recall:
\[\mathrm{corr}(a+bY, c+dx) = \frac{bd\mathrm{cov}(y,x)}{|b|\mathrm{sd}(Y)\, |d|\mathrm{sd}(x)} = {\rm sign}(bd)\mathrm{corr}(y,x)\]
\(\begin{align} \mathrm{corr}(y, \widehat{y}) = \mathrm{corr}(y, \widehat{\beta}_{0} + \widehat{\beta}_{1} x) = \textrm{sign}(\widehat{\beta}_{1}) \mathrm{corr}(y,x) \end{align}\)
\(\widehat{\beta}_{1}\) and \(\mathrm{corr}(y,x)\) always have the same sign: \(\widehat{\beta}_{1} = \mathrm{corr}(y,x) (s_y/s_x)\)
Thus, \(0 \le \mathrm{corr}(y, \widehat{y}) \le 1\) and \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)
So, \(R^2 = \left[\mathrm{corr}(y, x)\right]^2 = \left[\mathrm{corr}(y, \widehat{y})\right]^2\)
This will be handy when we have models with more than one predictor.
But we will have to be careful: when we add more predictors to a model, \(R^2\) will always increase or stay constant. Therefore we won’t be able to reliably use \(R^2\) as a way of comparing models to see which is better.
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
Adjusted \(R^2\) includes the degrees of freedom:
\[\text{adjusted} R^2 = 1 - \frac{\mathrm{SSE}/ \mathrm{SSE}(\mathrm{df})}{\mathrm{SST}/ \mathrm{SST}(\mathrm{df})} = \frac{\sum (y_i - \widehat{y}_i)^2/(n-p)}{\sum (y_i - \overline{y})^2 / (n-1)}\]
Adjusted \(R^2\) does not always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another.
Terminology: For multiple linear regression, \(R^2\) is sometimes called
the multiple correlation coefficient.
Model with both father’s (\(x_1\)) and mother’s height (\(x_2\)) included:
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.both)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.254 | 0.2452 | 2.19 | 28.93 | 0 | 3 | 170 |
### Save the R-sqr value as a number (not a data frame)
heightFit.both.Rsqr <- as.numeric(glance(heightFit.both)[1])
Regression on both the fathers’ and mothers’ heights explains 25.4% of the variability in sons’ heights.
Model with only father’s height (\(x_1\)) included:
### Estimate a simple linear regression model
heightFit.fathers <- lm(height ~ father, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.fathers)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 38.4008 | 4.6882 | 8.191 | 0 |
father | 0.4432 | 0.0678 | 6.535 | 0 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.fathers)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.1998 | 0.1952 | 2.261 | 42.71 | 0 | 2 | 171 |
### Save the R-sqr value as a number (not a data frame)
heightFit.fathers.Rsqr <- as.numeric(glance(heightFit.fathers)[1])
Regression on the fathers’ heights explains 20% of the variability in sons’ heights.
Model with only mother’s height (\(x_2\)) included:
### Estimate a simple linear regression model
heightFit.mothers <- lm(height ~ mother, data=heightDataSons)
### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.mothers)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 49.8503 | 4.9715 | 10.027 | 0.0000 |
mother | 0.2997 | 0.0777 | 3.858 | 0.0002 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(heightFit.mothers)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.0801 | 0.0747 | 2.425 | 14.89 | 0.0002 | 2 | 171 |
### Save the R-sqr value as a number (not a data frame)
heightFit.mothers.Rsqr <- as.numeric(glance(heightFit.mothers)[1])
Regression on the mothers’ heights explains 8% of the variability in sons’ heights.
We see that 25.4% \(<\) 20% \(+\) 8%
because, for example, adding mothers’ heights to the model only brings in their (marginal) information
about sons’ heights exclusive of information they also hold about fathers’ heights.
Of course, \(R^2\) for the combined, two-predictor model must be
larger than either of the \(R^2\) for one-predictor models. Why?
Caution: This is not always true for the adjusted \(R^2\).
As an illustration of MLR modeling, we look at a dataset on U.S. fuel consumption by state (C. Bingham, American Almanac).
We will investigate fuel consumption as a linear function of demographics, fuel tax policy, and other factors.
### Read in the data on fuel taxes and demographics for 48 states
### (excludes Alaska and Hawaii)
fuelData <- read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fuel.txt")
### Rearrange the order of the variables (in the order selected)
### then sort by the state code (AL, AR, AZ, CA, ...)
fuelData <- fuelData %>%
dplyr::select(state, pop, fuelc, fuel, nlic, dlic, tax, inc, road) %>%
arrange(state)
### Display the data as a "tibble" (a "nice" format for a data frame)
head(as.tibble(fuelData))
state | pop | fuelc | fuel | nlic | dlic | tax | inc | road |
---|---|---|---|---|---|---|---|---|
AL | 3510 | 1946 | 554 | 1801 | 51.3 | 7.0 | 3.333 | 6.594 |
AR | 1978 | 1242 | 628 | 1081 | 54.7 | 7.5 | 3.357 | 4.121 |
AZ | 1945 | 1230 | 632 | 1173 | 60.3 | 7.0 | 4.300 | 3.635 |
CA | 20468 | 10730 | 524 | 12130 | 59.3 | 7.0 | 5.002 | 9.794 |
CN | 3082 | 1408 | 457 | 1760 | 57.1 | 10.0 | 5.342 | 1.333 |
CO | 2357 | 1384 | 587 | 1475 | 62.6 | 7.0 | 4.449 | 4.639 |
variable | description |
---|---|
state | State (excluding AK and HI) |
pop | Population in each state |
fuelc | fuel consumption, millions of gallons |
fuel | Motor fuel consumption, gal/person |
nlic | 1971 thousands of drivers |
dlic | Percent of population with driver’s license |
tax | 1972 motor fuel tax rate, cents/gal |
inc | 1972 per capita income, thousands of dollars |
road | 1971 federal-aid highways, thousands of miles |
We will only use per capita versions of the data:
dlic
vs. nlic
(percent of licenced drivers vs. count)
fuel
vs. fuelc
(gallons/person rather than gallons)
road
is a little problematic (more $ for states with large land area)
pop
could be included, but the variables are per capita already.
### Use select (from dplyr) to get the variables in order
### (response variable first), so scatterplot matrix better arranged.
temp <- dplyr::select(fuelData, fuel, dlic, tax, inc, road)
### scatterplot matrix using ggpairs from GGally package
ggpairs(temp)
tax
and dlic
appear to have a linear relationship with fuel
…and maybe income
?
road
doesn’t look to have much relationship to fuel
### Estimate various regression models we will use in these notes
fuelFit0 <- lm(fuel ~ 1, data=fuelData)
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
fuelFit4 <- lm(fuel ~ dlic + tax + inc + road, data=fuelData)
tidy(fuelFit4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
Recall from simple linear regression (one predictor)
\[R^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]
SST = \(\sum (y_i - \overline{y})^2\) = total sum of squares
SSE = \(\sum (y_i - \widehat{y}_i)^2\) = error sum of squares
SSR = \(\sum (\widehat{y}_i - \overline{y})^2\) = regression sum of squares
\(\mathrm{SST}= \mathrm{SSR}+ \mathrm{SSE}\)
Even the degrees of freedom for each sum of squares add up!
SST(df) = SSE(df) + SSR(df)
\((n-1) = (n-2) + 1\)
Finally, \(r^2 = [\mathrm{corr}(y, x)]^2 = R^2\)
All is the same for multiple linear regression except…
\(r_1^2 = [\mathrm{corr}(y, x_1)]^2 \ne R^2\)
Instead, recall that for simple linear regression, we define
\[\mathrm{corr}(y, x) = |\; \mathrm{corr}(y, \widehat{y}) \;|\]
For simple and multiple linear regression
\[ R^2 = [ \mathrm{corr}(y, \widehat{y}) ]^2 = \frac{\mathrm{SSR}}{\mathrm{SST}} = 1 - \frac{\mathrm{SSE}}{\mathrm{SST}}\]
### We estimated the full model "fuelFit4" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6787 | 0.6488 | 66.31 | 22.71 | 0 | 5 | 43 |
Since more information reduces error (SSE),
each additional predictor
(Technically, adding a predictor might have no effect on SSE or \(R^2\), if the estimated regression coefficient for that predictor was exactly 0. But this will almost never happen in practice.)
Here is \(R^2\) for the full model: dlic
, tax
, inc
, and road
versus a 3-predictor model (road
removed)
fuel = dlic + tax + inc + road
vs. fuel = dlic + tax + inc
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit4)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 377.291 | 185.541 | 2.0335 | 0.0482 |
dlic | 13.364 | 1.923 | 6.9499 | 0.0000 |
tax | -34.790 | 12.970 | -2.6823 | 0.0103 |
inc | -66.589 | 17.222 | -3.8665 | 0.0004 |
road | -2.426 | 3.389 | -0.7158 | 0.4780 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit4)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6787 | 0.6488 | 66.31 | 22.71 | 0 | 5 | 43 |
### We estimated the reduced model "fuelFit3" above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6749 | 0.6527 | 65.94 | 30.44 | 0 | 4 | 44 |
\(R^2\) went down even though we know road
is not significant predictor
What about the adjusted \(R^2\)? \(\quad\) It went up!
…which implies road
does not add enough to the model to offset the additional “complexity” of the model
\[{\rm Adjusted}\; R^2 \;=\; 1 - \frac{\mathrm{SSE}\;/\; \mathrm{SSE}(df)}{\mathrm{SST}\;/\; \mathrm{SST}(df)} \;=\; 1 - \frac{\mathrm{SSE}\;/\; (n-p-1)}{\mathrm{SST}\;/\; (n-1)}\]
Adjusted \(R^2\) sets up a “penalty” for model complexity.
Can use adjusted \(R^2\) to compare “nested” models.
Can even compare models that are not nested.
For example, compare the following two models with adjusted \(R^2\),
but do NOT compare using \(R^2\):
fuel = dlic + inc \(\qquad\) vs. \(\qquad\) fuel = dlic + tax
### Estimate a multiple linear regression
fuelFit2a <- lm(fuel ~ dlic + inc, data=fuelData)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2a)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 7.837 | 122.463 | 0.064 | 0.9493 |
dlic | 15.250 | 1.883 | 8.099 | 0.0000 |
inc | -70.924 | 18.209 | -3.895 | 0.0003 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2a)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6175 | 0.6005 | 70.72 | 36.33 | 0 | 3 | 45 |
### We estimated the fuelFit2 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 108.97 | 171.786 | 0.6343 | 0.5291 |
dlic | 12.51 | 2.091 | 5.9862 | 0.0000 |
tax | -32.08 | 12.197 | -2.6297 | 0.0117 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit2)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.5567 | 0.537 | 76.13 | 28.25 | 0 | 3 | 45 |
\[{\rm Adjusted}\; R^2 \;=\; 1 - \frac{n-1}{n-p-1} (1 - R^2) \;=\; \frac{(n-1)R^2 - p}{n-1-p}\]
Example: \(n=100\), \(p=20\), \(R^2=0.90\)
Then, adjusted \(R^2 = 0.87\).
Adjusted \(R^2\) is essentially “handicapped” or “penalized”
by the number of parameters (\(p+1\)) relative to \(n\)
NOTE: Adjusted \(R^2\) can be negative (not common in practice)
We already decided road
did not add enough information to the model to justify keeping it in.
Should we also remove inc
?
fuel = dlic + tax + inc
vs. fuel = dlic + tax
### We estimated the fuelFit2 and fuelFit3 models far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
tidy(fuelFit2)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 108.97 | 171.786 | 0.6343 | 0.5291 |
dlic | 12.51 | 2.091 | 5.9862 | 0.0000 |
tax | -32.08 | 12.197 | -2.6297 | 0.0117 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6749 | 0.6527 | 65.94 | 30.44 | 0 | 4 | 44 |
glance(fuelFit2)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.5567 | 0.537 | 76.13 | 28.25 | 0 | 3 | 45 |
Note that the SEs got larger, the t-statistics got smaller.
So, it can happen that a significant predictor becomes
not significant when a signifcant predictor is removed.
Also, important predictors can become insignificant if
too many nonconsequential variables in the model.
Not all of modeling can be based on statistical tests!
The model we have settled upon is
fuel = dlic + tax + inc
but the intercept is (barely) not significant:
### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
Even so, we keep the intercept in the model.
The intercept doesn’t really add complexity.
…and keeping it guarantees all the nice properties of least squares estimates.
What are the consequences of…
including non-significant variables in the model?
After all, \(R^2\) always improves, so why omit variables?
What are the consequences of…
omitting important variables?
After all, fewer parameters is better for
### We estimated the 5 models fuelFitX far above
### Here, we are adding the fitted values and residuals
### for each model to the dataset
fuelDataResids <- fuelData %>%
mutate(resids0 = residuals(fuelFit0), fitted0 = predict(fuelFit0),
resids1 = residuals(fuelFit1), fitted1 = predict(fuelFit1),
resids2 = residuals(fuelFit2), fitted2 = predict(fuelFit2),
resids3 = residuals(fuelFit3), fitted3 = predict(fuelFit3),
resids4 = residuals(fuelFit4), fitted4 = predict(fuelFit4)
)
### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = fuelDataResids) %>%
gf_hline(yintercept = 0, linetype = "dashed") %>%
gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")
### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = fuelDataResids, color="white", bins=12) %>%
gf_labs(x = "Residuals (fuel = dlic + tax + inc)")
### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = fuelDataResids) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)
Here are the cases with the largest residuals:
### select variables and arrange (sort) the data
temp <- fuelDataResids %>%
dplyr::select(state, fuel, dlic, tax, inc, resids3) %>%
arrange(resids3)
### Look at the last few observations (where sorted values are largest)
tail(temp, 8)
state | fuel | dlic | tax | inc | resids3 | |
---|---|---|---|---|---|---|
41 | MT | 704 | 58.6 | 7 | 3.897 | 62.53 |
42 | NM | 699 | 56.3 | 7 | 3.656 | 72.76 |
43 | IN | 580 | 53.0 | 8 | 4.391 | 78.60 |
44 | NV | 782 | 67.2 | 6 | 5.215 | 82.47 |
45 | VA | 547 | 51.7 | 9 | 4.258 | 83.91 |
46 | SD | 865 | 72.4 | 7 | 4.716 | 89.52 |
47 | ND | 714 | 54.0 | 7 | 3.718 | 123.59 |
48 | WY | 968 | 67.2 | 7 | 4.345 | 238.77 |
The largest residual happens to be for the state of Wyoming.
Here are the residual plots again, with the Wyoming residual removed.
### Use the filter function from dplyr package to filter out Wyoming
temp <- filter(fuelDataResids, state != "WY")
### Plot residuals vs. fitted values with dashed horizontal line at 0
gf_point(resids3 ~ fitted3, data = temp) %>%
gf_hline(yintercept = 0, linetype = "dashed") %>%
gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals")
### Remember, Wyoming is removed from data, so that residual is not displayed
### Draw a histogram of the residuals
p1 <- gf_histogram(~ resids3, data = temp, color="white", bins=12) %>%
gf_labs(x = "Residuals (fuel = dlic + tax + inc)")
### Draw a normal quantile plot of the residuals
p2 <- gf_qq(~ resids3, data = temp) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)
Can test for specific values of \(\beta_0, \beta_1, \beta_2, \ldots, \beta_p\)
Tests OK if model specifications reasonable
and residuals plausibly from a normal distribution.
\[\frac{\widehat{\beta}_j- \beta_j}{\mathrm{se}(\widehat{\beta}_j)} \;\sim\; t_{n-p-1}\]
### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
To test multiple coefficients, we use the F-test.
\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)
\(H_1:\) At least one of the \(\beta_j \ne 0\).
NOTE: Here, we are not testing \(\beta_0=0\).
Here is the process.
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\)
(no restrictions on the coefficients)
Compare to the reduced model (RM): \(Y = \beta_0 + \epsilon\)
(all other \(\beta_j=0\) as stated in \(H_0\))
If the model specifications are valid (including Normality),
Then the following comparison statistic has an \(F\) distribution.
\[F \;=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})}\]
Let’s try it with the fuel data: fuel = dlic + tax + inc
I encourage you to look at the R code.
### We estimated the models "fuelFit0" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you
### This compares reduced (fuelFit0) and full (fuelFit3) models
anova(fuelFit0, fuelFit3)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
47 | 588366 | NA | NA | NA | NA |
44 | 191302 | 3 | 397064 | 30.44 | 0 |
### Now, I'll do things a little more explicitly (manually)
### Use tidy() from dplyr package to put ANOVA output into a table
anovaTable3 <- tidy(anova(fuelFit0, fuelFit3))
Warning: Unknown or uninitialised column: 'term'.
### Grab the various pieces of the F-statistic from the ANOVA table
SSE.RM <- anovaTable3$rss[1]
SSE.FM <- anovaTable3$rss[2]
df.RM <- anovaTable3$res.df[1]
df.FM <- anovaTable3$res.df[2]
### Calculate the F-statistic
F.stat <- ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / ( SSE.FM / df.FM)
### Calculate the p-value of the F-test
p.value <- 1 - pf(F.stat, df.RM - df.FM, df.FM)
\[F \;=\; \frac{(588,366 - 191,302) \; / \; (47 - 44)} {191,302 \;/\; 44} \;=\; 30.44\]
p-value = 0.00000000008235
\(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\)
\(H_1:\) At least one of the \(\beta_j \ne 0\).
is the test (and F-statistic) that R automatically calculates for regression.
### We estimated the fuelFit3 model far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc.
glance(fuelFit3)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.6749 | 0.6527 | 65.94 | 30.44 | 0 | 4 | 44 |
(except not testing \(\beta_0=0\))
ANOVA = ANalysis Of VAriance
Source | SS (Sum Squares of) | df | MS (Mean Squares) |
---|---|---|---|
Model (Regression) | SSR = \(\sum_i (\widehat{y}_i-\overline{y})^2\) | \(p\) | MSR = SSR\(\,/\,p\) |
Residual (Error) | SSE = \(\sum_i (y_i-\widehat{y}_i)^2\) | \(n-p-1\) | MSE = SSE\(\,/\,(n-p-1)\) = \(\widehat{\sigma}{^2}\) |
Total | SST = \(\sum_i (y_i-\overline{y})^2\) = SSR + SSE | \(n-1\) | MST = SST\(\,/\,(n-1) = \mathrm{var}(y)\) |
When \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\),
Reduced model (RM): \(Y = \beta_0 + \epsilon\)
has SSE = \(\sum_i (y_i-\overline{y})^2\) on \(n-1\) df
Full model(FM): \(Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\)
has SSE = \(\sum_i (y_i-\widehat{y}_i)^2\) on \(n-p-1\) df
and the F-statistic is
\[\begin{aligned} F &= \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;-\; \mathrm{df}(\mathrm{FM})} \\ \\ &=\; \frac{\left[\sum (y_i-\overline{y})^2 - \sum(y_i - \widehat{y}_i)^2 \right] \;/\; [(n-1) - (n-p-1)]}{ \sum(y_i - \widehat{y}_i)^2 \;/\; (n-p-1) } \\ \\ &=\; \frac{(\mathrm{SST}- \mathrm{SSE}) \;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \;=\; \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \;=\; \frac{\mathrm{MSR}}{\mathrm{MSE}} \end{aligned}\]
Since the textbook mentions it, here is another way to compute the F-statistic:
\[\begin{aligned} F &=\; \frac{\mathrm{MSR}}{\mathrm{MSE}} \;=\; \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \\ \\ &=\; \frac{1 \;/\; \mathrm{SST}}{1 \;/\; \mathrm{SST}} \times \frac{\mathrm{SSR}\;/\; p}{\mathrm{SSE}\;/\; (n-p-1)} \\ \\ &=\; \frac{\mathrm{SSR}\;/\; \mathrm{SST}(p)}{\mathrm{SSE}\;/\; \mathrm{SST}(n-p-1)} \;=\; \frac{R^2_p \;/\; p} {(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}\]
…but there is nothing illuminating (to me) about this formula.
To test statements about multiple coefficients, we continue to use the F-test.
Let’s use the fuel data: fuel = dlic + tax + inc
\(H_0: \beta_2 = \beta_3 = 0\) (tax
and inc
not needed in the model)
Here is the process.
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
(no restrictions on the coefficients)
Compare to the reduced model (RM): \(Y = \beta_0 + \beta_1 x_1 + \epsilon\)
(\(\beta_2=0\) and \(\beta_3=0\) as stated in \(H_0\))
If the model specifications are valid (including Normality),
Then the following comparison statistic has an \(F\) distribution.
df for F-statistic: [df(RM) \(-\) df(FM)] and df(FM)
\[\begin{aligned} F &=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})} \\ \\ &=\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; (p-q)}{\mathrm{SSE}(\mathrm{FM}) \;/\; (n-p-1)} \\ \\ &=\; \frac{(R^2_p - R^2_q) \;/\; (p-q)}{(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}\]
where \(p\) is the number of predictors in the full model (FM)
and \(q<p\) is the number of predictors in the reduced model (RM)
OK. Let’s test \(H_0: \beta_2 = \beta_3 = 0\) (fuel
and inc
not needed in the model)
SHOW CODE
### We estimated the models "fuelFit1" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you
### This compares reduced (fuelFit1) and full (fuelFit3) models
fuelFit1 <- lm(fuel ~ dlic, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit1, fuelFit3)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
46 | 300919 | NA | NA | NA | NA |
44 | 191302 | 2 | 109616 | 12.61 | 0 |
fuel = dlic + tax + inc
\(H_0: \beta_3 = 0\) (fuel = dlic + tax)
Here is the process.
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
(no restrictions on the coefficients)
Compare to the reduced model (RM): \(Y = \beta_0 + \beta_1 x_1 + + \beta_2 x_2 + \epsilon\)
(\(\beta_3=0\) as stated in \(H_0\))
Here is the F-statistic:
### We estimated the models "fuelFit2" and "fuelFit3" far above
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you
### This compares reduced (fuelFit2) and full (fuelFit3) models
### ...and these models only differ by one coefficient (beta_3)
fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData)
fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData)
anova(fuelFit2, fuelFit3)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
45 | 260834 | NA | NA | NA | NA |
44 | 191302 | 1 | 69532 | 15.99 | 0.0002 |
F.stat <- anova(fuelFit2, fuelFit3)[2,5]
F-statistic = 15.99
…and the t-statistic for the same test
### We estimated the model "fuelFit3" far above
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(fuelFit3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 307.33 | 156.831 | 1.960 | 0.0564 |
dlic | 13.75 | 1.837 | 7.485 | 0.0000 |
tax | -29.48 | 10.584 | -2.786 | 0.0078 |
inc | -68.02 | 17.010 | -3.999 | 0.0002 |
### t-statistics are in the "statistic" variable of the tidy table
### There are p=3 predictors, so the (p+1) = 4th holds info for X_3
t.stat <- tidy(fuelFit3)$statistic[4]
t-statistic = -3.999
As we have seen before, \(F_{1,df} = (t_{df})^2\).
With dlic
, tax
, and inc
in the fuel model could the
marginal effect of tax
be 24 fewer gallons/year per person
while the marginal effect of dlic
increases average fuel by 12 gallons/year?
\(H_0: \beta_1=12,\; \textrm{and}\;\; \beta_2=-24\)
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\)
(no restrictions on the coefficients)
Compare to the reduced model (RM): \(Y = \beta_0 + 12 x_1 -24 x_2 + \beta_3 x_3 + \epsilon\)
(\(\beta_3=0\) as stated in \(H_0\))
Fit both the reduced and full models and compare using anova
function
SHOW CODE
### The offset() function lets you bring in specific values for coef
lmfit.RM <- lm(fuel ~ offset(12*dlic) + offset(-24*tax) + inc, data=fuelData)
### This model has no constraints (full model)
lmfit.FM <- lm(fuel ~ dlic + tax + inc, data=fuelData)
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you
### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
46 | 198266 | NA | NA | NA | NA |
44 | 191302 | 2 | 6964 | 0.8008 | 0.4554 |
SHOW CODE
Or use the linearHypothesis
function from the car
package:
### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(fuelFit3, c("dlic = 12", "tax = -24")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
46 | 198266 | NA | NA | NA | NA |
44 | 191302 | 2 | 6964 | 0.8008 | 0.4554 |
In the height data, perhaps the coefficients for fathers and for mothers are the same.
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.coef <- tidy(heightFit.both)
heightFit.both.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
\(H_0: \beta_1=\beta_2\)
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\)
(no restrictions on the coefficients)
Compare to the reduced model (RM): \(Y = \beta_0 + \beta_1 x_1 + \beta_1 x_2 + \epsilon = \beta_0 + \beta_1 (x_1 + x_2) + \epsilon\)
(\(\beta_1=\beta_2\) as stated in \(H_0\))
Fit both the reduced and full models and compare using anova
function
SHOW CODE
### The I() function lets you redefine a predictor
### In this case, the new predictor is the sum of two other predictors
lmfit.RM <- lm(height ~ I(father + mother), data=heightDataSons)
### This model has no constraints (full model)
lmfit.FM <- lm(height ~ father + mother, data=heightDataSons)
### ANOVA = ANalysis Of VAriance
### Basically, using the partitioning of SS to build an F-stat for you
### This compares reduced (lmfit.RM) and full (lmfit.FM) models
anova(lmfit.RM, lmfit.FM)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
171 | 828.3 | NA | NA | NA | NA |
170 | 815.3 | 1 | 12.99 | 2.708 | 0.1017 |
SHOW CODE
Or use the linearHypothesis
function from the car
package:
### The kable function is technically not needed (but makes output look nice)
kable(linearHypothesis(lmfit.FM, c("father = mother")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
171 | 828.3 | NA | NA | NA | NA |
170 | 815.3 | 1 | 12.99 | 2.708 | 0.1017 |
So, it is plausible that the coefficients are equal.
number of predictors | nested models | non-nested models |
---|---|---|
different number | F-test (see note 1) | |
adjusted \(R^2\) | adjusted \(R^2\) | |
\(R^2\) (see note 2) | ||
same number | not applicable | adjusted \(R^2\) (see note 3) |
\(R^2\) | ||
Note 1: The F-test has the advantage of providing a measure of uncertaintly (p-value)
Note 2: \(R^2\) always increases as more predictors are added to the model, so adjusted \(R^2\) is preferred for model comparison. On the other hand, \(R^2\) is an interpretable quantity: the amount of variation in the response explained by linear regression on the predictor(s).
Note 3: When two models with the same number of predictors are compared, then \(C \times R^2 =\) adjusted \(R^2\), (\(C\) is a constant), so comparing either measure provides equivalent information.
For families with a father 70 inches tall and mother 66 inches tall,
estimate the average of their sons’ heights (and a confidence interval for the true average).
Using the full model: height = father + mother,
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
tidy(heightFit.both)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
### Set up a little dataset to hold the predictor values for an estimate
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a confidence interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="confidence"))
fit | lwr | upr |
---|---|---|
69.9 | 69.46 | 70.34 |
For one family with a father 70 inches tall and mother 66 inches tall,
estimate their son’s height (and a prediction interval actual height).
Using the full model: height = father + mother,
### Estimate a multiple linear regression model
heightFit.both <- lm(height ~ father + mother, data=heightDataSons)
### Set up a little dataset to hold the predictor values for a prediction
newdata <- data.frame(father = 70, mother = 66)
### The predict() function gives you the estimate and a prediction interval
### The kable function is technically not needed (but makes output look nice)
kable(predict(heightFit.both, newdata, interval="prediction"))
fit | lwr | upr |
---|---|---|
69.9 | 65.56 | 74.25 |
Why is the prediction interval wider than the confidence interval?
How do the coefficients change if we change the units of observation?
### Add sons' heights in cm to the dataset
cm <- 2.54
heightDataSons.cm <- heightDataSons %>%
mutate(height.cm = height*cm)
### Fit the new linear model (response in cm instead of inches)
heightFit.both.cm <- lm(height.cm ~ father + mother, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.coef <- tidy(heightFit.both.cm)
Regression with original data:
### This table of coefficients was saved earlier
heightFit.both.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 24.4022 | 6.0422 | 4.039 | 0.0001 |
father | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother | 0.2480 | 0.0706 | 3.511 | 0.0006 |
Son’s heights changed to centimeters:
### This table of coefficients was saved earlier
heightFit.both.cm.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.9816 | 15.3472 | 4.039 | 0.0001 |
father | 1.0571 | 0.1679 | 6.294 | 0.0000 |
mother | 0.6299 | 0.1794 | 3.511 | 0.0006 |
What is the ratio of the new coefficients and SE’s compared to old?
### These tables of coefficients were saved earlier
### Coefficients in column 2, SEs in column 3
### Intercept in row 1
### Predictors in rows 2 ... to p
heightFit.both.coef[1:3, 2:3]
estimate | std.error |
---|---|
24.4022 | 6.0422 |
0.4162 | 0.0661 |
0.2480 | 0.0706 |
heightFit.both.cm.coef[1:3, 2:3]
estimate | std.error |
---|---|
61.9816 | 15.3472 |
1.0571 | 0.1679 |
0.6299 | 0.1794 |
heightFit.both.cm.coef[1:3, 2:3] / heightFit.both.coef[1:3, 2:3]
estimate | std.error |
---|---|
2.54 | 2.54 |
2.54 | 2.54 |
2.54 | 2.54 |
Rather confusingly, the mothers’ and fathers’ heights are still in inches.
So, the coefficients are in units of cm/inch.
Of course, we can change all units to cm (and this better not affect our analysis at all!)
### Add fathers' and mothers' heights (in cm) to the dataset
heightDataSons.cm <- heightDataSons.cm %>%
mutate(father.cm = father*cm, mother.cm = mother*cm)
### Fit a new linear model with all variables in cm (not inches)
heightFit.both.cm.cm <- lm(height.cm ~ father.cm + mother.cm, data=heightDataSons.cm)
### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values
heightFit.both.cm.cm.coef <- tidy(heightFit.both.cm.cm)
heightFit.both.cm.cm.coef
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.9816 | 15.3472 | 4.039 | 0.0001 |
father.cm | 0.4162 | 0.0661 | 6.294 | 0.0000 |
mother.cm | 0.2480 | 0.0706 | 3.511 | 0.0006 |
\[\begin{aligned} \widetilde{y} \;=\; \frac{y - \overline{y}}{s_y} \quad {\rm and} \quad \widetilde{x}_j \;=\; \frac{x_j - \overline{x}_j}{s_j} \quad {\rm for}\; j=1,2,\ldots, p \end{aligned}\]
\[Y - \overline{y}\;=\; \beta_0 + \beta_1 (x_1 - \overline{x}_1) + \beta_2 (x_2 - \overline{x}_2) + \cdots + \beta_p (x_p - \overline{x}_p) + \epsilon \]
We are still using the modified data:
sons’ heights in cm, fathers’ and mothers’ heights in inches
### Add centered variables to the predictors (sons' in cm, others in inches)
heightDataSons.ctr <- heightDataSons.cm %>%
mutate(height.ctr = height.cm - mean(height.cm),
father.ctr = father - mean(father),
mother.ctr = mother - mean(mother)
)
### Estimate the multiple linear model with centered data
heightFit.both.ctr <- lm(height.ctr ~ father.ctr + mother.ctr, data=heightDataSons.ctr)
Regression with uncentered data:
### This table of coefficients was saved earlier
tidy(heightFit.both.cm)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.9816 | 15.3472 | 4.039 | 0.0001 |
father | 1.0571 | 0.1679 | 6.294 | 0.0000 |
mother | 0.6299 | 0.1794 | 3.511 | 0.0006 |
Regression after all variables centered:
### This table of coefficients was saved earlier
tidy(heightFit.both.ctr)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000 | 0.4229 | 0.000 | 1.0000 |
father.ctr | 1.0571 | 0.1679 | 6.294 | 0.0000 |
mother.ctr | 0.6299 | 0.1794 | 3.511 | 0.0006 |
The intercept for the regression with centered data is zero. Why?
Hint: When we use least squares to estimate \(\beta_0\) for a linear model like this
\[Y \;=\; \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon\] \[ {\rm then,}\; \widehat{\beta_0} \;=\; \overline{y}- \sum_{i=1}^p \widehat{\beta}_j\overline{x}_j\]
The other regression coefficients (and their SEs) are the same as when not centered.
### Add standardized variables to the dataset
heightDataSons.cm.std <- heightDataSons.cm %>%
mutate(height.std = (height.cm - mean(height.cm)) / sd(height.cm),
father.std = (father - mean(father)) / sd(father),
mother.std = (mother - mean(mother)) / sd(mother)
)
### Estimate linear model with standardized response and predictor variables
heightFit.both.cm.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.cm.std)
Regression with unstandardized data:
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.9816 | 15.3472 | 4.039 | 0.0001 |
father | 1.0571 | 0.1679 | 6.294 | 0.0000 |
mother | 0.6299 | 0.1794 | 3.511 | 0.0006 |
Regression after all variables standardized:
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm.std)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000 | 0.0661 | 0.000 | 1.0000 |
father.std | 0.4198 | 0.0667 | 6.294 | 0.0000 |
mother.std | 0.2342 | 0.0667 | 3.511 | 0.0006 |
This last regression is the same whether we measured in cm or inches. Why?
Regression with unstandardized original data (all in inches):
### Tidy up the coeffient output for a model estimated earlier
tidy(heightFit.both.cm)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 61.9816 | 15.3472 | 4.039 | 0.0001 |
father | 1.0571 | 0.1679 | 6.294 | 0.0000 |
mother | 0.6299 | 0.1794 | 3.511 | 0.0006 |
Regression after all variables standardized:
### Add standardized variables to the dataset
heightDataSons.std <- heightDataSons %>%
mutate(height.std = (height - mean(height)) / sd(height),
father.std = (father - mean(father)) / sd(father),
mother.std = (mother - mean(mother)) / sd(mother)
)
### Estimate the linear model using standardized response and predictor vars
heightFit.both.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.std)
### View the coef, SEs, t-stats, p-values from the linear regression
tidy(heightFit.both.std)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 0.0000 | 0.0661 | 0.000 | 1.0000 |
father.std | 0.4198 | 0.0667 | 6.294 | 0.0000 |
mother.std | 0.2342 | 0.0667 | 3.511 | 0.0006 |
We don’t really gain anything from standardizing the original data since all variables
are already in the same units (inches) for the same characteristic (height).
Plus, it seems that the standard deviations of the original heights
were nearly the same already for each variable (before standardizing).
\(s_y = 2.521\)
\(s_{x1} = 2.543\)
\(s_{x2} = 2.381\)
So, the coefficients changed a little, but their interpretation is very different.
Now, for fathers one std deviation taller, their sons are expected
to be \(0.4198\) standard deviations taller (adjusted for the mothers’ heights).
i <- 5
n <- dim(heightDataSons)[1]
Suppose for the \(5th\) family in the data, the mother’s height was not recorded.
We could just throw out case \(i = 5\).
…but what if a mother’s or father’s height is missing from several families.
Should we just throw out the information on the other parent so easily?
One solution: imputation of missing values.
This can be done many ways.
We will only discuss two.
For example, replace the missing \(5th\) mother’s height with the average height of the other \((n-1) = 172\) mothers’ heights.
kable(mean(~ mother[-5], data=heightDataSons))
x |
---|
63.95 |
i
[1] 5
temp <- heightDataSons[-i, ]
lmfit <- lm(mother ~ father + height, data = temp)
newdata <- data.frame(father = heightDataSons$father[i], height = heightDataSons$height[i])
newdata <- mutate(newdata, mother = predict(lmfit, newdata))
kable(newdata)
father | height | mother |
---|---|---|
69 | 71 | 64.49 |