--- title: "STAT 224 Chapter 3" params: dotabs: no output: html_document: code_folding: hide css: styles.css df_print: kable number_sections: yes self_contained: yes theme: cerulean --- ```{r test-main, child = 'preamble.Rmd'} ``` # Chapter 3 `r if(params$dotabs) paste('{.tabset}')` ## Lecture #5 ### Model #### Multiple linear regression model If there are other variables that may help us predict $Y$, we could extend our simple model to include them. $$Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p$$ ...and then, allow for error (statistical model) $$Y \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p + \epsilon$$ The following are also multiple linear regression models $$ \begin{aligned} Y &\;=\; \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 log(x_1) + \beta_2 x_2 + \epsilon \end{aligned} $$ $Y$ is approximately a linear function of $x$'s and $\epsilon$ measures discrepency of the approximation As before, $\epsilon$ contains no systematic information about $Y$ that is not already captured by the $x$'s The $\beta$'s are called partial regression coefficients, or just regression coefficients #### Reminder: Model specifications 1. The errors are **uncorrelated** 2. The errors have no systematic info about $Y$ not already captured by the $x$'s a. $E(\epsilon_i)=0$ for all units $i$ (for all values of all predictors $x_j$) b. The errors are not correlated with the predictor values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all units $i$ (for all predictors $x_j$) 3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$ a. $E(Y\; |\; X_1=x_1, X_2=x_2, \ldots, X_p=x_p) \;=\; \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots \beta_px_p$ 4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$ ### Data #### Data format We will endevour to provide you with **tidy** data: + Each row corresponds to one observational unit (case) + Each variable is one column | observation | response $y$ | predictor $x_1$ | $x_2$ | ... | $x_p$ | |:-----------:|:------------:|:---------------:|:-----:|:--------:|:-----:| | 1 | $y_1$ | $x_{11}$ | $x_{12}$ | ... | $x_{1p}$ | | 2 | $y_2$ | $x_{21}$ | $x_{22}$ | ... | $x_{2p}$ | | ... | ... | ... | ... | ... | ... | | ... | ... | ... | ... | ... | ... | | $n$ | $y_n$ | $x_{n1}$ | $x_{n2}$ | ... | $x_{np}$ | $x_{ij}$ = observation for unit $i$ on variable $x_j$ ### Example: Heights of parents and their sons First, I cleaned up the data. Click the `Code` button to see my R code. ```{r data-reduction-one-son-per-family, results='hide'} ### For these data, I added some categorical variables (for plotting) ### and reduced the data to one (random) son per family. ### It is not expected that you would need to do code like in this "chunk" ### Add categorical variables for heights of moms and dads cuts <- c(61, 66, 68, 70, 72, 74, 80) midpoints <- c(65, 67, 69, 71, 73, 75) heightDataAll <- Galton %>% mutate(father2grp = factor(father > median(father), labels = c("Father Short", "Father Tall"))) %>% mutate(mother2grp = factor(mother > median(mother), labels = c("Mother Short", "Mother Tall"))) %>% mutate(father4grp = factor(cut(father, 4), labels = c("Father Short", "Father Average", "Father Tall", "Father Very Tall"))) %>% mutate(mother4grp = factor(cut(mother, 4), labels = c("Mother Short", "Mother Average", "Mother Tall", "Mother Very Tall"))) %>% mutate(fathergrp = factor(cut(father, cuts), labels = midpoints)) ### Reduce the data to one (random) son per family ### If you set the "seed" for the random number generator first, ### then you get the same "random" results each time you compile the doc. ### This keeps you sane. ### The identical dataset is generated every time I compile this document. set.seed(12345) heightDataSonsGroups <- heightDataAll %>% filter(sex == "M") %>% group_by(family) %>% sample_n(1) %>% ungroup() %>% dplyr::select(-family, -sex) heightDataSons <- heightDataSonsGroups %>% dplyr::select(height, father, mother, nkids) ### Now each family (row) is one "case" ### and each case has 3 observed variables: heights of the father, mother, and one son, plus number of children in the family (nkids) ### ...plus some categorical values I created for heights of fathers and mothers n <- dim(heightDataSons)[1] ``` #### How are sons' heights related to their father's height? The $`r n`$ points are shrunk and "jittered" so you can see more of them. ```{r, fig.height=3} p1 <- gf_point(height ~ father, data=heightDataSons) %>% gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") ### If you set the "seed" for the random number generator first, ### then you get the same "random" results each time you compile the doc. ### This keeps you sane when you want to later comment on a plot. ### The plot doesn't change in appearance each time you compile! set.seed(12345) p2 <- gf_jitter(height ~ father, data=heightDataSons, width=0.75, height=0.75, cex=0.75) %>% gf_labs(x="Father's Height (inches)", y="Son's Height (inches)") grid.arrange(p1, p2, ncol=2) ``` #### Linear relationship between average son's height and father's height? ```{r} ### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg ggplot(heightDataSonsGroups, aes(x=fathergrp, y=height)) + geom_boxplot() + labs(x="Father's Height (inches)", y="Son's Height (inches)") + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point", size=3, color="red") ``` ```{r} ### Fit a LS regression line heightFit <- lm(height ~ father, data=heightDataSons) ### Add the fitted values and residuals to the data heightDataFitSons <- mutate(heightDataSons, yhat = fitted(heightFit), resid = residuals(heightFit) ) ``` A look at the residuals: ```{r} ### I Googled around quite a bit to figure out how to make this plot with ggplot2 pkg ggplot(heightDataFitSons, aes(x = father, y = height)) + geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + ### Plot regression slope geom_segment(aes(xend = father, yend = yhat), alpha = .2) + ### alpha to fade lines geom_point() + geom_point(aes(y = yhat), shape = 1) + theme_bw() ### Add theme for cleaner look ``` ... and a residual plot: ```{r} gf_point(resid ~ father, data=heightDataFitSons) %>% gf_hline(yintercept = 0) ``` #### Reading the output Output from the `tidy` function in the `broom` package. ```{r} tidy(heightFit) ``` What is the value of the least squares regression intercept? What is the meaning of the intercept for these data? What is the value of the least squares regression slope? What is the meaning of the slope for these data? Output from the `glance` function in the `broom` package. ```{r} glance(heightFit) ``` In this course, we will not likely discuss four of the statistics that `glance` provides: `logLik`, `AIC`, `BIC`, anad `deviance`. So, I will restrict the output for lectures: ```{r} glance(heightFit)[-c(7:10)] ``` Click the `Code` button to see a summary of the regression output using base R. ```{r} heightFit <- lm(height ~ father, data=heightDataSons) ### Here is the base R summary of a linear model result summary(heightFit) ``` What is the value of $R^2$ and what does it mean? What hypothesis is tested using this p-value? Hint: Compare the t-statistics above to the F-statistic. ```{r eval=FALSE} ### Plot not displayed. ### Just playing around with the full dataset (out of curiousity) gf_point(height ~ father, color = ~ sex, size = ~ nkids, data = heightDataAll) ``` ```{r practice-boxplots, eval=FALSE} ### Plots not displayed. ### I was just playing around with various ggplot2 graphics functions. ### Mostly drawing means (and lines between means) with boxplots ### Who knows, someday you might want to show something similar at your summer intership! x <- factor(rep(1:10, 100)) y <- rnorm(1000) df <- data.frame(x=x, y=y) ggplot(df, aes(x=x, y=y)) + geom_boxplot() + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point") qplot(father2grp, height, data=heightDataSons, geom="boxplot", xlab="Father's Height", ylab="Son's Height") + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point", size=3) gf_boxplot(height ~ father2grp, data=heightDataSons) + labs(x="Father's Height", y="Son's Height") + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point", size=3) ggplot(heightDataSons, aes(x=father2grp, y=height)) + geom_boxplot() + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point", size=3, color="red") ggplot(heightDataSons, aes(x=father4grp, y=height)) + geom_boxplot() + stat_summary(fun.y=mean, geom="line", aes(group=1)) + stat_summary(fun.y=mean, geom="point", size=3, color="red") gf_histogram(~ height | father2grp, bins=12, data=heightDataSons, color="white") + labs(x="Son's Height", y="Number of Sons") ggplot(heightDataSons, aes(x = height)) + geom_histogram(bins=12, color = "white") + facet_wrap(~ father2grp, ncol=1) gf_boxplot(height ~ father4grp, data=heightDataSons) + labs(x="Father's Height", y="Son's Height") gf_violin(height ~ father4grp, draw_quantiles=c(0.50), data=heightDataSons) + labs(x="Father's Height", y="Son's Height") ``` #### What about the other predictors in the data? Could there be other information in the data that can help predict a son's height from family characteristics? ```{r} ### Look at the first 10 observations in the data head(heightDataSons, 10) ``` Start with a scatterplot matrix. Using base R: ```{r} ### Scatterplot matrix in base R pairs(heightDataSons, pch=20) ### Table of correlations between all variables in the data cor(heightDataSons) ``` Using the `GGally` package: ```{r} ### Scatterplot matrix using GGally package ggpairs(heightDataSons) ``` ...and again getting fancier with the `GGally` package: This adds simple linear regression lines and confidence bands to each plot. ```{r} ### I stumbled across this while Googling around about ggpairs function. Nifty. ggpairs(heightDataSons, lower = list(continuous = wrap("smooth", method = "lm"))) ``` Father's height is more strongly correlated with Son's height than mother's height. Number of children in family doesn't seem to be related to anyone's height ... not surprising? Fathers and mothers have fairly low height correlation ...a bit surprising? Then, perhaps the mothers bring "significant" new information about the Son's height that is not already captured by father's height. #### Fit the multiple linear regression model $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$ where $x_1$ = father's height $x_2$ = mother's height $x_3$ = numer of children in the family ```{r} ### Fit a multiple linear regression model with least squares heightFit.all <- lm(height ~ father + mother + nkids, data=heightDataSons) ### Look at the results using functions in the broom package tidy(heightFit.all) glance(heightFit.all)[-c(7:10)] ``` Surprisingly, the number of children in the family is a moderately significant predictor. For now, I will leave it out of the model. $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$ where $x_1$ = father's height $x_2$ = mother's height ```{r} ### Estimate a multiple linear regression model heightFit.both <- lm(height ~ father + mother, data=heightDataSons) ### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.both.coef <- tidy(heightFit.both) heightFit.both.coef ``` ### Estimation #### Least squares estimates of coefficients As we did with one predictor, we estimate the regression coefficients by "least squares" Minimize the sum of squared residuals. The estimated error (estimate of $\epsilon_i$) is called a **residual** ($e_i$) $$\begin{aligned} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \yhat_i\\ \\ &\;=\; y_i - (\widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2} + \cdots + \bp x_{ip}) \end{aligned}$$ $$SSE \;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{\beta_0} - \widehat{\beta_1} x_{i1} - \widehat{\beta_2} x_{i2} - \cdots + \bp x_{ip})^2$$ We can minimize this function with respect to $\widehat{\beta_0}, \widehat{\beta_1}, \widehat{\beta_2}, \ldots, \bp$ to find the least squares estimates of the model coefficients. The formulas get a bit crazy for $p > 1$ predictors. After taking partial derivatives wrt each $\widehat\beta_j$ we need to solve $(p+1)$ linear equations for $(p+1)$ unknowns. This is all best represented in matrix form (...linear algebra). At least the intercept has a simple formula. With one predictor: $Y \;=\; \beta_0 + \beta_1 x + \epsilon$, recall that $\widehat{\beta_0} \;=\; \ybar - \widehat{\beta_1} \xbar$. For multiple linear regression, $$\widehat{\beta_0} \;=\; \ybar - \widehat{\beta_1} \xbar_1 - \widehat{\beta_2} \xbar_2 - \cdots - \bp \xbar_p \;=\; \ybar - \sum_{j=1}^p \widehat\beta_j \xbar_j$$ #### Estimate of the standard deviation As with all least squares methods, we estimate $\var(\epsilon)$ as $$\widehat\sigma^2 \;=\; \frac{\SSE}{\df} \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{\df}$$ degrees of freedom (df) = sample size $-$ ### parameters estimated for the mean ($\beta_0, \beta_1, \beta_2, \ldots, \beta_p$) = $n-(p+1) \;=\; (n - p - 1)$ $$\widehat\sigma^2 \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{n - p - 1}$$ #### Finding the least squares estimates using R `lm` = "linear model" function `lm(y ~ x1 + x2 + x3 + x4, data = mydata)` ### Interpreting (partial) regression coefficients ```{r} ### Estimate a multiple linear regression model heightFit.both <- lm(height ~ father + mother, data=heightDataSons) ### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.both.coef <- tidy(heightFit.both) heightFit.both.coef ### Grab the coefficients (column 2) for 1st two predictors (rows 2 & 3) ### (The intercept information is in row 1) b.both.fathers <- heightFit.both.coef$estimate[2] b.both.mothers <- heightFit.both.coef$estimate[3] ``` **One interpretation:** Among mothers of a particular height, if they each had a child with a man 1 inch taller instead, we expect their sons would have been, on average, `r b.both.fathers` inches taller. Among fathers of a particular height, if they each had a child with a woman 1 inch taller instead, we expect their sons would have been, on average, `r b.both.mothers` inches taller.


**Another interpretation:** Fathers 1 inch taller than others tend to have sons about `r b.both.fathers` inches taller, on average (after adjusting for the height of the mother). Mothers 1 inch taller than others tend to have sons about `r b.both.mothers` inches taller, on average (after adjusting for the height of the father).


**Yet another interpretation (much like the first):** Holding the height of the mother constant, fathers 1 inch taller than others tend to have sons about `r b.both.fathers` inches taller, on average. Holding the height of the father constant, mothers 1 inch taller than others tend to have sons about `r b.both.mothers` inches taller, on average. ### Simple vs. multiple linear regression coefficients What if we only use father's height as a predictor? (Simple linear regression) ```{r} ### Estimate a simple linear regression model heightFit.fathers <- lm(height ~ father, data=heightDataSons) ### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.fathers.coef <- tidy(heightFit.fathers) heightFit.fathers.coef ``` ```{r} ### Estimate a simple linear regression model heightFit.fathers <- lm(height ~ father, data=heightDataSons) ### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.fathers.coef <- tidy(heightFit.fathers) ### Grab the coefficient (column 2) for the predictor (row 2) ### (The intercept information is in row 1) b.fathers <- heightFit.fathers.coef$estimate[2] ``` Why did the coefficient for father's height change from `r b.both.fathers` to `r b.fathers`? For the simple linear regression, the interpretation is different. Men 1 inch taller than others tend to have sons about `r b.fathers` inches taller, on average (without taking into account the height of the mother). #### A closer look at $\beta_1$ (the slope) $\beta_1$ is the slope of the simple linear regression function. Slope is the change in average $Y$ associated with a one-unit increase of $x$. \[ \E(Y^{old})={\beta_0}+\beta_1x^{old} \quad \textrm{ and} \] \[ \E(Y^{new})=\beta_0+\beta_1(x^{old}+1), \] then \[ \E(Y^{new})-\E(Y^{old}) \;=\; \beta_0+\beta_1(x^{old}+1)-(\beta_0+\beta_1x^{old}) \;=\; \beta_1 \] #### A closer look at $\beta_1$ (the partial coeff for $x_1$) Consider a multiple regression with two predictors: \[ \E(Y^{old}) \;=\; {\beta_0}+\beta_1x_1^{old}+\beta_2x_2 \quad \textrm{ and} \] \[ \E(Y^{new}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2, \] then \[ \E(Y^{new})-\E(Y^{old}) \;=\; \beta_0+\beta_1(x_1^{old}+1)+\beta_2x_2-(\beta_0+\beta_1x_1^{old}+\beta_2x_2) \;=\; \beta_1. \] ...and this math only works if $x_2$ does not change when $x_1$ increases by one unit. **Interpretation:** $\beta_1$ is the expected change in $Y$ when $x_1$ increase by 1-unit, with all other predictors ($x_2$ in this example) *unchanged* (but present in the model). **Another interpretation:** The expected change in $Y$ when $x_1$ increases by 1 unit, *adjusting for* all other predictors. ### Mathematical understanding of partial coefficients **NOTE:** The following discussion is just for a pedagogical purpose -- another way to view things. This is **NOT** how we calculate coefficients for multiple regression. For a model with two predictors $$\yhat_i = \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}$$ $\beta_1$ represents the effect of $x_1$ on the mean of $Y$ after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$ #### A 3-step process for finding $\widehat{\beta_1}$. 1. Regress $Y$ on $x_2$ 2. Regress $x_1$ on $x_2$ 3. Regress the residuals from 1 onto the residuals from 2 The estimated slope for Model #3 = $\widehat{\beta_1}$ for the multiple regression. What? This is **NOT obvious.** Let's investigate. ##### Estimate Model #1: Regress $Y$ on $x_2$ $$y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$$ This model estimate "grabs" all the (linear) information about $Y$ that $x_2$ has to offer. The residuals leftover ($e_{i; y, x_2}$) are not correlated with $x_2$. ##### Estimate Model #2: Regress $x_1$ on $x_2$ $$x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$$ This model estimate grabs all the (linear) information about $x_1$ that $x_2$ has to offer. The residuals leftover ($e_{i; x_1, x_2}$) are not correlated with $x_2$. ##### Estimate Model #3: Regress residuals: Model #1 on Model #2 $$e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$$ How do we know the intercept is exactly zero? Think on it. This model grabs all the linear information about $y$ that $x_1$ has to offer **after** we already removed all linear information $x_2$ contained about either $y$ or $x_1$. Assuming this works, then $\widehat{\alpha_1}$ in the estimate of Model #3 should be exactly $\widehat{\beta_1}$ in the full, two-predictor model estimate: $$y_i \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}$$ So, $\beta_1$ represents the effect of $x_1$ on the mean of $Y$ after taking out the (linear) effects of $x_2$ from both $Y$ and $x_1$ #### Confirming the 3-step process for finding $\widehat{\beta_1}$ Recall the Model #2 estimate above: $x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$ Substituting the Model #2 estimate into the full model estimate, we have $$ \begin{aligned} y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} x_{i1} + \widehat{\beta_2} x_{i2}\\ \\ y_i & \;=\; \widehat{\beta_0} + \widehat{\beta_1} (\widehat{\gamma_0} + \widehat{\gamma_1} x_{i2} + e_{i; x_1, x_2}) + \widehat{\beta_2} x_{i2}\\ \\ &= (\widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0}) + (\widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2}) x_{i2} + \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} $$ Then, by comparing the above with the Model #1 estimate: $y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$ we see that $$ \begin{aligned} \widehat{\delta_0} & \;=\; \widehat{\beta_0} + \widehat{\beta_1}\widehat{\gamma_0} \\ \widehat{\delta_1} & \;=\; \widehat{\beta_1}\widehat{\gamma_1} + \widehat{\beta_2} \\ e_{i; y, x_2} & \;=\; \widehat{\beta_1} e_{i; x_1, x_2} \end{aligned} $$ The last equation shows $\,\widehat{\beta_1}$ is exactly the coefficient in the Model #3 estimate: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$ #### Example: Seeing the 3-step process for finding $\widehat{\beta_1}$ For the height data $x_1$ = father's height $x_2$ = mother's height $y$ = son's height ##### Model #1 estimate is $$y_i \;=\; \widehat{\delta_0} + \widehat{\delta_1} x_{i2}$$ Regress sons' heights ($Y$) on mothers' heights ($x_2$): ```{r} ### Estimate a simple linear regression model heightFit.mothers <- lm(height ~ mother, data=heightDataSons) ### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.mothers.coef <- tidy(heightFit.mothers) heightFit.mothers.coef ### Save the residuals y.on.x2.resids <- residuals(heightFit.mothers) ``` ##### Model #2 estimate is $$x_{i1} \;=\; \widehat{\gamma_0} + \widehat{\gamma_1} x_{i2}$$ Regress fathers' heights ($x_1$) on mothers' heights ($x_2$): ```{r} ### Estimate a simple linear regression model heightFit.fathersOnMothers <- lm(father ~ mother, data=heightDataSons) ### Save the 2-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.fathersOnMothers.coef <- tidy(heightFit.fathersOnMothers) heightFit.fathersOnMothers.coef ### Save the residuals x1.on.x2.resids <- residuals(heightFit.fathersOnMothers) ``` Some of the residuals: ```{r} head(mutate(heightDataSons, y.on.x2.resids, x1.on.x2.resids), 10) ``` ##### Model #3 estimate is The regression of the residuals to find $$\widehat{\alpha_1}$: $e_{i; y, x_2} \;=\; 0 + \widehat{\alpha_1} e_{i; x_1, x_2}$$ ```{r} ### Estimate a simple linear regression model heightFit.residsModel <- lm(y.on.x2.resids ~ x1.on.x2.resids) ### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values tidy(heightFit.residsModel) ``` ...and finding $\widehat{\beta_1}$ directly from the full model (both father and mother heights as predictors) ```{r} ### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values ### for the multiple linear regression on both x1 and x2 tidy(lm(height ~ father + mother, data=heightDataSons)) ``` **It works!** #### Reminder: The 3-step process is just for pedagogical illustration We don't actually use the 3-step process to find $\widehat{\beta_1}$ in practice. We just compute the two-predictor model directly in R. ### Geometric interpretation of regression coefficients #### One predictor (regression line) + The fitted values ($\Yhat$'s) will all lie on the regression line + $\beta_0$ is the intercept of that line + $\beta_1$ is the slope of that line + $Y$ values may be above, below, or on the line within the ($x,Y$) plane of the data #### Two predictors (regression plane) + The fitted values ($\Yhat$'s) will all lie on the regression plane + $\beta_0$ is the $Y$ value with the $x$ variables both = 0 + $\beta_1$ is the slope of the plane along the $x_1$ direction + $\beta_2$ is the slope of the same plane along the $x_2$ direction + $Y$ values may be above, below, or on the plane within the full space of the data Recall: $\beta_1$ is the effect of $x_1$ on expected $Y$ but only after we have "adjusted for" $x_2$ in the model. A graph may give a sense of what is meant by "adjusted for" [http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf](http://statistics.uchicago.edu/~burbank/s224/graphs/regress_plane2.pdf) The plane is not parallel to either $x$ axis, but changes in both directions (if we have two meaningful predictors of $Y$). Given any $x_2$, the incremental change due to $x_1$ is the same, but the resultant predicted $Y$ is not the same ...due to the contribution of $x_2$. #### Multiple predictors (regression surface) The fitted values ($\Yhat$'s) will all lie on the same regression surface The $Y$ values may be above, below, or on the surface ## Lecture #6 ### R^2 (From Chapter 2 but I skipped past this part when giving the lecture) #### Coefficient of determination: $R^2$ ("R squared") One measure of the "goodness of fit" of the regression model to the data: $R^2$ = proportion of variability in the response ($Y$) explained by regression on the predictor ($x$) $$R^2 \;=\; \frac{\SSR}{\SST} \;=\; 1 - \frac{\SSE}{\SST} \;=\; 1 - \frac{\sum (y_i - \yhat_i)^2}{\sum (y_i - \ybar)^2}$$ If the predictors don't include much information about the response, then $\sum (y_i - \yhat_i)^2$ is close to $\sum (y_i - \ybar)^2$ since $\ybar$ is the least squares estimate for the mean without predictors. In that case, $R^2$ is near $1 - 1 = 0$. On the other hand, if the predictors have so much information that residuals are very, very small, then $R^2$ is near $1 - 0 = 1$. ```{r} repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt") repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] summary(repairFit) ``` #### Why do we call this number $R^2$? When there is only one predictor ($x$), $R^2 = \corr(y,x)^2 = r^2$ ```{r,include=FALSE} covyx = cov(Minutes ~ Units, data=repairData) covyx corryx = cor(Minutes ~ Units, data=repairData) corryx ``` ```{r} cor(Minutes ~ Units, data=repairData) cor(Minutes ~ Units, data=repairData)^2 ``` $r = `r corryx`$ $r^2 = `r corryx^2`$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] ``` So, 98.7\% of the variability in repair times can be attributed to the regression on the number of units. The rest of the variability (residuals) is not explained by the number of units repaired. #### Another way to view $R^2$ Claim: $\corr(y, \yhat) = |\, \corr(y, x)\, |$ Recall: + $\cov(a+by, c+dx) = bd\cov(y,x)$ + $\sd(a+bY) = |b|\sd(y)$ and $\sd(c + dx) = |d|\sd(x)$ $$\corr(a+bY, c+dx) = \frac{bd\cov(y,x)}{|b|\sd(Y)\, |d|\sd(x)} = {\rm sign}(bd)\corr(y,x)$$ $\begin{align} \corr(y, \yhat) = \corr(y, \b0 + \b1 x) = \textrm{sign}(\b1) \corr(y,x) \end{align}$ $\b1$ and $\corr(y,x)$ always have the same sign: $\b1 = \corr(y,x) (s_y/s_x)$ Thus, $0 \le \corr(y, \yhat) \le 1$ and $\corr(y, \yhat) = |\, \corr(y, x)\, |$ So, $R^2 = \left[\corr(y, x)\right]^2 = \left[\corr(y, \yhat)\right]^2$ This will be handy when we have models with more than one predictor. But we will have to be careful: when we add more predictors to a model, $R^2$ will *always* increase or stay constant. Therefore we won't be able to reliably use $R^2$ as a way of comparing models to see which is better. #### Adjusted R^2 ```{r} repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] ``` Adjusted $R^2$ includes the degrees of freedom: $$\text{adjusted} R^2 = 1 - \frac{\SSE / \SSE(\df)}{\SST / \SST(\df)} = \frac{\sum (y_i - \yhat_i)^2/(n-p)}{\sum (y_i - \ybar)^2 / (n-1)}$$ Adjusted $R^2$ does *not* always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another. ### R$^2$s are not additive **Terminology:** For multiple linear regression, $R^2$ is sometimes called the **multiple correlation coefficient.** Model with both father's ($x_1$) and mother's height ($x_2$) included: ```{r} ### Estimate a multiple linear regression model heightFit.both <- lm(height ~ father + mother, data=heightDataSons) ### View the 3-by-4 table of coefficients, SEs, t-stats, and p-values tidy(heightFit.both) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(heightFit.both)[-c(7:10)] ### Save the R-sqr value as a number (not a data frame) heightFit.both.Rsqr <- as.numeric(glance(heightFit.both)[1]) ``` Regression on both the fathers' and mothers' heights explains `r round(100*heightFit.both.Rsqr, 1)`% of the variability in sons' heights. Model with only father's height ($x_1$) included: ```{r} ### Estimate a simple linear regression model heightFit.fathers <- lm(height ~ father, data=heightDataSons) ### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values tidy(heightFit.fathers) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(heightFit.fathers)[-c(7:10)] ### Save the R-sqr value as a number (not a data frame) heightFit.fathers.Rsqr <- as.numeric(glance(heightFit.fathers)[1]) ``` Regression on the fathers' heights explains `r round(100*heightFit.fathers.Rsqr, 1)`% of the variability in sons' heights. Model with only mother's height ($x_2$) included: ```{r} ### Estimate a simple linear regression model heightFit.mothers <- lm(height ~ mother, data=heightDataSons) ### View the 2-by-4 table of coefficients, SEs, t-stats, and p-values tidy(heightFit.mothers) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(heightFit.mothers)[-c(7:10)] ### Save the R-sqr value as a number (not a data frame) heightFit.mothers.Rsqr <- as.numeric(glance(heightFit.mothers)[1]) ``` Regression on the mothers' heights explains `r round(100*heightFit.mothers.Rsqr, 1)`% of the variability in sons' heights. We see that `r round(100*heightFit.both.Rsqr, 1)`% $<$ `r round(100*heightFit.fathers.Rsqr, 1)`% $+$ `r round(100*heightFit.mothers.Rsqr, 1)`% because, for example, adding mothers' heights to the model only brings in their (marginal) information about sons' heights exclusive of information they also hold about fathers' heights. Of course, $R^2$ for the combined, two-predictor model must be larger than either of the $R^2$ for one-predictor models. Why? **Caution:** This is not always true for the adjusted $R^2$. ### Example: Fuel consumption data As an illustration of MLR modeling, we look at a dataset on U.S. fuel consumption by state (C. Bingham, American Almanac). We will investigate fuel consumption as a linear function of demographics, fuel tax policy, and other factors. ```{r} ### Read in the data on fuel taxes and demographics for 48 states ### (excludes Alaska and Hawaii) fuelData <- read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fuel.txt") ### Rearrange the order of the variables (in the order selected) ### then sort by the state code (AL, AR, AZ, CA, ...) fuelData <- fuelData %>% dplyr::select(state, pop, fuelc, fuel, nlic, dlic, tax, inc, road) %>% arrange(state) ### Display the data as a "tibble" (a "nice" format for a data frame) head(as.tibble(fuelData)) ``` variable | description |:-------|:-------------| state | State (excluding AK and HI) pop | Population in each state fuelc | fuel consumption, millions of gallons fuel | Motor fuel consumption, gal/person nlic | 1971 thousands of drivers dlic | Percent of population with driver's license tax | 1972 motor fuel tax rate, cents/gal inc | 1972 per capita income, thousands of dollars road | 1971 federal-aid highways, thousands of miles #### A first model: fuel = dlic + tax + inc + road We will only use per capita versions of the data: `dlic` vs. `nlic` (percent of licenced drivers vs. count) `fuel` vs. `fuelc` (gallons/person rather than gallons) `road` is a little problematic (more $ for states with large land area) `pop` could be included, but the variables are per capita already. #### Scatterplot matrix ```{r} ### Use select (from dplyr) to get the variables in order ### (response variable first), so scatterplot matrix better arranged. temp <- dplyr::select(fuelData, fuel, dlic, tax, inc, road) ### scatterplot matrix using ggpairs from GGally package ggpairs(temp) ``` `tax` and `dlic` appear to have a linear relationship with `fuel` ...and maybe `income`? `road` doesn't look to have much relationship to `fuel` #### The full model: fuel = dlic + tax + inc + road ```{r} ### Estimate various regression models we will use in these notes fuelFit0 <- lm(fuel ~ 1, data=fuelData) fuelFit1 <- lm(fuel ~ dlic, data=fuelData) fuelFit2 <- lm(fuel ~ dlic + tax, data=fuelData) fuelFit3 <- lm(fuel ~ dlic + tax + inc, data=fuelData) fuelFit4 <- lm(fuel ~ dlic + tax + inc + road, data=fuelData) tidy(fuelFit4) ``` ### Comparing models: R^2 #### One predictor: $R^2$ Recall from simple linear regression (one predictor) $$R^2 = \frac{\SSR}{\SST} = 1 - \frac{\SSE}{\SST}$$ SST = $\sum (y_i - \ybar)^2$ = total sum of squares SSE = $\sum (y_i - \yhat_i)^2$ = error sum of squares SSR = $\sum (\yhat_i - \ybar)^2$ = regression sum of squares $\SST = \SSR + \SSE$ Even the degrees of freedom for each sum of squares add up! SST(df) = SSE(df) + SSR(df) $(n-1) = (n-2) + 1$ Finally, $r^2 = [\corr(y, x)]^2 = R^2$ #### Multiple predictors: $R^2$ All is the same for multiple linear regression except... $r_1^2 = [\corr(y, x_1)]^2 \ne R^2$ Instead, recall that for simple linear regression, we define $$\corr(y, x) = |\; \corr(y, \yhat) \;|$$ For simple and multiple linear regression $$ R^2 = [ \corr(y, \yhat) ]^2 = \frac{\SSR}{\SST} = 1 - \frac{\SSE}{\SST}$$ #### Fuel example: $R^2$ ```{r} ### We estimated the full model "fuelFit4" above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit4) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit4)[-c(7:10)] ``` #### Every additional predicitor increases $R^2$ Since more information reduces error (SSE), each additional predictor + decreases SSE + increases $R^2$ (Technically, adding a predictor might have *no effect* on SSE or $R^2$, if the estimated regression coefficient for that predictor was *exactly* 0. But this will almost never happen in practice.) Here is $R^2$ for the full model: `dlic`, `tax`, `inc`, and `road` versus a 3-predictor model (`road` removed) fuel = dlic + tax + inc + road vs. fuel = dlic + tax + inc ```{r} ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit4) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit4)[-c(7:10)] ``` ```{r} ### We estimated the reduced model "fuelFit3" above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit3) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit3)[-c(7:10)] ``` $R^2$ went down even though we know `road` is not significant predictor What about the adjusted $R^2$? $\quad$ It went up! ...which implies `road` does not add enough to the model to offset the additional "complexity" of the model ### Adjusted $R^2$ $${\rm Adjusted}\; R^2 \;=\; 1 - \frac{\SSE \;/\; \SSE(df)}{\SST \;/\; \SST(df)} \;=\; 1 - \frac{\SSE \;/\; (n-p-1)}{\SST \;/\; (n-1)}$$ Adjusted $R^2$ sets up a "penalty" for model complexity. Can use adjusted $R^2$ to compare "nested" models. Can even compare models that are not nested. For example, compare the following two models with adjusted $R^2$, but do **NOT** compare using $R^2$: fuel = dlic + inc $\qquad$ vs. $\qquad$ fuel = dlic + tax ```{r} ### Estimate a multiple linear regression fuelFit2a <- lm(fuel ~ dlic + inc, data=fuelData) ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit2a) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit2a)[-c(7:10)] ### We estimated the fuelFit2 model far above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit2) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit2)[-c(7:10)] ``` #### Adjusted $R^2$ is smaller than $R^2$ $${\rm Adjusted}\; R^2 \;=\; 1 - \frac{n-1}{n-p-1} (1 - R^2) \;=\; \frac{(n-1)R^2 - p}{n-1-p}$$ Example: $n=100$, $p=20$, $R^2=0.90$ Then, adjusted $R^2 = 0.87$. Adjusted $R^2$ is essentially "handicapped" or "penalized" by the number of parameters ($p+1$) relative to $n$ NOTE: Adjusted $R^2$ can be negative (not common in practice) ### Remove another predictor? fuel = tax + dlic We already decided `road` did not add enough information to the model to justify keeping it in. Should we also remove `inc`? fuel = dlic + tax + inc vs. fuel = dlic + tax ```{r} ### We estimated the fuelFit2 and fuelFit3 models far above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit3) tidy(fuelFit2) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit3)[-c(7:10)] glance(fuelFit2)[-c(7:10)] ``` Note that the SEs got larger, the t-statistics got smaller. So, it can happen that a significant predictor becomes not significant when a signifcant predictor is removed. Also, important predictors can become insignificant if too many nonconsequential variables in the model. Not all of modeling can be based on statistical tests! #### What if the intercept is not significant? The model we have settled upon is fuel = dlic + tax + inc but the intercept is (barely) not significant: ```{r} ### We estimated the fuelFit3 model far above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit3) ``` Even so, we keep the intercept in the model. The intercept doesn't really add complexity. ...and keeping it guarantees all the nice properties of least squares estimates. #### Questions about including predictors ...or not What are the consequences of... including non-significant variables in the model? After all, $R^2$ always improves, so why omit variables? What are the consequences of... omitting important variables? After all, fewer parameters is better for + more df for variance estimate + less model complexity ### Model validation: fuel = dlic + tax + inc #### Reminder: Model specifications 1. The errors are **uncorrelated** 2. The errors have no systematic info about $Y$ a. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$) b. The errors are not correlated with the predictor values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$) 3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$ 4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$ #### Checking the residuals against the model specifications ```{r fig.height=4} ### We estimated the 5 models fuelFitX far above ### Here, we are adding the fitted values and residuals ### for each model to the dataset fuelDataResids <- fuelData %>% mutate(resids0 = residuals(fuelFit0), fitted0 = predict(fuelFit0), resids1 = residuals(fuelFit1), fitted1 = predict(fuelFit1), resids2 = residuals(fuelFit2), fitted2 = predict(fuelFit2), resids3 = residuals(fuelFit3), fitted3 = predict(fuelFit3), resids4 = residuals(fuelFit4), fitted4 = predict(fuelFit4) ) ### Plot residuals vs. fitted values with dashed horizontal line at 0 gf_point(resids3 ~ fitted3, data = fuelDataResids) %>% gf_hline(yintercept = 0, linetype = "dashed") %>% gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals") ``` ```{r fig.height=4} ### Draw a histogram of the residuals p1 <- gf_histogram(~ resids3, data = fuelDataResids, color="white", bins=12) %>% gf_labs(x = "Residuals (fuel = dlic + tax + inc)") ### Draw a normal quantile plot of the residuals p2 <- gf_qq(~ resids3, data = fuelDataResids) %>% gf_qqline() grid.arrange(p1, p2, ncol=2) ``` #### Does outlier influence our decision about normality? Here are the cases with the largest residuals: ```{r} ### select variables and arrange (sort) the data temp <- fuelDataResids %>% dplyr::select(state, fuel, dlic, tax, inc, resids3) %>% arrange(resids3) ### Look at the last few observations (where sorted values are largest) tail(temp, 8) ``` The largest residual happens to be for the state of Wyoming. Here are the residual plots again, with the Wyoming residual removed. ```{r fig.height=4} ### Use the filter function from dplyr package to filter out Wyoming temp <- filter(fuelDataResids, state != "WY") ### Plot residuals vs. fitted values with dashed horizontal line at 0 gf_point(resids3 ~ fitted3, data = temp) %>% gf_hline(yintercept = 0, linetype = "dashed") %>% gf_labs(x = "Fitted values (yhat = dlic + tax + inc)", y = "Residuals") ``` ```{r fig.height=4} ### Remember, Wyoming is removed from data, so that residual is not displayed ### Draw a histogram of the residuals p1 <- gf_histogram(~ resids3, data = temp, color="white", bins=12) %>% gf_labs(x = "Residuals (fuel = dlic + tax + inc)") ### Draw a normal quantile plot of the residuals p2 <- gf_qq(~ resids3, data = temp) %>% gf_qqline() grid.arrange(p1, p2, ncol=2) ``` ### Inference for multiple linear regression #### Several types of tests Can test for specific values of $\beta_0, \beta_1, \beta_2, \ldots, \beta_p$ 1. Test one single coefficient = 0 (or test another value) 2. Test if all coefficients $\beta_1, \beta_2, \ldots, \beta_p$ simultaneously = 0 + That is, predictors don't have much information about response 3. Test whether some subset of coefficients simultaneously = 0 4. Other constraints: Test whether... + some coefficients equal specific values + some coefficients are equal to the each other + some coefficients sum to a specific value + ...and so on Tests OK if model specifications reasonable and residuals plausibly from a normal distribution. #### 1. Test one single coefficient $$\frac{\bj - \beta_j}{\se(\bj)} \;\sim\; t_{n-p-1}$$ ```{r} ### We estimated the model "fuelFit3" far above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit3) ``` #### 2. Test if all coefficients simultaneously = 0 To test multiple coefficients, we use the F-test. $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$ $H_1:$ At least one of the $\beta_j \ne 0$. **NOTE:** Here, we are not testing $\beta_0=0$. Here is the process. Start with the **full model** (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$ (no restrictions on the coefficients) Compare to the **reduced model** (RM): $Y = \beta_0 + \epsilon$ (all other $\beta_j=0$ as stated in $H_0$) If the model specifications are valid (including Normality), Then the following comparison statistic has an $F$ distribution. $$F \;=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)}$$ Let's try it with the fuel data: fuel = dlic + tax + inc I encourage you to look at the R code. ```{r,collapse=TRUE} ### We estimated the models "fuelFit0" and "fuelFit3" far above ### ANOVA = ANalysis Of VAriance ### Basically, using the partitioning of SS to build an F-stat for you ### This compares reduced (fuelFit0) and full (fuelFit3) models anova(fuelFit0, fuelFit3) ### Now, I'll do things a little more explicitly (manually) ### Use tidy() from dplyr package to put ANOVA output into a table anovaTable3 <- tidy(anova(fuelFit0, fuelFit3)) ### Grab the various pieces of the F-statistic from the ANOVA table SSE.RM <- anovaTable3$rss[1] SSE.FM <- anovaTable3$rss[2] df.RM <- anovaTable3$res.df[1] df.FM <- anovaTable3$res.df[2] ### Calculate the F-statistic F.stat <- ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / ( SSE.FM / df.FM) ### Calculate the p-value of the F-test p.value <- 1 - pf(F.stat, df.RM - df.FM, df.FM) ``` $$F \;=\; \frac{(`r SSE.RM` - `r SSE.FM`) \; / \; (`r df.RM` - `r df.FM`)} {`r SSE.FM` \;/\; `r df.FM`} \;=\; `r F.stat`$$ p-value = `r p.value` #### Standard F-test from software $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$ $H_1:$ At least one of the $\beta_j \ne 0$. is the test (and F-statistic) that R automatically calculates for regression. ```{r} ### We estimated the fuelFit3 model far above ### View the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values tidy(fuelFit3) ### View the F-statistic, R-sqr, adj-R-sqr, SD estimate, etc. glance(fuelFit3)[-c(7:10)] ``` #### The ANOVA table and the F test for all $\beta_j=0$ (except not testing $\beta_0=0$) **ANOVA = ANalysis Of VAriance** Source | SS (Sum Squares of) | df | MS (Mean Squares) |:-------------|:------------------------|:-------|:---------------------| Model (Regression) | SSR = $\sum_i (\yhat_i-\ybar)^2$ | $p$ | MSR = SSR$\,/\,p$ Residual (Error) | SSE = $\sum_i (y_i-\yhat_i)^2$ | $n-p-1$ | MSE = SSE$\,/\,(n-p-1)$ = $\widehat{\sigma}{^2}$ Total | SST = $\sum_i (y_i-\ybar)^2$ = SSR + SSE | $n-1$ | MST = SST$\,/\,(n-1) = \var(y)$ When $H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0$, Reduced model (RM): $Y = \beta_0 + \epsilon$ has SSE = $\sum_i (y_i-\ybar)^2$ on $n-1$ df Full model(FM): $Y= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$ has SSE = $\sum_i (y_i-\yhat_i)^2$ on $n-p-1$ df and the F-statistic is $$\begin{aligned} F &= \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;-\; \df(\FM)} \\ \\ &=\; \frac{\left[\sum (y_i-\ybar)^2 - \sum(y_i - \yhat_i)^2 \right] \;/\; [(n-1) - (n-p-1)]}{ \sum(y_i - \yhat_i)^2 \;/\; (n-p-1) } \\ \\ &=\; \frac{(\SST - \SSE) \;/\; p}{\SSE \;/\; (n-p-1)} \;=\; \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} \;=\; \frac{\MSR}{\MSE} \end{aligned}$$ Since the textbook mentions it, here is another way to compute the F-statistic: $$\begin{aligned} F &=\; \frac{\MSR}{\MSE} \;=\; \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} \\ \\ &=\; \frac{1 \;/\; \SST}{1 \;/\; \SST} \times \frac{\SSR \;/\; p}{\SSE \;/\; (n-p-1)} \\ \\ &=\; \frac{\SSR \;/\; \SST (p)}{\SSE \;/\; \SST (n-p-1)} \;=\; \frac{R^2_p \;/\; p} {(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}$$ ...but there is nothing illuminating (to me) about this formula. #### Test whether some set of coefficients simultaneously = 0 To test statements about multiple coefficients, we continue to use the F-test. Let's use the fuel data: fuel = dlic + tax + inc $H_0: \beta_2 = \beta_3 = 0$ (`tax` and `inc` not needed in the model) Here is the process. Start with the **full model** (FM): $Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$ (no restrictions on the coefficients) Compare to the **reduced model** (RM): $Y = \beta_0 + \beta_1 x_1 + \epsilon$ ($\beta_2=0$ and $\beta_3=0$ as stated in $H_0$) If the model specifications are valid (including Normality), Then the following comparison statistic has an $F$ distribution. df for F-statistic: [df(RM) $-$ df(FM)] and df(FM) $$\begin{aligned} F &=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)} \\ \\ &=\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; (p-q)}{\SSE(\FM) \;/\; (n-p-1)} \\ \\ &=\; \frac{(R^2_p - R^2_q) \;/\; (p-q)}{(1 - R^2_p) \;/\; (n-p-1)} \end{aligned}$$ where $p$ is the number of predictors in the full model (FM) and $q% mutate(height.cm = height*cm) ### Fit the new linear model (response in cm instead of inches) heightFit.both.cm <- lm(height.cm ~ father + mother, data=heightDataSons.cm) ### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.both.cm.coef <- tidy(heightFit.both.cm) ``` Regression with original data: ```{r} ### This table of coefficients was saved earlier heightFit.both.coef ``` Son's heights changed to centimeters: ```{r} ### This table of coefficients was saved earlier heightFit.both.cm.coef ``` What is the ratio of the new coefficients and SE's compared to old? ```{r} ### These tables of coefficients were saved earlier ### Coefficients in column 2, SEs in column 3 ### Intercept in row 1 ### Predictors in rows 2 ... to p heightFit.both.coef[1:3, 2:3] heightFit.both.cm.coef[1:3, 2:3] heightFit.both.cm.coef[1:3, 2:3] / heightFit.both.coef[1:3, 2:3] ``` Rather confusingly, the mothers' and fathers' heights are still in inches. So, the coefficients are in units of cm/inch. Of course, we can change all units to cm (and this better not affect our analysis at all!) ```{r} ### Add fathers' and mothers' heights (in cm) to the dataset heightDataSons.cm <- heightDataSons.cm %>% mutate(father.cm = father*cm, mother.cm = mother*cm) ### Fit a new linear model with all variables in cm (not inches) heightFit.both.cm.cm <- lm(height.cm ~ father.cm + mother.cm, data=heightDataSons.cm) ### Save the (p+1)-by-4 table of coefficients, SEs, t-stats, and p-values heightFit.both.cm.cm.coef <- tidy(heightFit.both.cm.cm) heightFit.both.cm.cm.coef ``` #### A general solution to scaling: Standardize the variables $$\begin{aligned} \widetilde{y} \;=\; \frac{y - \ybar}{s_y} \quad {\rm and} \quad \widetilde{x}_j \;=\; \frac{x_j - \xbar_j}{s_j} \quad {\rm for}\; j=1,2,\ldots, p \end{aligned}$$ #### First, let's just center the data $$Y - \ybar \;=\; \beta_0 + \beta_1 (x_1 - \xbar_1) + \beta_2 (x_2 - \xbar_2) + \cdots + \beta_p (x_p - \xbar_p) + \epsilon $$ We are still using the modified data: sons' heights in cm, fathers' and mothers' heights in inches ```{r} ### Add centered variables to the predictors (sons' in cm, others in inches) heightDataSons.ctr <- heightDataSons.cm %>% mutate(height.ctr = height.cm - mean(height.cm), father.ctr = father - mean(father), mother.ctr = mother - mean(mother) ) ### Estimate the multiple linear model with centered data heightFit.both.ctr <- lm(height.ctr ~ father.ctr + mother.ctr, data=heightDataSons.ctr) ``` Regression with uncentered data: ```{r} ### This table of coefficients was saved earlier tidy(heightFit.both.cm) ``` Regression after all variables centered: ```{r} ### This table of coefficients was saved earlier tidy(heightFit.both.ctr) ``` The intercept for the regression with centered data is zero. Why? **Hint:** When we use least squares to estimate $\beta_0$ for a linear model like this $$Y \;=\; \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon$$ $$ {\rm then,}\; \widehat{\beta_0} \;=\; \ybar - \sum_{i=1}^p \bj \xbar_j$$ The other regression coefficients (and their SEs) are the same as when not centered. #### Now, let's standardize the data ```{r} ### Add standardized variables to the dataset heightDataSons.cm.std <- heightDataSons.cm %>% mutate(height.std = (height.cm - mean(height.cm)) / sd(height.cm), father.std = (father - mean(father)) / sd(father), mother.std = (mother - mean(mother)) / sd(mother) ) ### Estimate linear model with standardized response and predictor variables heightFit.both.cm.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.cm.std) ``` Regression with unstandardized data: ```{r} ### Tidy up the coeffient output for a model estimated earlier tidy(heightFit.both.cm) ``` Regression after all variables standardized: ```{r} ### Tidy up the coeffient output for a model estimated earlier tidy(heightFit.both.cm.std) ``` This last regression is the same whether we measured in cm or inches. Why? Regression with unstandardized original data (all in inches): ```{r} ### Tidy up the coeffient output for a model estimated earlier tidy(heightFit.both.cm) ``` Regression after all variables standardized: ```{r} ### Add standardized variables to the dataset heightDataSons.std <- heightDataSons %>% mutate(height.std = (height - mean(height)) / sd(height), father.std = (father - mean(father)) / sd(father), mother.std = (mother - mean(mother)) / sd(mother) ) ### Estimate the linear model using standardized response and predictor vars heightFit.both.std <- lm(height.std ~ father.std + mother.std, data=heightDataSons.std) ### View the coef, SEs, t-stats, p-values from the linear regression tidy(heightFit.both.std) ``` We don't really gain anything from standardizing the original data since all variables are already in the same units (inches) for the same characteristic (height). Plus, it seems that the standard deviations of the original heights were nearly the same already for each variable (before standardizing). $s_y = `r sd(~ height, data=heightDataSons)`$ $s_{x1} = `r sd(~ father, data=heightDataSons)`$ $s_{x2} = `r sd(~ mother, data=heightDataSons)`$ So, the coefficients changed a little, but their interpretation is very different. Now, for fathers one std deviation taller, their sons are expected to be $`r round(tidy(heightFit.both.std)$estimate[2], 4)`$ standard deviations taller (adjusted for the mothers' heights). ### Properties of LS estimators #### Reminder: Model specifications 1. The errors are **uncorrelated** 2. The errors have no systematic info about $Y$ a. $E(\epsilon_i)=0$ for all $i$ (for all values of all predictors $x_j$) b. The errors are not correlated with the predictor values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x_j$) 3. The mean of $Y$ depends linearly on $x_1, x_2, \ldots, x_p$ 4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$ #### Facts about LS estimators + $\bj$s unbiased: $\E(\bj) = \beta_j$ for $j=0,1,2,\ldots,p$ + $\var(\bj) = c_j \sigma^2$ + let R compute the $c_j$s (requires matrix algebra) + $\bj \sim$ Normal($\,\beta_j, \sqrt{c_j}\sigma$) + $W = \SSE/\sigma$ is chi-square ($\chi^2$) with $\df = n - p - 1$ + $\bj$s are independent of $\widehat{\sigma}$ (spread of residuals does not effect slope) + BLUE: Best Linear Unbiased Estimators (minimum variance) ### Imputing a missing value ```{r} i <- 5 n <- dim(heightDataSons)[1] ``` Suppose for the $`r 5`th$ family in the data, the mother's height was not recorded. We could just throw out case $i = `r 5`$. ...but what if a mother's or father's height is missing from several families. Should we just throw out the information on the other parent so easily? One solution: **imputation** of missing values. This can be done many ways. We will only discuss two. #### Option 1: Replace missing value with average For example, replace the missing $`r 5`th$ mother's height with the average height of the other $(n-1) = `r n-1`$ mothers' heights. ```{r message=FALSE} kable(mean(~ mother[-5], data=heightDataSons)) ``` #### Option 2: Replace missing value with regression prediction + Temporarily let mother's height be the response variable. + Let son's and father's heights be the predictors. + Remove case $i = `r 5`$ from the data (missing the mother's height). + Regress mother's height on both son's and father's height + Predict mother's height for case $i = `r 5`$ using the son's and father's heights from case $i = `r 5`$ ```{r} i temp <- heightDataSons[-i, ] lmfit <- lm(mother ~ father + height, data = temp) newdata <- data.frame(father = heightDataSons$father[i], height = heightDataSons$height[i]) newdata <- mutate(newdata, mother = predict(lmfit, newdata)) kable(newdata) ```