--- title: "STAT 224 Chapters 1-2" params: dotabs: no output: html_document: code_folding: hide css: styles.css df_print: kable number_sections: yes self_contained: yes theme: cerulean pdf_document: default --- ```{r test-main, child = 'preamble.Rmd'} ``` # Chapters 1 & 2 `r if(params$dotabs) paste('{.tabset}')` ## Lecture 1 First, I loaded some R packages needed for the code in this document. Click the `Code` button to see my R code. ```{r load-packages2, message=FALSE} library(tidyverse) library(knitr) library(mosaic) library(gridExtra) library(broom) library(modelr) library(car) ``` For any package, you install just once, then use `library` to load the package when needed. For example: `install.packages("mosaic")` only once, ever. `library(mosaic)` when you start up RStudio. ### Models #### Deterministic (Functional) Models * Some variables are related in nature. * Example: Equation from the CRC handbook describing a physical law. * $F \;=\; ma$ * For some output variable $Y$ and input $X$, * $Y \;=\; f(X)$ * A functional model perfectly relates $Y$ to $X$ (well... in theory). #### Statistical Models * A Statistical model is a simple, low-dimensional summary of * the relationship present in the data * the data-generation process * the relationship present in the population * We may **not** know the functional form of relationship * The model allows errors (**uncertainty**) * $Y \;=\; f(X) + \epsilon$ * random variable = deterministic function + random error * The model allows for deviations (**variability**) from the "true" function when estimating from data * $\widehat{Y} \;=\; \widehat{f}(x) + e_i$ * estimate of variable = estimate of deterministic function + observed error #### Linear Regression Models Collect data on + $Y$ = response variable + $X_1, X_2, \ldots, X_p$ = predictor variables General deterministic part of model $\quad Y \;=\; f(X_1,X_2, \ldots, X_p)$ We might opt for a linear relationship $$Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p$$ ...and then, allow for error (a statistical model) $$Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon$$ This is called a **linear regression model** #### "All models are wrong, but some are useful" (George Box) Few processes are actually linear. Still, the following _**are**_ linear regression models $$\begin{align} Y &\;=\; \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 \log(X) + \epsilon \end{align}$$ because the "coefficients" $\beta_0$ and $\beta_1$ enter the model linearly. #### Some non-linear relationships can be _linearized_. For example, the non-linear model $Y \;=\; \beta_0 e^{\beta_1 X} \epsilon$ can be **transformed** to a linear model (take the log of both sides) $$\begin{align} log(Y) &\;=\; \log(\beta_0) + \beta_1 X + \log(\epsilon) \\ \\ Y' & \;=\; \beta_0' + \beta_1 X + \epsilon' \end{align}$$

**STOP**

#### Linear or not? Can be linearized or not? (a) $Y \;=\; \beta_0 + \beta_1^X + \epsilon$ (b) $Y \;=\; \beta_0 \beta_1^X \epsilon$ (c) $Y \;=\; \beta_0 + \beta_1 e^{X^2} + \epsilon$ (d) $Y \;=\; \beta_0 + \beta_1 X^2 + \beta_2 \log(X) + \epsilon$


Once we propose a linear model, the method is intuitive: Fit a linear function through the data points using some objective criterion for a good "fit". ### Data #### The Data Format We will endevour to provide you with **tidy** data: + Each row corresponds to one observational unit (case) + Each variable is one column | observation | response $Y$ | predictor $x$ | |:-----------:|:------------:|:-------------:| | 1 | $y_1$ | $x_1$ | | 2 | $y_2$ | $x_2$ | | ... | ... | ... | | ... | ... | ... | | ... | ... | ... | | $n$ | $y_n$ | $x_n$ | #### Example: Computer Repair A sample of records on computer repair service calls was collected. Two variables were observed: * Time to complete work (minutes) * Number of units repaired Which variable is response ($Y$)? Which variable is explanatory or "predictor" variable ($X$)? It depends... on the question of interest. What if we know how long the repairer worked. Did they repair as many units as we would have expected? What if we knew how many units needed repair. How much time should we schedule for the repairer? #### Example: Computer Repairs (response = time, predictor = units) $Y$ = Time to complete work (minutes) $X$ = Number of units repaired ```{r message=FALSE, warning=FALSE} # Read in the data from the course data repository repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt") repairData # view the data (OK, since small sample) glimpse(repairData) # view a summary of the data (OK, for larger samples too) ``` 1. Scatterplot using `ggplot2` package (with `ggformula` package to help out) 2. Scatterplot using `lattice` package 3. Scatterplot using `base` R (defaults not pretty) 4. Scatterplot using `base` R (make it prettier) ```{r, message=FALSE, warning=FALSE, fig.height=3} p1 = gf_point(Minutes ~ Units, data=repairData) # ggplot2 graphics made easier by ggformula package p2 = xyplot(Minutes ~ Units, data=repairData) # lattice graphics package grid.arrange(p1, p2, ncol=2) # arrange the plots side by side (2 columns) par(mfrow=c(1,2)) # prepare to arrange base R plots with 1 row, 2 cols plot(repairData$Units, repairData$Minutes) plot(repairData$Units, repairData$Minutes, pch=20, xlab="Units", ylab="Minutes") par(mfrow=c(1,1)) # return the base R setting to 1 plot (1 row, 1 col) ``` I usually use the `ggplot2` package (with the `ggformula` package to help out). **STOP** Demonstrate `ggformula` syntax.




Fancy add-ons you can "chain" together with `ggplot2` ```{r, message=FALSE, warning=FALSE, fig.width=5, fig.height=5} gf_point(Minutes ~ Units, data=repairData) %>% gf_rug() %>% gf_smooth(se = TRUE) ``` ### Summary Statistics #### Reminder: Summary Statistics Sample averages: $\;\overline{y} \;=\; \frac{1}{n}\sum_{i=1}^n y_i \qquad \overline{x} \;=\; \frac{1}{n}\sum_{i=1}^n x_i$ Sample variances: $$\var(y) \;=\; s_y^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \qquad \var(x) \;=\; s_x^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \overline{x})^2$$ Sample standard deviations: $$\sd(y) \;=\; s_y \;=\; \sqrt{s_y^2}\;, \quad \sd(x) \;=\; s_x \;=\; \sqrt{s_x^2}$$ Standardized data: $$z_i \;=\; \frac{y_i - \overline{y}}{s_y} \qquad w_i \;=\; \frac{x_i - \overline{x}}{s_x}$$ Can show that $\overline{z} \;=\; 0$ and $\sd(z) \;=\; s_z \;=\; 1$ (same for $w$) **STOP** Show that $\overline{z} \;=\; 0$




$$\overline{z} \;=\; \frac{1}{n}\sum_{i=1}^n z_i \;=\; \frac{1}{n}\sum_{i=1}^n \frac{y_i - \overline{y}}{s_y} \;=\; \frac{1}{n(s_y)} \sum_{i=1}^n (y_i - \overline{y}) \;=\; \frac{1}{n(s_y)} \left[ \sum_{i=1}^n y_i - \sum_{i=1}^n \overline{y}\right] \;=\; \frac{1}{n(s_y)} [ n\overline{y} - n\overline{y}] \quad\;=\;\quad 0 $$ #### Example: Computer Repair Two variables: * $Y$ = Time to complete work (minutes) * $X$ = Number of units repaired ```{r, message=FALSE, warning=FALSE} ybar = mean(~ Minutes, data=repairData) ybar xbar = mean(~ Units, data=repairData) xbar sy = sd(~ Minutes, data=repairData) sy sx = sd(~ Units, data=repairData) sx ``` $\bar y = `r ybar` \quad \xbar = `r xbar`$ $s_y = `r sy` \quad s_x = `r sx`$ ```{r} # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" # Here we add standardized versions of the original variables repairData = mutate(repairData, stdMinutes = (Minutes - mean(Minutes)) / sd(Minutes), stdUnits = (Units - mean(Units)) / sd(Units) ) repairData # Confirm that the average of standardized data is 0, with sd=1 zbar = mean(~ stdMinutes, data=repairData) zbar sz = sd(~ stdMinutes, data=repairData) sz ``` For the standardized response (Minutes): $\zbar = `r round(zbar, 1)` \quad s_z = `r sz`,$ as expected. Standardizing the data does **not** change the "relationship" ```{r, message=FALSE, warning=FALSE, fig.height=3} p1 = gf_point(Minutes ~ Units, data=repairData) p2 = gf_point(stdMinutes ~ stdUnits, data=repairData) grid.arrange(p1, p2, ncol=2) ``` ### Correlation #### Reminder: Covariance and Correlation * variance of $y$ is $$\var(y) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y})^2 \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (y_i - \overline{y})$$ * **covariance** of $y$ and $x$: $$\cov(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})$$ * variance of $z$ is $$\var(z) \;=\; \frac{1}{n-1} \sum_i (z_i - \overline{z})^2 \;=\; \frac{1}{n-1} \sum_i (z_i - 0)^2 \;=\; \frac{1}{n-1} \sum_i z_i^2 \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) $$ From here, it's simple to show that $\var(z) = 1$: $$\var(z) \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \quad\;=\;\quad \frac{1}{s_y^2} \frac{1}{n-1} \sum_i \left(y_i - \overline{y}\right) \left(y_i - \overline{y}\right) \quad\;=\;\quad \frac{1}{s_y^2} var(y) = \frac{1}{s_y^2} s_y^2 = 1 $$ * **correlation** of $y$ and $x$ is $cov(z,w)$ of standardized data: $$\begin{align} r \;=\; \corr(y,x) \quad &\;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right) \\ \\ \quad &\;=\; \quad \frac{\cov(y,x)}{s_y s_x} \quad \;=\; \quad \frac{\cov(y,x)}{\sqrt{\var(y) \var(x)}} \end{align}$$ #### How covariance measures "relationship" Consider the following variables * $V$ = hours of sleep two nights before midterm exam * $W$ = hours of sleep the night before midterm exam * $Z$ = hours of sleep the night after the midterm exam Consider $V$ and $W$. Which is reponse? Which is predictor? Would you expect a positive or negative correlation? Consider $W$ and $Z$. Which is reponse? Which is predictor? Would you expect a positive or negative correlation? **STOP** Ponder the questions above.








#### How covariance measures "relationship" * `sleep2Before` = $V$ = hours of sleep two nights before midterm exam * `sleepBefore` = $W$ = hours of sleep the night before midterm exam * `sleepAfter` = $Z$ = hours of sleep the night after the midterm exam Here is a little data: ```{r, fig.height=3} v = c(5, 7, 4, 8, 6) w = c(3, 8, 2, 5, 6) z = c(8, 7, 9, 7, 8) df = tibble(sleep2Before=v, sleepBefore=w, sleepAfter=z) df # gf_vline adds a vertical line at "xintercept" # gf_hline adds a horizontal line at "yintercept" p1 = gf_point(sleepBefore ~ sleep2Before, data=df) %>% gf_vline(xintercept=mean(~ sleep2Before, data=df), linetype=2) %>% gf_hline(yintercept=mean(~ sleepBefore, data=df), linetype=2) p2 = gf_point(sleepAfter ~ sleepBefore, data=df) %>% gf_vline(xintercept=mean(~ sleepBefore, data=df), linetype=2) %>% gf_hline(yintercept=mean(~ sleepAfter, data=df), linetype=2) grid.arrange(p1, p2, ncol=2) ``` $$\cov(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})$$ ```{r} covwv = cov(sleepBefore ~ sleep2Before, data=df) covwv covzw = cov(sleepAfter ~ sleepBefore, data=df) covzw ``` $\cov(w,v) = `r covwv`$ $\cov(z,w) = `r covzw`$ What if we recorded the data in minutes instead of hours? Then, the relationships are still the same ```{r} df2 = df*60 p3 = gf_point(sleepBefore ~ sleep2Before, data=df2) %>% gf_vline(xintercept=mean(~ sleep2Before, data=df2), linetype=2) %>% gf_hline(yintercept=mean(~ sleepBefore, data=df2), linetype=2) p4 = gf_point(sleepAfter ~ sleepBefore, data=df2) %>% gf_vline(xintercept=mean(~ sleepBefore, data=df2), linetype=2) %>% gf_hline(yintercept=mean(~ sleepAfter, data=df2), linetype=2) grid.arrange(p1, p2, p3, p4, ncol=2) ``` ...but, the covariance changes: ```{r} cov(60*sleepBefore ~ 60*sleep2Before, data=df) cov(60*sleepAfter ~ 60*sleepBefore, data=df) ``` $\cov(60w,60v) = `r 60*60*covwv`$ $\cov(60z,60w) = `r 60*60*covzw`$ Standardization (correlation) takes care of the units problem. $$r \;=\; \corr(y,x) \quad \;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right)$$ ```{r} corrwv = cor(sleepBefore ~ sleep2Before, data=df) corrwv cor(60*sleepBefore ~ 60*sleep2Before, data=df) corrzw = cor(sleepAfter ~ sleepBefore, data=df) corrzw cor(60*sleepAfter ~ 60*sleepBefore, data=df) ``` ```{r} covwv = cov(sleepBefore ~ sleep2Before, data=df) covzw = cov(sleepAfter ~ sleepBefore, data=df) ``` $\corr(w,v) = `r corrwv`$ $\corr(60w,60v) = `r corrwv`$ $\corr(z,w) = `r corrzw`$ $\corr(60z,60w) = `r corrzw`$ #### Facts about correlation and covariance * Covariance and correlation are **measures of association** (relationship) * Positive or negative, according to the direction of the association * Covariance is **not** scale invariant $$\cov(a + by, c + dx) \;=\; bd\cov(y,x)$$ * Correlation is scale invariant (unitless) $$\corr(a + by, c + dx) \;=\; {\rm sign}(bd)\corr(y,x)$$ * The sample **correlation coefficient** between $y$ and $x$ is $r \;=\; \corr(y,x)$ * $-1 \le r \le 1$ * $r = -1$ or $r = 1$ means perfect correlation: * All the points are on a straight line. * Strength of association:$\;\; r$ closer to $-1$ or $1$ is strongest * $r$ is a sample estimate for population correlation coefficient: $\;\rho \;=\; \corr(Y,X)$. #### Guess the correlation #### Always make a scatterplot when correlation calculated Let $Y = 50 - X^2$, then * $Y$ and $X$ are have a **perfect** relationship (no deviations from the curve) * ...but the relationship is quadratic, not linear * By symmetry, the correlation in this case is exactly zero! ```{r, fig.width=5, fig.height=5} mydata = tibble(x = c(-7:7), y = 50 - x^2) mydata gf_point(y ~ x, data=mydata) + geom_vline(data=mydata, aes(xintercept=mean(x)), linetype=2) + geom_hline(data=mydata, aes(yintercept=mean(y)), linetype=2) cor(y ~ x, data=mydata) ``` Believe it or not (HW), all four relationships below have exactly the same correlation coefficient. ```{r} AnscombeData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P029b.txt") p1 = gf_point(Y1 ~ X1, data=AnscombeData) p2 = gf_point(Y2 ~ X2, data=AnscombeData) p3 = gf_point(Y3 ~ X3, data=AnscombeData) p4 = gf_point(Y4 ~ X4, data=AnscombeData) grid.arrange(p1, p2, p3, p4, ncol=2) ``` We definitely should make a plot to interpret correlation! #### Example: Computer Repair Two variables: * $Y$ = Time to complete work (minutes) * $X$ = Number of units repaired The relationship appears linear. ```{r, fig.width=5, fig.height=5} repairData gf_point(Minutes ~ Units, data=repairData) ``` Correlation is quite high. ```{r} covyx = cov(Minutes ~ Units, data=repairData) covyx corryx = cor(Minutes ~ Units, data=repairData) corryx ``` Covariance and correlation in original units: $\cov(y,x) = `r covyx`$ $\corr(y,x) = `r corryx`$ ```{r} covzw = cov(stdMinutes ~ stdUnits, data=repairData) covzw corrzw = cor(stdMinutes ~ stdUnits, data=repairData) corrzw ``` Covariance and correlation in standardized units: $\cov(z,w) = `r covzw`$ $\corr(z,w) = `r covzw`$ ### Simple Linear Regression (one predictor) #### The linear regression model $$\begin{align} Y & \;=\; f(X) + \epsilon \\ \\ & \;=\; \beta_0 + \beta_1 X + \epsilon \end{align}$$ This is called a **simple** linear regression model since there's only one predictor: $X$ random variable = deterministic linear function + random error ** Illustration of data generation **





##### Why model the data-generating process If we make a detailed model of how the data are generated, we can: * Generate samples from the model -- simulated data * Find fitted regression coefficients for these simulated data, and then... * Plot sampling distributions for these fitted coefficients * See how much the fitted coefficients differ from the true population coefficients, and then... * Find sampling distributions for the differences between fitted and true coefficients, and then... * Do hypothesis testing, calculate confidence intervals, etc. **Or**, we can do all of the above analytically. This gives us results in nice algebraic form, and lets us see which exact assumptions about the data generating process were necessary to our conclusions. #### Keeping it simple: assumptions for Simple Linear Regression If the real data generation process is not well-described by a model meeting the following criteria, then we need a different model. ##### Keeping it simple: Treat $x$ values as "fixed" Throughout the course, we will consider the $x$ values as fixed. That is, all models are built **conditional** on the $x$ values observed. Adding in a model for the distribution of $X$ is beyond the scope of the course. Still, the model conditional on the observed $x$ values (although incorrect as all models are) is very useful for summarizing and describing relationships and making predictions for $Y$ based on new observations on $x$. ##### Keeping it simple: A probability model for the errors 1. The errors are **uncorrelated** with each other * $\corr(Y_i, Y_j) \;=\; \corr(\epsilon_i, \epsilon_j) \;=\; 0$ for all $i\ne j$ 2. The errors have no systematic info about $Y$ a. $E(\epsilon_i)=0$ for all $i$ (for all $x$) * The distribution of errors has mean zero * No systematic bias b. The errors are not correlated with the $x$ values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x$) ##### Keeping it simple: The mean of $Y$ is a linear function of $x$ The mean of $Y$ must depend linearly on $x$. $$\begin{align} E(Y\; |\; X=x) &\;=\; E(\beta_0 +\beta_1 x + \epsilon ) \\ \\ & \;=\; \beta_0 +\beta_1 x + E(\epsilon) \\ \\ & \;=\; \beta_0 +\beta_1 x + 0 \;=\; \beta_0 +\beta_1 x \end{align}$$ So, if the errors have no relationship to the values of $x$, it follows that the mean of $Y$ is a linear function of $x$. (Or a better way to put it: if the mean of $Y$ is *not* a linear function of $x$, the errors won't meet criterion 2b.)





## Lecture 2 ### The model for individuals in the population For individual $i$ in the population with predictor value $X=x=x_i$ $$Y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i$$ This model says that the data generation process can be described as fixing values for $x$, then finding corresponding $y=\beta_0 + \beta_1 x$, and finally adding a random error $\epsilon$ for each individual. The average response for such individuals is $E(Y\; |\; X=x_i) \;=\; \beta_0 +\beta_1 x_i$ Let's try generating a sample from a population where we have fixed values for $\beta_0$, $\beta_1$, and $x$ values. ```{r, echo=TRUE} beta0_sim1 = 3 beta1_sim1 = 4 n_sim1 = 50 mx_sim1 = 2 sx_sim1 = 1 ``` I set $\B0 = `r beta0_sim1`$, $\B1 = `r beta1_sim1`$, and $n= `r n_sim1`$. First we fix the values and pick a set of $n = `r n_sim1`$ $x$ values. I'll pick them from a Normal($\mu = `r mx_sim1`, \sigma = `r sx_sim1`$) distribution. ```{r} set.seed(1234567) ``` ```{r, echo=TRUE, fig.width=5, fig.height=5} # Generate x values for individuals in population X = rnorm(n_sim1, m = mx_sim1, s = sx_sim1) gf_histogram(~ X, bins=12, color = "white") ``` If the relationship is deterministic, we could generate $Y$-values with a simple linear relationship: $Y = `r beta0_sim1` + `r beta1_sim1` x$ ```{r, fig.width=5, fig.height=5} Y_without_error = beta0_sim1 + beta1_sim1 * X # Y values without errors gf_point(Y_without_error ~ X, color='red') ``` What is the intercept of this line? What is its slope?


```{r, echo=TRUE, fig.width=5, fig.height=5} gf_point(Y_without_error ~ X, color = 'red') %>% gf_abline(slope = beta1_sim1, intercept = beta0_sim1) ``` Of course, for a statistical model, we only require the **mean** of $Y$ to be a linear function of $x$ ```{r} sd_error_sim1 = 2 ``` So, let's generate a set of errors $\epsilon$, one for each observation. I'll pick them from a normal distribution: Normal($\mu = 0, \sigma = `r sd_error_sim1`$) . **Reminder:** According to our simple statistical model, the errors have mean = 0 ```{r} set.seed(123466) ``` ```{r, echo=TRUE, fig.width=5, fig.height=5} # Generate error values for individuals in population with given x-values error = rnorm(n_sim1, m = 0, s = sd_error_sim1) gf_histogram(~ error, bins=12, color = "white") ``` Finally, we can generate $Y$ values from the $x$ and error values by adding the errors to the linear function of $x$ values. ```{r, echo=TRUE, fig.width=5, fig.height=5} Y = beta0_sim1 + beta1_sim1 * X + error # Y values, with errors generated_data = tibble(X, error, Y_without_error, Y) head(generated_data, 10) gf_point(Y ~ X, data = generated_data, color = 'blue', show.legend = TRUE) %>% gf_point(Y_without_error ~ X, data = generated_data, color = 'red') %>% gf_abline(intercept = beta0_sim1, slope = beta1_sim1) ``` ##### What does $\beta_0$ represent? We have already seen that $\beta_0$ is the intercept of the above graphs. Let's make this more precise. When $x=0$, we expect $Y$ to be $\beta_0$. $$E(Y\; |\; X=0) \;=\; \beta_0 +\beta_1 (0) \;=\; \beta_0$$ **Computer repair data**: What does $\B0$ represent? ```{r, fig.width=5, fig.height=5} gf_point(Minutes ~ Units, data=repairData) + xlim(-1, max(~ Units, data=repairData)) + ylim(10, max(~ Minutes, data=repairData)) # gf_smooth(method = "lm", se = FALSE) + ```
**STOP ...and think about it.**





When $x=0$, we estimate the mean of $y$ to be $\yhat = \b0 + \b1 (0) = \b0$ minutes. For the computer example, this means that when no units are serviced, (guessing from the graph) we expect the work to take about 5 minutes. Hmmmm.... That doesn't make sense. Perhaps interpret this a "start up" time before working on the 1st unit? ##### What does $\beta_1$ represent? We have already seen that $\beta_1$ is the slope of the above graphs. Let's make this more precise, too -- and in a way that will carry forward when we go to multiple predictor variables, later in the course. For individuals with predictor value $x+1$ versus those with $x$, we expect the reponse to be $\beta_1$ units higher (or lower). $$\begin{align} E(Y\; |\; X=x+1) &\;=\; \beta_0 +\beta_1 (x+1) \\ \\ &\;=\; \beta_0 +\beta_1 x + \beta_1 \;=\; E(Y\; |\; X=x) + \beta_1 \end{align}$$ That is, $\; E(Y\; |\; X=x+1) - E(Y\; |\; X=x) \;=\; \beta_1$ **Computer repair data**: What does $\beta_1$ represent? ```{r, fig.width=5, fig.height=5} gf_point(Minutes ~ Units, data=repairData) + xlim(-1, max(~ Units, data=repairData)) + ylim(10, max(~ Minutes, data=repairData)) # gf_smooth(method = "lm", se = FALSE) + ```
**STOP ...and think about it.**





For each additional unit to be serviced, we estimate the mean time to completion (guessing from the graph) to be about 12 minutes longer. Estimates of $\B0$ and $\B1$ from data: ```{r} tidy(lm(Minutes ~ Units, data=repairData)) ``` We will soon learn how to "best" calculate a slope and intercept from data. ### Estimating $Y$ For Simple Linear Regression ##### Some terminology for estimates of $Y$ or its average For individual $i$ in the sample $$y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i$$ Suppose we know $x_i$ but not $y_i$. The true value is $y_i \;=\; \beta_0 +\beta_1 X + \epsilon_i$, but we don't know $\epsilon_i$ The **fitted** (estimated) value of $Y$ for individual $i$ **already in the sample** is $$\yhat_i \;=\; \b0 + \b1 x_i$$ The estimated **average** (expected) value for **all individuals in the population with $X=x_0$** is $$\widehat{\mu}_0 = \;=\;\b0 + \b1 x_0$$ The **predicted** (estimated) value for **a yet-to-be-observed or just now observed individual** in the population with $X=x_0$ is $$\yhat_0 \;=\; \b0 + \b1 x_0$$ ##### We are actually estimating model parameters Most of the time, we don't know anything about the underlying population, including the true values of the parameters $\beta_0$ and $\beta_1$, and $\var(\epsilon)=\sigma^2$. All we have are observations in a sample. We will use these observations to calculate *estimates* of $\beta_0$, $\beta_1$, as well as errors $\epsilon_i$ and their variance. #### Residuals Suppose we have estimates for $\beta_0$ and $\beta_1$. We can then use these estimates to calculate the fitted response $\yhat_i$ for each observation. The difference between the fitted response and the observed response is an estimate of $\epsilon_i$. The estimated error (estimate of $\epsilon_i$) is called a **residual** ($e_i$) $$\begin{align} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \yhat_i\\ \\ &\;=\; y_i - (\b0 + \b1 x_i) \end{align}$$ ```{r, fig.height=4} # I'm going to use Base R graphics here since I happen to know how to # add lines and text to a plot (off the top of my head). # Someday, maybe I'll learn to do this in ggplot2 (ggformula). a = 10 b = 4 mx = 15 sx = 5 s = 20 nsim2 = 100 set.seed(123456) x = rnorm(nsim2, m=mx, sd=sx) y = a + b*x + rnorm(nsim2, m=0, sd=s) ybar = mean(y) xbar = mean(x) # plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y))) # abline(a=a, b=b, lwd=3) # abline(h = ybar) # identify(y ~ x) xi = xbar + sx yi = ybar + 3*s yhati = a + b*xi plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l") abline(a=a, b=b, lwd=3) points(xi, yi, pch=20, cex=2) points(xi, yhati, cex=2) lines(c(xi,xi), c(0, ybar)) lines(c(xi,xi), c(ybar, yhati), lty=2) lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3) lines(c(0,xi), c(yi, yi)) lines(c(0,xi), c(ybar, ybar)) lines(c(xbar - sx, xbar - sx), c(ybar, yi)) text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2) text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4) text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2) text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)") text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1) text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1) ``` **Residuals for the computer repair data**: ```{r} # Find the least squares estimates for intercept and slope (lm function) repairFit = lm(Minutes ~ Units, data=repairData) # Once you have an "lm object", you can grab all kinds of summaries from it # save the fitted values: predict function # save the residuals: residuals function # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" repairData = mutate(repairData, fitted = predict(repairFit), resids = residuals(repairFit) ) # This is a pretty fancy plot that I Googled and borrowed from here: # https://www.r-bloggers.com/visualising-residuals/ ggplot(repairData, aes(x = Units, y = Minutes)) + geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope geom_segment(aes(xend = Units, yend = fitted), alpha = 0.2) + # alpha to fade lines geom_point() + geom_point(aes(y = fitted), shape = 1) + theme_bw() # Add theme for cleaner look ``` ### Model Estimation #### So, how do we estimate $\beta_0$ and $\beta_1$? What line should we choose to "fit" the model to data? (This is just computer simulated data for illustration.) ```{r message=FALSE, warning=FALSE} # This entire section is borrowed from Hadley Wickham: R for Data Science # http://r4ds.had.co.nz/model-basics.html hadleyData <- modelr::sim1 ggplot(hadleyData, aes(x, y)) + geom_point() ``` #### How do we estimate $\beta_0$ and $\beta_1$? ```{r} m = 250 ``` Here are $`r m`$ random possible models, but most are terrible! ```{r} set.seed(12345) models = tibble( b0 = runif(m, -20, 40), b1 = runif(m, -5, 5) ) ggplot(hadleyData, aes(x, y)) + geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) + geom_point() ``` ### Loss functions No matter which two number we might choose to measure slope and intercept, we are summarizing an entire distribution with two numbers. + There is a cost. + We lose information. + We should measure that loss and be aware of its magnitude. + Statisticians measure loss numerically with a "loss function" + measures the distance of the data from the line given with a two-number summary (the "slope" and "intercept") and + is a measure of distance (spread) from the line. #### One criteria: Minimize the sum of squared residuals The **least squares** estimates $\b0$ and $\b1$ minimize the sum of squared residuals In our textbook, SSE is called the "Sum of Squared Residuals (Errors)". MSE is called the "Mean Squared Residual (Error)". **NOTATION CAUTION:** This is a bit confusing since "errors" ($\epsilon_i$) are not observed. Only residuals are observed, but "SSR" is reserved for another quantity to be discussed later. $$\SSE \;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \yhat_i)^2 \;=\; \sum_i (y_i - \b0 - \b1 x_i)^2$$ ```{r} hadleyFit = lm(y ~ x, data=hadleyData) mydata = hadleyData mydata$fitted = predict(hadleyFit) # Save the fitted values mydata$residuals = residuals(hadleyFit) # Save the residual values ggplot(mydata, aes(x, y)) + geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope geom_segment(aes(xend = x, yend = fitted), alpha = 0.2) + # alpha to fade lines geom_point() + geom_point(aes(y = fitted), shape = 1) + theme_bw() # Add theme for cleaner look ``` #### Least squares estimates: $\; \b0$ and $\; \b1$ Which models have the smallest sum of squared residuals (SSE)? ```{r} model1 = function(b, data) { b[1] + data$x * b[2] } measure_distance = function(mod, data) { diff = data$y - model1(mod, data) sum(diff ^ 2) # changed to SSE from Hadley's RMSE sqrt(mean(diff ^ 2)) } hadley_SSE = function(b0, b1) { measure_distance(c(b0, b1), hadleyData) } models = models %>% mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE)) ggplot(hadleyData, aes(x, y)) + geom_point(size = 2, colour = "grey30") + geom_abline( aes(intercept = b0, slope = b1, colour = SSE), data = filter(models, rank(SSE) <= 10) ) ``` Let's try to find the smallest SSE on our own. Among the $`r m`$ models, we have isolated the 10 "closest" to the data ```{r} ggplot(models, aes(b0, b1)) + geom_point(data = filter(models, rank(SSE) <= 10), size = 4, colour = "red") + geom_point(aes(colour = SSE)) + labs(x = "Intercepts", y = "Slopes") ``` Here is a more organized approach: A grid search ```{r} grid = expand.grid( b0 = seq(-5, 20, length = 25), b1 = seq(1, 3, length = 25) ) %>% mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE)) grid %>% ggplot(aes(b0, b1)) + geom_point(data = filter(grid, rank(SSE) <= 10), size = 4, colour = "red") + geom_point(aes(colour = SSE)) ``` ```{r} ggplot(hadleyData, aes(x, y)) + geom_point(size = 2, colour = "grey30") + geom_abline( aes(intercept = b0, slope = b1, colour = SSE), data = filter(grid, rank(SSE) <= 10) ) # End of section borrowed from Hadley Wickham: R for Data Science ``` #### Finally, let's find the least squares estimates for $\b0$ and $\b1$ Of course, we can make the grid finer and finer to get closer to the "best fit" line ...or just use calculus to minimize the SSE We obtain the least squares estimators by minimizing the sum of squared residuals. You do **NOT** need to be able to **do** the math, just understand it. Find $\b0$ and $\b1$ that lead to: $$\min(\SSE) \;=\; \min \sum_i e_i^2 \;=\; \min \sum_i (y_i-\yhat_i)^2 \;=\; \min \sum_i (y_i-\b0-\b1 x_i)^2$$ * Differentiate the SSE with respect to $\b0$ and $\b1$, * Set the two equations equal to zero, and * solve for $\b0$ and $\b1$ $$\begin{align} \textrm{w.r.t } \b0 \Rightarrow -2\sum (y_i - \b0 - \b1 x_i) \;=\; 0 \hskip1in (1) \\ \\ \textrm{w.r.t } \b1 \Rightarrow -2\sum x_i(y_i - \b0 - \b1 x_i) \;=\; 0 \hskip1in (2) \end{align}$$ #### The least squares estimates for $\b0$ and $\b1$ From equation ($1$): $\;\b0 \;=\; \ybar - \b1 \xbar$ Plug this into equation ($2$): $$\begin{align} & \sum x_i(y_i - \ybar - \b1\xbar-\b1x_i)=0 \\ \\ \implies\; & \sum x_i(y_i - \ybar)\;=\; \b1 \sum x_i(x_i - \xbar) \\ \\ \implies\; & \b1\;=\; \frac{\sum x_i(y_i - \ybar)}{\sum x_i(x_i - \xbar)} \hskip2in (3) \end{align}$$ #### Rewriting $\b1$ Note that $$\begin{align} & \sum (x_i-\xbar)(y_i - \ybar) \\ \\ &\;=\; \sum x_i (y_i-\ybar) -\sum \xbar (y_i-\ybar) \\ \\ &\;=\; \sum x_i (y_i-\ybar) - \xbar \sum (y_i-\ybar) \\ \\ &\;=\; \sum x_i (y_i-\ybar) - \xbar (0) \\ \\ &\;=\; \sum x_i (y_i-\ybar) \end{align}$$ Similarly, we have * $\sum (x_i-\xbar)(y_i - \ybar)\;=\; \sum y_i(x_i-\xbar)$ * $\sum (x_i-\xbar)^2\;=\; \sum x_i(x_i-\xbar)$ Therefore, according to ($3$) we have other ways to write the estimator $$\b1 \;=\; \frac{\sum x_i(y_i-\ybar) }{\sum(x_i-\xbar)^2} \;=\; \frac{\sum (x_i-\xbar)y_i}{\sum(x_i-\xbar)^2} \;=\; \frac{\sum (x_i-\xbar)(y_i-\ybar) }{\sum(x_i-\xbar)^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}$$ #### Summary: The simplest formulas for estimators $\b0$ and $\b1$ $\b0 \;=\; \ybar - \b1 \xbar$ $\b1 \;=\; r \frac{s_y}{s_x}$ ### Side note: Simpler Model from Back in STAT 101 (No Predictor): Why did we choose $\ybar$ to estimate $\mu$? When sampling from one population of quantitative values $Y$, we estimated the mean of $Y$: $E(Y) = \mu$. Using our new linear regression notation, we actually estimated the mean of $Y$ for the linear model: $\qquad Y = \beta_0 + \epsilon \;\;=\;\; \mu + \epsilon$ **Now, we want to estimate $\widehat{\mu} = \yhat = \b0$** ##### A "loss function" We will again use as our loss function the sum (or mean) of squared residuals. $$ SSE(a) = \sum_{i=1}^n\, (y_i-a)^2 \qquad\qquad MSE(a) = \frac{1}{n-1}\sum_{i=1}^n\, (y_i-a)^2 $$ What value of $a$ should we choose if loss measured by SSE? It seems reasonable that $a$ should be in the "center" of the data. But which value in the middle would be best? One optimality criteria: Choose a number that minimizes SSE. Using the least squares estimation method, $\b0$ is the value that minimizes $$\SSE \;=\; \sum (y_i - \widehat{\mu})^2 \;=\; \sum (y_i - \yhat)^2 \;=\; \sum(y_i - \b0)^2$$ So, we want to minimize a function of the form $f(a) = \sum (y_i -a)^2$ with respect to $a$. $$\frac{d}{da} f(a) = 2 \sum (y_i - a) (-1) = -2 \sum(y_i - a)$$ Set the derivative = 0 and solve for $a$. $$ \begin{align} -2 \sum(y_i - a) = 0 \implies\; & \sum(y_i - a) = 0 \\ \implies\; & \sum y_i - \sum a = 0 \\ \implies\; & \sum y_i - n a = 0 \\ \implies\; & \sum y_i = n a \\ \implies\; & a = \frac{1}{n} \sum y_i \\ \implies\; & a = \ybar \end{align} $$ The 2nd derivative of $f(a)$ is $$\frac{d}{da} (-2)\sum (y_i -a) = \frac{d}{da} (-2)\sum y_i + \frac{d}{da} (-2)\sum (-a) = 0 + 2n > 0$$ So, $\b0 = \ybar$ is a minimum, called the "least squares" estimate for $\mu$. So, $\; \b0 \;=\; \widehat{\mu} \;=\; \yhat \;=\; \ybar$ #### Finding the least squares estimates using R `lm` = "linear model" function `lm(y ~ x, data = mydata)` Back to the computer repair data: * $Y$ = Time to complete work (minutes) * $X$ = Number of units repaired ```{r, fig.width=5, fig.height=5} # Find the least squares estimates for slope and intercept (lm function) lmfit = lm(Minutes ~ Units, data=repairData) # gf_coefline will draw the regression line on the scatterplot gf_point(Minutes ~ Units, data=repairData) %>% gf_coefline(model=lmfit) ``` ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # tidy (from the broom package) gives a summary of the slope and intercept # and accompanying estimate of variance in the estimates tidy(repairFit) b0 = tidy(repairFit)$estimate[1] b0 b1 = tidy(repairFit)$estimate[2] b1 ``` Let's try the least squares estimate on the simulated data we generated earlier where we set $\B0 = `r beta0_sim1`$ and $\B1 = `r beta1_sim1`$ and generated random $Y$ values. Since we simulated the data from the linear regression model, then the least squares estimates should be accurate. ```{r, fig.width=5, fig.height=5} # Find the least squares estimates for slope and intercept (lm function) lmfit_sim = lm(Y ~ X, data = generated_data) # gf_coefline will draw the regression line on the scatterplot gf_point(Y ~ X, data = generated_data) %>% gf_coefline(model = lmfit_sim) %>% gf_abline(intercept = beta0_sim1, slope = beta1_sim1, color = 'red',alpha=1) ``` Remember, our true population parameters were $\beta_0= `r beta0_sim1`$ and $\beta_1= `r beta1_sim1`$. ```{r} lmfit_sim = lm(Y ~ X, data = generated_data) tidy(lmfit_sim) b0_sim = tidy(lmfit_sim)$estimate[1] b1_sim = tidy(lmfit_sim)$estimate[2] ``` Indeed $\b0=`r round(b0_sim, 3)` \approx `r beta0_sim1` = \B0 \qquad \text{ and } \qquad \b1=`r round(b1_sim, 3)` \approx `r beta1_sim1` = \B1$ ### Checking the specifications of a linear regression model $Y \;=\; \beta_0 + \beta_1 x + \epsilon$ random variable = deterministic linear function + random error This is called a **simple** linear regression model since only one predictor: $x$ 1. The errors are **uncorrelated** + $\corr(Y_i, Y_j) \;=\; \corr(\epsilon_i, \epsilon_j) \;=\; 0$ for all $i\ne j$ 2. The errors have no systematic info about $Y$ a. The average error = $E(\epsilon_i)=0$ for all $i$ (for all $x$) + The distribution of errors has mean zero + No systematic bias b. The errors are not correlated with the $x$ values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x$) 3. The mean of $Y$ depends linearly on $x$ + $E(Y\; |\; X=x) \;=\; \beta_0 +\beta_1 x$ #### Item 1: The errors are uncorrelated In order to determine if this is reasonable in light of the data typically, you have to know how the data were collected Example: Computer repair * if the computer repair person was new to the job and * we knew the order of the repair jobs Then, perhaps at the start, residuals tend to be larger (more variability in the minutes per computer) ...and later on, the residuals tend to be smaller (less variability in the minutes per computer) That is, the residuals could be correlated over time We'll take up this topic in Chapter 8: The Problem of Correlated Errors #### Item 2a: The errors have mean zero There is actually nothing we need to check in the data because least squares residuals always add up to zero! For any data, $$\begin{align} \sum_i e_i &\;=\; \sum_i (y_i - \yhat_i) \\ \\ &\;=\; \sum_i (y_i - \b0 - \b1x_i) \\ \\ &\;=\; \sum_i (y_i - (\ybar - \b1\xbar) - \b1x_i) \\ \\ &\;=\; \sum_i (y_i - \ybar) - \b1 \sum_i (x_i - \xbar) \\ \\ &\;=\; 0 - \b1 (0) \;=\; 0 \end{align}$$ So, the observed regression line $\yhat = \b0 + \b1 x$ is the observed average and is the balancing point of the observed $y$-values. This should look familiar. We showed that for data from a population $Y$: $\displaystyle{\sum_i (y_i - \ybar) = 0}$. The observed average is the balancing point of the observed $y$-values. #### Item 2b: The errors are not correlated with the $x$ values There is actually nothing we need to check in the data because least squared residuals are always uncorrelated with the $x$ values $$\cov(x, e) \;=\; \frac{1}{n-1} \sum_i (x_i - \xbar) (e_i - \ebar)$$ We know that $\sum_i e_i \;=\; 0$, so $\ebar \;=\; 0$. So, $$\begin{align} \sum_i (x_i - \xbar) (e_i - \ebar) &= \sum_i (x_i - \xbar) e_i \\ &\;=\; \sum_i x_ie_i - \xbar \sum_i e_i \\ &\;=\; \sum_i x_ie_i - \xbar (0) \\ &\;=\; \sum_i x_ie_i \\ \end{align}$$ Finally, you can show that $\sum_i x_i e_i \;=\; 0$ for any data. Thus, $\cov(x, e) \;=\; 0$. Therefore, $\corr(x, e) \;=\; 0$ #### Item 2c: $\var(\epsilon_i) = \sigma^2$ is same for all $i$ (for all $x$) Later, we will look at graphical diagnostics to check for constant spread around the line for all $x$ values. We estimate $\var(\epsilon)$ as $$\widehat\sigma^2 \;=\; \frac{\SSE}{\df} \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{n-2}$$ degrees of freedom (df) = sample size $-$ # parameters estimated for the mean ($\beta_0$ and $\beta_1$) = $n-2$ #### Item 3: The mean of $Y$ depends linearly on $x$ Again, we will look at useful graphical diagnostics later. ## Lecture 3 ### Statistical Properties of LS Estimates of Model Parameters #### Variability of $\b0$ and $\b1$ We know that $\b0$ and $\b1$ will not exactly match the true values $\beta_0$ and $\beta_1$. Here, we are going to explore this variability. We'll start with a quick visualization. I generate several samples. The first is large, with 10,000 observations, and it's meant to represent the population. The other three are small. ```{r} library(reshape2) set.seed(1234) sd_error_sim1=5 # Simulate the population (really just a big sample with 10,000 observations) n_simbig=10000 X_big = runif(n_simbig) error = rnorm(n_simbig, m=0, sd=sd_error_sim1) # Generate errors for sample i Y_big = beta0_sim1 + beta1_sim1 * X_big + error # Now draw 3 smaller samples n_sim1=10 X = runif(n_sim1) # First sample error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y1 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i # Second sample error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y2 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i # Third sample error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y3 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i # Reshape data for plotting this_data<-data.frame(X,Y1,Y2,Y3) this_data_long <- melt(this_data, id="X") # Plot population & sample 2 gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% gf_point(Y2~X,color="green")%>% gf_smooth(Y2~X,method='lm',se=FALSE,color="green") # Plot population & all samples gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% gf_point(value~X,data=this_data_long,color=~variable)%>% gf_smooth(value~X,data=this_data_long,color=~variable,method='lm',se=FALSE) ``` As we can see, the estimated slopes $\b1$ and intercepts $\b0$ differ for each of our three samples. What we'll discuss now is what distributions of $\b0$ and $\b1$ we'd expect to see as we increase the number of our samples: the *sampling distributions*. #### Sampling distributions for $\b0$ and $\b1$ If the model is correct: $Y \;=\; \beta_0 + \beta_1 x + \epsilon$ then the least squares estimates are unbiased. $$E(\b0) \;=\; \beta_0 \qquad {\rm and} \qquad E(\b1) \;=\; \beta_1$$ To show this, let's ponder the formula for $\b1$ first. #### Useful algebraic facts about $\b1$ Recall the various expressions for $\b1$: $$\b1 \;=\; \frac{\sum x_i(y_i-\ybar) }{\sum(x_i-\xbar)^2} \;=\; \frac{\sum (x_i-\xbar)y_i}{\sum(x_i-\xbar)^2} \;=\; \frac{\sum (x_i-\xbar)(y_i-\ybar) }{\sum(x_i-\xbar)^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}$$ In particular, $$\b1 \;=\; \frac{\sum (x_i-\xbar)y_i}{\sum(x_j-\xbar)^2} \;=\; \sum{k_i y_i}$$ where, $$k_i \; =\; \frac{(x_i-\xbar)}{\sum(x_i-\xbar)^2}$$ We'll soon need these facts about the $k_i$: $$\begin{align} \sum_i k_i &\;=\; \sum_i \frac{(x_i-\xbar)}{\sum(x_j-\xbar)^2} \;=\; \frac{\sum_i (x_i-\xbar)} {\sum(x_j-\xbar)^2} \;=\; 0 \\ \\ \sum_i k_ix_i &\;=\; \sum_i \frac{(x_i-\xbar)x_i}{\sum(x_j-\xbar)^2} \;=\; \frac{\sum_i (x_i-\xbar)^2} {\sum(x_j-\xbar)^2} \;=\; 1 \\ \\ \sum_i k_i^2 &\;=\; \sum_i \frac{(x_i-\xbar)^2}{\left[\sum(x_j-\xbar)^2\right]^2} \;=\; \frac{\sum_i(x_i-\xbar)^2}{\left[\sum(x_j-\xbar)^2\right]^2} \;=\; \frac{1}{\sum(x_i-\xbar)^2} \end{align}$$ #### OK. Now we can see $\b1$ is an unbiased estimator for $\beta_1$ $$\begin{align} E(\b1) &\;=\; E\left(\sum k_iY_i\right) \;=\; \sum E(k_i Y_i) \;=\; \sum k_i E(Y_i) \\ \\ &\;=\; \sum k_i (\beta_0 + \beta_1 x_i) \;=\; \beta_0 \sum k_i + \beta_1 \sum k_i x_i \\ \\ &\;=\; \beta_0 (0) + \beta_1 (1) \;=\; \beta_1 \end{align}$$ ...and we can find the variance of $\b1$: $$\begin{align} \var(\b1) &\;=\; \var\left(\sum k_iY_i\right) \;=\; \sum \var(k_i Y_i) \;=\; \sum k_i^2 \var(Y_i) \\ \\ &\;=\; \sum k_i^2 \sigma^2 \;=\; \sigma^2 \sum k_i^2 \;=\; \frac{\sigma^2}{\sum(x_i-\xbar)^2} \end{align}$$ #### OK. Now we can see $\b0$ is an unbiased estimator for $\beta_0$ $$\begin{align} E(\b0) &\;=\; E(\Ybar - \b1 \xbar) \;=\; E(\Ybar) - \xbar E(\b1) \\ \\ &\;=\; (\beta_0 + \beta_1 \xbar) - \xbar \beta_1 \;=\; \beta_0 \end{align}$$ ...and we can find the variance of $\b0$: $$\begin{align} \var(\b0) &\;=\; \var(\Ybar - \b1 \xbar) \;=\; \var(\Ybar) + \xbar^2\var(\b1) -\xbar\cov(\Ybar, \b1)\\ \\ &(*) \;=\; \frac{\sigma^2}{n} + \frac{\sigma^2 \xbar^2}{\sum(x_i-\xbar)^2} \;=\; \sigma^2 \left[ \frac{1}{n} + \frac{\xbar^2}{\sum(x_i-\xbar)^2}\right] \end{align}$$ $(*)$ Can show that $\Ybar$ and $\b1$ are uncorrelated. ### Estimating the spread around the line: $\sigma^2$ Of course, we don't know $\sigma^2$ (population variance). So, we use an estimate from data: $$\widehat\sigma^2 \;=\; \frac{\SSE}{\df} \;=\; \frac{e_i^2}{n-2} \;=\; \frac{\sum_i (y_i - \yhat_i)^2}{n-2}$$ Then, the standard errors (estimated standard deviations) are $$\se(\b1) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\xbar)^2}}$$ $$\se(\b0) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\xbar^2}{\sum(x_i-\xbar)^2}}$$ ### The Normal Distribution #### Useful choice of a probability distribution for the errors: Normal ...and if we add one more model specification 4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$ That is $Y \;\sim\; N(\beta_0 + \beta_1 x, \sigma^2)$ Then, $$\b0 \sim N(\beta_0, \sd(\b0)) \qquad {\rm and} \qquad \b1 \sim N(\beta_1, \sd(\b1))$$
Also,
$$\frac{\b0 - \beta_0}{\se(\b0)} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\b1 - \beta_1}{\se(\b1)} \;\sim\; t_{n-2}$$ ### Similar Concepts in "STAT 101" #### This is not so novel. Recall the "STAT 101" one-sample t-test. Observe data on a response variable: $y_1, y_2, \ldots, y_n$ Estimate the population mean ($\mu$) with the sample average ($\ybar$). And $\ybar$ is the least squares (LS) estimate for $\mu$. If the population $Y \sim N(\mu, \sigma^2)$ then $$\frac{\ybar - \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}$$ In STAT 101, we estimated $\sigma^2$ just as for all least squares estimates: $$\widehat\sigma^2 \;=\; s^2 \;=\; \frac{\SSE}{\df} \;=\; \frac{\sum_i (y_i - \ybar)^2}{n-1}$$ degrees of freedom (df) = sample size $-$ # parameters estimated for the mean ($\mu$) = $n-1$ #### Back in STAT 101, we found LS estimate unbiased too. $$E(\Ybar) \;=\; E\left[\frac{1}{n}\sum_i Y_i\right] \;=\; \frac{1}{n}\sum E(Y_i) \;=\; \frac{1}{n}\sum \mu \;=\; \frac{1}{n} (n \mu) = \mu$$ and for a random sample: $\;\var(\Ybar) = \sigma^2/n$ #### So, back in STAT 101, we also used a t-distribution If the population is $N(\mu, \sigma^2)$ then, $\; \epsilon \;=\; (Y - \mu) \;\sim\; N(0, \sigma^2)$ and $$\frac{\ybar - \yhat}{\se(\,\ybar\,)} = \frac{\ybar - \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}$$ ### Demo Sampling Distributions We can generate a set of $Y$-values from a known population model we used earlier: ```{r} kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1)) ``` $\B0 = `r beta0_sim1`$ $\B1 = `r beta1_sim1`$ $\sigma = `r sd_error_sim1`$ $n = `r n_sim1`$ 1. The mean of $Y$ is linear: $E(Y) = `r beta0_sim1` + `r beta1_sim1` x$ 2. The $x$ values come from a Normal($\mu = `r mx_sim1`, \sigma = `r sx_sim1`$) distribution. 3. The errors ($\epsilon$) from a Normal($\mu = 0, \sigma = `r sd_error_sim1`$) distribution. 4. Finally, we can generate $Y$ values from the $x$ and error values by adding the errors to the linear function of $x$ values. $Y_i = `r beta0_sim1` + `r beta1_sim1` x_i + \epsilon_i$ #### Simulate sampling distributions of $\b0$ and $\b1$ ```{r, echo=TRUE} number_samples = 1000 kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1, number_samples)) ``` **Using An Applet:** Inputs needed for the applet: $\B1 = `r beta1_sim1`$ $\B0 = `r beta0_sim1`$ $\mu_x = `r mx_sim1`$ $\sigma_x = `r sx_sim1`$ $\sigma = `r sd_error_sim1`$ Range of $y:\; (-1, 20)$ Range of $x:\; (-1, 5)$ **Using R:** We can simulate `r number_samples` samples of $Y$-values as follows: First, complete steps 1 and 2 (just once) to generate $x$ values. Then, repeat steps 3 and 4 `r number_samples` times, and each time calculate $\b0$ and $\b1$ and save those values. Then, look at the distributions of the `r number_samples` $\b0$ and $\b1$ values. ```{r} set.seed(1234) ``` **View the code:** ```{r, echo=TRUE} # Set up places to store the beta_0 and beta_1 estimates beta0hats = 0 beta1hats = 0 plots=NULL # The code below will repeat number_samples times # i represents the sample number for (i in 1:1000) { error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i fit = lm(Y ~ X) # Calculate LS estimates for intercept and slope for sample i # assign the current beta_0 estimate to the ith position in vector beta0hats beta0hats[i] = tidy(fit)$estimate[1] # assign the current beta_1 estimate to the ith position in vector beta1hats beta1hats[i] = tidy(fit)$estimate[2] } # end of the loop that gets repeated number_sample times ``` #### Analyze simulated sampling distributions of $\b0$ and $\b1$ What would you expect center, spread, and shape of these distributions to be? **STOP**



```{r, fig.width=4.5, fig.height=4, fig.show="hold"} gf_dhistogram(~ beta0hats, bins = 20, color = "white") gf_dhistogram(~ beta1hats, bins = 20, color = "white") ``` ##### The means of the sampling distributions If we enter `mean(beta0hats)` in R, what do you expect for the numerical result? If we enter `mean(beta1hats)` in R, what do you expect for the numerical result?

**STOP to ponder that**





The results are `r mean(beta0hats)` and `r mean(beta1hats)` which should be close to $\B0 = `r beta0_sim1`$ and $\B1 = `r beta1_sim1`$ because both estimators are unbiased: $$E(\b0) = \B0 \qquad \text{ and } \qquad E(\b1) = \B1$$ ##### The standard deviations of the sampling distributions According to earlier work, $$SD(\b0) \;=\; \sigma \;\sqrt{\frac{1}{n} + \frac{\xbar^2}{\sum(x_i-\xbar)^2}}$$ $$SD(\b1) \;=\; \frac{\sigma}{\sqrt{\sum(x_i-\xbar)^2}}$$ The standard deviations of the statistics according to theory: ```{r} xbar_sim1 = mean(X) SSx_sim1 = sum((X - xbar_sim1)^2) SDx_sim1 = sqrt( SSx_sim1 / (n_sim1-1) ) SDb0_sim1 = sd_error_sim1 * sqrt(1/n_sim1 + xbar_sim1^2 / SSx_sim1) SDb1_sim1 = sd_error_sim1 / sqrt(SSx_sim1) kable( tibble( sd_error=sd_error_sim1, n=n_sim1, xbar = xbar_sim1, SSx = SSx_sim1, SDx = SDx_sim1, SDb0 = SDb0_sim1, SDb1 = SDb1_sim1 ) ) ``` The sample standard deviations of `r number_samples` simulated statistics: ```{r} kable( tibble( mean(beta0hats), sd(beta0hats), mean(beta1hats), sd(beta1hats) ) ) ``` ##### The shape of the sampling distributions Claim: Since the errors come from a Normal distribution, then 1. The $Y$-values come from a Normal distribution 2. The $\b0$-values come from a Normal distribution 3. The $\b1$-values come from a Normal distribution The simulated sampling distribution of $\b0$ values: ```{r, fig.width=4.5, fig.height=4, fig.show="hold"} gf_dhistogram(~ beta0hats, bins = 20, color = "white") %>% gf_dist("norm", params = list(beta0_sim1, SDb0_sim1)) gf_dhistogram(~ beta1hats, bins = 20, color = "white") %>% gf_dist("norm", params = list(beta1_sim1, SDb1_sim1)) ``` #### Back to the computer repair data... ```{r, message=FALSE, warning=FALSE} # glance and tidy (from the broom package) give nicely formatted summaries of regression estimates repairFit = lm(Minutes ~ Units, data=repairData) tidy(repairFit) glance(repairFit)[-c(7:10)] # base R has a summary function summary(repairFit) mean(~ Units, data=repairData) sd(~ Units, data=repairData) range(~ Minutes, data=repairData) range(~ Units, data=repairData) ``` Inputs needed for the applet: $\b1 = `r tidy(repairFit)$estimate[2]`$ $\b0 = `r tidy(repairFit)$estimate[1]`$ $\widehat{\sigma}^2 = `r as.numeric(glance(repairFit)[3])`$ $\xbar = `r mean(~ Units, data=repairData)`$ $s_x = `r sd(~ Units, data=repairData)`$ Range of $y:\; (`r min(~ Minutes, data=repairData)`,\; `r max(~ Minutes, data=repairData)`)$ Range of $x:\; (`r min(~ Units, data=repairData)`,\; `r max(~ Units, data=repairData)`)$ ### Reminder: checking model assumptions **1. The errors are uncorrelated** Data collected as random sample? Effect of time? Chapter 8: The Problem of correlated errors. **2.a. No systematic bias: $E(\epsilon_i) \;=\; 0$ for all $i$ (for all $x$)** Actually nothing we need to check. For any data and least squares regression, $\sum_i e_i = 0$ **2.b. The errors are not correlated with the $x$ values** Actually nothing we need to check. For any data and least squares regression: $\cov(x, e) = 0$ **2.c. $\var(\epsilon_i) = \sigma^2$ is same for all $i$ (for all $x$)** Graphical diagnostics. Chapter 4: Regression Diagnostics. Check for constant spread around the line for all $x$ values. **3. The mean of $Y$ depends linearly on $x$** Again, we will look at useful graphs later. ### Summary: Statistical Properties of LS Estimates **The least squares estimates** $$ \b0 = \ybar -\b1\xbar \qquad \textrm{and}\qquad \b1 \;=\; \frac{\sum (x_i-\xbar)y_i}{\sum(x_i-\xbar)^2} \;=\; \frac{\sum (x_i-\xbar)(y_i-\ybar) }{\sum(x_i-\xbar)^2} \cdots {\rm algebra} \cdots \;=\; r \frac{s_y}{s_x}$$ **Unbiased estimators** Both $\b0$ and $\b1$ are unbiased estimators of their population counterparts: $E(\b0) = \beta_0$ and $E(\b1) = \beta_1$. **Standard errors** $$\se(\b1) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\xbar)^2}}\qquad \textrm{and}\qquad \se(\b0) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\xbar^2}{\sum(x_i-\xbar)^2}}$$ where $$\widehat\sigma \;=\; \sqrt{\frac{\SSE}{\df}} \;=\; \sqrt{\frac{e_i^2}{n-2}} \;=\; \sqrt{\frac{\sum_i (y_i - \yhat_i)^2}{n-2}}$$ If $\epsilon \sim N(0,\sigma^2)$ then (or sample size large), then $$\b0 \sim N(\beta_0, \sd(\b0)) \qquad {\rm and} \qquad \b1 \sim N(\beta_1, \sd(\b1))$$
Also,
$$\frac{\b0 - \beta_0}{\se(\b0)} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\b1 - \beta_1}{\se(\b1)} \;\sim\; t_{n-2}$$ ### Hypothesis Tests for Model Parameters Reminder: Steps for hypothesis testing: 1. Specify the Null Hypothesis 2. Specify the Alternative Hypothesis 3. Set the significance level ($\alpha$) 4. Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic and the corresponding P-Value 5. Draw a conclusion **Testing $H_0:\beta_0=0\;$ or $\;H_0:\beta_1=0$ using t-test** Let's work the first case: our null hypothesis is $H_0:\beta_0=0\;$. *If the null hypothesis is true*, then the statistic $$\frac{\b0 - \beta_0}{\se(\b0)} = \frac{\b0 - 0}{\se(\b0)}=\frac{\b0}{\se(\b0)}$$ will follow a t-distribution with $n-2$ degrees of freedom. Reminder of the intuition here: If the null hypothesis is true, most of the time, this statistic will be close to zero. If the statistic is large, either we got very (un)lucky, or the null hypothesis wasn't true to begin with. We can test the null hypothesis by calculating this statistic and comparing it with a critical value $t^\star$ for our chosen significance value. (Remember to use $\alpha/2$ for a two-sided hypotheis.) Alternatively, we can calculate the p-value corresponding to our statistic, and compare it to $\alpha$. Now let's work the second case: our null hypothesis is $H_1:\beta_1=0\;$. *If the null hypothesis is true*, then the statistic $$\frac{\b1 - \beta_1}{\se(\b1)} = \frac{\b1 - 1}{\se(\b1)}=\frac{\b1}{\se(\b1)}$$ We can test this hypothesis by colculating this statistic and the corresponding P-value. #### Testing equality of model coefficients to 0 in R: it's a special case! *In general*, when we want to test a hypothesis $H_0: \beta_0=k$ our test statistic will be of the form $t=\frac{\b1 - \beta_1}{\se(\b1)} = \frac{\b1 - k}{\se(\b1)}$. Our test statistic only simplifies to the form $\frac{\b1}{\se(\b1)}$ for the specific hypothesis $H_0: \beta_0=0$. However, this is a very common hypothesis to test. Because we so frequently want to test this hypothesis, R will computer the corresponding statistic and p-value for us automatically. They are given in the outputs of the `lm` command. The important thing to remember is that these outputs *only apply* to the specific null hypothesis that the coefficients equal zero. If you want to test a different hypothesis, you'll need to calculate the test statistic and p-value manually, using the formula $t=\frac{\b1 - \beta_1}{\se(\b1)}$. We'll discuss how to do this later. #### Let's test $H_0: \beta_0 = 0$ vs.\ $H_1: \beta_0 \ne 0$ For computer repairs, it seems reasonable that it should take no time to complete no work (zero units repaired). Our prediction is $\yhat_i \;=\; \b0 + \b1 x_i = \b0 + \b1 (0) \;=\; \b0 = `r b0`$ minutes. If $H_0$ is true, then $\beta_0=0$ and $$\frac{\b0 - \beta_0}{\se(\b0)} \;=\; \frac{\b0 - 0}{\se(\b0)} \;=\; \frac{\b0}{\se(\b0)} \;\sim\; t_{n-2}$$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) tidy(repairFit) ``` ```{r} repairFit = lm(Minutes ~ Units, data=repairData) tidyRepairFit = tidy(repairFit) b0.se = tidyRepairFit$std.error[1] b0.tstat = tidyRepairFit$statistic[1] b0.pvalue = tidyRepairFit$p.value[1] ``` | Summaries | Test | |:-----------------------|:----------------------------------| | $\b0 = `r b0`$ | $t_{\rm observed} = `r b0.tstat`$ | | $\se(\b0) = `r b0.se`$ | pvalue = $`r b0.pvalue`$ | Using a significance level of $\alpha=0.05$, we fail to reject $H_0:\beta_0=0$. #### Oops! We forgot to check if $Y \sim$ Normal is plausible $Y \sim N(\beta_0 + \beta_1 x, \sigma^2)$ when $\epsilon \sim N(0, \sigma^2)$. So, let's graph the residuals. ```{r, fig.height=3} repairData # dotplot (since only 14 data values, too small for boxplot or histogram) gf_dotplot(~ resids, data=repairData) ``` Skew is not extreme. No obvious outliers. The t-method should be valid here. #### We can strongly reject $H_0: \beta_1=0$ If $\beta_1=0$ then $x$ has no (linear) information about $Y$. In fact, $\beta_1=0$ is equivalent to $\rho = \corr(Y,x) = 0$ since $$\beta_1 = \rho \frac{\sigma_Y}{\sigma_X}$$ So, $H_0: \beta_1=0$ and $H_0: \rho=0$ both test for linear relationship. If $H_0$ is true, then $\beta_1=0$ and $$\frac{\b1 - \beta_1}{\se(\b1)} \;=\; \frac{\b1 - 0}{\se(\b1)} \;=\; \frac{\b1}{\se(\b1)} \;\sim\; t_{n-2}$$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) (tidyRepairFit<-tidy(repairFit)) b1.se = tidyRepairFit$std.error[2] b1.tstat = tidyRepairFit$statistic[2] b1.pvalue = tidyRepairFit$p.value[2] b1.estimate = tidyRepairFit$estimate[2] ``` | Summaries | Test | |:-----------------------|:----------------------------------| | $\b1 = `r b1`$ | $t_{\rm observed} = `r b1.tstat`$ | | $\se(\b1) = `r b1.se`$ | pvalue = $`r b1.pvalue`$ | We strongly reject $H_0: \beta_1=0$. There is a strong positive relationship between repair time and number of units serviced. ### Testing a specific other value Consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work. If $H_0$ is true, then $\beta_1=14.5$ and $$t = \frac{\b1 - \beta_1}{\se(\b1)} \;=\; \frac{\b1 - 14.5}{\se(\b1)} \;\sim\; t_{n-2}$$ Because our null hypothesis concerns a value other than zero, need to calculate $t$ manually. ```{r} # The long way, but straightforward b1.12.tstat = (b1.estimate - 14.5) / b1.se b1.12.tstat ``` Once we do this, we can calculate the p-value using the function `pt`. We take 1 minus its output, then multiply by two since we are testing a two-sided hypothesis. ```{r} n<-nrow(repairData) b1.12.pvalue = 2 * (1 - pt(b1.12.tstat, df=n-2)) b1.12.pvalue ``` | Summaries | Test | |:-----------------------|:-----------------------------------------| | $\b1 = `r b1`$ | $t_{\rm observed} = `r b1.12.tstat`$ | | $\se(\b1) = `r b1.se`$ | $t_{\rm observed}^2 = `r b1.12.tstat^2`$ | | | pvalue = $`r b1.12.pvalue`$ | There is a shortcut to test this hypothesis, using the `linearHypothesis` function in the `car` package. This function doesn't report out the t-statistic -- it reports the F-statistic, which we'll discuss shortly -- but the p-values are the same as if we'd done the t-test. $H_0: \beta_1 = 14.5$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("Units = 14.5"))) ``` ### Partitioning the Sum of Squares #### Partitioning and residuals $$(y_i - \ybar) \;=\; (y_i - \yhat_i + \yhat_i - \ybar) \;=\; (y_i - \yhat_i) + (\yhat_i - \ybar)$$ ```{r, fig.height=4} # I'm going to use Base R graphics here since I happen to know how to # add lines and text to a plot (off the top of my head). # Someday, maybe I'll learn to do this in ggplot2 (ggformula). draw_residual_schematic <- function() { a = 10 b = 4 mx = 15 sx = 5 s = 20 n = 100 set.seed(123456) x = rnorm(n, m=mx, sd=sx) y = a + b*x + rnorm(n, m=0, sd=s) ybar = mean(y) xbar = mean(x) # plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y))) # abline(a=a, b=b, lwd=3) # abline(h = ybar) # identify(y ~ x) xi = xbar + sx yi = ybar + 3*s yhati = a + b*xi myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l") abline(a=a, b=b, lwd=3) points(xi, yi, pch=20, cex=2) points(xi, yhati, cex=2) lines(c(xi,xi), c(0, ybar)) lines(c(xi,xi), c(ybar, yhati), lty=2) lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3) lines(c(0,xi), c(yi, yi)) lines(c(0,xi), c(ybar, ybar)) lines(c(xbar - sx, xbar - sx), c(ybar, yi)) text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2) text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4) text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2) text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)") text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1) text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1) } draw_residual_schematic() ``` #### Partitioning the sum of squares It's not obvious, but is true (for LS regression): $$\sum (y_i - \ybar)^2 \;=\; \sum (y_i - \yhat_i + \yhat_i - \ybar)^2 \;=\; \sum (y_i - \yhat_i)^2 \;+\; \sum (\yhat_i - \ybar)^2$$ SST = $\sum (y_i - \ybar)^2$ = total sum of squares SSE = $\sum (y_i - \yhat_i)^2$ = error sum of squares SSR = $\sum (\yhat_i - \ybar)^2$ = regression sum of squares So, SST = SSE + SSR #### Coefficient of determination: $R^2$ ("R squared") One measure of the "goodness of fit" of the regression model to the data: $R^2$ = proportion of variability in the response ($Y$) explained by regression on the predictor ($x$) $$R^2 \;=\; \frac{\SSR}{\SST} \;=\; 1 - \frac{\SSE}{\SST} \;=\; 1 - \frac{\sum (y_i - \yhat_i)^2}{\sum (y_i - \ybar)^2}$$ If the predictors don't include much information about the response, then $\sum (y_i - \yhat_i)^2$ is close to $\sum (y_i - \ybar)^2$ since $\ybar$ is the least squares estimate for the mean without predictors. In that case, $R^2$ is near $1 - 1 = 0$. On the other hand, if the predictors have so much information that residuals are very, very small, then $R^2$ is near $1 - 0 = 1$. ```{r} repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] summary(repairFit) ``` #### Why do we call this number $R^2$? When there is only one predictor ($x$), $R^2 = \corr(y,x)^2 = r^2$ ```{r} cor(Minutes ~ Units, data=repairData) cor(Minutes ~ Units, data=repairData)^2 ``` $r = `r corryx`$ $r^2 = `r corryx^2`$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] ``` So, 98.7\% of the variability in repair times can be attributed to the regression on the number of units. The rest of the variability (residuals) is not explained by the number of units repaired. #### Another way to view $R^2$ Claim: $\corr(y, \yhat) = |\, \corr(y, x)\, |$ Recall: + $\cov(a+by, c+dx) = bd\cov(y,x)$ + $\sd(a+bY) = |b|\sd(y)$ and $\sd(c + dx) = |d|\sd(x)$ $$\corr(a+bY, c+dx) = \frac{bd\cov(y,x)}{|b|\sd(Y)\, |d|\sd(x)} = {\rm sign}(bd)\corr(y,x)$$ $\begin{align} \corr(y, \yhat) = \corr(y, \b0 + \b1 x) = \textrm{sign}(\b1) \corr(y,x) \end{align}$ $\b1$ and $\corr(y,x)$ always have the same sign: $\b1 = \corr(y,x) (s_y/s_x)$ Thus, $0 \le \corr(y, \yhat) \le 1$ and $\corr(y, \yhat) = |\, \corr(y, x)\, |$ So, $R^2 = \left[\corr(y, x)\right]^2 = \left[\corr(y, \yhat)\right]^2$ This will be handy when we have models with more than one predictor. But we will have to be careful: when we add more predictors to a model, $R^2$ will *always* increase or stay constant. Therefore we won't be able to reliably use $R^2$ as a way of comparing models to see which is better. #### Adjusted R^2 ```{r} repairFit = lm(Minutes ~ Units, data=repairData) glance(repairFit)[-c(7:10)] ``` Adjusted $R^2$ includes the degrees of freedom: $$\text{adjusted} R^2 = 1 - \frac{\SSE / \SSE(\df)}{\SST / \SST(\df)} = \frac{\sum (y_i - \yhat_i)^2/(n-p)}{\sum (y_i - \ybar)^2 / (n-1)}$$ Adjusted $R^2$ does *not* always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another. ## Lecture 4 ### Partitioning the sum of squares $$(y_i - \ybar) \;=\; (y_i - \yhat_i + \yhat_i - \ybar) \;=\; (y_i - \yhat_i) + (\yhat_i - \ybar)$$ ```{r, fig.height=4} # I'm going to use Base R graphics here since I happen to know how to # add lines and text to a plot (off the top of my head). # Someday, maybe I'll learn to do this in ggplot2 (ggformula). draw_residual_schematic <- function() { a = 10 b = 4 mx = 15 sx = 5 s = 20 n = 100 set.seed(123456) x = rnorm(n, m=mx, sd=sx) y = a + b*x + rnorm(n, m=0, sd=s) ybar = mean(y) xbar = mean(x) # plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y))) # abline(a=a, b=b, lwd=3) # abline(h = ybar) # identify(y ~ x) xi = xbar + sx yi = ybar + 3*s yhati = a + b*xi myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l") abline(a=a, b=b, lwd=3) points(xi, yi, pch=20, cex=2) points(xi, yhati, cex=2) lines(c(xi,xi), c(0, ybar)) lines(c(xi,xi), c(ybar, yhati), lty=2) lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3) lines(c(0,xi), c(yi, yi)) lines(c(0,xi), c(ybar, ybar)) lines(c(xbar - sx, xbar - sx), c(ybar, yi)) text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2) text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4) text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2) text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)") text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1) text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1) } draw_residual_schematic() ``` #### Partitioning the sum of squares It's not obvious, but is true (for LS regression): $$\sum (y_i - \ybar)^2 \;=\; \sum (y_i - \yhat_i + \yhat_i - \ybar)^2 \;=\; \sum (y_i - \yhat_i)^2 \;+\; \sum (\yhat_i - \ybar)^2$$ SST = $\sum (y_i - \ybar)^2$ = total sum of squares SSE = $\sum (y_i - \yhat_i)^2$ = error sum of squares SSR = $\sum (\yhat_i - \ybar)^2$ = regression sum of squares So, SST = SSE + SSR #### Partitioning of the degrees of freedom Even the degrees of freedom for each sum of squares add up! SST(df) = SSE(df) + SSR(df) $(n-1) = (n-2) + 1$ #### A little more about degrees of freedom For a linear model with mean estimated by least squares: degrees of freedom of the sum of squared errors = sample size $-$ # parameters estimated for the mean For SST = $\sum (y_i - \ybar)^2$, the linear model we have in mind is $Y = \mu + \epsilon$ (one mean parameter). Thus, SST has $(n-1)$ df. For SSE = $\sum (y_i - \yhat_i)^2 = \sum (y_i - (\b0 + \b1 x_i))^2$, the linear model we have in mind is $Y = \beta_0 + \beta_1 x + \epsilon$ (two mean paramters). Thus, SSE has $(n-2)$ df. For SSR = SST - SSE = $\sum (\yhat_i - \ybar)^2$ the degrees of freedom is the difference in the number of mean parameters for the two models. Thus, SSR has $(2 - 1) = 1$ df. #### And a little bit more about degrees of freedom Degrees of freedom can be thought of in many ways. Let's consider the concept of constraints or "subspaces". SST = $\sum (y_i - \ybar)^2$, but $\sum (y_i - \ybar) = 0$ (a constraint). From this constraint, we can write $\displaystyle y_n = n\ybar - \sum_{i=1}^{n-1} y_i$. Thus, once we know estimate for the mean $\ybar$ and $y_1, y_2, \ldots,$ and $y_{n-1}$, then we can solve for $y_n$. We say that $(n-1)$ y-values are "free" to move, but the last value is constrained by $\ybar$. That's $(n-1)$ "degrees of freedom". Similarly, SSE = $\sum (y_i - \yhat_i)^2$, but $\sum (y_i - \yhat_i) = 0$ (a constraint). It is also true (mentioned far above) that $\sum((y_i - \yhat_i)x_i)=0$, for any dataset. From these constraints, we can write both $y_n$ and $y_{n-1}$ as functions of $\b0, \b1, y_1, y_2, \ldots, y_{n-2}$ and all of the $x$-values. Thus, once we know the estimate for the mean ($\b0$ and $\b1$) and $y_1, y_2, \ldots, y_{n-2}$ (and all of the $x$-values), then we can solve for $y_n$ and $y_{n-1}$. We say that $(n-2)$ y-values are "free" to move, but the last two values are constrained by $\b0, \b1,$ and the x-values. That's $(n-2)$ "degrees of freedom". Finally, SSR = $\sum (\yhat_i - \ybar)^2$, but $\sum (\yhat_i - \ybar) = 0$ (a constraint). We can show this is true: $$\begin{align} \sum (\yhat_i - \ybar) &\;=\; \sum \yhat_i - \sum \ybar \\ \\ &\;=\; \sum \left(\b0 + \b1 x_i\right) - n\ybar \\ \\ &\;=\; \sum \left[(\ybar - \b1 \xbar) + \b1 x_i\right] - n\ybar \\ \\ &\;=\; \sum \ybar + \b1 \sum (x_i - \xbar) - n\ybar \\ \\ &\;=\; n \ybar + \b1 (0) - n\ybar \\ \\ &\;=\; 0 \end{align}$$ Thus, once we know the estimate for the mean ($\b0$ and $\b1$) and just one $\yhat_i$ (and all of the $x$-values), then we can solve for all $(n-1)$ other $\yhat_j$ values. We say that just one $\yhat$-value is "free" to move, and the remaining $\yhat$-values are constrained by $\ybar$ and the $x$-values. That's $1$ (one) "degree of freedom". Let me try to convince you of this constraint. Suppose we know all $n$ of the $x$-values, but only $\yhat_5$. Then, we know two points on the regression line: (1) $(x_5, \yhat_5)$ (2) $(\xbar, \ybar)$ since every simple linear regression line passes through this point in which case, we know $\displaystyle \frac{\yhat_5 - \ybar}{x_5 - \xbar} = \frac{ \textrm{rise} }{ \textrm{run} } = \b1$ (the slope) and we can find $\b0 = \ybar - \b1 \xbar$ (the intercept). From here, we can find all $(n-1)$ other values of $\yhat_i = \b0 + \b1 x_i$. ### An F-test for Multiple Parameters #### What if we want a single ("omnibus") test: $H_0: \beta_0 = \beta_1 = 0$? Here is the process. Start with the **full model** (FM): $Y = \beta_0 + \beta_1 x + \epsilon$ (no restrictions on $\beta_0$ or $\beta_1$) Compare to the **reduced model** (RM): $Y = \epsilon \;=\; 0 + \epsilon\;\;$ (assumes $E(Y)=0$) Both $\beta_0=0$ and $\beta_1=0$ as stated in $H_0$. Let's get an intuition for what the FM and the RM look like. We'll draw the regression lines for both models here, for two different examples. In the first example, the true values $\beta_0$ and $\beta_1$ are 3 and 4, respectively. In the second example, the true values are 0 and 0 -- that is, the null hypothesis $H_0$ is actually true. ```{r} library(reshape2) set.seed(1234) sd_error_sim1=5 beta0_sim2=3 beta1_sim2=4 beta0_sim3=0 beta1_sim3=0 n_sim1=30 # First example X1 = runif(n_sim1) error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y1 = beta0_sim2 + beta1_sim2 * X1 + error # Generate y-values for sample i # Second example X2 = runif(n_sim1) error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i Y2 = beta0_sim3 + beta1_sim3 * X2 + error # Generate y-values for sample i # Plot sample 1 gf_point(Y1~X1,color="grey")%>% gf_smooth(Y1~X1,method='lm',se=FALSE,color="red") %>% gf_abline(intercept=0,slope=0,color="blue") # Plot sample 2 ``` ```{r} gf_point(Y2~X2,color="grey") %>% gf_abline(color="red", slope = -0.10, intercept = 5) %>% gf_smooth(Y2~X2,method='lm',se=FALSE,color="red") ``` **STOP** Discuss intuition about the performance of the FM and the RM in each of the two cases.



If the model specifications are valid (including Normal errors), Then the following comparison statistic has an $F$ distribution. $$F = \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)}$$ The degrees of freedom for the distribution of the F-statistic are the numerator df and the denominator df. #### The F-statistic is always positive. Why? #### F-test for $H_0: \beta_0 = \beta_1 = 0$ For the reduced model: $Y = \epsilon\quad$ and $\quad \yhat_i = 0$ for all $i$ Thus, $\SSE(\RM) = \sum (y_i - \yhat_i)^2 = \sum y_i^2$ with degrees of freedom $n-0=n$ (no parameters estimated for the mean). For the full model: $Y = \beta_0 + \beta_1 + \epsilon$ and $\yhat_i = \b0 + \b1 x_i$ for all $i$ Thus, $\SSE(\FM) = \sum (y_i - \yhat_i)^2 = \sum (y_i - \b0 - \b1 x_i)^2$ with degrees of freedom $n-2$ (two parameters estimated for the mean). $$F \;=\; \frac{\left[ \sum y_i^2 - \sum (y_i - \b0 - \b1 x_i)^2 \right] \;/\; [n - (n-2)]}{\sum (y_i - \b0 - \b1 x_i)^2 \;/\; (n-2)} \;\sim\; F_{2, n-2}$$ The long (but illustrative) calculation: **SHOW CODE** ```{r results='asis'} # The long (but illustrative) way y = repairData$Minutes x = repairData$Units n = length(y) b0 b1 SSE.RM = sum(y^2) SSE.FM = sum((y - b0 - b1*x)^2) df.RM = n df.FM = n - 2 F.stat = ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / (SSE.FM / df.FM) F.stat 1 - pf(F.stat, df.RM-df.FM, df.FM) ``` The short way ...find a function in a package: **SHOW CODE** ```{r} # The short, way ...find a function in a package. # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. # ...and so can make things look nice on your HW if you are using R Markdown repairFit = lm(Minutes ~ Units, data=repairData) library(car) kable(linearHypothesis(repairFit, c("(Intercept) = 0", "Units = 0"))) ``` ...and another method using the base R `anova()` function **SHOW CODE** ```{r} # ...and another method using base R anova() function # anova(reducedModelFit, fullModelFit) # full model repairFit = lm(Minutes ~ Units, data=repairData) # reduced model (beta0 = 0, beta1 = 0) reducedFit = lm(Minutes ~ -1 + I(0*Units), data=repairData) # compare the two (nested) models with an F-test anova(reducedFit, repairFit) ``` #### Just demonstrating that F-test protects against false positives When $\alpha=0.05$, we expect 5\% false positives. P(Type I Error) = P(reject $H_0$ when it was actually true) = $\alpha$ So, if we do two tests, the false positive rate is higher. (The more tests you do, the more likely you'll get a false positive) Test of $H_0: \beta_0 = 11.8$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("(Intercept) = 11.8"))) ``` Test of $H_0: \beta_1 = 14.5$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("Units = 14.5"))) ``` Test of $H_0: \beta_0 = 11.8$ and $\beta_1 = 14.5$ using an R package formula: **SHOW CODE** ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("(Intercept) = 11.8", "Units = 14.5"))) ``` And another method comparing the reduced (restricted) model with the full (unrestricted) model. Use the base R `anova()` function: **SHOW CODE** ```{r} # ...and another method using base R anova() function # anova(reducedModelFit, fullModelFit) # full model repairFit = lm(Minutes ~ Units, data=repairData) # reduced model (beta0 = 0, beta1 = 0) reducedFit = lm(Minutes ~ -1, offset = 11.8 + 14.5*Units, data=repairData) # compare the two (nested) models with an F-test kable(anova(reducedFit, repairFit)) ``` #### Bonferroni method: Also protects against false positives This method is indeed simpler than the F-test, but... it's too conservative (more protection against false positives than needed) Bonferroni correction: If $k$ tests, then compare p-value to $\displaystyle \frac{\alpha}{k}$ for each test. Since we have two tests (slope and intercept) and $\alpha=0.05$, we should not reject either null unless p-value $< 0.05/2 = 0.025.$ Test of $H_0: \beta_0 = 11.8$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("(Intercept) = 11.8"))) ``` Test of $H_0: \beta_1 = 14.5$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("Units = 14.5"))) ``` Using the Bonferroni correction, we would not reject either $H_0$. #### FYI. When $H_0: \beta_1 =$ number, $t^2=F$ (equivalent tests) t-test of $H_0: \beta_1=0$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) tidy(repairFit) ``` F-test of $H_0: \beta_1=0$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("Units = 0"))) ``` The relationship holds when testing for $\beta_1$ other than 0. For example, consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work. If $H_0$ is true, then $\beta_1=14.5$ and $$\frac{\b1 - \beta_1}{\se(\b1)} \;=\; \frac{\b1 - 14.5}{\se(\b1)} \;\sim\; t_{n-2}$$ ```{r} # The long way, but straightforward b1.12.tstat = (b1 - 14.5) / b1.se b1.12.tstat b1.12.tstat^2 b1.12.pvalue = 2 * (1 - pt(b1.12.tstat, df=n-2)) b1.12.pvalue ``` | Summaries | Test | |:-----------------------|:-----------------------------------------| | $\b1 = `r b1`$ | $t_{\rm observed} = `r b1.12.tstat`$ | | $\se(\b1) = `r b1.se`$ | $t_{\rm observed}^2 = `r b1.12.tstat^2`$ | | | pvalue = $`r b1.12.pvalue`$ | ..and here is the equivalent F-test for $H_0: \beta_1 = 14.5$ ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # linearHypothesis is in the "car" package. # I had to Google for this one! # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(linearHypothesis(repairFit, c("Units = 14.5"))) ``` ### Similar Concepts in "STAT 101" #### This is not so novel. Recall the "STAT 101" two-sample t-test. We have two random samples (from two populations) For population #1: $Y = \mu_1 + \epsilon$ For population #2: $Y = \mu_2 + \epsilon$ We can write this as a simple linear regression model. Let $\displaystyle x = \begin{cases} 0 & \textrm{data from population #1} \\ 1 & \textrm{data from population #2} \end{cases}$ Then, write $Y = \beta_0 + \beta_1 x + \epsilon$ where $E(\epsilon)=0$, $\;\;\var(\epsilon) = \sigma^2\;\;$ and $\;\;\epsilon\;\;$ follows a normal distn What does this mean about variances for the two populations? #### Back in "STAT 101" we found LS estimates For data from population #1, $\;\;x=0\;\;$ so, $\;\;Y = \beta_0 + \epsilon\;\; \implies \; \beta_0 = \mu_1$ For data from population #2, $\;\;x=1\;\;$ so, $\;\;Y = \beta_0 + \beta_1 + \epsilon\;\; \implies \; \beta_1 = \mu_2 - \mu_1$ $\b0 = \widehat{\mu}_1 = \ybar_1$ $\b1 = \widehat{\mu}_2 - \widehat{\mu}_1 = \ybar_2 - \ybar_1$ #### Back in "STAT 101" we estimated the variance Before reminding you of that variance estimate, let's continue with the LS regression approach:
For any simple linear regression model, we use $\displaystyle \widehat{\sigma}^2 = \frac{\SSE}{\df}$ $$ \widehat{\sigma}^2 \;\;=\;\; \frac{\SSE}{\df} \;\;=\;\; \frac{\sum (y_i - \b0 - \b1 x)^2}{n_1 + n_2 - 2} $$ Let's work on the numerator $$\begin{align} \sum (y_i - \b0 - \b1 x)^2 &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \b0)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \b0 - \b1)^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \ybar_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \ybar_1 - (\ybar_2 - \ybar_1))^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \ybar_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \ybar_2)^2 \\ &\;\;=\;\; (n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2 \end{align}$$ Therefore, $\displaystyle \widehat{\sigma}^2 = \frac{(n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2}{n_1 + n_2 -2}$ **Does this expression look familiar?**




$\widehat{\sigma}^2 = s_p^2$ = pooled estimate of the population variance #### "STAT 101" t-test for $\beta_1$ $H_0: \beta_1 = 0$ is equivalent to $H_0: (\mu_2 - \mu_1) = 0$ which is the same as $H_0: \mu_2 = \mu_1)$ So, we have the same t-statistic as in "STAT 101" $$\begin{align} t &\;\;=\;\; \frac{\b1 - 0}{\se(\b1)} \\ \\ &\;\;=\;\; \frac{\ybar_2 - \ybar_1}{\se(\ybar_2 - \ybar_1)} \;\;=\;\; \frac{\ybar_2 - \ybar_1}{{\sqrt{\displaystyle \frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} } \end{align}$$ #### "STAT 101" t-statistic squared = linear models F-statistic A lot of algebra is required (not shown), but I'll set up the scenario for you. F-test for $H_0: \mu_1 = \mu_2 = \mu$ Reduced model: $Y = \mu + \epsilon\quad \df = n_1+n_2-1$ Full model: $Y = \beta_0 + \beta_1 x + \epsilon\quad \df = n_1+n_2-2$ $$\begin{align} F &\;\;=\;\; \frac{[ \SSE(\RM) - \SSE(\FM) ] \;/\; [ \df(\RM) - \df(\FM) ]}{\SSE(\FM) \;/\; \df(\FM)} \\ \\ &\;\;=\;\; \frac{\sum_{i=1}^{n_1+n_2} (y_i - \overline{\ybar})^2 - \sum_{i=1}^{n_1} (y_i - \ybar_1)^2 - \sum_{j=1}^{n_2} (y_j - \ybar_2)^2}{s_p^2} \\ \\ &\;\;=\;\; \textrm{after lots of algebra...} \\ \\ &\;\;=\;\; \frac{(\ybar_2 - \ybar_1)^2}{\displaystyle{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} \;\;=\;\; t^2 \end{align}$$ ### Confidence Intervals #### Confidence interval for $\beta_0$ Recall the least squares regression results for the computer repair data ```{r} repairFit = lm(Minutes ~ Units, data=repairData) tidy(repairFit) glance(repairFit)[-c(7:10)] ``` $$\b0 \;\pm\; t^* \se(\b0)$$ With $n-2 = `r n-2`$ degrees of freedom and $\alpha=0.05$ ```{r} n t.star = - qt(0.025, df=n-2) t.star ``` Make an educated guess: Will zero be inside the confidence interval? **STOP**









Yes, we expect 0 to be in this confidence interval since we failed to reject $H_0: \beta_0 = 0$ earlier. ```{r} # The long way, but illustrative n t.star = - qt(0.025, df=n-2) t.star pm = c(-1, 1) b0 + pm * t.star * b0.se # The short way ...I Googled for a package with a nice function repairFit = lm(Minutes ~ Units, data=repairData) # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(confint(repairFit, "(Intercept)", level=0.95)) ``` #### Confidence interval for $\beta_1$ $$\b1 \;\pm\; t^* \se(\b1)$$ With $n-2 = `r n-2`$ degrees of freedom and $\alpha=0.05$ ```{r} n t.star = - qt(0.025, df=n-2) t.star ``` Make an educated guess: Will zero be inside the confidence interval? **STOP**









No, we do not expect 0 to be in this confidence interval since we rejected $H_0: \beta_1 = 0$ earlier. ```{r} # The long way, but illustrative n t.star = - qt(0.025, df=n-2) t.star pm = c(-1, 1) b1 + pm * t.star * b1.se # The short way ...I Googled for a package with a nice function repairFit = lm(Minutes ~ Units, data=repairData) # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(confint(repairFit, "Units", level=0.95)) ``` Or use code to find both confidence intervals at the same time ```{r} repairFit = lm(Minutes ~ Units, data=repairData) # The "kable" function is not technically necessary. # It just makes things look better on my slides. kable(confint(repairFit, level=0.95)) ``` #### Using the model to estimate the mean response The estimated **average** (expected) value for all individuals in the sub-population with $X=x_0$ is $$\widehat{\mu}_0 \;\;=\;\; \b0 + \b1 x_0$$ A confidence interval for the mean response at some $x_0$ is $$\widehat{\mu}_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{\frac{1}{n} + \frac{(x_0 - \xbar)^2}{\sum (x_i-\xbar)^2}}$$ ```{r} a = 2 b = 7 ``` Among the jobs where $`r a`$ or $`r b`$ units need repair, how long do we expect the work to take? ```{r} # I had defined new data values earlier a b # Set up a data frame (dataset) to store the x-values for input to prediction x.values = data.frame(Units = c(a,b)) repairFit = lm(Minutes ~ Units, data=repairData) # The "kable" function is not technically necessary. # It just makes things look better on my slides. # The short way ...I Googled and found "predict" # ...which is just in the "stats" package that comes with R by default kable(predict(repairFit, x.values, interval="confidence")) ``` Which interval is narrower? Why? #### Confidence bands: computer repair data ```{r, height=4} # After Googling, I initally only found ggplot2 functions # ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) .... # (Although you will see in two plots later that I did successfully Google # how to do this using the ggformula function gg_lm) ggplot(repairData, aes(x = Units, y = Minutes)) + geom_point() + geom_smooth(method="lm", se=TRUE) ``` #### Confidence bands: simulated data ```{r, height=4} # After Googling, I initally only found ggplot2 functions # ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) .... # (Although you will see in the next graph that I did successfully Google # how to do this using the ggformula function gg_lm) data_sim1 = data.frame(Y, X) ggplot(data_sim1, aes(x = X, y = Y)) + geom_point() + geom_smooth(method="lm", se=TRUE) ``` ### Prediction Intervals #### Using the model to predict an individual outcome The **predicted** value for one yet-to-be-observed (or just observed) individual in the population with $X=x_0$ is $$\yhat_0 \;=\; \b0 + \b1 x_0$$ A prediction interval for the individual response at some $x_0$ is $$\yhat_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \xbar)^2}{\sum (x_i-\xbar)^2}}$$ We just got two new jobs where $`r a`$ and $`r b`$ units need repair. How long do should it take to complete either job? ```{r} # I had defined new data values earlier a b # Set up a data frame (dataset) to store the x-values for input to prediction x.values = data.frame(Units = c(a,b)) repairFit = lm(Minutes ~ Units, data=repairData) # The "kable" function is not technically necessary. # It just makes things look better on my slides. # The short way ...I Googled and found "predict" # ...which is just in the "stats" package that comes with R by default kable(predict(repairFit, x.values, interval="prediction")) ``` #### Why are prediction intervals wider than confidence intervals? Discuss... #### Prediction limits: computer repair data ```{r, fig.height=4} # I found out how to get the confidence bands on the plot using a ggformula # function: gf_lm (ggformula is just a formula-based interface to ggplot2) gf_lm(Minutes ~ Units, data = repairData, interval = "prediction", fill = "skyblue") %>% gf_lm(interval = "confidence", fill="blue", color="black") %>% gf_point() ``` #### Prediction limits: simulated data ```{r, fig.height=4} # I found out how to get the confidence bands on the plot using a ggformula # function: gf_lm (ggformula is just a formula-based interface to ggplot2) gf_lm(Y ~ X, data = data_sim1, interval = "prediction", fill = "skyblue") %>% gf_lm(interval = "confidence", fill="blue", color="black") %>% gf_point() ``` ### Example: Mercury in Fish These data concern mercury levels in large-mouthed bass in North Carolina rivers. Mercury levels in river water typically at undetectable levels, yet fish might contain unsafe amount to eat (1 part per million) In principle, mercury concentration in fish can be measured, but that's expensive ...and involves waiting days for lab results. So, that doesn't help you decide whether or not to eat the fish you caught today. So, can we eat large-mouthed bass from North Carolina rivers? ### Mercury in Fish: A first data analysis ```{r} fishData = read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fish.txt") # rescale weight from grams to kg to make units of y and x similar in size, # so slope won't be huge (and thus, a little harder to interpret, at least for me anyway) # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" fishData = mutate(fishData, weight = weight / 1000) # sample size n = dim(fishData)[1] ``` + Older fish contain more mercury. + Older fish weight more. + Expensive and lengthy: determine mercury level in a fish + Not feasible: determine the age of a fish + Easy and quick: weigh the fish Data were collected on `r n` large-mouth bass in North Carolina rivers. `mercury` = concentration of mercury in fish tissue (ppm = mcg/g) `weight` = weight in kilograms (kg) Could you weigh a fish and then give a recommendation as to whether it is safe to eat it? "The Food and Drug Administration has set an "action level" of 1.0 ppm mercury for fish in interstate commerce." [https://www.greenleft.org.au/content/mercury-how-much-safe](https://www.greenleft.org.au/content/mercury-how-much-safe) A concentration over 1 part per million is considered unsafe for human consumption. In light of this, what recommendations can you make for fish caught from these rivers? #### Mercury: Linear relationship? Is there a linear relationship between `mercury` and `weight`? ```{r, fig.height=4} fishFit = lm(mercury ~ weight, data=fishData) gf_point(mercury ~ weight, data=fishData) %>% gf_coefline(model=fishFit) ``` Do the specifications of a linear regression model seem reasonable for these data? 1. The errors are **uncorrelated** 2. The errors have no systematic info about $Y$ a. $E(\epsilon_i)=0$ for all $i$ (for all $x$) b. The errors are not correlated with the $x$ values c. $\var(Y_i) \;=\; \var(\epsilon_i) \;=\; \sigma^2$ is same for all $i$ (for all $x$) 3. The mean of $Y$ depends linearly on $x$ 4. The errors have a normal distribution: $\epsilon \;\sim\; N(0, \sigma^2)$ What is the distribution of the response and predictor? ```{r, fig.width=10, fig.height=4, fig.show="hold"} # histograms of the response and predictor variables p1 = gf_histogram(~ mercury, data = fishData, bins=20, color="white") p2 = gf_histogram(~ weight, data = fishData, bins=20, color="white") grid.arrange(p1, p2, ncol=2) ``` #### Mercury: Plot of residuals vs. fitted values ```{r, fig.height=4} # add fitted values and residuals to the data frame # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" fishFit = lm(mercury ~ weight, data=fishData) fishData = fishData %>% mutate(fitted = predict(fishFit), resids = residuals(fishFit)) # residual plot: plot of residuals vs. fitted values gf_point(resids ~ fitted, data = fishData) %>% gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>% gf_hline(yintercept = 0, linetype = 2) ``` #### Mercury: Could residuls come from $N(0,\sigma^2)$? ```{r fig.width=10, fig.height=4, fig.show="hold"} p1 = gf_histogram(~ resids, bins = 20, data = fishData, color="white") p2 = gf_qq(~ resids, data = fishData) %>% gf_qqline() grid.arrange(p1, p2, ncol=2) # NOT USED: I was playing around writing a function to draw a normal quantile plot # with reference line using ggplot2, but the above is easier using ggformula package qqplot.data = function (sample) # argument: vector of numbers { # following four lines from base R's qqline() y = quantile(sample[!is.na(sample)], c(0.25, 0.75)) x = qnorm(c(0.25, 0.75)) slope = diff(y)/diff(x) intercept = y[1L] - slope * x[1L] df = data.frame(sample) ggplot(df, aes(sample = sample)) + stat_qq() + geom_abline(slope = slope, intercept = intercept) } p2 = qqplot.data(fishData$resids) ``` #### Mercury: Is the model useful for predictions? ```{r, fig.height=4} # I found out how to get the confidence/prediction bands on the plot using a ggformula # function: gf_lm (ggformula is just a formula-based interface to ggplot2) # LS regression line with confidence and prediction bands gf_lm(mercury ~ weight, data = fishData, formula = y ~ x, backtrans= identity, interval = "prediction", fill = "skyblue") %>% # gf_lm(interval = "confidence", fill="blue", color="black") %>% gf_point() ``` ### Mercury in Fish: A second data analysis #### Mercury: Transformation of the data ```{r} # scatterplot with LS regression line added fishFit = lm(mercury ~ weight, data=fishData) p1 = gf_point(mercury ~ weight, data = fishData) %>% gf_coefline(model = fishFit) # residual plot: plot of residuals vs. fitted values gf_point(resids ~ fitted, data = fishData) %>% gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>% gf_hline(yintercept = 0, linetype = 2) grid.arrange(p1, p2, ncol=2) ``` The y-values and x-values are skewed right. The larger values are twice the size of middle values. The errors get larger with the predictor (multiplicative errors, not additive). Let's make a new response: `logmercury = log(mercury, base=2)` and a new predictor: `logweight = log(weight, base=2)` ```{r} # Create new data frame adding log(weight) and log(mercury) to data frame (base 2 log) # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" fish2Data = fishData %>% mutate(logweight = log(weight, base=2), logmercury = log(mercury, base=2)) ``` This helps the skewness problems. ```{r, fig.width=10, fig.height=4, fig.show="hold"} # histograms of response and predictor variables p1 = gf_histogram(~ logmercury, data = fish2Data, bins = 15, color = "white") p2 = gf_histogram(~ logweight, data = fish2Data, bins = 15, color = "white") grid.arrange(p1, p2, ncol=2) ``` #### Mercury Transformed: Linear relationship? Is the relationship still linear? Did we fix the non-constant variance problem? ```{r, fig.height=4} # Fit LS model to transformed variables fish2Fit = lm(logmercury ~ logweight, data = fish2Data) # Scatterplot of transformed data with LS regression line gf_point(logmercury ~ logweight, data = fish2Data) %>% gf_coefline(model = fish2Fit) ``` #### Do transformed data better support model requirements? Plot of residuals vs. fitted values (yhats) ```{r, fig.height=4} # add fitted values and residuals to the data frame # Use the "mutate" verb in the dplyr package to add variables # to an existing dataset (called a "data frame" in R) # There is a DataCamp.com course on this: "Data Manipulation in R with dplyr" fish2Fit = lm(logmercury ~ logweight, data = fish2Data) fish2Data = fish2Data %>% mutate(fitted = predict(fish2Fit), resids = residuals(fish2Fit)) # residual plot: plot of residuals vs. fitted values gf_point(resids ~ fitted, data = fish2Data) %>% gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>% gf_hline(yintercept = 0, linetype = 2) ``` #### Mercury Transformed: Could residuls come from $N(0,\sigma^2)$? ```{r fig.width=10, fig.height=4, fig.show="hold"} p1 = gf_histogram(~ resids, data = fish2Data, bins=15, color="white") p2 = gf_qq(~ resids, data = fish2Data) %>% gf_qqline() grid.arrange(p1, p2, ncol=2) ``` #### Mercury Transformed: Is the model useful for predictions? ```{r, fig.height=4} # I found out how to get the confidence/prediction bands on the plot using a ggformula # function: gf_lm (ggformula is just a formula-based interface to ggplot2) # LS regression line with confidence and prediction bands gf_lm(logmercury ~ logweight, data = fish2Data, interval = "prediction", fill = "skyblue") %>% # gf_lm(interval = "confidence", fill="blue", color="black") %>% gf_point() ``` ### So, is it safe to eat small fish? From the graph, it appears **on average**, the fish have below 1 ppm mercury (0 on the y-axis above) as long as it weighs less than $2^0=1$ kg. However, for the particular fish you just caught (prediction), you might be more comfortable if it weighed less than $2^{-1.25}=`r 2^(-1.25)`$ kg (`r 1000*2^(-1.25)` grams)! Confidence and prediction bands transformed back to the original scale. A bit easier to read. ```{r, fig.height=4} # I found out how to get the confidence/prediction bands on the plot using a ggformula # function: gf_lm (ggformula is just a formula-based interface to ggplot2) # confidence and prediction bands on the original scale # but data fit using the log(y) and log(x) transformation gf_lm(mercury ~ weight, data = fish2Data, formula = log(y) ~ log(x), backtrans = exp, interval = "prediction", fill = "skyblue") %>% gf_lm( formula = log(y) ~ log(x), backtrans = exp, interval = "confidence", color = "blue") %>% gf_point() ```