First, I loaded some R packages needed for the code in this document. Click the Code
button to see my R code.
library(tidyverse)
library(knitr)
library(mosaic)
library(gridExtra)
library(broom)
library(modelr)
library(car)
For any package, you install just once, then use library
to load the package when needed. For example:
install.packages("mosaic")
only once, ever.
library(mosaic)
when you start up RStudio.
Collect data on
General deterministic part of model \(\quad Y \;=\; f(X_1,X_2, \ldots, X_p)\)
We might opt for a linear relationship \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p\] …and then, allow for error (a statistical model) \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\] This is called a linear regression model
(George Box)
Few processes are actually linear.
Still, the following are linear regression models
\[\begin{align} Y &\;=\; \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 \log(X) + \epsilon \end{align}\] because the “coefficients” \(\beta_0\) and \(\beta_1\) enter the model linearly.
For example, the non-linear model \(Y \;=\; \beta_0 e^{\beta_1 X} \epsilon\)
can be transformed to a linear model (take the log of both sides) \[\begin{align}
log(Y) &\;=\; \log(\beta_0) + \beta_1 X + \log(\epsilon) \\ \\
Y' & \;=\; \beta_0' + \beta_1 X + \epsilon'
\end{align}\]
STOP
Once we propose a linear model, the method is intuitive:
Fit a linear function through the data points using some objective criterion for a good “fit”.
We will endevour to provide you with tidy data:
observation | response \(Y\) | predictor \(x\) |
---|---|---|
1 | \(y_1\) | \(x_1\) |
2 | \(y_2\) | \(x_2\) |
… | … | … |
… | … | … |
… | … | … |
\(n\) | \(y_n\) | \(x_n\) |
A sample of records on computer repair service calls was collected.
Two variables were observed:
Which variable is response (\(Y\))?
Which variable is explanatory or “predictor” variable (\(X\))?
It depends… on the question of interest.
What if we know how long the repairer worked.
Did they repair as many units as we would have expected?
What if we knew how many units needed repair.
How much time should we schedule for the repairer?
\(Y\) = Time to complete work (minutes)
\(X\) = Number of units repaired
# Read in the data from the course data repository
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")
repairData # view the data (OK, since small sample)
Minutes | Units |
---|---|
23 | 1 |
29 | 2 |
49 | 3 |
64 | 4 |
74 | 4 |
87 | 5 |
96 | 6 |
97 | 6 |
109 | 7 |
119 | 8 |
149 | 9 |
145 | 9 |
154 | 10 |
166 | 10 |
glimpse(repairData) # view a summary of the data (OK, for larger samples too)
Observations: 14
Variables: 2
$ Minutes <dbl> 23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149, 145, 154, 166
$ Units <dbl> 1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10
ggplot2
package (with ggformula
package to help out)lattice
packagebase
R (defaults not pretty)base
R (make it prettier)p1 = gf_point(Minutes ~ Units, data=repairData) # ggplot2 graphics made easier by ggformula package
p2 = xyplot(Minutes ~ Units, data=repairData) # lattice graphics package
grid.arrange(p1, p2, ncol=2) # arrange the plots side by side (2 columns)
par(mfrow=c(1,2)) # prepare to arrange base R plots with 1 row, 2 cols
plot(repairData$Units, repairData$Minutes)
plot(repairData$Units, repairData$Minutes, pch=20, xlab="Units", ylab="Minutes")
par(mfrow=c(1,1)) # return the base R setting to 1 plot (1 row, 1 col)
I usually use the ggplot2
package (with the ggformula
package to help out).
STOP
Demonstrate ggformula
syntax.
Fancy add-ons you can “chain” together with ggplot2
gf_point(Minutes ~ Units, data=repairData) %>%
gf_rug() %>%
gf_smooth(se = TRUE)
Sample averages: \(\;\overline{y} \;=\; \frac{1}{n}\sum_{i=1}^n y_i \qquad \overline{x} \;=\; \frac{1}{n}\sum_{i=1}^n x_i\)
Sample variances: \[\mathrm{var}(y) \;=\; s_y^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \qquad \mathrm{var}(x) \;=\; s_x^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \overline{x})^2\] Sample standard deviations: \[\mathrm{sd}(y) \;=\; s_y \;=\; \sqrt{s_y^2}\;, \quad \mathrm{sd}(x) \;=\; s_x \;=\; \sqrt{s_x^2}\]
Standardized data: \[z_i \;=\; \frac{y_i - \overline{y}}{s_y} \qquad w_i \;=\; \frac{x_i - \overline{x}}{s_x}\] Can show that \(\overline{z} \;=\; 0\) and \(\mathrm{sd}(z) \;=\; s_z \;=\; 1\) (same for \(w\))
STOP
Show that \(\overline{z} \;=\; 0\)
\[\overline{z} \;=\; \frac{1}{n}\sum_{i=1}^n z_i \;=\; \frac{1}{n}\sum_{i=1}^n \frac{y_i - \overline{y}}{s_y} \;=\; \frac{1}{n(s_y)} \sum_{i=1}^n (y_i - \overline{y}) \;=\; \frac{1}{n(s_y)} \left[ \sum_{i=1}^n y_i - \sum_{i=1}^n \overline{y}\right] \;=\; \frac{1}{n(s_y)} [ n\overline{y} - n\overline{y}] \quad\;=\;\quad 0 \]
Two variables:
ybar = mean(~ Minutes, data=repairData)
ybar
[1] 97.21
xbar = mean(~ Units, data=repairData)
xbar
[1] 6
sy = sd(~ Minutes, data=repairData)
sy
[1] 46.22
sx = sd(~ Units, data=repairData)
sx
[1] 2.961
\(\bar y = 97.21 \quad \overline{x}= 6\)
\(s_y = 46.22 \quad s_x = 2.961\)
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
# Here we add standardized versions of the original variables
repairData = mutate(repairData,
stdMinutes = (Minutes - mean(Minutes)) / sd(Minutes),
stdUnits = (Units - mean(Units)) / sd(Units)
)
repairData
Minutes | Units | stdMinutes | stdUnits |
---|---|---|---|
23 | 1 | -1.6058 | -1.6885 |
29 | 2 | -1.4760 | -1.3508 |
49 | 3 | -1.0432 | -1.0131 |
64 | 4 | -0.7187 | -0.6754 |
74 | 4 | -0.5023 | -0.6754 |
87 | 5 | -0.2210 | -0.3377 |
96 | 6 | -0.0263 | 0.0000 |
97 | 6 | -0.0046 | 0.0000 |
109 | 7 | 0.2550 | 0.3377 |
119 | 8 | 0.4714 | 0.6754 |
149 | 9 | 1.1205 | 1.0131 |
145 | 9 | 1.0339 | 1.0131 |
154 | 10 | 1.2287 | 1.3508 |
166 | 10 | 1.4883 | 1.3508 |
# Confirm that the average of standardized data is 0, with sd=1
zbar = mean(~ stdMinutes, data=repairData)
zbar
[1] 1.192e-16
sz = sd(~ stdMinutes, data=repairData)
sz
[1] 1
For the standardized response (Minutes): \(\overline{z}= 0 \quad s_z = 1,\) as expected.
Standardizing the data does not change the “relationship”
p1 = gf_point(Minutes ~ Units, data=repairData)
p2 = gf_point(stdMinutes ~ stdUnits, data=repairData)
grid.arrange(p1, p2, ncol=2)
\[\mathrm{var}(y) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y})^2 \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (y_i - \overline{y})\]
\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]
\[\mathrm{var}(z) \;=\; \frac{1}{n-1} \sum_i (z_i - \overline{z})^2 \;=\; \frac{1}{n-1} \sum_i (z_i - 0)^2 \;=\; \frac{1}{n-1} \sum_i z_i^2 \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \] From here, it’s simple to show that \(\mathrm{var}(z) = 1\):
\[\mathrm{var}(z) \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \quad\;=\;\quad \frac{1}{s_y^2} \frac{1}{n-1} \sum_i \left(y_i - \overline{y}\right) \left(y_i - \overline{y}\right) \quad\;=\;\quad \frac{1}{s_y^2} var(y) = \frac{1}{s_y^2} s_y^2 = 1 \]
\[\begin{align} r \;=\; \mathrm{corr}(y,x) \quad &\;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right) \\ \\ \quad &\;=\; \quad \frac{\mathrm{cov}(y,x)}{s_y s_x} \quad \;=\; \quad \frac{\mathrm{cov}(y,x)}{\sqrt{\mathrm{var}(y) \mathrm{var}(x)}} \end{align}\]
Consider the following variables
Consider \(V\) and \(W\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?
Consider \(W\) and \(Z\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?
STOP
Ponder the questions above.
sleep2Before
= \(V\) = hours of sleep two nights before midterm examsleepBefore
= \(W\) = hours of sleep the night before midterm examsleepAfter
= \(Z\) = hours of sleep the night after the midterm examHere is a little data:
v = c(5, 7, 4, 8, 6)
w = c(3, 8, 2, 5, 6)
z = c(8, 7, 9, 7, 8)
df = tibble(sleep2Before=v, sleepBefore=w, sleepAfter=z)
df
sleep2Before | sleepBefore | sleepAfter |
---|---|---|
5 | 3 | 8 |
7 | 8 | 7 |
4 | 2 | 9 |
8 | 5 | 7 |
6 | 6 | 8 |
# gf_vline adds a vertical line at "xintercept"
# gf_hline adds a horizontal line at "yintercept"
p1 = gf_point(sleepBefore ~ sleep2Before, data=df) %>%
gf_vline(xintercept=mean(~ sleep2Before, data=df), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepBefore, data=df), linetype=2)
p2 = gf_point(sleepAfter ~ sleepBefore, data=df) %>%
gf_vline(xintercept=mean(~ sleepBefore, data=df), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepAfter, data=df), linetype=2)
grid.arrange(p1, p2, ncol=2)
\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]
covwv = cov(sleepBefore ~ sleep2Before, data=df)
covwv
[1] 2.75
covzw = cov(sleepAfter ~ sleepBefore, data=df)
covzw
[1] -1.55
\(\mathrm{cov}(w,v) = 2.75\)
\(\mathrm{cov}(z,w) = -1.55\)
What if we recorded the data in minutes instead of hours?
Then, the relationships are still the same
df2 = df*60
p3 = gf_point(sleepBefore ~ sleep2Before, data=df2) %>%
gf_vline(xintercept=mean(~ sleep2Before, data=df2), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepBefore, data=df2), linetype=2)
p4 = gf_point(sleepAfter ~ sleepBefore, data=df2) %>%
gf_vline(xintercept=mean(~ sleepBefore, data=df2), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepAfter, data=df2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)
…but, the covariance changes:
cov(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 9900
cov(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -5580
\(\mathrm{cov}(60w,60v) = 9,900\)
\(\mathrm{cov}(60z,60w) = -5,580\)
Standardization (correlation) takes care of the units problem. \[r \;=\; \mathrm{corr}(y,x) \quad \;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right)\]
corrwv = cor(sleepBefore ~ sleep2Before, data=df)
corrwv
[1] 0.7285
cor(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 0.7285
corrzw = cor(sleepAfter ~ sleepBefore, data=df)
corrzw
[1] -0.776
cor(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -0.776
covwv = cov(sleepBefore ~ sleep2Before, data=df)
covzw = cov(sleepAfter ~ sleepBefore, data=df)
\(\mathrm{corr}(w,v) = 0.7285\)
\(\mathrm{corr}(60w,60v) = 0.7285\)
\(\mathrm{corr}(z,w) = -0.776\)
\(\mathrm{corr}(60z,60w) = -0.776\)
Let \(Y = 50 - X^2\), then
mydata = tibble(x = c(-7:7), y = 50 - x^2)
mydata
x | y |
---|---|
-7 | 1 |
-6 | 14 |
-5 | 25 |
-4 | 34 |
-3 | 41 |
-2 | 46 |
-1 | 49 |
0 | 50 |
1 | 49 |
2 | 46 |
3 | 41 |
4 | 34 |
5 | 25 |
6 | 14 |
7 | 1 |
gf_point(y ~ x, data=mydata) +
geom_vline(data=mydata, aes(xintercept=mean(x)), linetype=2) +
geom_hline(data=mydata, aes(yintercept=mean(y)), linetype=2)
cor(y ~ x, data=mydata)
[1] 0
Believe it or not (HW),
all four relationships below have exactly the same correlation coefficient.
AnscombeData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P029b.txt")
p1 = gf_point(Y1 ~ X1, data=AnscombeData)
p2 = gf_point(Y2 ~ X2, data=AnscombeData)
p3 = gf_point(Y3 ~ X3, data=AnscombeData)
p4 = gf_point(Y4 ~ X4, data=AnscombeData)
grid.arrange(p1, p2, p3, p4, ncol=2)
We definitely should make a plot to interpret correlation!
Two variables:
The relationship appears linear.
repairData
Minutes | Units | stdMinutes | stdUnits |
---|---|---|---|
23 | 1 | -1.6058 | -1.6885 |
29 | 2 | -1.4760 | -1.3508 |
49 | 3 | -1.0432 | -1.0131 |
64 | 4 | -0.7187 | -0.6754 |
74 | 4 | -0.5023 | -0.6754 |
87 | 5 | -0.2210 | -0.3377 |
96 | 6 | -0.0263 | 0.0000 |
97 | 6 | -0.0046 | 0.0000 |
109 | 7 | 0.2550 | 0.3377 |
119 | 8 | 0.4714 | 0.6754 |
149 | 9 | 1.1205 | 1.0131 |
145 | 9 | 1.0339 | 1.0131 |
154 | 10 | 1.2287 | 1.3508 |
166 | 10 | 1.4883 | 1.3508 |
gf_point(Minutes ~ Units, data=repairData)
Correlation is quite high.
covyx = cov(Minutes ~ Units, data=repairData)
covyx
[1] 136
corryx = cor(Minutes ~ Units, data=repairData)
corryx
[1] 0.9937
Covariance and correlation in original units:
\(\mathrm{cov}(y,x) = 136\)
\(\mathrm{corr}(y,x) = 0.9937\)
covzw = cov(stdMinutes ~ stdUnits, data=repairData)
covzw
[1] 0.9937
corrzw = cor(stdMinutes ~ stdUnits, data=repairData)
corrzw
[1] 0.9937
Covariance and correlation in standardized units:
\(\mathrm{cov}(z,w) = 0.9937\)
\(\mathrm{corr}(z,w) = 0.9937\)
\[\begin{align} Y & \;=\; f(X) + \epsilon \\ \\ & \;=\; \beta_0 + \beta_1 X + \epsilon \end{align}\]
This is called a simple linear regression model
since there’s only one predictor: \(X\)
random variable = deterministic linear function + random error
** Illustration of data generation **
If we make a detailed model of how the data are generated, we can:
Or, we can do all of the above analytically. This gives us results in nice algebraic form, and lets us see which exact assumptions about the data generating process were necessary to our conclusions.
If the real data generation process is not well-described by a model meeting the following criteria, then we need a different model.
Throughout the course, we will consider the \(x\) values as fixed.
That is, all models are built conditional on the \(x\) values observed.
Adding in a model for the distribution of \(X\) is beyond the scope of the course.
Still, the model conditional on the observed \(x\) values (although incorrect as all models are) is very useful for summarizing and describing relationships and making predictions for \(Y\) based on new observations on \(x\).
The mean of \(Y\) must depend linearly on \(x\).
\[\begin{align} E(Y\; |\; X=x) &\;=\; E(\beta_0 +\beta_1 x + \epsilon ) \\ \\ & \;=\; \beta_0 +\beta_1 x + E(\epsilon) \\ \\ & \;=\; \beta_0 +\beta_1 x + 0 \;=\; \beta_0 +\beta_1 x \end{align}\]
So, if the errors have no relationship to the values of \(x\), it follows that the mean of \(Y\) is a linear function of \(x\). (Or a better way to put it: if the mean of \(Y\) is not a linear function of \(x\), the errors won’t meet criterion 2b.)
For individual \(i\) in the population with predictor value \(X=x=x_i\)
\[Y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\] This model says that the data generation process can be described as fixing values for \(x\), then finding corresponding \(y=\beta_0 + \beta_1 x\), and finally adding a random error \(\epsilon\) for each individual.
The average response for such individuals is \(E(Y\; |\; X=x_i) \;=\; \beta_0 +\beta_1 x_i\)
Let’s try generating a sample from a population
where we have fixed values for \(\beta_0\), \(\beta_1\), and \(x\) values.
beta0_sim1 = 3
beta1_sim1 = 4
n_sim1 = 50
mx_sim1 = 2
sx_sim1 = 1
I set \(\beta_{0} = 3\), \(\beta_{1} = 4\), and \(n= 50\).
First we fix the values and pick a set of \(n = 50\) \(x\) values.
I’ll pick them from a Normal(\(\mu = 2, \sigma = 1\)) distribution.
set.seed(1234567)
# Generate x values for individuals in population
X = rnorm(n_sim1, m = mx_sim1, s = sx_sim1)
gf_histogram(~ X, bins=12, color = "white")
If the relationship is deterministic,
we could generate \(Y\)-values with a simple linear relationship: \(Y = 3 + 4 x\)
Y_without_error = beta0_sim1 + beta1_sim1 * X # Y values without errors
gf_point(Y_without_error ~ X, color='red')
What is the intercept of this line? What is its slope?
gf_point(Y_without_error ~ X, color = 'red') %>%
gf_abline(slope = beta1_sim1, intercept = beta0_sim1)
Of course, for a statistical model,
we only require the mean of \(Y\) to be a linear function of \(x\)
sd_error_sim1 = 2
So, let’s generate a set of errors \(\epsilon\), one for each observation.
I’ll pick them from a normal distribution: Normal(\(\mu = 0, \sigma = 2\)) .
Reminder: According to our simple statistical model, the errors have mean = 0
set.seed(123466)
# Generate error values for individuals in population with given x-values
error = rnorm(n_sim1, m = 0, s = sd_error_sim1)
gf_histogram(~ error, bins=12, color = "white")
Finally, we can generate \(Y\) values from the \(x\) and error values
by adding the errors to the linear function of \(x\) values.
Y = beta0_sim1 + beta1_sim1 * X + error # Y values, with errors
generated_data = tibble(X, error, Y_without_error, Y)
head(generated_data, 10)
X | error | Y_without_error | Y |
---|---|---|---|
2.1567 | -2.5555 | 11.627 | 9.0713 |
3.3738 | -0.2737 | 16.495 | 16.2216 |
2.7307 | -2.0445 | 13.923 | 11.8782 |
0.6492 | 0.8448 | 5.597 | 6.4416 |
1.9915 | -0.8568 | 10.966 | 10.1092 |
2.3210 | 0.9133 | 12.284 | 13.1972 |
0.2219 | -3.0707 | 3.887 | 0.8167 |
2.9095 | 1.4770 | 14.638 | 16.1150 |
1.0806 | 2.6609 | 7.322 | 9.9832 |
1.8423 | 0.4994 | 10.369 | 10.8685 |
gf_point(Y ~ X, data = generated_data, color = 'blue', show.legend = TRUE) %>%
gf_point(Y_without_error ~ X, data = generated_data, color = 'red') %>%
gf_abline(intercept = beta0_sim1, slope = beta1_sim1)
We have already seen that \(\beta_0\) is the intercept of the above graphs. Let’s make this more precise.
When \(x=0\), we expect \(Y\) to be \(\beta_0\).
\[E(Y\; |\; X=0) \;=\; \beta_0 +\beta_1 (0) \;=\; \beta_0\]
Computer repair data: What does \(\beta_{0}\) represent?
gf_point(Minutes ~ Units, data=repairData) +
xlim(-1, max(~ Units, data=repairData)) +
ylim(10, max(~ Minutes, data=repairData))
# gf_smooth(method = "lm", se = FALSE) +
STOP …and think about it.
When \(x=0\), we estimate the mean of \(y\) to be \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) = \widehat{\beta}_{0}\) minutes.
For the computer example, this means that when no units are serviced,
(guessing from the graph) we expect the work to take about 5 minutes.
Hmmmm…. That doesn’t make sense.
Perhaps interpret this a “start up” time before working on the 1st unit?
We have already seen that \(\beta_1\) is the slope of the above graphs. Let’s make this more precise, too – and in a way that will carry forward when we go to multiple predictor variables, later in the course.
For individuals with predictor value \(x+1\) versus those with \(x\),
we expect the reponse to be \(\beta_1\) units higher (or lower).
\[\begin{align} E(Y\; |\; X=x+1) &\;=\; \beta_0 +\beta_1 (x+1) \\ \\ &\;=\; \beta_0 +\beta_1 x + \beta_1 \;=\; E(Y\; |\; X=x) + \beta_1 \end{align}\]
That is, \(\; E(Y\; |\; X=x+1) - E(Y\; |\; X=x) \;=\; \beta_1\)
Computer repair data: What does \(\beta_1\) represent?
gf_point(Minutes ~ Units, data=repairData) +
xlim(-1, max(~ Units, data=repairData)) +
ylim(10, max(~ Minutes, data=repairData))
# gf_smooth(method = "lm", se = FALSE) +
STOP …and think about it.
For each additional unit to be serviced, we estimate the mean time to completion (guessing from the graph) to be about 12 minutes longer.
Estimates of \(\beta_{0}\) and \(\beta_{1}\) from data:
tidy(lm(Minutes ~ Units, data=repairData))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
We will soon learn how to “best” calculate a slope and intercept from data.
For individual \(i\) in the sample
\[y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\]
Suppose we know \(x_i\) but not \(y_i\). The true value is \(y_i \;=\; \beta_0 +\beta_1 X + \epsilon_i\), but we don’t know \(\epsilon_i\)
The fitted (estimated) value of \(Y\)
for individual \(i\) already in the sample is \[\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\]
The estimated average (expected) value
for all individuals in the population with \(X=x_0\) is \[\widehat{\mu}_0 = \;=\;\widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]
The predicted (estimated) value
for a yet-to-be-observed or just now observed individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]
Most of the time, we don’t know anything about the underlying population, including the true values of the parameters \(\beta_0\) and \(\beta_1\), and \(\mathrm{var}(\epsilon)=\sigma^2\).
All we have are observations in a sample.
We will use these observations to calculate estimates of \(\beta_0\), \(\beta_1\), as well as errors \(\epsilon_i\) and their variance.
Suppose we have estimates for \(\beta_0\) and \(\beta_1\).
We can then use these estimates to calculate the fitted response \(\widehat{y}_i\) for each observation.
The difference between the fitted response and the observed response is an estimate of \(\epsilon_i\).
The estimated error (estimate of \(\epsilon_i\)) is called a residual (\(e_i\))
\[\begin{align} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i) \end{align}\]
# I'm going to use Base R graphics here since I happen to know how to
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
a = 10
b = 4
mx = 15
sx = 5
s = 20
nsim2 = 100
set.seed(123456)
x = rnorm(nsim2, m=mx, sd=sx)
y = a + b*x + rnorm(nsim2, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)
# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)
xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi
plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)
points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)
lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))
text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
Residuals for the computer repair data:
# Find the least squares estimates for intercept and slope (lm function)
repairFit = lm(Minutes ~ Units, data=repairData)
# Once you have an "lm object", you can grab all kinds of summaries from it
# save the fitted values: predict function
# save the residuals: residuals function
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
repairData = mutate(repairData,
fitted = predict(repairFit),
resids = residuals(repairFit)
)
# This is a pretty fancy plot that I Googled and borrowed from here:
# https://www.r-bloggers.com/visualising-residuals/
ggplot(repairData, aes(x = Units, y = Minutes)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = Units, yend = fitted), alpha = 0.2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = fitted), shape = 1) +
theme_bw() # Add theme for cleaner look
What line should we choose to “fit” the model to data?
(This is just computer simulated data for illustration.)
# This entire section is borrowed from Hadley Wickham: R for Data Science
# http://r4ds.had.co.nz/model-basics.html
hadleyData <- modelr::sim1
ggplot(hadleyData, aes(x, y)) +
geom_point()
m = 250
Here are \(250\) random possible models, but most are terrible!
set.seed(12345)
models = tibble(
b0 = runif(m, -20, 40),
b1 = runif(m, -5, 5)
)
ggplot(hadleyData, aes(x, y)) +
geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) +
geom_point()
### Loss functions No matter which two number we might choose to measure slope and intercept,
we are summarizing an entire distribution with two numbers.
The least squares estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) minimize the sum of squared residuals
In our textbook,
SSE is called the “Sum of Squared Residuals (Errors)”.
MSE is called the “Mean Squared Residual (Error)”.
NOTATION CAUTION:
This is a bit confusing since “errors” (\(\epsilon_i\)) are not observed.
Only residuals are observed, but “SSR” is reserved for another quantity to be discussed later.
\[\mathrm{SSE}\;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{y}_i)^2 \;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\]
hadleyFit = lm(y ~ x, data=hadleyData)
mydata = hadleyData
mydata$fitted = predict(hadleyFit) # Save the fitted values
mydata$residuals = residuals(hadleyFit) # Save the residual values
ggplot(mydata, aes(x, y)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = x, yend = fitted), alpha = 0.2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = fitted), shape = 1) +
theme_bw() # Add theme for cleaner look
Which models have the smallest sum of squared residuals (SSE)?
model1 = function(b, data) {
b[1] + data$x * b[2]
}
measure_distance = function(mod, data) {
diff = data$y - model1(mod, data)
sum(diff ^ 2) # changed to SSE from Hadley's RMSE sqrt(mean(diff ^ 2))
}
hadley_SSE = function(b0, b1) {
measure_distance(c(b0, b1), hadleyData)
}
models = models %>%
mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
ggplot(hadleyData, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = b0, slope = b1, colour = SSE),
data = filter(models, rank(SSE) <= 10)
)
Let’s try to find the smallest SSE on our own.
http://www.rossmanchance.com/applets/RegShuffle.htm
Among the \(250\) models, we have isolated the 10 “closest” to the data
ggplot(models, aes(b0, b1)) +
geom_point(data = filter(models, rank(SSE) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = SSE)) +
labs(x = "Intercepts", y = "Slopes")
Here is a more organized approach: A grid search
grid = expand.grid(
b0 = seq(-5, 20, length = 25),
b1 = seq(1, 3, length = 25)
) %>%
mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
grid %>%
ggplot(aes(b0, b1)) +
geom_point(data = filter(grid, rank(SSE) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = SSE))
ggplot(hadleyData, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = b0, slope = b1, colour = SSE),
data = filter(grid, rank(SSE) <= 10)
)
# End of section borrowed from Hadley Wickham: R for Data Science
Of course, we can make the grid finer and finer to get closer to the “best fit” line
…or just use calculus to minimize the SSE
We obtain the least squares estimators by minimizing the sum of squared residuals.
You do NOT need to be able to do the math, just understand it.
Find \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) that lead to:
\[\min(\mathrm{SSE}) \;=\; \min \sum_i e_i^2 \;=\; \min \sum_i (y_i-\widehat{y}_i)^2 \;=\; \min \sum_i (y_i-\widehat{\beta}_{0}-\widehat{\beta}_{1} x_i)^2\]
\[\begin{align} \textrm{w.r.t } \widehat{\beta}_{0} \Rightarrow -2\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (1) \\ \\ \textrm{w.r.t } \widehat{\beta}_{1} \Rightarrow -2\sum x_i(y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (2) \end{align}\]
From equation (\(1\)): \(\;\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)
Plug this into equation (\(2\)): \[\begin{align} & \sum x_i(y_i - \overline{y}- \widehat{\beta}_{1}\overline{x}-\widehat{\beta}_{1}x_i)=0 \\ \\ \implies\; & \sum x_i(y_i - \overline{y})\;=\; \widehat{\beta}_{1} \sum x_i(x_i - \overline{x}) \\ \\ \implies\; & \widehat{\beta}_{1}\;=\; \frac{\sum x_i(y_i - \overline{y})}{\sum x_i(x_i - \overline{x})} \hskip2in (3) \end{align}\]
Note that \[\begin{align} & \sum (x_i-\overline{x})(y_i - \overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) -\sum \overline{x}(y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}\sum (y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}(0) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) \end{align}\]
Similarly, we have
\(\sum (x_i-\overline{x})(y_i - \overline{y})\;=\; \sum y_i(x_i-\overline{x})\)
\(\sum (x_i-\overline{x})^2\;=\; \sum x_i(x_i-\overline{x})\)
Therefore, according to (\(3\)) we have other ways to write the estimator \[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]
\(\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)
\(\widehat{\beta}_{1} \;=\; r \frac{s_y}{s_x}\)
When sampling from one population of quantitative values \(Y\), we estimated the mean of \(Y\): \(E(Y) = \mu\).
Using our new linear regression notation, we actually estimated the mean of \(Y\) for the linear model: \(\qquad Y = \beta_0 + \epsilon \;\;=\;\; \mu + \epsilon\)
Now, we want to estimate \(\widehat{\mu} = \widehat{y}= \widehat{\beta}_{0}\)
We will again use as our loss function the sum (or mean) of squared residuals.
\[
SSE(a) = \sum_{i=1}^n\, (y_i-a)^2
\qquad\qquad
MSE(a) = \frac{1}{n-1}\sum_{i=1}^n\, (y_i-a)^2
\]
What value of \(a\) should we choose if loss measured by SSE?
It seems reasonable that \(a\) should be in the “center” of the data.
But which value in the middle would be best?
One optimality criteria: Choose a number that minimizes SSE.
Using the least squares estimation method, \(\widehat{\beta}_{0}\) is the value that minimizes
\[\mathrm{SSE}\;=\; \sum (y_i - \widehat{\mu})^2 \;=\; \sum (y_i - \widehat{y})^2 \;=\; \sum(y_i - \widehat{\beta}_{0})^2\]
So, we want to minimize a function of the form \(f(a) = \sum (y_i -a)^2\) with respect to \(a\).
\[\frac{d}{da} f(a) = 2 \sum (y_i - a) (-1) = -2 \sum(y_i - a)\]
Set the derivative = 0 and solve for \(a\).
\[ \begin{align} -2 \sum(y_i - a) = 0 \implies\; & \sum(y_i - a) = 0 \\ \implies\; & \sum y_i - \sum a = 0 \\ \implies\; & \sum y_i - n a = 0 \\ \implies\; & \sum y_i = n a \\ \implies\; & a = \frac{1}{n} \sum y_i \\ \implies\; & a = \overline{y} \end{align} \]
The 2nd derivative of \(f(a)\) is
\[\frac{d}{da} (-2)\sum (y_i -a) = \frac{d}{da} (-2)\sum y_i + \frac{d}{da} (-2)\sum (-a) = 0 + 2n > 0\]
So, \(\widehat{\beta}_{0} = \overline{y}\) is a minimum, called the “least squares” estimate for \(\mu\).
So, \(\; \widehat{\beta}_{0} \;=\; \widehat{\mu} \;=\; \widehat{y}\;=\; \overline{y}\)
lm
= “linear model” function
lm(y ~ x, data = mydata)
Back to the computer repair data:
# Find the least squares estimates for slope and intercept (lm function)
lmfit = lm(Minutes ~ Units, data=repairData)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Minutes ~ Units, data=repairData) %>%
gf_coefline(model=lmfit)
repairFit = lm(Minutes ~ Units, data=repairData)
# tidy (from the broom package) gives a summary of the slope and intercept
# and accompanying estimate of variance in the estimates
tidy(repairFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
b0 = tidy(repairFit)$estimate[1]
b0
[1] 4.162
b1 = tidy(repairFit)$estimate[2]
b1
[1] 15.51
Let’s try the least squares estimate on the simulated data we generated earlier
where we set \(\beta_{0} = 3\) and \(\beta_{1} = 4\) and generated random \(Y\) values.
Since we simulated the data from the linear regression model,
then the least squares estimates should be accurate.
# Find the least squares estimates for slope and intercept (lm function)
lmfit_sim = lm(Y ~ X, data = generated_data)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Y ~ X, data = generated_data) %>%
gf_coefline(model = lmfit_sim) %>%
gf_abline(intercept = beta0_sim1, slope = beta1_sim1, color = 'red',alpha=1)
Remember, our true population parameters were \(\beta_0= 3\) and \(\beta_1= 4\).
lmfit_sim = lm(Y ~ X, data = generated_data)
tidy(lmfit_sim)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 2.937 | 0.7552 | 3.889 | 0.0003 |
X | 3.983 | 0.3592 | 11.088 | 0.0000 |
b0_sim = tidy(lmfit_sim)$estimate[1]
b1_sim = tidy(lmfit_sim)$estimate[2]
Indeed \(\widehat{\beta}_{0}=2.937 \approx 3 = \beta_{0} \qquad \text{ and } \qquad \widehat{\beta}_{1}=3.983 \approx 4 = \beta_{1}\)
\(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)
random variable = deterministic linear function + random error
This is called a simple linear regression model since only one predictor: \(x\)
The errors are uncorrelated
The errors have no systematic info about \(Y\)
The average error = \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))
The distribution of errors has mean zero
No systematic bias
The errors are not correlated with the \(x\) values
\(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))
The mean of \(Y\) depends linearly on \(x\)
There is actually nothing we need to check in the data
because least squares residuals always add up to zero!
For any data, \[\begin{align} \sum_i e_i &\;=\; \sum_i (y_i - \widehat{y}_i) \\ \\ &\;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - (\overline{y}- \widehat{\beta}_{1}\overline{x}) - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - \overline{y}) - \widehat{\beta}_{1} \sum_i (x_i - \overline{x}) \\ \\ &\;=\; 0 - \widehat{\beta}_{1} (0) \;=\; 0 \end{align}\]
So, the observed regression line \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} x\) is the
observed average and is the balancing point of the observed \(y\)-values.
This should look familiar.
We showed that for data from a population \(Y\): \(\displaystyle{\sum_i (y_i - \overline{y}) = 0}\).
The observed average is the balancing point of the observed \(y\)-values.
Later, we will look at graphical diagnostics
to check for constant spread around the line for all \(x\) values.
We estimate \(\mathrm{var}(\epsilon)\) as
\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]
degrees of freedom (df)
= sample size \(-\) # parameters estimated for the mean (\(\beta_0\) and \(\beta_1\))
= \(n-2\)
Again, we will look at useful graphical diagnostics later.
We know that \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) will not exactly match the true values \(\beta_0\) and \(\beta_1\). Here, we are going to explore this variability.
We’ll start with a quick visualization. I generate several samples. The first is large, with 10,000 observations, and it’s meant to represent the population. The other three are small.
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
set.seed(1234)
sd_error_sim1=5
# Simulate the population (really just a big sample with 10,000 observations)
n_simbig=10000
X_big = runif(n_simbig)
error = rnorm(n_simbig, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y_big = beta0_sim1 + beta1_sim1 * X_big + error
# Now draw 3 smaller samples
n_sim1=10
X = runif(n_sim1)
# First sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Second sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Third sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y3 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Reshape data for plotting
this_data<-data.frame(X,Y1,Y2,Y3)
this_data_long <- melt(this_data, id="X")
# Plot population & sample 2
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>%
gf_point(Y2~X,color="green")%>% gf_smooth(Y2~X,method='lm',se=FALSE,color="green")
# Plot population & all samples
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>%
gf_point(value~X,data=this_data_long,color=~variable)%>% gf_smooth(value~X,data=this_data_long,color=~variable,method='lm',se=FALSE)
As we can see, the estimated slopes \(\widehat{\beta}_{1}\) and intercepts \(\widehat{\beta}_{0}\) differ for each of our three samples. What we’ll discuss now is what distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) we’d expect to see as we increase the number of our samples: the sampling distributions.
If the model is correct: \(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)
then the least squares estimates are unbiased.
\[E(\widehat{\beta}_{0}) \;=\; \beta_0 \qquad {\rm and} \qquad E(\widehat{\beta}_{1}) \;=\; \beta_1\]
To show this, let’s ponder the formula for \(\widehat{\beta}_{1}\) first.
Recall the various expressions for \(\widehat{\beta}_{1}\):
\[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]
In particular, \[\widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_j-\overline{x})^2} \;=\; \sum{k_i y_i}\] where,
\[k_i \; =\; \frac{(x_i-\overline{x})}{\sum(x_i-\overline{x})^2}\] We’ll soon need these facts about the \(k_i\):
\[\begin{align} \sum_i k_i &\;=\; \sum_i \frac{(x_i-\overline{x})}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})} {\sum(x_j-\overline{x})^2} \;=\; 0 \\ \\ \sum_i k_ix_i &\;=\; \sum_i \frac{(x_i-\overline{x})x_i}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})^2} {\sum(x_j-\overline{x})^2} \;=\; 1 \\ \\ \sum_i k_i^2 &\;=\; \sum_i \frac{(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{\sum_i(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{1}{\sum(x_i-\overline{x})^2} \end{align}\]
\[\begin{align} E(\widehat{\beta}_{1}) &\;=\; E\left(\sum k_iY_i\right) \;=\; \sum E(k_i Y_i) \;=\; \sum k_i E(Y_i) \\ \\ &\;=\; \sum k_i (\beta_0 + \beta_1 x_i) \;=\; \beta_0 \sum k_i + \beta_1 \sum k_i x_i \\ \\ &\;=\; \beta_0 (0) + \beta_1 (1) \;=\; \beta_1 \end{align}\]
…and we can find the variance of \(\widehat{\beta}_{1}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{1}) &\;=\; \mathrm{var}\left(\sum k_iY_i\right) \;=\; \sum \mathrm{var}(k_i Y_i) \;=\; \sum k_i^2 \mathrm{var}(Y_i) \\ \\ &\;=\; \sum k_i^2 \sigma^2 \;=\; \sigma^2 \sum k_i^2 \;=\; \frac{\sigma^2}{\sum(x_i-\overline{x})^2} \end{align}\]
\[\begin{align} E(\widehat{\beta}_{0}) &\;=\; E(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; E(\overline{Y}) - \overline{x}E(\widehat{\beta}_{1}) \\ \\ &\;=\; (\beta_0 + \beta_1 \overline{x}) - \overline{x}\beta_1 \;=\; \beta_0 \end{align}\]
…and we can find the variance of \(\widehat{\beta}_{0}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{0}) &\;=\; \mathrm{var}(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; \mathrm{var}(\overline{Y}) + \overline{x}^2\mathrm{var}(\widehat{\beta}_{1}) -\overline{x}\mathrm{cov}(\overline{Y}, \widehat{\beta}_{1})\\ \\ &(*) \;=\; \frac{\sigma^2}{n} + \frac{\sigma^2 \overline{x}^2}{\sum(x_i-\overline{x})^2} \;=\; \sigma^2 \left[ \frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}\right] \end{align}\] \((*)\) Can show that \(\overline{Y}\) and \(\widehat{\beta}_{1}\) are uncorrelated.
Of course, we don’t know \(\sigma^2\) (population variance).
So, we use an estimate from data:
\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{e_i^2}{n-2} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]
Then, the standard errors (estimated standard deviations) are
\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\] \[\mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]
…and if we add one more model specification
That is \(Y \;\sim\; N(\beta_0 + \beta_1 x, \sigma^2)\)
Then, \[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]
Also,
\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
Observe data on a response variable: \(y_1, y_2, \ldots, y_n\)
Estimate the population mean (\(\mu\)) with the sample average (\(\overline{y}\)).
And \(\overline{y}\) is the least squares (LS) estimate for \(\mu\).
If the population \(Y \sim N(\mu, \sigma^2)\) then
\[\frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]
In STAT 101, we estimated \(\sigma^2\) just as for all least squares estimates: \[\widehat\sigma^2 \;=\; s^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \overline{y})^2}{n-1}\]
degrees of freedom (df)
= sample size \(-\) # parameters estimated for the mean (\(\mu\))
= \(n-1\)
\[E(\overline{Y}) \;=\; E\left[\frac{1}{n}\sum_i Y_i\right] \;=\; \frac{1}{n}\sum E(Y_i) \;=\; \frac{1}{n}\sum \mu \;=\; \frac{1}{n} (n \mu) = \mu\]
and for a random sample: \(\;\mathrm{var}(\overline{Y}) = \sigma^2/n\)
If the population is \(N(\mu, \sigma^2)\) then, \(\; \epsilon \;=\; (Y - \mu) \;\sim\; N(0, \sigma^2)\)
and
\[\frac{\overline{y}- \widehat{y}}{\mathrm{se}(\,\overline{y}\,)} = \frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]
We can generate a set of \(Y\)-values from a known population model we used earlier:
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1))
beta0 | beta1 | sd_error | n |
---|---|---|---|
3 | 4 | 5 | 10 |
\(\beta_{0} = 3\)
\(\beta_{1} = 4\)
\(\sigma = 5\)
\(n = 10\)
The mean of \(Y\) is linear: \(E(Y) = 3 + 4 x\)
The \(x\) values come from a Normal(\(\mu = 2, \sigma = 1\)) distribution.
The errors (\(\epsilon\)) from a Normal(\(\mu = 0, \sigma = 5\)) distribution.
Finally, we can generate \(Y\) values from the \(x\) and error values by adding the errors to the linear function of \(x\) values.
\(Y_i = 3 + 4 x_i + \epsilon_i\)
number_samples = 1000
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1, number_samples))
beta0 | beta1 | sd_error | n | number_samples |
---|---|---|---|---|
3 | 4 | 5 | 10 | 1000 |
Using An Applet:
http://www.rossmanchance.com/applets/RegSim/RegCoeff.html
Inputs needed for the applet:
\(\beta_{1} = 4\)
\(\beta_{0} = 3\)
\(\mu_x = 2\)
\(\sigma_x = 1\)
\(\sigma = 5\)
Range of \(y:\; (-1, 20)\) Range of \(x:\; (-1, 5)\)
Using R:
We can simulate 1,000 samples of \(Y\)-values as follows:
First, complete steps 1 and 2 (just once) to generate \(x\) values.
Then, repeat steps 3 and 4 1,000 times,
and each time calculate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) and save those values.
Then, look at the distributions of the 1,000 \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) values.
set.seed(1234)
View the code:
# Set up places to store the beta_0 and beta_1 estimates
beta0hats = 0
beta1hats = 0
plots=NULL
# The code below will repeat number_samples times
# i represents the sample number
for (i in 1:1000) {
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
fit = lm(Y ~ X) # Calculate LS estimates for intercept and slope for sample i
# assign the current beta_0 estimate to the ith position in vector beta0hats
beta0hats[i] = tidy(fit)$estimate[1]
# assign the current beta_1 estimate to the ith position in vector beta1hats
beta1hats[i] = tidy(fit)$estimate[2]
}
# end of the loop that gets repeated number_sample times
What would you expect center, spread, and shape of these distributions to be?
STOP
gf_dhistogram(~ beta0hats, bins = 20, color = "white")
gf_dhistogram(~ beta1hats, bins = 20, color = "white")
If we enter mean(beta0hats)
in R, what do you expect for the numerical result?
If we enter mean(beta1hats)
in R, what do you expect for the numerical result?
STOP to ponder that
The results are 2.987 and 4.064
which should be close to \(\beta_{0} = 3\) and \(\beta_{1} = 4\)
because both estimators are unbiased:
\[E(\widehat{\beta}_{0}) = \beta_{0} \qquad \text{ and } \qquad E(\widehat{\beta}_{1}) = \beta_{1}\]
According to earlier work,
\[SD(\widehat{\beta}_{0}) \;=\; \sigma \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]
\[SD(\widehat{\beta}_{1}) \;=\; \frac{\sigma}{\sqrt{\sum(x_i-\overline{x})^2}}\]
The standard deviations of the statistics according to theory:
xbar_sim1 = mean(X)
SSx_sim1 = sum((X - xbar_sim1)^2)
SDx_sim1 = sqrt( SSx_sim1 / (n_sim1-1) )
SDb0_sim1 = sd_error_sim1 * sqrt(1/n_sim1 + xbar_sim1^2 / SSx_sim1)
SDb1_sim1 = sd_error_sim1 / sqrt(SSx_sim1)
kable( tibble( sd_error=sd_error_sim1, n=n_sim1, xbar = xbar_sim1,
SSx = SSx_sim1, SDx = SDx_sim1, SDb0 = SDb0_sim1, SDb1 = SDb1_sim1 ) )
sd_error | n | xbar | SSx | SDx | SDb0 | SDb1 |
---|---|---|---|---|---|---|
5 | 10 | 0.6909 | 0.6483 | 0.2684 | 4.572 | 6.21 |
The sample standard deviations of 1,000 simulated statistics:
kable( tibble( mean(beta0hats), sd(beta0hats), mean(beta1hats), sd(beta1hats) ) )
mean(beta0hats) | sd(beta0hats) | mean(beta1hats) | sd(beta1hats) |
---|---|---|---|
2.987 | 4.418 | 4.064 | 6.031 |
Claim: Since the errors come from a Normal distribution, then
The \(Y\)-values come from a Normal distribution
The \(\widehat{\beta}_{0}\)-values come from a Normal distribution
The \(\widehat{\beta}_{1}\)-values come from a Normal distribution
The simulated sampling distribution of \(\widehat{\beta}_{0}\) values:
gf_dhistogram(~ beta0hats, bins = 20, color = "white") %>%
gf_dist("norm", params = list(beta0_sim1, SDb0_sim1))
gf_dhistogram(~ beta1hats, bins = 20, color = "white") %>%
gf_dist("norm", params = list(beta1_sim1, SDb1_sim1))
# glance and tidy (from the broom package) give nicely formatted summaries of regression estimates
repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
# base R has a summary function
summary(repairFit)
Call:
lm(formula = Minutes ~ Units, data = repairData)
Residuals:
Min 1Q Median 3Q Max
-9.232 -3.341 -0.714 4.777 7.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.162 3.355 1.24 0.24
Units 15.509 0.505 30.71 8.9e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared: 0.987, Adjusted R-squared: 0.986
F-statistic: 943 on 1 and 12 DF, p-value: 8.92e-13
mean(~ Units, data=repairData)
[1] 6
sd(~ Units, data=repairData)
[1] 2.961
range(~ Minutes, data=repairData)
[1] 23 166
range(~ Units, data=repairData)
[1] 1 10
http://www.rossmanchance.com/applets/RegSim/RegCoeff.html
Inputs needed for the applet:
\(\widehat{\beta}_{1} = 15.51\)
\(\widehat{\beta}_{0} = 4.162\)
\(\widehat{\sigma}^2 = 5.392\)
\(\overline{x}= 6\)
\(s_x = 2.961\)
Range of \(y:\; (23,\; 166)\)
Range of \(x:\; (1,\; 10)\)
1. The errors are uncorrelated
Data collected as random sample? Effect of time? Chapter 8: The Problem of correlated errors.
2.a. No systematic bias: \(E(\epsilon_i) \;=\; 0\) for all \(i\) (for all \(x\))
Actually nothing we need to check. For any data and least squares regression, \(\sum_i e_i = 0\)
2.b. The errors are not correlated with the \(x\) values
Actually nothing we need to check. For any data and least squares regression: \(\mathrm{cov}(x, e) = 0\)
2.c. \(\mathrm{var}(\epsilon_i) = \sigma^2\) is same for all \(i\) (for all \(x\))
Graphical diagnostics. Chapter 4: Regression Diagnostics.
Check for constant spread around the line for all \(x\) values.
3. The mean of \(Y\) depends linearly on \(x\)
Again, we will look at useful graphs later.
The least squares estimates
\[ \widehat{\beta}_{0} = \overline{y}-\widehat{\beta}_{1}\overline{x} \qquad \textrm{and}\qquad \widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm algebra} \cdots \;=\; r \frac{s_y}{s_x}\]
Unbiased estimators
Both \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) are unbiased estimators of their population counterparts:
\(E(\widehat{\beta}_{0}) = \beta_0\) and \(E(\widehat{\beta}_{1}) = \beta_1\).
Standard errors
\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\qquad \textrm{and}\qquad \mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\] where \[\widehat\sigma \;=\; \sqrt{\frac{\mathrm{SSE}}{\mathrm{df}}} \;=\; \sqrt{\frac{e_i^2}{n-2}} \;=\; \sqrt{\frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}}\] If \(\epsilon \sim N(0,\sigma^2)\) then (or sample size large), then
\[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]
Also,
\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
Reminder: Steps for hypothesis testing:
Testing \(H_0:\beta_0=0\;\) or \(\;H_0:\beta_1=0\) using t-test
Let’s work the first case: our null hypothesis is \(H_0:\beta_0=0\;\). If the null hypothesis is true, then the statistic
\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} = \frac{\widehat{\beta}_{0} - 0}{\mathrm{se}(\widehat{\beta}_{0})}=\frac{\widehat{\beta}_{0}}{\mathrm{se}(\widehat{\beta}_{0})}\]
will follow a t-distribution with \(n-2\) degrees of freedom.
Reminder of the intuition here: If the null hypothesis is true, most of the time, this statistic will be close to zero. If the statistic is large, either we got very (un)lucky, or the null hypothesis wasn’t true to begin with.
We can test the null hypothesis by calculating this statistic and comparing it with a critical value \(t^\star\) for our chosen significance value. (Remember to use \(\alpha/2\) for a two-sided hypotheis.) Alternatively, we can calculate the p-value corresponding to our statistic, and compare it to \(\alpha\).
Now let’s work the second case: our null hypothesis is \(H_1:\beta_1=0\;\). If the null hypothesis is true, then the statistic
\[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} = \frac{\widehat{\beta}_{1} - 1}{\mathrm{se}(\widehat{\beta}_{1})}=\frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})}\]
We can test this hypothesis by colculating this statistic and the corresponding P-value.
In general, when we want to test a hypothesis \(H_0: \beta_0=k\) our test statistic will be of the form \(t=\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} = \frac{\widehat{\beta}_{1} - k}{\mathrm{se}(\widehat{\beta}_{1})}\). Our test statistic only simplifies to the form \(\frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})}\) for the specific hypothesis \(H_0: \beta_0=0\).
However, this is a very common hypothesis to test. Because we so frequently want to test this hypothesis, R will computer the corresponding statistic and p-value for us automatically. They are given in the outputs of the lm
command.
The important thing to remember is that these outputs only apply to the specific null hypothesis that the coefficients equal zero. If you want to test a different hypothesis, you’ll need to calculate the test statistic and p-value manually, using the formula \(t=\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})}\). We’ll discuss how to do this later.
For computer repairs, it seems reasonable
that it should take no time to complete no work (zero units repaired).
Our prediction is \(\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) \;=\; \widehat{\beta}_{0} = 4.162\) minutes.
If \(H_0\) is true, then \(\beta_0=0\) and \[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;=\; \frac{\widehat{\beta}_{0} - 0}{\mathrm{se}(\widehat{\beta}_{0})} \;=\; \frac{\widehat{\beta}_{0}}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2}\]
repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
repairFit = lm(Minutes ~ Units, data=repairData)
tidyRepairFit = tidy(repairFit)
b0.se = tidyRepairFit$std.error[1]
b0.tstat = tidyRepairFit$statistic[1]
b0.pvalue = tidyRepairFit$p.value[1]
Summaries | Test |
---|---|
\(\widehat{\beta}_{0} = 4.162\) | \(t_{\rm observed} = 1.24\) |
\(\mathrm{se}(\widehat{\beta}_{0}) = 3.355\) | pvalue = \(0.2385\) |
Using a significance level of \(\alpha=0.05\),
we fail to reject \(H_0:\beta_0=0\).
\(Y \sim N(\beta_0 + \beta_1 x, \sigma^2)\) when \(\epsilon \sim N(0, \sigma^2)\).
So, let’s graph the residuals.
repairData
Minutes | Units | stdMinutes | stdUnits | fitted | resids |
---|---|---|---|---|---|
23 | 1 | -1.6058 | -1.6885 | 19.67 | 3.3296 |
29 | 2 | -1.4760 | -1.3508 | 35.18 | -6.1792 |
49 | 3 | -1.0432 | -1.0131 | 50.69 | -1.6880 |
64 | 4 | -0.7187 | -0.6754 | 66.20 | -2.1967 |
74 | 4 | -0.5023 | -0.6754 | 66.20 | 7.8033 |
87 | 5 | -0.2210 | -0.3377 | 81.71 | 5.2945 |
96 | 6 | -0.0263 | 0.0000 | 97.21 | -1.2143 |
97 | 6 | -0.0046 | 0.0000 | 97.21 | -0.2143 |
109 | 7 | 0.2550 | 0.3377 | 112.72 | -3.7231 |
119 | 8 | 0.4714 | 0.6754 | 128.23 | -9.2318 |
149 | 9 | 1.1205 | 1.0131 | 143.74 | 5.2594 |
145 | 9 | 1.0339 | 1.0131 | 143.74 | 1.2594 |
154 | 10 | 1.2287 | 1.3508 | 159.25 | -5.2494 |
166 | 10 | 1.4883 | 1.3508 | 159.25 | 6.7506 |
# dotplot (since only 14 data values, too small for boxplot or histogram)
gf_dotplot(~ resids, data=repairData)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Skew is not extreme. No obvious outliers.
The t-method should be valid here.
If \(\beta_1=0\) then \(x\) has no (linear) information about \(Y\).
In fact, \(\beta_1=0\) is equivalent to \(\rho = \mathrm{corr}(Y,x) = 0\) since \[\beta_1 = \rho \frac{\sigma_Y}{\sigma_X}\] So, \(H_0: \beta_1=0\) and \(H_0: \rho=0\) both test for linear relationship.
If \(H_0\) is true, then \(\beta_1=0\) and \[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 0}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
repairFit = lm(Minutes ~ Units, data=repairData)
(tidyRepairFit<-tidy(repairFit))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
b1.se = tidyRepairFit$std.error[2]
b1.tstat = tidyRepairFit$statistic[2]
b1.pvalue = tidyRepairFit$p.value[2]
b1.estimate = tidyRepairFit$estimate[2]
Summaries | Test |
---|---|
\(\widehat{\beta}_{1} = 15.51\) | \(t_{\rm observed} = 30.71\) |
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\) | pvalue = \(8.916e-13\) |
We strongly reject \(H_0: \beta_1=0\).
There is a strong positive relationship between repair time and number of units serviced.
Consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work.
If \(H_0\) is true, then \(\beta_1=14.5\) and \[t = \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 14.5}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
Because our null hypothesis concerns a value other than zero, need to calculate \(t\) manually.
# The long way, but straightforward
b1.12.tstat = (b1.estimate - 14.5) / b1.se
b1.12.tstat
[1] 1.998
Once we do this, we can calculate the p-value using the function pt
. We take 1 minus its output, then multiply by two since we are testing a two-sided hypothesis.
n<-nrow(repairData)
b1.12.pvalue = 2 * (1 - pt(b1.12.tstat, df=n-2))
b1.12.pvalue
[1] 0.06894
Summaries | Test |
---|---|
\(\widehat{\beta}_{1} = 15.51\) | \(t_{\rm observed} = 1.998\) |
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\) | \(t_{\rm observed}^2 = 3.991\) |
pvalue = \(0.06894\) |
There is a shortcut to test this hypothesis, using the linearHypothesis
function in the car
package. This function doesn’t report out the t-statistic – it reports the F-statistic, which we’ll discuss shortly – but the p-values are the same as if we’d done the t-test.
\(H_0: \beta_1 = 14.5\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 464.9 | NA | NA | NA | NA |
12 | 348.8 | 1 | 116 | 3.991 | 0.0689 |
\[(y_i - \overline{y}) \;=\; (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y}) \;=\; (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})\]
# I'm going to use Base R graphics here since I happen to know how to
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
draw_residual_schematic <- function() {
a = 10
b = 4
mx = 15
sx = 5
s = 20
n = 100
set.seed(123456)
x = rnorm(n, m=mx, sd=sx)
y = a + b*x + rnorm(n, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)
# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)
xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi
myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)
points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)
lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))
text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
}
draw_residual_schematic()
It’s not obvious, but is true (for LS regression):
\[\sum (y_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i)^2 \;+\; \sum (\widehat{y}_i - \overline{y})^2\]
SST = \(\sum (y_i - \overline{y})^2\) = total sum of squares
SSE = \(\sum (y_i - \widehat{y}_i)^2\) = error sum of squares
SSR = \(\sum (\widehat{y}_i - \overline{y})^2\) = regression sum of squares
So, SST = SSE + SSR
One measure of the “goodness of fit” of the regression model to the data:
\(R^2\) = proportion of variability in the response (\(Y\))
explained by regression on the predictor (\(x\))
\[R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;=\; 1 - \frac{\sum (y_i - \widehat{y}_i)^2}{\sum (y_i - \overline{y})^2}\]
If the predictors don’t include much information about the response,
then \(\sum (y_i - \widehat{y}_i)^2\) is close to \(\sum (y_i - \overline{y})^2\)
since \(\overline{y}\) is the least squares estimate for the mean without predictors.
In that case, \(R^2\) is near \(1 - 1 = 0\).
On the other hand, if the predictors have so much information that
residuals are very, very small, then \(R^2\) is near \(1 - 0 = 1\).
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
summary(repairFit)
Call:
lm(formula = Minutes ~ Units, data = repairData)
Residuals:
Min 1Q Median 3Q Max
-9.232 -3.341 -0.714 4.777 7.803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.162 3.355 1.24 0.24
Units 15.509 0.505 30.71 8.9e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared: 0.987, Adjusted R-squared: 0.986
F-statistic: 943 on 1 and 12 DF, p-value: 8.92e-13
When there is only one predictor (\(x\)), \(R^2 = \mathrm{corr}(y,x)^2 = r^2\)
cor(Minutes ~ Units, data=repairData)
[1] 0.9937
cor(Minutes ~ Units, data=repairData)^2
[1] 0.9874
\(r = 0.9937\)
\(r^2 = 0.9874\)
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
So, 98.7% of the variability in repair times can be attributed to the regression on the number of units.
The rest of the variability (residuals) is not explained by the number of units repaired.
Claim: \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)
Recall:
\[\mathrm{corr}(a+bY, c+dx) = \frac{bd\mathrm{cov}(y,x)}{|b|\mathrm{sd}(Y)\, |d|\mathrm{sd}(x)} = {\rm sign}(bd)\mathrm{corr}(y,x)\]
\(\begin{align} \mathrm{corr}(y, \widehat{y}) = \mathrm{corr}(y, \widehat{\beta}_{0} + \widehat{\beta}_{1} x) = \textrm{sign}(\widehat{\beta}_{1}) \mathrm{corr}(y,x) \end{align}\)
\(\widehat{\beta}_{1}\) and \(\mathrm{corr}(y,x)\) always have the same sign: \(\widehat{\beta}_{1} = \mathrm{corr}(y,x) (s_y/s_x)\)
Thus, \(0 \le \mathrm{corr}(y, \widehat{y}) \le 1\) and \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)
So, \(R^2 = \left[\mathrm{corr}(y, x)\right]^2 = \left[\mathrm{corr}(y, \widehat{y})\right]^2\)
This will be handy when we have models with more than one predictor.
But we will have to be careful: when we add more predictors to a model, \(R^2\) will always increase or stay constant. Therefore we won’t be able to reliably use \(R^2\) as a way of comparing models to see which is better.
repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
Adjusted \(R^2\) includes the degrees of freedom:
\[\text{adjusted} R^2 = 1 - \frac{\mathrm{SSE}/ \mathrm{SSE}(\mathrm{df})}{\mathrm{SST}/ \mathrm{SST}(\mathrm{df})} = \frac{\sum (y_i - \widehat{y}_i)^2/(n-p)}{\sum (y_i - \overline{y})^2 / (n-1)}\]
Adjusted \(R^2\) does not always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another.
\[(y_i - \overline{y}) \;=\; (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y}) \;=\; (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})\]
# I'm going to use Base R graphics here since I happen to know how to
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
draw_residual_schematic <- function() {
a = 10
b = 4
mx = 15
sx = 5
s = 20
n = 100
set.seed(123456)
x = rnorm(n, m=mx, sd=sx)
y = a + b*x + rnorm(n, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)
# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)
xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi
myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)
points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)
lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))
text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
}
draw_residual_schematic()
It’s not obvious, but is true (for LS regression):
\[\sum (y_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i)^2 \;+\; \sum (\widehat{y}_i - \overline{y})^2\]
SST = \(\sum (y_i - \overline{y})^2\) = total sum of squares
SSE = \(\sum (y_i - \widehat{y}_i)^2\) = error sum of squares
SSR = \(\sum (\widehat{y}_i - \overline{y})^2\) = regression sum of squares
So, SST = SSE + SSR
Even the degrees of freedom for each sum of squares add up!
SST(df) = SSE(df) + SSR(df)
\((n-1) = (n-2) + 1\)
For a linear model with mean estimated by least squares:
degrees of freedom of the sum of squared errors
= sample size \(-\) # parameters estimated for the mean
For SST = \(\sum (y_i - \overline{y})^2\),
the linear model we have in mind is \(Y = \mu + \epsilon\) (one mean parameter).
Thus, SST has \((n-1)\) df.
For SSE = \(\sum (y_i - \widehat{y}_i)^2 = \sum (y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i))^2\),
the linear model we have in mind is \(Y = \beta_0 + \beta_1 x + \epsilon\) (two mean paramters).
Thus, SSE has \((n-2)\) df.
For SSR = SST - SSE = \(\sum (\widehat{y}_i - \overline{y})^2\)
the degrees of freedom is the difference in the number of mean parameters for the two models.
Thus, SSR has \((2 - 1) = 1\) df.
Degrees of freedom can be thought of in many ways.
Let’s consider the concept of constraints or “subspaces”.
SST = \(\sum (y_i - \overline{y})^2\), but \(\sum (y_i - \overline{y}) = 0\) (a constraint).
From this constraint, we can write \(\displaystyle y_n = n\overline{y}- \sum_{i=1}^{n-1} y_i\).
Thus, once we know estimate for the mean \(\overline{y}\)
and \(y_1, y_2, \ldots,\) and \(y_{n-1}\),
then we can solve for \(y_n\).
We say that \((n-1)\) y-values are “free” to move,
but the last value is constrained by \(\overline{y}\).
That’s \((n-1)\) “degrees of freedom”.
Similarly,
SSE = \(\sum (y_i - \widehat{y}_i)^2\), but \(\sum (y_i - \widehat{y}_i) = 0\) (a constraint).
It is also true (mentioned far above) that
\(\sum((y_i - \widehat{y}_i)x_i)=0\), for any dataset.
From these constraints, we can write both \(y_n\) and \(y_{n-1}\) as functions of \(\widehat{\beta}_{0}, \widehat{\beta}_{1}, y_1, y_2, \ldots, y_{n-2}\) and all of the \(x\)-values.
Thus, once we know the estimate for the mean (\(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\))
and \(y_1, y_2, \ldots, y_{n-2}\) (and all of the \(x\)-values),
then we can solve for \(y_n\) and \(y_{n-1}\).
We say that \((n-2)\) y-values are “free” to move,
but the last two values are constrained by \(\widehat{\beta}_{0}, \widehat{\beta}_{1},\) and the x-values.
That’s \((n-2)\) “degrees of freedom”.
Finally,
SSR = \(\sum (\widehat{y}_i - \overline{y})^2\), but \(\sum (\widehat{y}_i - \overline{y}) = 0\) (a constraint).
We can show this is true: \[\begin{align} \sum (\widehat{y}_i - \overline{y}) &\;=\; \sum \widehat{y}_i - \sum \overline{y}\\ \\ &\;=\; \sum \left(\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\right) - n\overline{y}\\ \\ &\;=\; \sum \left[(\overline{y}- \widehat{\beta}_{1} \overline{x}) + \widehat{\beta}_{1} x_i\right] - n\overline{y}\\ \\ &\;=\; \sum \overline{y}+ \widehat{\beta}_{1} \sum (x_i - \overline{x}) - n\overline{y}\\ \\ &\;=\; n \overline{y}+ \widehat{\beta}_{1} (0) - n\overline{y}\\ \\ &\;=\; 0 \end{align}\]
Thus, once we know the estimate for the mean (\(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\))
and just one \(\widehat{y}_i\) (and all of the \(x\)-values),
then we can solve for all \((n-1)\) other \(\widehat{y}_j\) values.
We say that just one \(\widehat{y}\)-value is “free” to move,
and the remaining \(\widehat{y}\)-values are constrained by \(\overline{y}\) and the \(x\)-values.
That’s \(1\) (one) “degree of freedom”.
Let me try to convince you of this constraint.
Suppose we know all \(n\) of the \(x\)-values, but only \(\widehat{y}_5\).
Then, we know two points on the regression line:
\((x_5, \widehat{y}_5)\)
\((\overline{x}, \overline{y})\) since every simple linear regression line passes through this point
in which case, we know \(\displaystyle \frac{\widehat{y}_5 - \overline{y}}{x_5 - \overline{x}} = \frac{ \textrm{rise} }{ \textrm{run} } = \widehat{\beta}_{1}\) (the slope)
and we can find \(\widehat{\beta}_{0} = \overline{y}- \widehat{\beta}_{1} \overline{x}\) (the intercept).
From here, we can find all \((n-1)\) other values of \(\widehat{y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\).
Here is the process.
Start with the full model (FM): \(Y = \beta_0 + \beta_1 x + \epsilon\)
(no restrictions on \(\beta_0\) or \(\beta_1\))
Compare to the reduced model (RM): \(Y = \epsilon \;=\; 0 + \epsilon\;\;\) (assumes \(E(Y)=0\))
Both \(\beta_0=0\) and \(\beta_1=0\) as stated in \(H_0\).
Let’s get an intuition for what the FM and the RM look like. We’ll draw the regression lines for both models here, for two different examples. In the first example, the true values \(\beta_0\) and \(\beta_1\) are 3 and 4, respectively. In the second example, the true values are 0 and 0 – that is, the null hypothesis \(H_0\) is actually true.
library(reshape2)
set.seed(1234)
sd_error_sim1=5
beta0_sim2=3
beta1_sim2=4
beta0_sim3=0
beta1_sim3=0
n_sim1=30
# First example
X1 = runif(n_sim1)
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim2 + beta1_sim2 * X1 + error # Generate y-values for sample i
# Second example
X2 = runif(n_sim1)
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim3 + beta1_sim3 * X2 + error # Generate y-values for sample i
# Plot sample 1
gf_point(Y1~X1,color="grey")%>% gf_smooth(Y1~X1,method='lm',se=FALSE,color="red") %>% gf_abline(intercept=0,slope=0,color="blue")
# Plot sample 2
gf_point(Y2~X2,color="grey") %>%
gf_abline(color="red", slope = -0.10, intercept = 5) %>%
gf_smooth(Y2~X2,method='lm',se=FALSE,color="red")
STOP Discuss intuition about the performance of the FM and the RM in each of the two cases.
If the model specifications are valid (including Normal errors),
Then the following comparison statistic has an \(F\) distribution.
\[F = \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})}\]
The degrees of freedom for the distribution of the F-statistic
are the numerator df and the denominator df.
For the reduced model: \(Y = \epsilon\quad\) and \(\quad \widehat{y}_i = 0\) for all \(i\)
Thus, \(\mathrm{SSE}(\mathrm{RM}) = \sum (y_i - \widehat{y}_i)^2 = \sum y_i^2\)
with degrees of freedom \(n-0=n\) (no parameters estimated for the mean).
For the full model: \(Y = \beta_0 + \beta_1 + \epsilon\)
and \(\widehat{y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\) for all \(i\)
Thus, \(\mathrm{SSE}(\mathrm{FM}) = \sum (y_i - \widehat{y}_i)^2 = \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\)
with degrees of freedom \(n-2\) (two parameters estimated for the mean).
\[F \;=\; \frac{\left[ \sum y_i^2 - \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2 \right] \;/\; [n - (n-2)]}{\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2 \;/\; (n-2)} \;\sim\; F_{2, n-2}\]
The long (but illustrative) calculation:
SHOW CODE
# The long (but illustrative) way
y = repairData$Minutes
x = repairData$Units
n = length(y)
b0
[1] 4.162
b1
[1] 15.51
SSE.RM = sum(y^2)
SSE.FM = sum((y - b0 - b1*x)^2)
df.RM = n
df.FM = n - 2
F.stat = ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / (SSE.FM / df.FM)
F.stat
[1] 2747
1 - pf(F.stat, df.RM-df.FM, df.FM)
[1] 1.11e-16
The short way …find a function in a package:
SHOW CODE
# The short, way ...find a function in a package.
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
# ...and so can make things look nice on your HW if you are using R Markdown
repairFit = lm(Minutes ~ Units, data=repairData)
library(car)
kable(linearHypothesis(repairFit, c("(Intercept) = 0", "Units = 0")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
14 | 160077.0 | NA | NA | NA | NA |
12 | 348.8 | 2 | 159728 | 2747 | 0 |
…and another method using the base R anova()
function
SHOW CODE
# ...and another method using base R anova() function
# anova(reducedModelFit, fullModelFit)
# full model
repairFit = lm(Minutes ~ Units, data=repairData)
# reduced model (beta0 = 0, beta1 = 0)
reducedFit = lm(Minutes ~ -1 + I(0*Units), data=repairData)
# compare the two (nested) models with an F-test
anova(reducedFit, repairFit)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
14 | 160077.0 | NA | NA | NA | NA |
12 | 348.8 | 2 | 159728 | 2747 | 0 |
When \(\alpha=0.05\), we expect 5% false positives.
P(Type I Error) = P(reject \(H_0\) when it was actually true) = \(\alpha\)
So, if we do two tests, the false positive rate is higher.
(The more tests you do, the more likely you’ll get a false positive)
Test of \(H_0: \beta_0 = 11.8\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 499.5 | NA | NA | NA | NA |
12 | 348.8 | 1 | 150.7 | 5.183 | 0.0419 |
Test of \(H_0: \beta_1 = 14.5\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 464.9 | NA | NA | NA | NA |
12 | 348.8 | 1 | 116 | 3.991 | 0.0689 |
Test of \(H_0: \beta_0 = 11.8\) and \(\beta_1 = 14.5\) using an R package formula:
SHOW CODE
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8", "Units = 14.5")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
14 | 500.1 | NA | NA | NA | NA |
12 | 348.8 | 2 | 151.2 | 2.601 | 0.1153 |
And another method comparing the reduced (restricted) model with the full (unrestricted) model.
Use the base R anova()
function:
SHOW CODE
# ...and another method using base R anova() function
# anova(reducedModelFit, fullModelFit)
# full model
repairFit = lm(Minutes ~ Units, data=repairData)
# reduced model (beta0 = 0, beta1 = 0)
reducedFit = lm(Minutes ~ -1, offset = 11.8 + 14.5*Units, data=repairData)
# compare the two (nested) models with an F-test
kable(anova(reducedFit, repairFit))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
14 | 500.1 | NA | NA | NA | NA |
12 | 348.8 | 2 | 151.2 | 2.601 | 0.1153 |
This method is indeed simpler than the F-test, but…
it’s too conservative
(more protection against false positives than needed)
Bonferroni correction:
If \(k\) tests, then compare p-value to \(\displaystyle \frac{\alpha}{k}\) for each test.
Since we have two tests (slope and intercept) and \(\alpha=0.05\), we should not reject either null unless p-value \(< 0.05/2 = 0.025.\)
Test of \(H_0: \beta_0 = 11.8\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 499.5 | NA | NA | NA | NA |
12 | 348.8 | 1 | 150.7 | 5.183 | 0.0419 |
Test of \(H_0: \beta_1 = 14.5\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 464.9 | NA | NA | NA | NA |
12 | 348.8 | 1 | 116 | 3.991 | 0.0689 |
Using the Bonferroni correction, we would not reject either \(H_0\).
t-test of \(H_0: \beta_1=0\)
repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
F-test of \(H_0: \beta_1=0\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 0")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 27768.4 | NA | NA | NA | NA |
12 | 348.8 | 1 | 27420 | 943.2 | 0 |
The relationship holds when testing for \(\beta_1\) other than 0.
For example, consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work.
If \(H_0\) is true, then \(\beta_1=14.5\) and \[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 14.5}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
# The long way, but straightforward
b1.12.tstat = (b1 - 14.5) / b1.se
b1.12.tstat
[1] 1.998
b1.12.tstat^2
[1] 3.991
b1.12.pvalue = 2 * (1 - pt(b1.12.tstat, df=n-2))
b1.12.pvalue
[1] 0.06894
Summaries | Test |
---|---|
\(\widehat{\beta}_{1} = 15.51\) | \(t_{\rm observed} = 1.998\) |
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\) | \(t_{\rm observed}^2 = 3.991\) |
pvalue = \(0.06894\) |
..and here is the equivalent F-test for \(H_0: \beta_1 = 14.5\)
repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
13 | 464.9 | NA | NA | NA | NA |
12 | 348.8 | 1 | 116 | 3.991 | 0.0689 |
We have two random samples (from two populations)
For population #1: \(Y = \mu_1 + \epsilon\)
For population #2: \(Y = \mu_2 + \epsilon\)
We can write this as a simple linear regression model.
Let \(\displaystyle x = \begin{cases} 0 & \textrm{data from population #1} \\ 1 & \textrm{data from population #2} \end{cases}\)
Then, write \(Y = \beta_0 + \beta_1 x + \epsilon\)
where \(E(\epsilon)=0\), \(\;\;\mathrm{var}(\epsilon) = \sigma^2\;\;\) and \(\;\;\epsilon\;\;\) follows a normal distn
What does this mean about variances for the two populations?
For data from population #1, \(\;\;x=0\;\;\) so, \(\;\;Y = \beta_0 + \epsilon\;\; \implies \; \beta_0 = \mu_1\)
For data from population #2, \(\;\;x=1\;\;\) so, \(\;\;Y = \beta_0 + \beta_1 + \epsilon\;\; \implies \; \beta_1 = \mu_2 - \mu_1\)
\(\widehat{\beta}_{0} = \widehat{\mu}_1 = \overline{y}_1\)
\(\widehat{\beta}_{1} = \widehat{\mu}_2 - \widehat{\mu}_1 = \overline{y}_2 - \overline{y}_1\)
Before reminding you of that variance estimate, let’s continue with the LS regression approach:
For any simple linear regression model, we use \(\displaystyle \widehat{\sigma}^2 = \frac{\mathrm{SSE}}{\mathrm{df}}\)
\[ \widehat{\sigma}^2 \;\;=\;\; \frac{\mathrm{SSE}}{\mathrm{df}} \;\;=\;\; \frac{\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x)^2}{n_1 + n_2 - 2} \]
Let’s work on the numerator \[\begin{align} \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x)^2 &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \widehat{\beta}_{0})^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \widehat{\beta}_{0} - \widehat{\beta}_{1})^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \overline{y}_1 - (\overline{y}_2 - \overline{y}_1))^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \overline{y}_2)^2 \\ &\;\;=\;\; (n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2 \end{align}\]
Therefore, \(\displaystyle \widehat{\sigma}^2 = \frac{(n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2}{n_1 + n_2 -2}\)
Does this expression look familiar?
\(\widehat{\sigma}^2 = s_p^2\) = pooled estimate of the population variance
\(H_0: \beta_1 = 0\)
is equivalent to
\(H_0: (\mu_2 - \mu_1) = 0\)
which is the same as \(H_0: \mu_2 = \mu_1)\)
So, we have the same t-statistic as in “STAT 101”
\[\begin{align} t &\;\;=\;\; \frac{\widehat{\beta}_{1} - 0}{\mathrm{se}(\widehat{\beta}_{1})} \\ \\ &\;\;=\;\; \frac{\overline{y}_2 - \overline{y}_1}{\mathrm{se}(\overline{y}_2 - \overline{y}_1)} \;\;=\;\; \frac{\overline{y}_2 - \overline{y}_1}{{\sqrt{\displaystyle \frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} } \end{align}\]
A lot of algebra is required (not shown), but I’ll set up the scenario for you.
F-test for \(H_0: \mu_1 = \mu_2 = \mu\)
Reduced model: \(Y = \mu + \epsilon\quad \mathrm{df}= n_1+n_2-1\)
Full model: \(Y = \beta_0 + \beta_1 x + \epsilon\quad \mathrm{df}= n_1+n_2-2\)
\[\begin{align} F &\;\;=\;\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})} \\ \\ &\;\;=\;\; \frac{\sum_{i=1}^{n_1+n_2} (y_i - \overline{\overline{y}})^2 - \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 - \sum_{j=1}^{n_2} (y_j - \overline{y}_2)^2}{s_p^2} \\ \\ &\;\;=\;\; \textrm{after lots of algebra...} \\ \\ &\;\;=\;\; \frac{(\overline{y}_2 - \overline{y}_1)^2}{\displaystyle{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} \;\;=\;\; t^2 \end{align}\]
Recall the least squares regression results for the computer repair data
repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
Units | 15.509 | 0.505 | 30.71 | 0.0000 |
glance(repairFit)[-c(7:10)]
r.squared | adj.r.squared | sigma | statistic | p.value | df | df.residual |
---|---|---|---|---|---|---|
0.9874 | 0.9864 | 5.392 | 943.2 | 0 | 2 | 12 |
\[\widehat{\beta}_{0} \;\pm\; t^* \mathrm{se}(\widehat{\beta}_{0})\]
With \(n-2 = 12\) degrees of freedom and \(\alpha=0.05\)
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
Make an educated guess:
Will zero be inside the confidence interval?
STOP
Yes, we expect 0 to be in this confidence interval since we failed to reject \(H_0: \beta_0 = 0\) earlier.
# The long way, but illustrative
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
pm = c(-1, 1)
b0 + pm * t.star * b0.se
[1] -3.148 11.472
# The short way ...I Googled for a package with a nice function
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(confint(repairFit, "(Intercept)", level=0.95))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -3.148 | 11.47 |
\[\widehat{\beta}_{1} \;\pm\; t^* \mathrm{se}(\widehat{\beta}_{1})\]
With \(n-2 = 12\) degrees of freedom and \(\alpha=0.05\)
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
Make an educated guess:
Will zero be inside the confidence interval?
STOP
No, we do not expect 0 to be in this confidence interval since we rejected \(H_0: \beta_1 = 0\) earlier.
# The long way, but illustrative
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
pm = c(-1, 1)
b1 + pm * t.star * b1.se
[1] 14.41 16.61
# The short way ...I Googled for a package with a nice function
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(confint(repairFit, "Units", level=0.95))
2.5 % | 97.5 % | |
---|---|---|
Units | 14.41 | 16.61 |
Or use code to find both confidence intervals at the same time
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
kable(confint(repairFit, level=0.95))
2.5 % | 97.5 % | |
---|---|---|
(Intercept) | -3.148 | 11.47 |
Units | 14.409 | 16.61 |
The estimated average (expected) value
for all individuals in the sub-population with \(X=x_0\) is \[\widehat{\mu}_0 \;\;=\;\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\] A confidence interval for the mean response at some \(x_0\) is \[\widehat{\mu}_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{\sum (x_i-\overline{x})^2}}\]
a = 2
b = 7
Among the jobs where \(2\) or \(7\) units need repair, how long do we expect the work to take?
# I had defined new data values earlier
a
[1] 2
b
[1] 7
# Set up a data frame (dataset) to store the x-values for input to prediction
x.values = data.frame(Units = c(a,b))
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
# The short way ...I Googled and found "predict"
# ...which is just in the "stats" package that comes with R by default
kable(predict(repairFit, x.values, interval="confidence"))
fit | lwr | upr |
---|---|---|
35.18 | 29.77 | 40.59 |
112.72 | 109.40 | 116.05 |
Which interval is narrower? Why?
# After Googling, I initally only found ggplot2 functions
# ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) ....
# (Although you will see in two plots later that I did successfully Google
# how to do this using the ggformula function gg_lm)
ggplot(repairData, aes(x = Units, y = Minutes)) +
geom_point() +
geom_smooth(method="lm", se=TRUE)
# After Googling, I initally only found ggplot2 functions
# ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) ....
# (Although you will see in the next graph that I did successfully Google
# how to do this using the ggformula function gg_lm)
data_sim1 = data.frame(Y, X)
ggplot(data_sim1, aes(x = X, y = Y)) +
geom_point() +
geom_smooth(method="lm", se=TRUE)
The predicted value
for one yet-to-be-observed (or just observed) individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]
A prediction interval for the individual response at some \(x_0\) is \[\widehat{y}_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{\sum (x_i-\overline{x})^2}}\] We just got two new jobs where \(2\) and \(7\) units need repair.
How long do should it take to complete either job?
# I had defined new data values earlier
a
[1] 2
b
[1] 7
# Set up a data frame (dataset) to store the x-values for input to prediction
x.values = data.frame(Units = c(a,b))
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.
# It just makes things look better on my slides.
# The short way ...I Googled and found "predict"
# ...which is just in the "stats" package that comes with R by default
kable(predict(repairFit, x.values, interval="prediction"))
fit | lwr | upr |
---|---|---|
35.18 | 22.25 | 48.11 |
112.72 | 100.51 | 124.93 |
Discuss…
# I found out how to get the confidence bands on the plot using a ggformula
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
gf_lm(Minutes ~ Units, data = repairData,
interval = "prediction", fill = "skyblue") %>%
gf_lm(interval = "confidence", fill="blue", color="black") %>%
gf_point()
# I found out how to get the confidence bands on the plot using a ggformula
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
gf_lm(Y ~ X, data = data_sim1,
interval = "prediction", fill = "skyblue") %>%
gf_lm(interval = "confidence", fill="blue", color="black") %>%
gf_point()
These data concern mercury levels in large-mouthed bass
in North Carolina rivers.
Mercury levels in river water typically at undetectable levels,
yet fish might contain unsafe amount to eat (1 part per million)
In principle, mercury concentration in fish can be measured,
but that’s expensive …and involves waiting days for lab results.
So, that doesn’t help you decide whether or not
to eat the fish you caught today.
So, can we eat large-mouthed bass from North Carolina rivers?
fishData = read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fish.txt")
# rescale weight from grams to kg to make units of y and x similar in size,
# so slope won't be huge (and thus, a little harder to interpret, at least for me anyway)
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fishData = mutate(fishData, weight = weight / 1000)
# sample size
n = dim(fishData)[1]
Older fish weight more.
Easy and quick: weigh the fish
Data were collected on 171 large-mouth bass in North Carolina rivers.
mercury
= concentration of mercury in fish tissue (ppm = mcg/g)
weight
= weight in kilograms (kg)
Could you weigh a fish and then give a recommendation
as to whether it is safe to eat it?
“The Food and Drug Administration has set an”action level" of 1.0 ppm mercury for fish in interstate commerce."
https://www.greenleft.org.au/content/mercury-how-much-safe
A concentration over 1 part per million is considered unsafe for human consumption. In light of this, what recommendations can you make for fish caught from these rivers?
Is there a linear relationship between mercury
and weight
?
fishFit = lm(mercury ~ weight, data=fishData)
gf_point(mercury ~ weight, data=fishData) %>%
gf_coefline(model=fishFit)
Do the specifications of a linear regression model seem reasonable for these data?
What is the distribution of the response and predictor?
# histograms of the response and predictor variables
p1 = gf_histogram(~ mercury, data = fishData, bins=20, color="white")
p2 = gf_histogram(~ weight, data = fishData, bins=20, color="white")
grid.arrange(p1, p2, ncol=2)
# add fitted values and residuals to the data frame
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fishFit = lm(mercury ~ weight, data=fishData)
fishData = fishData %>%
mutate(fitted = predict(fishFit), resids = residuals(fishFit))
# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fishData) %>%
gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
gf_hline(yintercept = 0, linetype = 2)
p1 = gf_histogram(~ resids, bins = 20, data = fishData, color="white")
p2 = gf_qq(~ resids, data = fishData) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)
# NOT USED: I was playing around writing a function to draw a normal quantile plot
# with reference line using ggplot2, but the above is easier using ggformula package
qqplot.data = function (sample) # argument: vector of numbers
{
# following four lines from base R's qqline()
y = quantile(sample[!is.na(sample)], c(0.25, 0.75))
x = qnorm(c(0.25, 0.75))
slope = diff(y)/diff(x)
intercept = y[1L] - slope * x[1L]
df = data.frame(sample)
ggplot(df, aes(sample = sample)) +
stat_qq() +
geom_abline(slope = slope, intercept = intercept)
}
p2 = qqplot.data(fishData$resids)
# I found out how to get the confidence/prediction bands on the plot using a ggformula
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
# LS regression line with confidence and prediction bands
gf_lm(mercury ~ weight, data = fishData,
formula = y ~ x, backtrans= identity,
interval = "prediction", fill = "skyblue") %>%
# gf_lm(interval = "confidence", fill="blue", color="black") %>%
gf_point()
# scatterplot with LS regression line added
fishFit = lm(mercury ~ weight, data=fishData)
p1 = gf_point(mercury ~ weight, data = fishData) %>%
gf_coefline(model = fishFit)
# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fishData) %>%
gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
gf_hline(yintercept = 0, linetype = 2)
grid.arrange(p1, p2, ncol=2)
The y-values and x-values are skewed right.
The larger values are twice the size of middle values.
The errors get larger with the predictor (multiplicative errors, not additive).
Let’s make a new response: logmercury = log(mercury, base=2)
and a new predictor: logweight = log(weight, base=2)
# Create new data frame adding log(weight) and log(mercury) to data frame (base 2 log)
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fish2Data = fishData %>%
mutate(logweight = log(weight, base=2),
logmercury = log(mercury, base=2))
This helps the skewness problems.
# histograms of response and predictor variables
p1 = gf_histogram(~ logmercury, data = fish2Data, bins = 15, color = "white")
p2 = gf_histogram(~ logweight, data = fish2Data, bins = 15, color = "white")
grid.arrange(p1, p2, ncol=2)
Is the relationship still linear?
Did we fix the non-constant variance problem?
# Fit LS model to transformed variables
fish2Fit = lm(logmercury ~ logweight, data = fish2Data)
# Scatterplot of transformed data with LS regression line
gf_point(logmercury ~ logweight, data = fish2Data) %>%
gf_coefline(model = fish2Fit)
Plot of residuals vs. fitted values (yhats)
# add fitted values and residuals to the data frame
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fish2Fit = lm(logmercury ~ logweight, data = fish2Data)
fish2Data = fish2Data %>%
mutate(fitted = predict(fish2Fit), resids = residuals(fish2Fit))
# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fish2Data) %>%
gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
gf_hline(yintercept = 0, linetype = 2)
p1 = gf_histogram(~ resids, data = fish2Data, bins=15, color="white")
p2 = gf_qq(~ resids, data = fish2Data) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)
# I found out how to get the confidence/prediction bands on the plot using a ggformula
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
# LS regression line with confidence and prediction bands
gf_lm(logmercury ~ logweight, data = fish2Data,
interval = "prediction", fill = "skyblue") %>%
# gf_lm(interval = "confidence", fill="blue", color="black") %>%
gf_point()
From the graph, it appears on average, the fish have
below 1 ppm mercury (0 on the y-axis above)
as long as it weighs less than \(2^0=1\) kg.
However, for the particular fish you just caught (prediction),
you might be more comfortable if it weighed less than \(2^{-1.25}=0.4204\) kg (420.4 grams)!
Confidence and prediction bands transformed back to the original scale.
A bit easier to read.
# I found out how to get the confidence/prediction bands on the plot using a ggformula
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
# confidence and prediction bands on the original scale
# but data fit using the log(y) and log(x) transformation
gf_lm(mercury ~ weight, data = fish2Data,
formula = log(y) ~ log(x), backtrans = exp,
interval = "prediction", fill = "skyblue") %>%
gf_lm(
formula = log(y) ~ log(x), backtrans = exp,
interval = "confidence", color = "blue") %>%
gf_point()