First, I loaded some R packages needed for the code in this document. Click the Code button to see my R code.
library(tidyverse)
library(knitr)
library(mosaic)
library(gridExtra)
library(broom)
library(modelr)
library(car)
For any package, you install just once, then use library to load the package when needed. For example:
install.packages("mosaic") only once, ever.
library(mosaic) when you start up RStudio.
Collect data on
General deterministic part of model \(\quad Y \;=\; f(X_1,X_2, \ldots, X_p)\)
We might opt for a linear relationship \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p\] …and then, allow for error (a statistical model) \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\] This is called a linear regression model
(George Box)
Few processes are actually linear.
Still, the following are linear regression models
\[\begin{align} Y &\;=\; \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 \log(X) + \epsilon \end{align}\] because the “coefficients” \(\beta_0\) and \(\beta_1\) enter the model linearly.
For example, the non-linear model \(Y \;=\; \beta_0 e^{\beta_1 X} \epsilon\)
can be transformed to a linear model (take the log of both sides) \[\begin{align}
log(Y) &\;=\; \log(\beta_0) + \beta_1 X + \log(\epsilon) \\ \\
Y' & \;=\; \beta_0' + \beta_1 X + \epsilon'
\end{align}\]
STOP
Once we propose a linear model, the method is intuitive:
Fit a linear function through the data points using some objective criterion for a good “fit”.
We will endevour to provide you with tidy data:
| observation | response \(Y\) | predictor \(x\) |
|---|---|---|
| 1 | \(y_1\) | \(x_1\) |
| 2 | \(y_2\) | \(x_2\) |
| … | … | … |
| … | … | … |
| … | … | … |
| \(n\) | \(y_n\) | \(x_n\) |
A sample of records on computer repair service calls was collected.
Two variables were observed:
Which variable is response (\(Y\))?
Which variable is explanatory or “predictor” variable (\(X\))?
It depends… on the question of interest.
What if we know how long the repairer worked.
Did they repair as many units as we would have expected?
What if we knew how many units needed repair.
How much time should we schedule for the repairer?
\(Y\) = Time to complete work (minutes)
\(X\) = Number of units repaired
# Read in the data from the course data repository
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")
repairData # view the data (OK, since small sample)
| Minutes | Units |
|---|---|
| 23 | 1 |
| 29 | 2 |
| 49 | 3 |
| 64 | 4 |
| 74 | 4 |
| 87 | 5 |
| 96 | 6 |
| 97 | 6 |
| 109 | 7 |
| 119 | 8 |
| 149 | 9 |
| 145 | 9 |
| 154 | 10 |
| 166 | 10 |
glimpse(repairData) # view a summary of the data (OK, for larger samples too)
Observations: 14
Variables: 2
$ Minutes <dbl> 23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149, 145, 154, 166
$ Units <dbl> 1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10
ggplot2 package (with ggformula package to help out)lattice packagebase R (defaults not pretty)base R (make it prettier)p1 = gf_point(Minutes ~ Units, data=repairData) # ggplot2 graphics made easier by ggformula package
p2 = xyplot(Minutes ~ Units, data=repairData) # lattice graphics package
grid.arrange(p1, p2, ncol=2) # arrange the plots side by side (2 columns)
par(mfrow=c(1,2)) # prepare to arrange base R plots with 1 row, 2 cols
plot(repairData$Units, repairData$Minutes)
plot(repairData$Units, repairData$Minutes, pch=20, xlab="Units", ylab="Minutes")
par(mfrow=c(1,1)) # return the base R setting to 1 plot (1 row, 1 col)
I usually use the ggplot2 package (with the ggformula package to help out).
STOP
Demonstrate ggformula syntax.
Fancy add-ons you can “chain” together with ggplot2
gf_point(Minutes ~ Units, data=repairData) %>%
gf_rug() %>%
gf_smooth(se = TRUE)
Sample averages: \(\;\overline{y} \;=\; \frac{1}{n}\sum_{i=1}^n y_i \qquad \overline{x} \;=\; \frac{1}{n}\sum_{i=1}^n x_i\)
Sample variances: \[\mathrm{var}(y) \;=\; s_y^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \qquad \mathrm{var}(x) \;=\; s_x^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \overline{x})^2\] Sample standard deviations: \[\mathrm{sd}(y) \;=\; s_y \;=\; \sqrt{s_y^2}\;, \quad \mathrm{sd}(x) \;=\; s_x \;=\; \sqrt{s_x^2}\]
Standardized data: \[z_i \;=\; \frac{y_i - \overline{y}}{s_y} \qquad w_i \;=\; \frac{x_i - \overline{x}}{s_x}\] Can show that \(\overline{z} \;=\; 0\) and \(\mathrm{sd}(z) \;=\; s_z \;=\; 1\) (same for \(w\))
STOP
Show that \(\overline{z} \;=\; 0\)
\[\overline{z} \;=\; \frac{1}{n}\sum_{i=1}^n z_i \;=\; \frac{1}{n}\sum_{i=1}^n \frac{y_i - \overline{y}}{s_y} \;=\; \frac{1}{n(s_y)} \sum_{i=1}^n (y_i - \overline{y}) \;=\; \frac{1}{n(s_y)} \left[ \sum_{i=1}^n y_i - \sum_{i=1}^n \overline{y}\right] \;=\; \frac{1}{n(s_y)} [ n\overline{y} - n\overline{y}] \quad\;=\;\quad 0 \]
Two variables:
ybar = mean(~ Minutes, data=repairData)
ybar
[1] 97.21
xbar = mean(~ Units, data=repairData)
xbar
[1] 6
sy = sd(~ Minutes, data=repairData)
sy
[1] 46.22
sx = sd(~ Units, data=repairData)
sx
[1] 2.961
\(\bar y = 97.21 \quad \overline{x}= 6\)
\(s_y = 46.22 \quad s_x = 2.961\)
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
# Here we add standardized versions of the original variables
repairData = mutate(repairData,
stdMinutes = (Minutes - mean(Minutes)) / sd(Minutes),
stdUnits = (Units - mean(Units)) / sd(Units)
)
repairData
| Minutes | Units | stdMinutes | stdUnits |
|---|---|---|---|
| 23 | 1 | -1.6058 | -1.6885 |
| 29 | 2 | -1.4760 | -1.3508 |
| 49 | 3 | -1.0432 | -1.0131 |
| 64 | 4 | -0.7187 | -0.6754 |
| 74 | 4 | -0.5023 | -0.6754 |
| 87 | 5 | -0.2210 | -0.3377 |
| 96 | 6 | -0.0263 | 0.0000 |
| 97 | 6 | -0.0046 | 0.0000 |
| 109 | 7 | 0.2550 | 0.3377 |
| 119 | 8 | 0.4714 | 0.6754 |
| 149 | 9 | 1.1205 | 1.0131 |
| 145 | 9 | 1.0339 | 1.0131 |
| 154 | 10 | 1.2287 | 1.3508 |
| 166 | 10 | 1.4883 | 1.3508 |
# Confirm that the average of standardized data is 0, with sd=1
zbar = mean(~ stdMinutes, data=repairData)
zbar
[1] 1.192e-16
sz = sd(~ stdMinutes, data=repairData)
sz
[1] 1
For the standardized response (Minutes): \(\overline{z}= 0 \quad s_z = 1,\) as expected.
Standardizing the data does not change the “relationship”
p1 = gf_point(Minutes ~ Units, data=repairData)
p2 = gf_point(stdMinutes ~ stdUnits, data=repairData)
grid.arrange(p1, p2, ncol=2)
\[\mathrm{var}(y) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y})^2 \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (y_i - \overline{y})\]
\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]
\[\mathrm{var}(z) \;=\; \frac{1}{n-1} \sum_i (z_i - \overline{z})^2 \;=\; \frac{1}{n-1} \sum_i (z_i - 0)^2 \;=\; \frac{1}{n-1} \sum_i z_i^2 \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \] From here, it’s simple to show that \(\mathrm{var}(z) = 1\):
\[\mathrm{var}(z) \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \quad\;=\;\quad \frac{1}{s_y^2} \frac{1}{n-1} \sum_i \left(y_i - \overline{y}\right) \left(y_i - \overline{y}\right) \quad\;=\;\quad \frac{1}{s_y^2} var(y) = \frac{1}{s_y^2} s_y^2 = 1 \]
\[\begin{align} r \;=\; \mathrm{corr}(y,x) \quad &\;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right) \\ \\ \quad &\;=\; \quad \frac{\mathrm{cov}(y,x)}{s_y s_x} \quad \;=\; \quad \frac{\mathrm{cov}(y,x)}{\sqrt{\mathrm{var}(y) \mathrm{var}(x)}} \end{align}\]
Consider the following variables
Consider \(V\) and \(W\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?
Consider \(W\) and \(Z\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?
STOP
Ponder the questions above.
sleep2Before = \(V\) = hours of sleep two nights before midterm examsleepBefore = \(W\) = hours of sleep the night before midterm examsleepAfter = \(Z\) = hours of sleep the night after the midterm examHere is a little data:
v = c(5, 7, 4, 8, 6)
w = c(3, 8, 2, 5, 6)
z = c(8, 7, 9, 7, 8)
df = tibble(sleep2Before=v, sleepBefore=w, sleepAfter=z)
df
| sleep2Before | sleepBefore | sleepAfter |
|---|---|---|
| 5 | 3 | 8 |
| 7 | 8 | 7 |
| 4 | 2 | 9 |
| 8 | 5 | 7 |
| 6 | 6 | 8 |
# gf_vline adds a vertical line at "xintercept"
# gf_hline adds a horizontal line at "yintercept"
p1 = gf_point(sleepBefore ~ sleep2Before, data=df) %>%
gf_vline(xintercept=mean(~ sleep2Before, data=df), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepBefore, data=df), linetype=2)
p2 = gf_point(sleepAfter ~ sleepBefore, data=df) %>%
gf_vline(xintercept=mean(~ sleepBefore, data=df), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepAfter, data=df), linetype=2)
grid.arrange(p1, p2, ncol=2)
\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]
covwv = cov(sleepBefore ~ sleep2Before, data=df)
covwv
[1] 2.75
covzw = cov(sleepAfter ~ sleepBefore, data=df)
covzw
[1] -1.55
\(\mathrm{cov}(w,v) = 2.75\)
\(\mathrm{cov}(z,w) = -1.55\)
What if we recorded the data in minutes instead of hours?
Then, the relationships are still the same
df2 = df*60
p3 = gf_point(sleepBefore ~ sleep2Before, data=df2) %>%
gf_vline(xintercept=mean(~ sleep2Before, data=df2), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepBefore, data=df2), linetype=2)
p4 = gf_point(sleepAfter ~ sleepBefore, data=df2) %>%
gf_vline(xintercept=mean(~ sleepBefore, data=df2), linetype=2) %>%
gf_hline(yintercept=mean(~ sleepAfter, data=df2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)
…but, the covariance changes:
cov(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 9900
cov(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -5580
\(\mathrm{cov}(60w,60v) = 9,900\)
\(\mathrm{cov}(60z,60w) = -5,580\)
Standardization (correlation) takes care of the units problem. \[r \;=\; \mathrm{corr}(y,x) \quad \;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right)\]
corrwv = cor(sleepBefore ~ sleep2Before, data=df)
corrwv
[1] 0.7285
cor(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 0.7285
corrzw = cor(sleepAfter ~ sleepBefore, data=df)
corrzw
[1] -0.776
cor(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -0.776
covwv = cov(sleepBefore ~ sleep2Before, data=df)
covzw = cov(sleepAfter ~ sleepBefore, data=df)
\(\mathrm{corr}(w,v) = 0.7285\)
\(\mathrm{corr}(60w,60v) = 0.7285\)
\(\mathrm{corr}(z,w) = -0.776\)
\(\mathrm{corr}(60z,60w) = -0.776\)
Let \(Y = 50 - X^2\), then
mydata = tibble(x = c(-7:7), y = 50 - x^2)
mydata
| x | y |
|---|---|
| -7 | 1 |
| -6 | 14 |
| -5 | 25 |
| -4 | 34 |
| -3 | 41 |
| -2 | 46 |
| -1 | 49 |
| 0 | 50 |
| 1 | 49 |
| 2 | 46 |
| 3 | 41 |
| 4 | 34 |
| 5 | 25 |
| 6 | 14 |
| 7 | 1 |
gf_point(y ~ x, data=mydata) +
geom_vline(data=mydata, aes(xintercept=mean(x)), linetype=2) +
geom_hline(data=mydata, aes(yintercept=mean(y)), linetype=2)
cor(y ~ x, data=mydata)
[1] 0
Believe it or not (HW),
all four relationships below have exactly the same correlation coefficient.
AnscombeData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P029b.txt")
p1 = gf_point(Y1 ~ X1, data=AnscombeData)
p2 = gf_point(Y2 ~ X2, data=AnscombeData)
p3 = gf_point(Y3 ~ X3, data=AnscombeData)
p4 = gf_point(Y4 ~ X4, data=AnscombeData)
grid.arrange(p1, p2, p3, p4, ncol=2)
We definitely should make a plot to interpret correlation!
Two variables:
The relationship appears linear.
repairData
| Minutes | Units | stdMinutes | stdUnits |
|---|---|---|---|
| 23 | 1 | -1.6058 | -1.6885 |
| 29 | 2 | -1.4760 | -1.3508 |
| 49 | 3 | -1.0432 | -1.0131 |
| 64 | 4 | -0.7187 | -0.6754 |
| 74 | 4 | -0.5023 | -0.6754 |
| 87 | 5 | -0.2210 | -0.3377 |
| 96 | 6 | -0.0263 | 0.0000 |
| 97 | 6 | -0.0046 | 0.0000 |
| 109 | 7 | 0.2550 | 0.3377 |
| 119 | 8 | 0.4714 | 0.6754 |
| 149 | 9 | 1.1205 | 1.0131 |
| 145 | 9 | 1.0339 | 1.0131 |
| 154 | 10 | 1.2287 | 1.3508 |
| 166 | 10 | 1.4883 | 1.3508 |
gf_point(Minutes ~ Units, data=repairData)
Correlation is quite high.
covyx = cov(Minutes ~ Units, data=repairData)
covyx
[1] 136
corryx = cor(Minutes ~ Units, data=repairData)
corryx
[1] 0.9937
Covariance and correlation in original units:
\(\mathrm{cov}(y,x) = 136\)
\(\mathrm{corr}(y,x) = 0.9937\)
covzw = cov(stdMinutes ~ stdUnits, data=repairData)
covzw
[1] 0.9937
corrzw = cor(stdMinutes ~ stdUnits, data=repairData)
corrzw
[1] 0.9937
Covariance and correlation in standardized units:
\(\mathrm{cov}(z,w) = 0.9937\)
\(\mathrm{corr}(z,w) = 0.9937\)
\[\begin{align} Y & \;=\; f(X) + \epsilon \\ \\ & \;=\; \beta_0 + \beta_1 X + \epsilon \end{align}\]
This is called a simple linear regression model
since there’s only one predictor: \(X\)
random variable = deterministic linear function + random error
** Illustration of data generation **
If we make a detailed model of how the data are generated, we can:
Or, we can do all of the above analytically. This gives us results in nice algebraic form, and lets us see which exact assumptions about the data generating process were necessary to our conclusions.
If the real data generation process is not well-described by a model meeting the following criteria, then we need a different model.
Throughout the course, we will consider the \(x\) values as fixed.
That is, all models are built conditional on the \(x\) values observed.
Adding in a model for the distribution of \(X\) is beyond the scope of the course.
Still, the model conditional on the observed \(x\) values (although incorrect as all models are) is very useful for summarizing and describing relationships and making predictions for \(Y\) based on new observations on \(x\).
The mean of \(Y\) must depend linearly on \(x\).
\[\begin{align} E(Y\; |\; X=x) &\;=\; E(\beta_0 +\beta_1 x + \epsilon ) \\ \\ & \;=\; \beta_0 +\beta_1 x + E(\epsilon) \\ \\ & \;=\; \beta_0 +\beta_1 x + 0 \;=\; \beta_0 +\beta_1 x \end{align}\]
So, if the errors have no relationship to the values of \(x\), it follows that the mean of \(Y\) is a linear function of \(x\). (Or a better way to put it: if the mean of \(Y\) is not a linear function of \(x\), the errors won’t meet criterion 2b.)
For individual \(i\) in the population with predictor value \(X=x=x_i\)
\[Y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\] This model says that the data generation process can be described as fixing values for \(x\), then finding corresponding \(y=\beta_0 + \beta_1 x\), and finally adding a random error \(\epsilon\) for each individual.
The average response for such individuals is \(E(Y\; |\; X=x_i) \;=\; \beta_0 +\beta_1 x_i\)
Let’s try generating a sample from a population
where we have fixed values for \(\beta_0\), \(\beta_1\), and \(x\) values.
beta0_sim1 = 3
beta1_sim1 = 4
n_sim1 = 50
mx_sim1 = 2
sx_sim1 = 1
I set \(\beta_{0} = 3\), \(\beta_{1} = 4\), and \(n= 50\).
First we fix the values and pick a set of \(n = 50\) \(x\) values.
I’ll pick them from a Normal(\(\mu = 2, \sigma = 1\)) distribution.
set.seed(1234567)
# Generate x values for individuals in population
X = rnorm(n_sim1, m = mx_sim1, s = sx_sim1)
gf_histogram(~ X, bins=12, color = "white")
If the relationship is deterministic,
we could generate \(Y\)-values with a simple linear relationship: \(Y = 3 + 4 x\)
Y_without_error = beta0_sim1 + beta1_sim1 * X # Y values without errors
gf_point(Y_without_error ~ X, color='red')
What is the intercept of this line? What is its slope?
gf_point(Y_without_error ~ X, color = 'red') %>%
gf_abline(slope = beta1_sim1, intercept = beta0_sim1)
Of course, for a statistical model,
we only require the mean of \(Y\) to be a linear function of \(x\)
sd_error_sim1 = 2
So, let’s generate a set of errors \(\epsilon\), one for each observation.
I’ll pick them from a normal distribution: Normal(\(\mu = 0, \sigma = 2\)) .
Reminder: According to our simple statistical model, the errors have mean = 0
set.seed(123466)
# Generate error values for individuals in population with given x-values
error = rnorm(n_sim1, m = 0, s = sd_error_sim1)
gf_histogram(~ error, bins=12, color = "white")
Finally, we can generate \(Y\) values from the \(x\) and error values
by adding the errors to the linear function of \(x\) values.
Y = beta0_sim1 + beta1_sim1 * X + error # Y values, with errors
generated_data = tibble(X, error, Y_without_error, Y)
head(generated_data, 10)
| X | error | Y_without_error | Y |
|---|---|---|---|
| 2.1567 | -2.5555 | 11.627 | 9.0713 |
| 3.3738 | -0.2737 | 16.495 | 16.2216 |
| 2.7307 | -2.0445 | 13.923 | 11.8782 |
| 0.6492 | 0.8448 | 5.597 | 6.4416 |
| 1.9915 | -0.8568 | 10.966 | 10.1092 |
| 2.3210 | 0.9133 | 12.284 | 13.1972 |
| 0.2219 | -3.0707 | 3.887 | 0.8167 |
| 2.9095 | 1.4770 | 14.638 | 16.1150 |
| 1.0806 | 2.6609 | 7.322 | 9.9832 |
| 1.8423 | 0.4994 | 10.369 | 10.8685 |
gf_point(Y ~ X, data = generated_data, color = 'blue', show.legend = TRUE) %>%
gf_point(Y_without_error ~ X, data = generated_data, color = 'red') %>%
gf_abline(intercept = beta0_sim1, slope = beta1_sim1)
We have already seen that \(\beta_0\) is the intercept of the above graphs. Let’s make this more precise.
When \(x=0\), we expect \(Y\) to be \(\beta_0\).
\[E(Y\; |\; X=0) \;=\; \beta_0 +\beta_1 (0) \;=\; \beta_0\]
Computer repair data: What does \(\beta_{0}\) represent?
gf_point(Minutes ~ Units, data=repairData) +
xlim(-1, max(~ Units, data=repairData)) +
ylim(10, max(~ Minutes, data=repairData))
# gf_smooth(method = "lm", se = FALSE) +
STOP …and think about it.
When \(x=0\), we estimate the mean of \(y\) to be \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) = \widehat{\beta}_{0}\) minutes.
For the computer example, this means that when no units are serviced,
(guessing from the graph) we expect the work to take about 5 minutes.
Hmmmm…. That doesn’t make sense.
Perhaps interpret this a “start up” time before working on the 1st unit?
We have already seen that \(\beta_1\) is the slope of the above graphs. Let’s make this more precise, too – and in a way that will carry forward when we go to multiple predictor variables, later in the course.
For individuals with predictor value \(x+1\) versus those with \(x\),
we expect the reponse to be \(\beta_1\) units higher (or lower).
\[\begin{align} E(Y\; |\; X=x+1) &\;=\; \beta_0 +\beta_1 (x+1) \\ \\ &\;=\; \beta_0 +\beta_1 x + \beta_1 \;=\; E(Y\; |\; X=x) + \beta_1 \end{align}\]
That is, \(\; E(Y\; |\; X=x+1) - E(Y\; |\; X=x) \;=\; \beta_1\)
Computer repair data: What does \(\beta_1\) represent?
gf_point(Minutes ~ Units, data=repairData) +
xlim(-1, max(~ Units, data=repairData)) +
ylim(10, max(~ Minutes, data=repairData))
# gf_smooth(method = "lm", se = FALSE) +
STOP …and think about it.
For each additional unit to be serviced, we estimate the mean time to completion (guessing from the graph) to be about 12 minutes longer.
Estimates of \(\beta_{0}\) and \(\beta_{1}\) from data:
tidy(lm(Minutes ~ Units, data=repairData))
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
| Units | 15.509 | 0.505 | 30.71 | 0.0000 |
We will soon learn how to “best” calculate a slope and intercept from data.
For individual \(i\) in the sample
\[y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\]
Suppose we know \(x_i\) but not \(y_i\). The true value is \(y_i \;=\; \beta_0 +\beta_1 X + \epsilon_i\), but we don’t know \(\epsilon_i\)
The fitted (estimated) value of \(Y\)
for individual \(i\) already in the sample is \[\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\]
The estimated average (expected) value
for all individuals in the population with \(X=x_0\) is \[\widehat{\mu}_0 = \;=\;\widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]
The predicted (estimated) value
for a yet-to-be-observed or just now observed individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]
Most of the time, we don’t know anything about the underlying population, including the true values of the parameters \(\beta_0\) and \(\beta_1\), and \(\mathrm{var}(\epsilon)=\sigma^2\).
All we have are observations in a sample.
We will use these observations to calculate estimates of \(\beta_0\), \(\beta_1\), as well as errors \(\epsilon_i\) and their variance.
Suppose we have estimates for \(\beta_0\) and \(\beta_1\).
We can then use these estimates to calculate the fitted response \(\widehat{y}_i\) for each observation.
The difference between the fitted response and the observed response is an estimate of \(\epsilon_i\).
The estimated error (estimate of \(\epsilon_i\)) is called a residual (\(e_i\))
\[\begin{align} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i) \end{align}\]
# I'm going to use Base R graphics here since I happen to know how to
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
a = 10
b = 4
mx = 15
sx = 5
s = 20
nsim2 = 100
set.seed(123456)
x = rnorm(nsim2, m=mx, sd=sx)
y = a + b*x + rnorm(nsim2, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)
# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)
xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi
plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)
points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)
lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))
text(x=xi, y=ybar + 2*s, labels="residual = (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
Residuals for the computer repair data:
# Find the least squares estimates for intercept and slope (lm function)
repairFit = lm(Minutes ~ Units, data=repairData)
# Once you have an "lm object", you can grab all kinds of summaries from it
# save the fitted values: predict function
# save the residuals: residuals function
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
repairData = mutate(repairData,
fitted = predict(repairFit),
resids = residuals(repairFit)
)
# This is a pretty fancy plot that I Googled and borrowed from here:
# https://www.r-bloggers.com/visualising-residuals/
ggplot(repairData, aes(x = Units, y = Minutes)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = Units, yend = fitted), alpha = 0.2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = fitted), shape = 1) +
theme_bw() # Add theme for cleaner look
What line should we choose to “fit” the model to data?
(This is just computer simulated data for illustration.)
# This entire section is borrowed from Hadley Wickham: R for Data Science
# http://r4ds.had.co.nz/model-basics.html
hadleyData <- modelr::sim1
ggplot(hadleyData, aes(x, y)) +
geom_point()
m = 250
Here are \(250\) random possible models, but most are terrible!
set.seed(12345)
models = tibble(
b0 = runif(m, -20, 40),
b1 = runif(m, -5, 5)
)
ggplot(hadleyData, aes(x, y)) +
geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) +
geom_point()
### Loss functions No matter which two number we might choose to measure slope and intercept,
we are summarizing an entire distribution with two numbers.
The least squares estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) minimize the sum of squared residuals
In our textbook,
SSE is called the “Sum of Squared Residuals (Errors)”.
MSE is called the “Mean Squared Residual (Error)”.
NOTATION CAUTION:
This is a bit confusing since “errors” (\(\epsilon_i\)) are not observed.
Only residuals are observed, but “SSR” is reserved for another quantity to be discussed later.
\[\mathrm{SSE}\;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{y}_i)^2 \;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\]
hadleyFit = lm(y ~ x, data=hadleyData)
mydata = hadleyData
mydata$fitted = predict(hadleyFit) # Save the fitted values
mydata$residuals = residuals(hadleyFit) # Save the residual values
ggplot(mydata, aes(x, y)) +
geom_smooth(method = "lm", se = FALSE, color = "lightgrey") + # Plot regression slope
geom_segment(aes(xend = x, yend = fitted), alpha = 0.2) + # alpha to fade lines
geom_point() +
geom_point(aes(y = fitted), shape = 1) +
theme_bw() # Add theme for cleaner look
Which models have the smallest sum of squared residuals (SSE)?
model1 = function(b, data) {
b[1] + data$x * b[2]
}
measure_distance = function(mod, data) {
diff = data$y - model1(mod, data)
sum(diff ^ 2) # changed to SSE from Hadley's RMSE sqrt(mean(diff ^ 2))
}
hadley_SSE = function(b0, b1) {
measure_distance(c(b0, b1), hadleyData)
}
models = models %>%
mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
ggplot(hadleyData, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = b0, slope = b1, colour = SSE),
data = filter(models, rank(SSE) <= 10)
)
Let’s try to find the smallest SSE on our own.
http://www.rossmanchance.com/applets/RegShuffle.htm
Among the \(250\) models, we have isolated the 10 “closest” to the data
ggplot(models, aes(b0, b1)) +
geom_point(data = filter(models, rank(SSE) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = SSE)) +
labs(x = "Intercepts", y = "Slopes")
Here is a more organized approach: A grid search
grid = expand.grid(
b0 = seq(-5, 20, length = 25),
b1 = seq(1, 3, length = 25)
) %>%
mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
grid %>%
ggplot(aes(b0, b1)) +
geom_point(data = filter(grid, rank(SSE) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = SSE))
ggplot(hadleyData, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = b0, slope = b1, colour = SSE),
data = filter(grid, rank(SSE) <= 10)
)
# End of section borrowed from Hadley Wickham: R for Data Science
Of course, we can make the grid finer and finer to get closer to the “best fit” line
…or just use calculus to minimize the SSE
We obtain the least squares estimators by minimizing the sum of squared residuals.
You do NOT need to be able to do the math, just understand it.
Find \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) that lead to:
\[\min(\mathrm{SSE}) \;=\; \min \sum_i e_i^2 \;=\; \min \sum_i (y_i-\widehat{y}_i)^2 \;=\; \min \sum_i (y_i-\widehat{\beta}_{0}-\widehat{\beta}_{1} x_i)^2\]
\[\begin{align} \textrm{w.r.t } \widehat{\beta}_{0} \Rightarrow -2\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (1) \\ \\ \textrm{w.r.t } \widehat{\beta}_{1} \Rightarrow -2\sum x_i(y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (2) \end{align}\]
From equation (\(1\)): \(\;\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)
Plug this into equation (\(2\)): \[\begin{align} & \sum x_i(y_i - \overline{y}- \widehat{\beta}_{1}\overline{x}-\widehat{\beta}_{1}x_i)=0 \\ \\ \implies\; & \sum x_i(y_i - \overline{y})\;=\; \widehat{\beta}_{1} \sum x_i(x_i - \overline{x}) \\ \\ \implies\; & \widehat{\beta}_{1}\;=\; \frac{\sum x_i(y_i - \overline{y})}{\sum x_i(x_i - \overline{x})} \hskip2in (3) \end{align}\]
Note that \[\begin{align} & \sum (x_i-\overline{x})(y_i - \overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) -\sum \overline{x}(y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}\sum (y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}(0) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) \end{align}\]
Similarly, we have
\(\sum (x_i-\overline{x})(y_i - \overline{y})\;=\; \sum y_i(x_i-\overline{x})\)
\(\sum (x_i-\overline{x})^2\;=\; \sum x_i(x_i-\overline{x})\)
Therefore, according to (\(3\)) we have other ways to write the estimator \[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]
\(\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)
\(\widehat{\beta}_{1} \;=\; r \frac{s_y}{s_x}\)
When sampling from one population of quantitative values \(Y\), we estimated the mean of \(Y\): \(E(Y) = \mu\).
Using our new linear regression notation, we actually estimated the mean of \(Y\) for the linear model: \(\qquad Y = \beta_0 + \epsilon \;\;=\;\; \mu + \epsilon\)
Now, we want to estimate \(\widehat{\mu} = \widehat{y}= \widehat{\beta}_{0}\)
We will again use as our loss function the sum (or mean) of squared residuals.
\[
SSE(a) = \sum_{i=1}^n\, (y_i-a)^2
\qquad\qquad
MSE(a) = \frac{1}{n-1}\sum_{i=1}^n\, (y_i-a)^2
\]
What value of \(a\) should we choose if loss measured by SSE?
It seems reasonable that \(a\) should be in the “center” of the data.
But which value in the middle would be best?
One optimality criteria: Choose a number that minimizes SSE.
Using the least squares estimation method, \(\widehat{\beta}_{0}\) is the value that minimizes
\[\mathrm{SSE}\;=\; \sum (y_i - \widehat{\mu})^2 \;=\; \sum (y_i - \widehat{y})^2 \;=\; \sum(y_i - \widehat{\beta}_{0})^2\]
So, we want to minimize a function of the form \(f(a) = \sum (y_i -a)^2\) with respect to \(a\).
\[\frac{d}{da} f(a) = 2 \sum (y_i - a) (-1) = -2 \sum(y_i - a)\]
Set the derivative = 0 and solve for \(a\).
\[ \begin{align} -2 \sum(y_i - a) = 0 \implies\; & \sum(y_i - a) = 0 \\ \implies\; & \sum y_i - \sum a = 0 \\ \implies\; & \sum y_i - n a = 0 \\ \implies\; & \sum y_i = n a \\ \implies\; & a = \frac{1}{n} \sum y_i \\ \implies\; & a = \overline{y} \end{align} \]
The 2nd derivative of \(f(a)\) is
\[\frac{d}{da} (-2)\sum (y_i -a) = \frac{d}{da} (-2)\sum y_i + \frac{d}{da} (-2)\sum (-a) = 0 + 2n > 0\]
So, \(\widehat{\beta}_{0} = \overline{y}\) is a minimum, called the “least squares” estimate for \(\mu\).
So, \(\; \widehat{\beta}_{0} \;=\; \widehat{\mu} \;=\; \widehat{y}\;=\; \overline{y}\)
lm = “linear model” function
lm(y ~ x, data = mydata)
Back to the computer repair data:
# Find the least squares estimates for slope and intercept (lm function)
lmfit = lm(Minutes ~ Units, data=repairData)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Minutes ~ Units, data=repairData) %>%
gf_coefline(model=lmfit)
repairFit = lm(Minutes ~ Units, data=repairData)
# tidy (from the broom package) gives a summary of the slope and intercept
# and accompanying estimate of variance in the estimates
tidy(repairFit)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 4.162 | 3.355 | 1.24 | 0.2385 |
| Units | 15.509 | 0.505 | 30.71 | 0.0000 |
b0 = tidy(repairFit)$estimate[1]
b0
[1] 4.162
b1 = tidy(repairFit)$estimate[2]
b1
[1] 15.51
Let’s try the least squares estimate on the simulated data we generated earlier
where we set \(\beta_{0} = 3\) and \(\beta_{1} = 4\) and generated random \(Y\) values.
Since we simulated the data from the linear regression model,
then the least squares estimates should be accurate.
# Find the least squares estimates for slope and intercept (lm function)
lmfit_sim = lm(Y ~ X, data = generated_data)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Y ~ X, data = generated_data) %>%
gf_coefline(model = lmfit_sim) %>%
gf_abline(intercept = beta0_sim1, slope = beta1_sim1, color = 'red',alpha=1)
Remember, our true population parameters were \(\beta_0= 3\) and \(\beta_1= 4\).
lmfit_sim = lm(Y ~ X, data = generated_data)
tidy(lmfit_sim)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 2.937 | 0.7552 | 3.889 | 0.0003 |
| X | 3.983 | 0.3592 | 11.088 | 0.0000 |
b0_sim = tidy(lmfit_sim)$estimate[1]
b1_sim = tidy(lmfit_sim)$estimate[2]
Indeed \(\widehat{\beta}_{0}=2.937 \approx 3 = \beta_{0} \qquad \text{ and } \qquad \widehat{\beta}_{1}=3.983 \approx 4 = \beta_{1}\)
\(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)
random variable = deterministic linear function + random error
This is called a simple linear regression model since only one predictor: \(x\)
The errors are uncorrelated
The errors have no systematic info about \(Y\)
The average error = \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))
The distribution of errors has mean zero
No systematic bias
The errors are not correlated with the \(x\) values
\(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))
The mean of \(Y\) depends linearly on \(x\)
There is actually nothing we need to check in the data
because least squares residuals always add up to zero!
For any data, \[\begin{align} \sum_i e_i &\;=\; \sum_i (y_i - \widehat{y}_i) \\ \\ &\;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - (\overline{y}- \widehat{\beta}_{1}\overline{x}) - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - \overline{y}) - \widehat{\beta}_{1} \sum_i (x_i - \overline{x}) \\ \\ &\;=\; 0 - \widehat{\beta}_{1} (0) \;=\; 0 \end{align}\]
So, the observed regression line \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} x\) is the
observed average and is the balancing point of the observed \(y\)-values.
This should look familiar.
We showed that for data from a population \(Y\): \(\displaystyle{\sum_i (y_i - \overline{y}) = 0}\).
The observed average is the balancing point of the observed \(y\)-values.
Later, we will look at graphical diagnostics
to check for constant spread around the line for all \(x\) values.
We estimate \(\mathrm{var}(\epsilon)\) as
\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]
degrees of freedom (df)
= sample size \(-\) # parameters estimated for the mean (\(\beta_0\) and \(\beta_1\))
= \(n-2\)
Again, we will look at useful graphical diagnostics later.
We know that \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) will not exactly match the true values \(\beta_0\) and \(\beta_1\). Here, we are going to explore this variability.
We’ll start with a quick visualization. I generate several samples. The first is large, with 10,000 observations, and it’s meant to represent the population. The other three are small.
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
set.seed(1234)
sd_error_sim1=5
# Simulate the population (really just a big sample with 10,000 observations)
n_simbig=10000
X_big = runif(n_simbig)
error = rnorm(n_simbig, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y_big = beta0_sim1 + beta1_sim1 * X_big + error
# Now draw 3 smaller samples
n_sim1=10
X = runif(n_sim1)
# First sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Second sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Third sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y3 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
# Reshape data for plotting
this_data<-data.frame(X,Y1,Y2,Y3)
this_data_long <- melt(this_data, id="X")
# Plot population & sample 2
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>%
gf_point(Y2~X,color="green")%>% gf_smooth(Y2~X,method='lm',se=FALSE,color="green")
# Plot population & all samples
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>%
gf_point(value~X,data=this_data_long,color=~variable)%>% gf_smooth(value~X,data=this_data_long,color=~variable,method='lm',se=FALSE)
As we can see, the estimated slopes \(\widehat{\beta}_{1}\) and intercepts \(\widehat{\beta}_{0}\) differ for each of our three samples. What we’ll discuss now is what distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) we’d expect to see as we increase the number of our samples: the sampling distributions.
If the model is correct: \(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)
then the least squares estimates are unbiased.
\[E(\widehat{\beta}_{0}) \;=\; \beta_0 \qquad {\rm and} \qquad E(\widehat{\beta}_{1}) \;=\; \beta_1\]
To show this, let’s ponder the formula for \(\widehat{\beta}_{1}\) first.
Recall the various expressions for \(\widehat{\beta}_{1}\):
\[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]
In particular, \[\widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_j-\overline{x})^2} \;=\; \sum{k_i y_i}\] where,
\[k_i \; =\; \frac{(x_i-\overline{x})}{\sum(x_i-\overline{x})^2}\] We’ll soon need these facts about the \(k_i\):
\[\begin{align} \sum_i k_i &\;=\; \sum_i \frac{(x_i-\overline{x})}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})} {\sum(x_j-\overline{x})^2} \;=\; 0 \\ \\ \sum_i k_ix_i &\;=\; \sum_i \frac{(x_i-\overline{x})x_i}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})^2} {\sum(x_j-\overline{x})^2} \;=\; 1 \\ \\ \sum_i k_i^2 &\;=\; \sum_i \frac{(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{\sum_i(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{1}{\sum(x_i-\overline{x})^2} \end{align}\]
\[\begin{align} E(\widehat{\beta}_{1}) &\;=\; E\left(\sum k_iY_i\right) \;=\; \sum E(k_i Y_i) \;=\; \sum k_i E(Y_i) \\ \\ &\;=\; \sum k_i (\beta_0 + \beta_1 x_i) \;=\; \beta_0 \sum k_i + \beta_1 \sum k_i x_i \\ \\ &\;=\; \beta_0 (0) + \beta_1 (1) \;=\; \beta_1 \end{align}\]
…and we can find the variance of \(\widehat{\beta}_{1}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{1}) &\;=\; \mathrm{var}\left(\sum k_iY_i\right) \;=\; \sum \mathrm{var}(k_i Y_i) \;=\; \sum k_i^2 \mathrm{var}(Y_i) \\ \\ &\;=\; \sum k_i^2 \sigma^2 \;=\; \sigma^2 \sum k_i^2 \;=\; \frac{\sigma^2}{\sum(x_i-\overline{x})^2} \end{align}\]
\[\begin{align} E(\widehat{\beta}_{0}) &\;=\; E(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; E(\overline{Y}) - \overline{x}E(\widehat{\beta}_{1}) \\ \\ &\;=\; (\beta_0 + \beta_1 \overline{x}) - \overline{x}\beta_1 \;=\; \beta_0 \end{align}\]
…and we can find the variance of \(\widehat{\beta}_{0}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{0}) &\;=\; \mathrm{var}(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; \mathrm{var}(\overline{Y}) + \overline{x}^2\mathrm{var}(\widehat{\beta}_{1}) -\overline{x}\mathrm{cov}(\overline{Y}, \widehat{\beta}_{1})\\ \\ &(*) \;=\; \frac{\sigma^2}{n} + \frac{\sigma^2 \overline{x}^2}{\sum(x_i-\overline{x})^2} \;=\; \sigma^2 \left[ \frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}\right] \end{align}\] \((*)\) Can show that \(\overline{Y}\) and \(\widehat{\beta}_{1}\) are uncorrelated.
Of course, we don’t know \(\sigma^2\) (population variance).
So, we use an estimate from data:
\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{e_i^2}{n-2} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]
Then, the standard errors (estimated standard deviations) are
\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\] \[\mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]
…and if we add one more model specification
That is \(Y \;\sim\; N(\beta_0 + \beta_1 x, \sigma^2)\)
Then, \[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]
Also,
\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]
Observe data on a response variable: \(y_1, y_2, \ldots, y_n\)
Estimate the population mean (\(\mu\)) with the sample average (\(\overline{y}\)).
And \(\overline{y}\) is the least squares (LS) estimate for \(\mu\).
If the population \(Y \sim N(\mu, \sigma^2)\) then
\[\frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]
In STAT 101, we estimated \(\sigma^2\) just as for all least squares estimates: \[\widehat\sigma^2 \;=\; s^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \overline{y})^2}{n-1}\]
degrees of freedom (df)
= sample size \(-\) # parameters estimated for the mean (\(\mu\))
= \(n-1\)
\[E(\overline{Y}) \;=\; E\left[\frac{1}{n}\sum_i Y_i\right] \;=\; \frac{1}{n}\sum E(Y_i) \;=\; \frac{1}{n}\sum \mu \;=\; \frac{1}{n} (n \mu) = \mu\]
and for a random sample: \(\;\mathrm{var}(\overline{Y}) = \sigma^2/n\)
If the population is \(N(\mu, \sigma^2)\) then, \(\; \epsilon \;=\; (Y - \mu) \;\sim\; N(0, \sigma^2)\)
and
\[\frac{\overline{y}- \widehat{y}}{\mathrm{se}(\,\overline{y}\,)} = \frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]
We can generate a set of \(Y\)-values from a known population model we used earlier:
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1))
| beta0 | beta1 | sd_error | n |
|---|---|---|---|
| 3 | 4 | 5 | 10 |
\(\beta_{0} = 3\)
\(\beta_{1} = 4\)
\(\sigma = 5\)
\(n = 10\)
The mean of \(Y\) is linear: \(E(Y) = 3 + 4 x\)
The \(x\) values come from a Normal(\(\mu = 2, \sigma = 1\)) distribution.
The errors (\(\epsilon\)) from a Normal(\(\mu = 0, \sigma = 5\)) distribution.
Finally, we can generate \(Y\) values from the \(x\) and error values by adding the errors to the linear function of \(x\) values.
\(Y_i = 3 + 4 x_i + \epsilon_i\)
number_samples = 1000
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1, number_samples))
| beta0 | beta1 | sd_error | n | number_samples |
|---|---|---|---|---|
| 3 | 4 | 5 | 10 | 1000 |
Using An Applet:
http://www.rossmanchance.com/applets/RegSim/RegCoeff.html
Inputs needed for the applet:
\(\beta_{1} = 4\)
\(\beta_{0} = 3\)
\(\mu_x = 2\)
\(\sigma_x = 1\)
\(\sigma = 5\)
Range of \(y:\; (-1, 20)\) Range of \(x:\; (-1, 5)\)
Using R:
We can simulate 1,000 samples of \(Y\)-values as follows:
First, complete steps 1 and 2 (just once) to generate \(x\) values.
Then, repeat steps 3 and 4 1,000 times,
and each time calculate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) and save those values.
Then, look at the distributions of the 1,000 \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) values.
set.seed(1234)
View the code:
# Set up places to store the beta_0 and beta_1 estimates
beta0hats = 0
beta1hats = 0
plots=NULL
# The code below will repeat number_samples times
# i represents the sample number
for (i in 1:1000) {
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i
fit = lm(Y ~ X) # Calculate LS estimates for intercept and slope for sample i
# assign the current beta_0 estimate to the ith position in vector beta0hats
beta0hats[i] = tidy(fit)$estimate[1]
# assign the current beta_1 estimate to the ith position in vector beta1hats
beta1hats[i] = tidy(fit)$estimate[2]
}
# end of the loop that gets repeated number_sample times
What would you expect center, spread, and shape of these distributions to be?
STOP
gf_dhistogram(~ beta0hats, bins = 20, color = "white")
gf_dhistogram(~ beta1hats, bins = 20, color = "white")