1 Chapters 1 & 2

1.1 Lecture 1

First, I loaded some R packages needed for the code in this document. Click the Code button to see my R code.

library(tidyverse)
library(knitr)
library(mosaic)
library(gridExtra)
library(broom)
library(modelr)
library(car)

For any package, you install just once, then use library to load the package when needed. For example:

install.packages("mosaic") only once, ever.

library(mosaic) when you start up RStudio.

1.1.1 Models

1.1.1.1 Deterministic (Functional) Models

Some variables are related in nature.
Example: Equation from the CRC handbook describing a physical law.
- \(F \;=\; ma\)
For some output variable \(Y\) and input \(X\),
- \(Y \;=\; f(X)\)
A functional model perfectly relates \(Y\) to \(X\) (well… in theory).

1.1.1.2 Statistical Models

A Statistical model is a simple, low-dimensional summary of
- the relationship present in the data
- the data-generation process
- the relationship present in the population
We may not know the functional form of relationship
The model allows errors (uncertainty)
- \(Y \;=\; f(X) + \epsilon\)
- random variable = deterministic function + random error
The model allows for deviations (variability) from the “true” function when estimating from data
- \(\widehat{Y} \;=\; \widehat{f}(x) + e_i\)
- estimate of variable = estimate of deterministic function + observed error

1.1.1.3 Linear Regression Models

Collect data on

\(Y\) = response variable
\(X_1, X_2, \ldots, X_p\) = predictor variables

General deterministic part of model \(\quad Y \;=\; f(X_1,X_2, \ldots, X_p)\)

We might opt for a linear relationship \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p\] …and then, allow for error (a statistical model) \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\] This is called a linear regression model

1.1.1.4 “All models are wrong, but some are useful”

(George Box)

Few processes are actually linear.

Still, the following are linear regression models

\[\begin{align} Y &\;=\; \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 \log(X) + \epsilon \end{align}\] because the “coefficients” \(\beta_0\) and \(\beta_1\) enter the model linearly.

1.1.1.5 Some non-linear relationships can be linearized.

For example, the non-linear model \(Y \;=\; \beta_0 e^{\beta_1 X} \epsilon\)

can be transformed to a linear model (take the log of both sides) \[\begin{align} log(Y) &\;=\; \log(\beta_0) + \beta_1 X + \log(\epsilon) \\ \\ Y' & \;=\; \beta_0' + \beta_1 X + \epsilon' \end{align}\]

STOP

1.1.1.6 Linear or not? Can be linearized or not?

\(Y \;=\; \beta_0 + \beta_1^X + \epsilon\)
\(Y \;=\; \beta_0 \beta_1^X \epsilon\)
\(Y \;=\; \beta_0 + \beta_1 e^{X^2} + \epsilon\)
\(Y \;=\; \beta_0 + \beta_1 X^2 + \beta_2 \log(X) + \epsilon\)

Once we propose a linear model, the method is intuitive:
Fit a linear function through the data points using some objective criterion for a good “fit”.

1.1.2 Data

1.1.2.1 The Data Format

We will endevour to provide you with tidy data:

Each row corresponds to one observational unit (case)
Each variable is one column

observation	response \(Y\)	predictor \(x\)
1	\(y_1\)	\(x_1\)
2	\(y_2\)	\(x_2\)
…	…	…
…	…	…
…	…	…
\(n\)	\(y_n\)	\(x_n\)

1.1.2.2 Example: Computer Repair

A sample of records on computer repair service calls was collected.

Two variables were observed:

Time to complete work (minutes)
Number of units repaired

Which variable is response (\(Y\))?

Which variable is explanatory or “predictor” variable (\(X\))?

It depends… on the question of interest.

What if we know how long the repairer worked.
Did they repair as many units as we would have expected?

What if we knew how many units needed repair.
How much time should we schedule for the repairer?

1.1.2.3 Example: Computer Repairs (response = time, predictor = units)

\(Y\) = Time to complete work (minutes)

\(X\) = Number of units repaired

# Read in the data from the course data repository
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")

repairData # view the data (OK, since small sample)

Minutes	Units
23	1
29	2
49	3
64	4
74	4
87	5
96	6
97	6
109	7
119	8
149	9
145	9
154	10
166	10

glimpse(repairData) # view a summary of the data (OK, for larger samples too)
Observations: 14
Variables: 2
$ Minutes <dbl> 23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149, 145, 154, 166
$ Units   <dbl> 1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10

Scatterplot using ggplot2 package (with ggformula package to help out)
Scatterplot using lattice package
Scatterplot using base R (defaults not pretty)
Scatterplot using base R (make it prettier)

p1 = gf_point(Minutes ~ Units, data=repairData) # ggplot2 graphics made easier by ggformula package
p2 = xyplot(Minutes ~ Units, data=repairData) # lattice graphics package
grid.arrange(p1, p2, ncol=2) # arrange the plots side by side (2 columns)


par(mfrow=c(1,2)) # prepare to arrange base R plots with 1 row, 2 cols
plot(repairData$Units, repairData$Minutes)
plot(repairData$Units, repairData$Minutes, pch=20, xlab="Units", ylab="Minutes")

par(mfrow=c(1,1)) # return the base R setting to 1 plot (1 row, 1 col)

I usually use the ggplot2 package (with the ggformula package to help out).

STOP

Demonstrate ggformula syntax.

Fancy add-ons you can “chain” together with ggplot2

gf_point(Minutes ~ Units, data=repairData) %>%
  gf_rug() %>%
  gf_smooth(se = TRUE)

1.1.3 Summary Statistics

1.1.3.1 Reminder: Summary Statistics

Sample averages: \(\;\overline{y} \;=\; \frac{1}{n}\sum_{i=1}^n y_i \qquad \overline{x} \;=\; \frac{1}{n}\sum_{i=1}^n x_i\)

Sample variances: \[\mathrm{var}(y) \;=\; s_y^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \qquad \mathrm{var}(x) \;=\; s_x^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \overline{x})^2\] Sample standard deviations: \[\mathrm{sd}(y) \;=\; s_y \;=\; \sqrt{s_y^2}\;, \quad \mathrm{sd}(x) \;=\; s_x \;=\; \sqrt{s_x^2}\]

Standardized data: \[z_i \;=\; \frac{y_i - \overline{y}}{s_y} \qquad w_i \;=\; \frac{x_i - \overline{x}}{s_x}\] Can show that \(\overline{z} \;=\; 0\) and \(\mathrm{sd}(z) \;=\; s_z \;=\; 1\) (same for \(w\))

STOP

Show that \(\overline{z} \;=\; 0\)

\[\overline{z} \;=\; \frac{1}{n}\sum_{i=1}^n z_i \;=\; \frac{1}{n}\sum_{i=1}^n \frac{y_i - \overline{y}}{s_y} \;=\; \frac{1}{n(s_y)} \sum_{i=1}^n (y_i - \overline{y}) \;=\; \frac{1}{n(s_y)} \left[ \sum_{i=1}^n y_i - \sum_{i=1}^n \overline{y}\right] \;=\; \frac{1}{n(s_y)} [ n\overline{y} - n\overline{y}] \quad\;=\;\quad 0 \]

1.1.3.2 Example: Computer Repair

Two variables:

\(Y\) = Time to complete work (minutes)
\(X\) = Number of units repaired

ybar = mean(~ Minutes, data=repairData)
ybar
[1] 97.21
xbar = mean(~ Units, data=repairData)
xbar
[1] 6
sy = sd(~ Minutes, data=repairData)
sy
[1] 46.22
sx = sd(~ Units, data=repairData)
sx
[1] 2.961

\(\bar y = 97.21 \quad \overline{x}= 6\)
\(s_y = 46.22 \quad s_x = 2.961\)

# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"

# Here we add standardized versions of the original variables
repairData = mutate(repairData, 
  stdMinutes = (Minutes - mean(Minutes)) / sd(Minutes),
  stdUnits = (Units - mean(Units)) / sd(Units)
)
repairData

Minutes	Units	stdMinutes	stdUnits
23	1	-1.6058	-1.6885
29	2	-1.4760	-1.3508
49	3	-1.0432	-1.0131
64	4	-0.7187	-0.6754
74	4	-0.5023	-0.6754
87	5	-0.2210	-0.3377
96	6	-0.0263	0.0000
97	6	-0.0046	0.0000
109	7	0.2550	0.3377
119	8	0.4714	0.6754
149	9	1.1205	1.0131
145	9	1.0339	1.0131
154	10	1.2287	1.3508
166	10	1.4883	1.3508


# Confirm that the average of standardized data is 0, with sd=1
zbar = mean(~ stdMinutes, data=repairData)
zbar
[1] 1.192e-16
sz = sd(~ stdMinutes, data=repairData)
sz
[1] 1

For the standardized response (Minutes): \(\overline{z}= 0 \quad s_z = 1,\) as expected.

Standardizing the data does not change the “relationship”

p1 = gf_point(Minutes ~ Units, data=repairData)
p2 = gf_point(stdMinutes ~ stdUnits, data=repairData)
grid.arrange(p1, p2, ncol=2)

1.1.4 Correlation

1.1.4.1 Reminder: Covariance and Correlation

variance of \(y\) is

\[\mathrm{var}(y) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y})^2 \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (y_i - \overline{y})\]

covariance of \(y\) and \(x\):

\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]

variance of \(z\) is

\[\mathrm{var}(z) \;=\; \frac{1}{n-1} \sum_i (z_i - \overline{z})^2 \;=\; \frac{1}{n-1} \sum_i (z_i - 0)^2 \;=\; \frac{1}{n-1} \sum_i z_i^2 \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \] From here, it’s simple to show that \(\mathrm{var}(z) = 1\):

\[\mathrm{var}(z) \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \quad\;=\;\quad \frac{1}{s_y^2} \frac{1}{n-1} \sum_i \left(y_i - \overline{y}\right) \left(y_i - \overline{y}\right) \quad\;=\;\quad \frac{1}{s_y^2} var(y) = \frac{1}{s_y^2} s_y^2 = 1 \]

correlation of \(y\) and \(x\) is \(cov(z,w)\) of standardized data:

\[\begin{align} r \;=\; \mathrm{corr}(y,x) \quad &\;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right) \\ \\ \quad &\;=\; \quad \frac{\mathrm{cov}(y,x)}{s_y s_x} \quad \;=\; \quad \frac{\mathrm{cov}(y,x)}{\sqrt{\mathrm{var}(y) \mathrm{var}(x)}} \end{align}\]

1.1.4.2 How covariance measures “relationship”

Consider the following variables

\(V\) = hours of sleep two nights before midterm exam
\(W\) = hours of sleep the night before midterm exam
\(Z\) = hours of sleep the night after the midterm exam

Consider \(V\) and \(W\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?

Consider \(W\) and \(Z\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?

STOP

Ponder the questions above.

1.1.4.3 How covariance measures “relationship”

sleep2Before = \(V\) = hours of sleep two nights before midterm exam
sleepBefore = \(W\) = hours of sleep the night before midterm exam
sleepAfter = \(Z\) = hours of sleep the night after the midterm exam

Here is a little data:

v = c(5, 7, 4, 8, 6)
w = c(3, 8, 2, 5, 6)
z = c(8, 7, 9, 7, 8)
df = tibble(sleep2Before=v, sleepBefore=w, sleepAfter=z)
df

sleep2Before	sleepBefore	sleepAfter
5	3	8
7	8	7
4	2	9
8	5	7
6	6	8

# gf_vline adds a vertical line at "xintercept"
# gf_hline adds a horizontal line at "yintercept"
p1 = gf_point(sleepBefore ~ sleep2Before, data=df) %>%
  gf_vline(xintercept=mean(~ sleep2Before, data=df), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepBefore, data=df), linetype=2)
p2 = gf_point(sleepAfter ~ sleepBefore, data=df) %>%
  gf_vline(xintercept=mean(~ sleepBefore, data=df), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepAfter, data=df), linetype=2)
grid.arrange(p1, p2, ncol=2)

\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]

covwv = cov(sleepBefore ~ sleep2Before, data=df)
covwv
[1] 2.75
covzw = cov(sleepAfter ~ sleepBefore, data=df)
covzw
[1] -1.55

\(\mathrm{cov}(w,v) = 2.75\)
\(\mathrm{cov}(z,w) = -1.55\)

What if we recorded the data in minutes instead of hours?
Then, the relationships are still the same

df2 = df*60
p3 = gf_point(sleepBefore ~ sleep2Before, data=df2) %>%
  gf_vline(xintercept=mean(~ sleep2Before, data=df2), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepBefore, data=df2), linetype=2)
p4 = gf_point(sleepAfter ~ sleepBefore, data=df2) %>%
  gf_vline(xintercept=mean(~ sleepBefore, data=df2), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepAfter, data=df2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)

…but, the covariance changes:

cov(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 9900
cov(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -5580

\(\mathrm{cov}(60w,60v) = 9,900\)
\(\mathrm{cov}(60z,60w) = -5,580\)

Standardization (correlation) takes care of the units problem. \[r \;=\; \mathrm{corr}(y,x) \quad \;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right)\]

corrwv = cor(sleepBefore ~ sleep2Before, data=df)
corrwv
[1] 0.7285
cor(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 0.7285

corrzw = cor(sleepAfter ~ sleepBefore, data=df)
corrzw
[1] -0.776
cor(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -0.776

covwv = cov(sleepBefore ~ sleep2Before, data=df)
covzw = cov(sleepAfter ~ sleepBefore, data=df)

\(\mathrm{corr}(w,v) = 0.7285\)
\(\mathrm{corr}(60w,60v) = 0.7285\)

\(\mathrm{corr}(z,w) = -0.776\)
\(\mathrm{corr}(60z,60w) = -0.776\)

1.1.4.4 Facts about correlation and covariance

Covariance and correlation are measures of association (relationship)
Positive or negative, according to the direction of the association
Covariance is not scale invariant \[\mathrm{cov}(a + by, c + dx) \;=\; bd\mathrm{cov}(y,x)\]
Correlation is scale invariant (unitless) \[\mathrm{corr}(a + by, c + dx) \;=\; {\rm sign}(bd)\mathrm{corr}(y,x)\]
The sample correlation coefficient between \(y\) and \(x\) is \(r \;=\; \mathrm{corr}(y,x)\)
\(-1 \le r \le 1\)
\(r = -1\) or \(r = 1\) means perfect correlation:
- All the points are on a straight line.
Strength of association:\(\;\; r\) closer to \(-1\) or \(1\) is strongest
\(r\) is a sample estimate for population correlation coefficient: \(\;\rho \;=\; \mathrm{corr}(Y,X)\).

1.1.4.5 Guess the correlation

http://www.rossmanchance.com/applets/GuessCorrelation.html

1.1.4.6 Always make a scatterplot when correlation calculated

Let \(Y = 50 - X^2\), then

\(Y\) and \(X\) are have a perfect relationship (no deviations from the curve)
…but the relationship is quadratic, not linear
By symmetry, the correlation in this case is exactly zero!

mydata = tibble(x = c(-7:7), y = 50 - x^2)
mydata

x	y
-7	1
-6	14
-5	25
-4	34
-3	41
-2	46
-1	49
0	50
1	49
2	46
3	41
4	34
5	25
6	14
7	1

gf_point(y ~ x, data=mydata) +
  geom_vline(data=mydata, aes(xintercept=mean(x)), linetype=2) +
  geom_hline(data=mydata, aes(yintercept=mean(y)), linetype=2)

cor(y ~ x, data=mydata)
[1] 0

Believe it or not (HW),
all four relationships below have exactly the same correlation coefficient.

AnscombeData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P029b.txt")
p1 = gf_point(Y1 ~ X1, data=AnscombeData)
p2 = gf_point(Y2 ~ X2, data=AnscombeData)
p3 = gf_point(Y3 ~ X3, data=AnscombeData)
p4 = gf_point(Y4 ~ X4, data=AnscombeData)
grid.arrange(p1, p2, p3, p4, ncol=2)

We definitely should make a plot to interpret correlation!

1.1.4.7 Example: Computer Repair

Two variables:

\(Y\) = Time to complete work (minutes)
\(X\) = Number of units repaired

The relationship appears linear.

repairData

Minutes	Units	stdMinutes	stdUnits
23	1	-1.6058	-1.6885
29	2	-1.4760	-1.3508
49	3	-1.0432	-1.0131
64	4	-0.7187	-0.6754
74	4	-0.5023	-0.6754
87	5	-0.2210	-0.3377
96	6	-0.0263	0.0000
97	6	-0.0046	0.0000
109	7	0.2550	0.3377
119	8	0.4714	0.6754
149	9	1.1205	1.0131
145	9	1.0339	1.0131
154	10	1.2287	1.3508
166	10	1.4883	1.3508

gf_point(Minutes ~ Units, data=repairData)

Correlation is quite high.

covyx = cov(Minutes ~ Units, data=repairData)
covyx
[1] 136
corryx = cor(Minutes ~ Units, data=repairData)
corryx
[1] 0.9937

Covariance and correlation in original units:
\(\mathrm{cov}(y,x) = 136\)
\(\mathrm{corr}(y,x) = 0.9937\)

covzw = cov(stdMinutes ~ stdUnits, data=repairData)
covzw
[1] 0.9937
corrzw = cor(stdMinutes ~ stdUnits, data=repairData)
corrzw
[1] 0.9937

Covariance and correlation in standardized units:
\(\mathrm{cov}(z,w) = 0.9937\)
\(\mathrm{corr}(z,w) = 0.9937\)

1.1.5 Simple Linear Regression (one predictor)

1.1.5.1 The linear regression model

\[\begin{align} Y & \;=\; f(X) + \epsilon \\ \\ & \;=\; \beta_0 + \beta_1 X + \epsilon \end{align}\]

This is called a simple linear regression model
since there’s only one predictor: \(X\)

random variable = deterministic linear function + random error

** Illustration of data generation **

1.1.5.1.1 Why model the data-generating process

If we make a detailed model of how the data are generated, we can:

Generate samples from the model – simulated data
Find fitted regression coefficients for these simulated data, and then…
Plot sampling distributions for these fitted coefficients
See how much the fitted coefficients differ from the true population coefficients, and then…
Find sampling distributions for the differences between fitted and true coefficients, and then…
Do hypothesis testing, calculate confidence intervals, etc.

Or, we can do all of the above analytically. This gives us results in nice algebraic form, and lets us see which exact assumptions about the data generating process were necessary to our conclusions.

1.1.5.2 Keeping it simple: assumptions for Simple Linear Regression

If the real data generation process is not well-described by a model meeting the following criteria, then we need a different model.

1.1.5.2.1 Keeping it simple: Treat \(x\) values as “fixed”

Throughout the course, we will consider the \(x\) values as fixed.
That is, all models are built conditional on the \(x\) values observed.
Adding in a model for the distribution of \(X\) is beyond the scope of the course.

Still, the model conditional on the observed \(x\) values (although incorrect as all models are) is very useful for summarizing and describing relationships and making predictions for \(Y\) based on new observations on \(x\).

1.1.5.2.2 Keeping it simple: A probability model for the errors

The errors are uncorrelated with each other
- \(\mathrm{corr}(Y_i, Y_j) \;=\; \mathrm{corr}(\epsilon_i, \epsilon_j) \;=\; 0\) for all \(i\ne j\)
The errors have no systematic info about \(Y\)
1. \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))
  - The distribution of errors has mean zero
  - No systematic bias
2. The errors are not correlated with the \(x\) values
3. \(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))

1.1.5.2.3 Keeping it simple: The mean of \(Y\) is a linear function of \(x\)

The mean of \(Y\) must depend linearly on \(x\).

\[\begin{align} E(Y\; |\; X=x) &\;=\; E(\beta_0 +\beta_1 x + \epsilon ) \\ \\ & \;=\; \beta_0 +\beta_1 x + E(\epsilon) \\ \\ & \;=\; \beta_0 +\beta_1 x + 0 \;=\; \beta_0 +\beta_1 x \end{align}\]

So, if the errors have no relationship to the values of \(x\), it follows that the mean of \(Y\) is a linear function of \(x\). (Or a better way to put it: if the mean of \(Y\) is not a linear function of \(x\), the errors won’t meet criterion 2b.)

1.2 Lecture 2

1.2.1 The model for individuals in the population

For individual \(i\) in the population with predictor value \(X=x=x_i\)

\[Y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\] This model says that the data generation process can be described as fixing values for \(x\), then finding corresponding \(y=\beta_0 + \beta_1 x\), and finally adding a random error \(\epsilon\) for each individual.

The average response for such individuals is \(E(Y\; |\; X=x_i) \;=\; \beta_0 +\beta_1 x_i\)

Let’s try generating a sample from a population
where we have fixed values for \(\beta_0\), \(\beta_1\), and \(x\) values.

beta0_sim1 = 3
beta1_sim1 = 4
n_sim1 = 50
mx_sim1 = 2
sx_sim1 = 1

I set \(\beta_{0} = 3\), \(\beta_{1} = 4\), and \(n= 50\).

First we fix the values and pick a set of \(n = 50\) \(x\) values.
I’ll pick them from a Normal(\(\mu = 2, \sigma = 1\)) distribution.

set.seed(1234567)

# Generate x values for individuals in population
X = rnorm(n_sim1, m = mx_sim1, s = sx_sim1) 
gf_histogram(~ X, bins=12, color = "white")

If the relationship is deterministic,
we could generate \(Y\)-values with a simple linear relationship: \(Y = 3 + 4 x\)

Y_without_error = beta0_sim1 + beta1_sim1 * X        # Y values without errors
gf_point(Y_without_error ~ X, color='red')

What is the intercept of this line? What is its slope?

gf_point(Y_without_error ~ X, color = 'red') %>%
  gf_abline(slope = beta1_sim1, intercept = beta0_sim1)

Of course, for a statistical model,
we only require the mean of \(Y\) to be a linear function of \(x\)

sd_error_sim1 = 2

So, let’s generate a set of errors \(\epsilon\), one for each observation.
I’ll pick them from a normal distribution: Normal(\(\mu = 0, \sigma = 2\)) .

Reminder: According to our simple statistical model, the errors have mean = 0

set.seed(123466)

# Generate error values for individuals in population with given x-values
error = rnorm(n_sim1, m = 0, s = sd_error_sim1) 
gf_histogram(~ error, bins=12, color = "white")

Finally, we can generate \(Y\) values from the \(x\) and error values
by adding the errors to the linear function of \(x\) values.

Y = beta0_sim1 + beta1_sim1 * X + error # Y values, with errors
generated_data = tibble(X, error, Y_without_error, Y)
head(generated_data, 10)

X	error	Y_without_error	Y
2.1567	-2.5555	11.627	9.0713
3.3738	-0.2737	16.495	16.2216
2.7307	-2.0445	13.923	11.8782
0.6492	0.8448	5.597	6.4416
1.9915	-0.8568	10.966	10.1092
2.3210	0.9133	12.284	13.1972
0.2219	-3.0707	3.887	0.8167
2.9095	1.4770	14.638	16.1150
1.0806	2.6609	7.322	9.9832
1.8423	0.4994	10.369	10.8685


gf_point(Y ~ X, data = generated_data, color = 'blue', show.legend = TRUE) %>%
  gf_point(Y_without_error ~ X, data = generated_data, color = 'red') %>%
  gf_abline(intercept = beta0_sim1, slope = beta1_sim1)

1.2.1.0.1 What does \(\beta_0\) represent?

We have already seen that \(\beta_0\) is the intercept of the above graphs. Let’s make this more precise.

When \(x=0\), we expect \(Y\) to be \(\beta_0\).

\[E(Y\; |\; X=0) \;=\; \beta_0 +\beta_1 (0) \;=\; \beta_0\]

Computer repair data: What does \(\beta_{0}\) represent?

gf_point(Minutes ~ Units, data=repairData) +
  xlim(-1, max(~ Units, data=repairData)) + 
  ylim(10, max(~ Minutes, data=repairData))

#  gf_smooth(method = "lm", se = FALSE) +

STOP …and think about it.

When \(x=0\), we estimate the mean of \(y\) to be \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) = \widehat{\beta}_{0}\) minutes.

For the computer example, this means that when no units are serviced,
(guessing from the graph) we expect the work to take about 5 minutes.

Hmmmm…. That doesn’t make sense.

Perhaps interpret this a “start up” time before working on the 1st unit?

1.2.1.0.2 What does \(\beta_1\) represent?

We have already seen that \(\beta_1\) is the slope of the above graphs. Let’s make this more precise, too – and in a way that will carry forward when we go to multiple predictor variables, later in the course.

For individuals with predictor value \(x+1\) versus those with \(x\),
we expect the reponse to be \(\beta_1\) units higher (or lower).

\[\begin{align} E(Y\; |\; X=x+1) &\;=\; \beta_0 +\beta_1 (x+1) \\ \\ &\;=\; \beta_0 +\beta_1 x + \beta_1 \;=\; E(Y\; |\; X=x) + \beta_1 \end{align}\]

That is, \(\; E(Y\; |\; X=x+1) - E(Y\; |\; X=x) \;=\; \beta_1\)

Computer repair data: What does \(\beta_1\) represent?

gf_point(Minutes ~ Units, data=repairData) +
  xlim(-1, max(~ Units, data=repairData)) + 
  ylim(10, max(~ Minutes, data=repairData))

#  gf_smooth(method = "lm", se = FALSE) +

STOP …and think about it.

For each additional unit to be serviced, we estimate the mean time to completion (guessing from the graph) to be about 12 minutes longer.

Estimates of \(\beta_{0}\) and \(\beta_{1}\) from data:

tidy(lm(Minutes ~ Units, data=repairData))

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

We will soon learn how to “best” calculate a slope and intercept from data.

1.2.2 Estimating \(Y\) For Simple Linear Regression

1.2.2.0.1 Some terminology for estimates of \(Y\) or its average

For individual \(i\) in the sample

\[y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\]

Suppose we know \(x_i\) but not \(y_i\). The true value is \(y_i \;=\; \beta_0 +\beta_1 X + \epsilon_i\), but we don’t know \(\epsilon_i\)

The fitted (estimated) value of \(Y\)
for individual \(i\) already in the sample is \[\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\]

The estimated average (expected) value
for all individuals in the population with \(X=x_0\) is \[\widehat{\mu}_0 = \;=\;\widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]

The predicted (estimated) value
for a yet-to-be-observed or just now observed individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]

1.2.2.0.2 We are actually estimating model parameters

Most of the time, we don’t know anything about the underlying population, including the true values of the parameters \(\beta_0\) and \(\beta_1\), and \(\mathrm{var}(\epsilon)=\sigma^2\).

All we have are observations in a sample.

We will use these observations to calculate estimates of \(\beta_0\), \(\beta_1\), as well as errors \(\epsilon_i\) and their variance.

1.2.2.1 Residuals

Suppose we have estimates for \(\beta_0\) and \(\beta_1\).

We can then use these estimates to calculate the fitted response \(\widehat{y}_i\) for each observation.

The difference between the fitted response and the observed response is an estimate of \(\epsilon_i\).

The estimated error (estimate of \(\epsilon_i\)) is called a residual (\(e_i\))

\[\begin{align} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i) \end{align}\]

# I'm going to use Base R graphics here since I happen to know how to 
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
a = 10
b = 4
mx = 15
sx = 5
s = 20
nsim2 = 100
set.seed(123456)
x = rnorm(nsim2, m=mx, sd=sx)
y = a + b*x + rnorm(nsim2, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)

# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)

xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi

plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)

points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)

lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))

text(x=xi, y=ybar + 2*s, labels="residual  =  (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)

Residuals for the computer repair data:

# Find the least squares estimates for intercept and slope (lm function)
repairFit = lm(Minutes ~ Units, data=repairData)
# Once you have an "lm object", you can grab all kinds of summaries from it
# save the fitted values: predict function
# save the residuals: residuals function

# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
repairData = mutate(repairData,
                     fitted = predict(repairFit),
                     resids = residuals(repairFit)
                     )
# This is a pretty fancy plot that I Googled and borrowed from here:
# https://www.r-bloggers.com/visualising-residuals/
ggplot(repairData, aes(x = Units, y = Minutes)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  # Plot regression slope
  geom_segment(aes(xend = Units, yend = fitted), alpha = 0.2) +  # alpha to fade lines
  geom_point() +
  geom_point(aes(y = fitted), shape = 1) +
  theme_bw()  # Add theme for cleaner look

1.2.3 Model Estimation

1.2.3.1 So, how do we estimate \(\beta_0\) and \(\beta_1\)?

What line should we choose to “fit” the model to data?

(This is just computer simulated data for illustration.)

# This entire section is borrowed from Hadley Wickham: R for Data Science
# http://r4ds.had.co.nz/model-basics.html
hadleyData <- modelr::sim1
ggplot(hadleyData, aes(x, y)) + 
  geom_point()

1.2.3.2 How do we estimate \(\beta_0\) and \(\beta_1\)?

m = 250

Here are \(250\) random possible models, but most are terrible!

set.seed(12345)
models = tibble(
  b0 = runif(m, -20, 40),
  b1 = runif(m, -5, 5)
)

ggplot(hadleyData, aes(x, y)) + 
  geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) +
  geom_point()

### Loss functions No matter which two number we might choose to measure slope and intercept,
we are summarizing an entire distribution with two numbers.

There is a cost.
We lose information.
We should measure that loss and be aware of its magnitude.
Statisticians measure loss numerically with a “loss function”
- measures the distance of the data from the line given with a two-number summary (the “slope” and “intercept”) and
- is a measure of distance (spread) from the line.

1.2.3.3 One criteria: Minimize the sum of squared residuals

The least squares estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) minimize the sum of squared residuals

In our textbook,
SSE is called the “Sum of Squared Residuals (Errors)”.
MSE is called the “Mean Squared Residual (Error)”.

NOTATION CAUTION:
This is a bit confusing since “errors” (\(\epsilon_i\)) are not observed.
Only residuals are observed, but “SSR” is reserved for another quantity to be discussed later.

\[\mathrm{SSE}\;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{y}_i)^2 \;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\]

hadleyFit = lm(y ~ x, data=hadleyData)
mydata = hadleyData
mydata$fitted = predict(hadleyFit)   # Save the fitted values
mydata$residuals = residuals(hadleyFit) # Save the residual values
ggplot(mydata, aes(x, y)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  # Plot regression slope
  geom_segment(aes(xend = x, yend = fitted), alpha = 0.2) +  # alpha to fade lines
  geom_point() +
  geom_point(aes(y = fitted), shape = 1) +
  theme_bw()  # Add theme for cleaner look

1.2.3.4 Least squares estimates: \(\; \widehat{\beta}_{0}\) and \(\; \widehat{\beta}_{1}\)

Which models have the smallest sum of squared residuals (SSE)?

model1 = function(b, data) {
  b[1] + data$x * b[2]
}
measure_distance = function(mod, data) {
  diff = data$y - model1(mod, data)
  sum(diff ^ 2) # changed to SSE from Hadley's RMSE sqrt(mean(diff ^ 2))
}
hadley_SSE = function(b0, b1) {
  measure_distance(c(b0, b1), hadleyData)
}
models = models %>% 
  mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
ggplot(hadleyData, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = b0, slope = b1, colour = SSE), 
    data = filter(models, rank(SSE) <= 10)
  )

Let’s try to find the smallest SSE on our own.

http://www.rossmanchance.com/applets/RegShuffle.htm

Among the \(250\) models, we have isolated the 10 “closest” to the data

ggplot(models, aes(b0, b1)) +
  geom_point(data = filter(models, rank(SSE) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = SSE)) +
  labs(x = "Intercepts", y = "Slopes")

Here is a more organized approach: A grid search

grid = expand.grid(
  b0 = seq(-5, 20, length = 25),
  b1 = seq(1, 3, length = 25)
  ) %>% 
  mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))

grid %>% 
  ggplot(aes(b0, b1)) +
  geom_point(data = filter(grid, rank(SSE) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = SSE))

ggplot(hadleyData, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = b0, slope = b1, colour = SSE), 
    data = filter(grid, rank(SSE) <= 10)
  )

# End of section borrowed from Hadley Wickham: R for Data Science

1.2.3.5 Finally, let’s find the least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

Of course, we can make the grid finer and finer to get closer to the “best fit” line

…or just use calculus to minimize the SSE

We obtain the least squares estimators by minimizing the sum of squared residuals.

You do NOT need to be able to do the math, just understand it.

Find \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) that lead to:

\[\min(\mathrm{SSE}) \;=\; \min \sum_i e_i^2 \;=\; \min \sum_i (y_i-\widehat{y}_i)^2 \;=\; \min \sum_i (y_i-\widehat{\beta}_{0}-\widehat{\beta}_{1} x_i)^2\]

Differentiate the SSE with respect to \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\),
Set the two equations equal to zero, and
solve for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

\[\begin{align} \textrm{w.r.t } \widehat{\beta}_{0} \Rightarrow -2\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (1) \\ \\ \textrm{w.r.t } \widehat{\beta}_{1} \Rightarrow -2\sum x_i(y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (2) \end{align}\]

1.2.3.6 The least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

From equation (\(1\)): \(\;\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)

Plug this into equation (\(2\)): \[\begin{align} & \sum x_i(y_i - \overline{y}- \widehat{\beta}_{1}\overline{x}-\widehat{\beta}_{1}x_i)=0 \\ \\ \implies\; & \sum x_i(y_i - \overline{y})\;=\; \widehat{\beta}_{1} \sum x_i(x_i - \overline{x}) \\ \\ \implies\; & \widehat{\beta}_{1}\;=\; \frac{\sum x_i(y_i - \overline{y})}{\sum x_i(x_i - \overline{x})} \hskip2in (3) \end{align}\]

1.2.3.7 Rewriting \(\widehat{\beta}_{1}\)

Note that \[\begin{align} & \sum (x_i-\overline{x})(y_i - \overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) -\sum \overline{x}(y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}\sum (y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}(0) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) \end{align}\]

Similarly, we have

\(\sum (x_i-\overline{x})(y_i - \overline{y})\;=\; \sum y_i(x_i-\overline{x})\)
\(\sum (x_i-\overline{x})^2\;=\; \sum x_i(x_i-\overline{x})\)

Therefore, according to (\(3\)) we have other ways to write the estimator \[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]

1.2.3.8 Summary: The simplest formulas for estimators \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

\(\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)

\(\widehat{\beta}_{1} \;=\; r \frac{s_y}{s_x}\)

1.2.4 Side note: Simpler Model from Back in STAT 101 (No Predictor): Why did we choose \(\overline{y}\) to estimate \(\mu\)?

When sampling from one population of quantitative values \(Y\), we estimated the mean of \(Y\): \(E(Y) = \mu\).

Using our new linear regression notation, we actually estimated the mean of \(Y\) for the linear model: \(\qquad Y = \beta_0 + \epsilon \;\;=\;\; \mu + \epsilon\)

Now, we want to estimate \(\widehat{\mu} = \widehat{y}= \widehat{\beta}_{0}\)

1.2.4.0.1 A “loss function”

We will again use as our loss function the sum (or mean) of squared residuals.
\[ SSE(a) = \sum_{i=1}^n\, (y_i-a)^2 \qquad\qquad MSE(a) = \frac{1}{n-1}\sum_{i=1}^n\, (y_i-a)^2 \]

What value of \(a\) should we choose if loss measured by SSE?

It seems reasonable that \(a\) should be in the “center” of the data.
But which value in the middle would be best?

One optimality criteria: Choose a number that minimizes SSE.

Using the least squares estimation method, \(\widehat{\beta}_{0}\) is the value that minimizes

\[\mathrm{SSE}\;=\; \sum (y_i - \widehat{\mu})^2 \;=\; \sum (y_i - \widehat{y})^2 \;=\; \sum(y_i - \widehat{\beta}_{0})^2\]

So, we want to minimize a function of the form \(f(a) = \sum (y_i -a)^2\) with respect to \(a\).

\[\frac{d}{da} f(a) = 2 \sum (y_i - a) (-1) = -2 \sum(y_i - a)\]

Set the derivative = 0 and solve for \(a\).

\[ \begin{align} -2 \sum(y_i - a) = 0 \implies\; & \sum(y_i - a) = 0 \\ \implies\; & \sum y_i - \sum a = 0 \\ \implies\; & \sum y_i - n a = 0 \\ \implies\; & \sum y_i = n a \\ \implies\; & a = \frac{1}{n} \sum y_i \\ \implies\; & a = \overline{y} \end{align} \]

The 2nd derivative of \(f(a)\) is

\[\frac{d}{da} (-2)\sum (y_i -a) = \frac{d}{da} (-2)\sum y_i + \frac{d}{da} (-2)\sum (-a) = 0 + 2n > 0\]

So, \(\widehat{\beta}_{0} = \overline{y}\) is a minimum, called the “least squares” estimate for \(\mu\).

So, \(\; \widehat{\beta}_{0} \;=\; \widehat{\mu} \;=\; \widehat{y}\;=\; \overline{y}\)

1.2.4.1 Finding the least squares estimates using R

lm = “linear model” function

lm(y ~ x, data = mydata)

Back to the computer repair data:

\(Y\) = Time to complete work (minutes)
\(X\) = Number of units repaired

# Find the least squares estimates for slope and intercept (lm function)
lmfit = lm(Minutes ~ Units, data=repairData)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Minutes ~ Units, data=repairData) %>%
  gf_coefline(model=lmfit)

repairFit = lm(Minutes ~ Units, data=repairData)
# tidy (from the broom package) gives a summary of the slope and intercept
# and accompanying estimate of variance in the estimates
tidy(repairFit)

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

b0 = tidy(repairFit)$estimate[1]
b0
[1] 4.162
b1 = tidy(repairFit)$estimate[2]
b1
[1] 15.51

Let’s try the least squares estimate on the simulated data we generated earlier
where we set \(\beta_{0} = 3\) and \(\beta_{1} = 4\) and generated random \(Y\) values.

Since we simulated the data from the linear regression model,
then the least squares estimates should be accurate.

# Find the least squares estimates for slope and intercept (lm function)
lmfit_sim = lm(Y ~ X, data = generated_data)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Y ~ X, data = generated_data) %>%
  gf_coefline(model = lmfit_sim) %>%
  gf_abline(intercept = beta0_sim1, slope = beta1_sim1, color = 'red',alpha=1)

Remember, our true population parameters were \(\beta_0= 3\) and \(\beta_1= 4\).

lmfit_sim = lm(Y ~ X, data = generated_data)
tidy(lmfit_sim)

term	estimate	std.error	statistic	p.value
(Intercept)	2.937	0.7552	3.889	0.0003
X	3.983	0.3592	11.088	0.0000

b0_sim = tidy(lmfit_sim)$estimate[1]
b1_sim = tidy(lmfit_sim)$estimate[2]

Indeed \(\widehat{\beta}_{0}=2.937 \approx 3 = \beta_{0} \qquad \text{ and } \qquad \widehat{\beta}_{1}=3.983 \approx 4 = \beta_{1}\)

1.2.5 Checking the specifications of a linear regression model

\(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)

random variable = deterministic linear function + random error

This is called a simple linear regression model since only one predictor: \(x\)

The errors are uncorrelated
- \(\mathrm{corr}(Y_i, Y_j) \;=\; \mathrm{corr}(\epsilon_i, \epsilon_j) \;=\; 0\) for all \(i\ne j\)
The errors have no systematic info about \(Y\)
1. The average error = \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))
  - The distribution of errors has mean zero
  - No systematic bias
2. The errors are not correlated with the \(x\) values
3. \(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))
The mean of \(Y\) depends linearly on \(x\)
- \(E(Y\; |\; X=x) \;=\; \beta_0 +\beta_1 x\)

1.2.5.1 Item 1: The errors are uncorrelated

In order to determine if this is reasonable in light of the data
typically, you have to know how the data were collected

Example: Computer repair

if the computer repair person was new to the job and
we knew the order of the repair jobs

Then, perhaps at the start, residuals tend to be larger
(more variability in the minutes per computer)

…and later on, the residuals tend to be smaller
(less variability in the minutes per computer)

That is, the residuals could be correlated over time

We’ll take up this topic in Chapter 8: The Problem of Correlated Errors

1.2.5.2 Item 2a: The errors have mean zero

There is actually nothing we need to check in the data
because least squares residuals always add up to zero!

For any data, \[\begin{align} \sum_i e_i &\;=\; \sum_i (y_i - \widehat{y}_i) \\ \\ &\;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - (\overline{y}- \widehat{\beta}_{1}\overline{x}) - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - \overline{y}) - \widehat{\beta}_{1} \sum_i (x_i - \overline{x}) \\ \\ &\;=\; 0 - \widehat{\beta}_{1} (0) \;=\; 0 \end{align}\]

So, the observed regression line \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} x\) is the
observed average and is the balancing point of the observed \(y\)-values.

This should look familiar.
We showed that for data from a population \(Y\): \(\displaystyle{\sum_i (y_i - \overline{y}) = 0}\).
The observed average is the balancing point of the observed \(y\)-values.

1.2.5.3 Item 2b: The errors are not correlated with the \(x\) values

There is actually nothing we need to check in the data
because least squared residuals are always uncorrelated with the \(x\) values

\[\mathrm{cov}(x, e) \;=\; \frac{1}{n-1} \sum_i (x_i - \overline{x}) (e_i - \overline{e})\] We know that \(\sum_i e_i \;=\; 0\), so \(\overline{e}\;=\; 0\). So, \[\begin{align} \sum_i (x_i - \overline{x}) (e_i - \overline{e}) &= \sum_i (x_i - \overline{x}) e_i \\ &\;=\; \sum_i x_ie_i - \overline{x}\sum_i e_i \\ &\;=\; \sum_i x_ie_i - \overline{x}(0) \\ &\;=\; \sum_i x_ie_i \\ \end{align}\]

Finally, you can show that \(\sum_i x_i e_i \;=\; 0\) for any data.

Thus, \(\mathrm{cov}(x, e) \;=\; 0\). Therefore, \(\mathrm{corr}(x, e) \;=\; 0\)

1.2.5.4 Item 2c: \(\mathrm{var}(\epsilon_i) = \sigma^2\) is same for all \(i\) (for all \(x\))

Later, we will look at graphical diagnostics
to check for constant spread around the line for all \(x\) values.

We estimate \(\mathrm{var}(\epsilon)\) as

\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]

degrees of freedom (df)

= sample size \(-\) # parameters estimated for the mean (\(\beta_0\) and \(\beta_1\))

= \(n-2\)

1.2.5.5 Item 3: The mean of \(Y\) depends linearly on \(x\)

Again, we will look at useful graphical diagnostics later.

1.3 Lecture 3

1.3.1 Statistical Properties of LS Estimates of Model Parameters

1.3.1.1 Variability of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

We know that \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) will not exactly match the true values \(\beta_0\) and \(\beta_1\). Here, we are going to explore this variability.

We’ll start with a quick visualization. I generate several samples. The first is large, with 10,000 observations, and it’s meant to represent the population. The other three are small.

library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
set.seed(1234)

sd_error_sim1=5


# Simulate the population (really just a big sample with 10,000 observations)
n_simbig=10000
X_big = runif(n_simbig)
error = rnorm(n_simbig, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y_big = beta0_sim1 + beta1_sim1 * X_big + error

# Now draw 3 smaller samples
n_sim1=10
X = runif(n_sim1)

# First sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Second sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Third sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y3 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Reshape data for plotting
this_data<-data.frame(X,Y1,Y2,Y3)
this_data_long <- melt(this_data, id="X") 

# Plot population & sample 2
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% 
  gf_point(Y2~X,color="green")%>% gf_smooth(Y2~X,method='lm',se=FALSE,color="green")


# Plot population & all samples
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% 
gf_point(value~X,data=this_data_long,color=~variable)%>% gf_smooth(value~X,data=this_data_long,color=~variable,method='lm',se=FALSE)

As we can see, the estimated slopes \(\widehat{\beta}_{1}\) and intercepts \(\widehat{\beta}_{0}\) differ for each of our three samples. What we’ll discuss now is what distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) we’d expect to see as we increase the number of our samples: the sampling distributions.

1.3.1.2 Sampling distributions for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

If the model is correct: \(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)

then the least squares estimates are unbiased.

\[E(\widehat{\beta}_{0}) \;=\; \beta_0 \qquad {\rm and} \qquad E(\widehat{\beta}_{1}) \;=\; \beta_1\]

To show this, let’s ponder the formula for \(\widehat{\beta}_{1}\) first.

1.3.1.3 Useful algebraic facts about \(\widehat{\beta}_{1}\)

Recall the various expressions for \(\widehat{\beta}_{1}\):

\[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]

In particular, \[\widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_j-\overline{x})^2} \;=\; \sum{k_i y_i}\] where,

\[k_i \; =\; \frac{(x_i-\overline{x})}{\sum(x_i-\overline{x})^2}\] We’ll soon need these facts about the \(k_i\):

\[\begin{align} \sum_i k_i &\;=\; \sum_i \frac{(x_i-\overline{x})}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})} {\sum(x_j-\overline{x})^2} \;=\; 0 \\ \\ \sum_i k_ix_i &\;=\; \sum_i \frac{(x_i-\overline{x})x_i}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})^2} {\sum(x_j-\overline{x})^2} \;=\; 1 \\ \\ \sum_i k_i^2 &\;=\; \sum_i \frac{(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{\sum_i(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{1}{\sum(x_i-\overline{x})^2} \end{align}\]

1.3.1.4 OK. Now we can see \(\widehat{\beta}_{1}\) is an unbiased estimator for \(\beta_1\)

\[\begin{align} E(\widehat{\beta}_{1}) &\;=\; E\left(\sum k_iY_i\right) \;=\; \sum E(k_i Y_i) \;=\; \sum k_i E(Y_i) \\ \\ &\;=\; \sum k_i (\beta_0 + \beta_1 x_i) \;=\; \beta_0 \sum k_i + \beta_1 \sum k_i x_i \\ \\ &\;=\; \beta_0 (0) + \beta_1 (1) \;=\; \beta_1 \end{align}\]

…and we can find the variance of \(\widehat{\beta}_{1}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{1}) &\;=\; \mathrm{var}\left(\sum k_iY_i\right) \;=\; \sum \mathrm{var}(k_i Y_i) \;=\; \sum k_i^2 \mathrm{var}(Y_i) \\ \\ &\;=\; \sum k_i^2 \sigma^2 \;=\; \sigma^2 \sum k_i^2 \;=\; \frac{\sigma^2}{\sum(x_i-\overline{x})^2} \end{align}\]

1.3.1.5 OK. Now we can see \(\widehat{\beta}_{0}\) is an unbiased estimator for \(\beta_0\)

\[\begin{align} E(\widehat{\beta}_{0}) &\;=\; E(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; E(\overline{Y}) - \overline{x}E(\widehat{\beta}_{1}) \\ \\ &\;=\; (\beta_0 + \beta_1 \overline{x}) - \overline{x}\beta_1 \;=\; \beta_0 \end{align}\]

…and we can find the variance of \(\widehat{\beta}_{0}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{0}) &\;=\; \mathrm{var}(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; \mathrm{var}(\overline{Y}) + \overline{x}^2\mathrm{var}(\widehat{\beta}_{1}) -\overline{x}\mathrm{cov}(\overline{Y}, \widehat{\beta}_{1})\\ \\ &(*) \;=\; \frac{\sigma^2}{n} + \frac{\sigma^2 \overline{x}^2}{\sum(x_i-\overline{x})^2} \;=\; \sigma^2 \left[ \frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}\right] \end{align}\] \((*)\) Can show that \(\overline{Y}\) and \(\widehat{\beta}_{1}\) are uncorrelated.

1.3.2 Estimating the spread around the line: \(\sigma^2\)

Of course, we don’t know \(\sigma^2\) (population variance).

So, we use an estimate from data:

\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{e_i^2}{n-2} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]

Then, the standard errors (estimated standard deviations) are

\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\] \[\mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]

1.3.3 The Normal Distribution

1.3.3.1 Useful choice of a probability distribution for the errors: Normal

…and if we add one more model specification

The errors have a normal distribution: \(\epsilon \;\sim\; N(0, \sigma^2)\)

That is \(Y \;\sim\; N(\beta_0 + \beta_1 x, \sigma^2)\)

Then, \[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]

Also,

\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]

1.3.4 Similar Concepts in “STAT 101”

1.3.4.1 This is not so novel. Recall the “STAT 101” one-sample t-test.

Observe data on a response variable: \(y_1, y_2, \ldots, y_n\)

Estimate the population mean (\(\mu\)) with the sample average (\(\overline{y}\)).

And \(\overline{y}\) is the least squares (LS) estimate for \(\mu\).

If the population \(Y \sim N(\mu, \sigma^2)\) then

\[\frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]

In STAT 101, we estimated \(\sigma^2\) just as for all least squares estimates: \[\widehat\sigma^2 \;=\; s^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \overline{y})^2}{n-1}\]

degrees of freedom (df)

= sample size \(-\) # parameters estimated for the mean (\(\mu\))

= \(n-1\)

1.3.4.2 Back in STAT 101, we found LS estimate unbiased too.

\[E(\overline{Y}) \;=\; E\left[\frac{1}{n}\sum_i Y_i\right] \;=\; \frac{1}{n}\sum E(Y_i) \;=\; \frac{1}{n}\sum \mu \;=\; \frac{1}{n} (n \mu) = \mu\]

and for a random sample: \(\;\mathrm{var}(\overline{Y}) = \sigma^2/n\)

1.3.4.3 So, back in STAT 101, we also used a t-distribution

If the population is \(N(\mu, \sigma^2)\) then, \(\; \epsilon \;=\; (Y - \mu) \;\sim\; N(0, \sigma^2)\)

and

\[\frac{\overline{y}- \widehat{y}}{\mathrm{se}(\,\overline{y}\,)} = \frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]

1.3.5 Demo Sampling Distributions

We can generate a set of \(Y\)-values from a known population model we used earlier:

kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1))

beta0	beta1	sd_error	n
3	4	5	10

\(\beta_{0} = 3\)
\(\beta_{1} = 4\)
\(\sigma = 5\)
\(n = 10\)

The mean of \(Y\) is linear: \(E(Y) = 3 + 4 x\)
The \(x\) values come from a Normal(\(\mu = 2, \sigma = 1\)) distribution.
The errors (\(\epsilon\)) from a Normal(\(\mu = 0, \sigma = 5\)) distribution.
Finally, we can generate \(Y\) values from the \(x\) and error values by adding the errors to the linear function of \(x\) values.

\(Y_i = 3 + 4 x_i + \epsilon_i\)

1.3.5.1 Simulate sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

number_samples = 1000
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1, number_samples))

beta0	beta1	sd_error	n	number_samples
3	4	5	10	1000

Using An Applet:

http://www.rossmanchance.com/applets/RegSim/RegCoeff.html

Inputs needed for the applet:

\(\beta_{1} = 4\)
\(\beta_{0} = 3\)
\(\mu_x = 2\)
\(\sigma_x = 1\)
\(\sigma = 5\)

Range of \(y:\; (-1, 20)\) Range of \(x:\; (-1, 5)\)

Using R:

We can simulate 1,000 samples of \(Y\)-values as follows:

First, complete steps 1 and 2 (just once) to generate \(x\) values.

Then, repeat steps 3 and 4 1,000 times,
and each time calculate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) and save those values.

Then, look at the distributions of the 1,000 \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) values.

set.seed(1234)

View the code:

# Set up places to store the beta_0 and beta_1 estimates
beta0hats = 0
beta1hats = 0
plots=NULL
# The code below will repeat number_samples times
# i represents the sample number
for (i in 1:1000) {  

  error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i

  Y = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

  fit = lm(Y ~ X) # Calculate LS estimates for intercept and slope for sample i

  # assign the current beta_0 estimate to the ith position in vector beta0hats
  beta0hats[i] = tidy(fit)$estimate[1]

  # assign the current beta_1 estimate to the ith position in vector beta1hats
  beta1hats[i] = tidy(fit)$estimate[2]
}
# end of the loop that gets repeated number_sample times

1.3.5.2 Analyze simulated sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

What would you expect center, spread, and shape of these distributions to be?

STOP

gf_dhistogram(~ beta0hats, bins = 20, color = "white")
  
gf_dhistogram(~ beta1hats, bins = 20, color = "white")

1.3.5.2.1 The means of the sampling distributions

If we enter mean(beta0hats) in R, what do you expect for the numerical result?

If we enter mean(beta1hats) in R, what do you expect for the numerical result?

STOP to ponder that

The results are 2.987 and 4.064
which should be close to \(\beta_{0} = 3\) and \(\beta_{1} = 4\)
because both estimators are unbiased:

\[E(\widehat{\beta}_{0}) = \beta_{0} \qquad \text{ and } \qquad E(\widehat{\beta}_{1}) = \beta_{1}\]

1.3.5.2.2 The standard deviations of the sampling distributions

According to earlier work,

\[SD(\widehat{\beta}_{0}) \;=\; \sigma \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]

\[SD(\widehat{\beta}_{1}) \;=\; \frac{\sigma}{\sqrt{\sum(x_i-\overline{x})^2}}\]

The standard deviations of the statistics according to theory:

xbar_sim1 = mean(X)
SSx_sim1 = sum((X - xbar_sim1)^2)
SDx_sim1 = sqrt( SSx_sim1 / (n_sim1-1) )
SDb0_sim1 = sd_error_sim1 * sqrt(1/n_sim1 + xbar_sim1^2 / SSx_sim1)
SDb1_sim1 = sd_error_sim1 / sqrt(SSx_sim1)
kable( tibble( sd_error=sd_error_sim1, n=n_sim1, xbar = xbar_sim1, 
               SSx = SSx_sim1, SDx = SDx_sim1, SDb0 = SDb0_sim1, SDb1 = SDb1_sim1 ) )

sd_error	n	xbar	SSx	SDx	SDb0	SDb1
5	10	0.6909	0.6483	0.2684	4.572	6.21

The sample standard deviations of 1,000 simulated statistics:

kable( tibble( mean(beta0hats), sd(beta0hats), mean(beta1hats), sd(beta1hats) ) )

mean(beta0hats)	sd(beta0hats)	mean(beta1hats)	sd(beta1hats)
2.987	4.418	4.064	6.031

1.3.5.2.3 The shape of the sampling distributions

Claim: Since the errors come from a Normal distribution, then

The \(Y\)-values come from a Normal distribution
The \(\widehat{\beta}_{0}\)-values come from a Normal distribution
The \(\widehat{\beta}_{1}\)-values come from a Normal distribution

The simulated sampling distribution of \(\widehat{\beta}_{0}\) values:

gf_dhistogram(~ beta0hats, bins = 20, color = "white") %>%
  gf_dist("norm", params = list(beta0_sim1, SDb0_sim1))
  
gf_dhistogram(~ beta1hats, bins = 20, color = "white") %>%
  gf_dist("norm", params = list(beta1_sim1, SDb1_sim1))

1.3.5.3 Back to the computer repair data…

# glance and tidy (from the broom package) give nicely formatted summaries of regression estimates
repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

# base R has a summary function
summary(repairFit)

Call:
lm(formula = Minutes ~ Units, data = repairData)

Residuals:
   Min     1Q Median     3Q    Max 
-9.232 -3.341 -0.714  4.777  7.803 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.162      3.355    1.24     0.24    
Units         15.509      0.505   30.71  8.9e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared:  0.987, Adjusted R-squared:  0.986 
F-statistic:  943 on 1 and 12 DF,  p-value: 8.92e-13

mean(~ Units, data=repairData)
[1] 6
sd(~ Units, data=repairData)
[1] 2.961

range(~ Minutes, data=repairData)
[1]  23 166
range(~ Units, data=repairData)
[1]  1 10

http://www.rossmanchance.com/applets/RegSim/RegCoeff.html

Inputs needed for the applet:

\(\widehat{\beta}_{1} = 15.51\)
\(\widehat{\beta}_{0} = 4.162\)

\(\widehat{\sigma}^2 = 5.392\)

\(\overline{x}= 6\)
\(s_x = 2.961\)

Range of \(y:\; (23,\; 166)\)
Range of \(x:\; (1,\; 10)\)

1.3.6 Reminder: checking model assumptions

1. The errors are uncorrelated

Data collected as random sample? Effect of time? Chapter 8: The Problem of correlated errors.

2.a. No systematic bias: \(E(\epsilon_i) \;=\; 0\) for all \(i\) (for all \(x\))

Actually nothing we need to check. For any data and least squares regression, \(\sum_i e_i = 0\)

2.b. The errors are not correlated with the \(x\) values

Actually nothing we need to check. For any data and least squares regression: \(\mathrm{cov}(x, e) = 0\)

2.c. \(\mathrm{var}(\epsilon_i) = \sigma^2\) is same for all \(i\) (for all \(x\))

Graphical diagnostics. Chapter 4: Regression Diagnostics.
Check for constant spread around the line for all \(x\) values.

3. The mean of \(Y\) depends linearly on \(x\)

Again, we will look at useful graphs later.

1.3.7 Summary: Statistical Properties of LS Estimates

The least squares estimates

\[ \widehat{\beta}_{0} = \overline{y}-\widehat{\beta}_{1}\overline{x} \qquad \textrm{and}\qquad \widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm algebra} \cdots \;=\; r \frac{s_y}{s_x}\]

Unbiased estimators

Both \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) are unbiased estimators of their population counterparts:
\(E(\widehat{\beta}_{0}) = \beta_0\) and \(E(\widehat{\beta}_{1}) = \beta_1\).

Standard errors

\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\qquad \textrm{and}\qquad \mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\] where \[\widehat\sigma \;=\; \sqrt{\frac{\mathrm{SSE}}{\mathrm{df}}} \;=\; \sqrt{\frac{e_i^2}{n-2}} \;=\; \sqrt{\frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}}\] If \(\epsilon \sim N(0,\sigma^2)\) then (or sample size large), then

\[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]

Also,

1.3.8 Hypothesis Tests for Model Parameters

Reminder: Steps for hypothesis testing:

Specify the Null Hypothesis
Specify the Alternative Hypothesis
Set the significance level (\(\alpha\))
Using the sample data and assuming the null hypothesis is true, calculate the value of the test statistic and the corresponding P-Value
Draw a conclusion

Testing \(H_0:\beta_0=0\;\) or \(\;H_0:\beta_1=0\) using t-test

Let’s work the first case: our null hypothesis is \(H_0:\beta_0=0\;\). If the null hypothesis is true, then the statistic

\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} = \frac{\widehat{\beta}_{0} - 0}{\mathrm{se}(\widehat{\beta}_{0})}=\frac{\widehat{\beta}_{0}}{\mathrm{se}(\widehat{\beta}_{0})}\]

will follow a t-distribution with \(n-2\) degrees of freedom.

Reminder of the intuition here: If the null hypothesis is true, most of the time, this statistic will be close to zero. If the statistic is large, either we got very (un)lucky, or the null hypothesis wasn’t true to begin with.

We can test the null hypothesis by calculating this statistic and comparing it with a critical value \(t^\star\) for our chosen significance value. (Remember to use \(\alpha/2\) for a two-sided hypotheis.) Alternatively, we can calculate the p-value corresponding to our statistic, and compare it to \(\alpha\).

Now let’s work the second case: our null hypothesis is \(H_1:\beta_1=0\;\). If the null hypothesis is true, then the statistic

\[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} = \frac{\widehat{\beta}_{1} - 1}{\mathrm{se}(\widehat{\beta}_{1})}=\frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})}\]

We can test this hypothesis by colculating this statistic and the corresponding P-value.

1.3.8.1 Testing equality of model coefficients to 0 in R: it’s a special case!

In general, when we want to test a hypothesis \(H_0: \beta_0=k\) our test statistic will be of the form \(t=\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} = \frac{\widehat{\beta}_{1} - k}{\mathrm{se}(\widehat{\beta}_{1})}\). Our test statistic only simplifies to the form \(\frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})}\) for the specific hypothesis \(H_0: \beta_0=0\).

However, this is a very common hypothesis to test. Because we so frequently want to test this hypothesis, R will computer the corresponding statistic and p-value for us automatically. They are given in the outputs of the lm command.

The important thing to remember is that these outputs only apply to the specific null hypothesis that the coefficients equal zero. If you want to test a different hypothesis, you’ll need to calculate the test statistic and p-value manually, using the formula \(t=\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})}\). We’ll discuss how to do this later.

1.3.8.2 Let’s test \(H_0: \beta_0 = 0\) vs. \(H_1: \beta_0 \ne 0\)

For computer repairs, it seems reasonable
that it should take no time to complete no work (zero units repaired).

Our prediction is \(\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) \;=\; \widehat{\beta}_{0} = 4.162\) minutes.

If \(H_0\) is true, then \(\beta_0=0\) and \[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;=\; \frac{\widehat{\beta}_{0} - 0}{\mathrm{se}(\widehat{\beta}_{0})} \;=\; \frac{\widehat{\beta}_{0}}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2}\]

repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

repairFit = lm(Minutes ~ Units, data=repairData)
tidyRepairFit = tidy(repairFit)
b0.se = tidyRepairFit$std.error[1]
b0.tstat = tidyRepairFit$statistic[1]
b0.pvalue = tidyRepairFit$p.value[1]

Summaries	Test
\(\widehat{\beta}_{0} = 4.162\)	\(t_{\rm observed} = 1.24\)
\(\mathrm{se}(\widehat{\beta}_{0}) = 3.355\)	pvalue = \(0.2385\)

Using a significance level of \(\alpha=0.05\),
we fail to reject \(H_0:\beta_0=0\).

1.3.8.3 Oops! We forgot to check if \(Y \sim\) Normal is plausible

\(Y \sim N(\beta_0 + \beta_1 x, \sigma^2)\) when \(\epsilon \sim N(0, \sigma^2)\).

So, let’s graph the residuals.

repairData

Minutes	Units	stdMinutes	stdUnits	fitted	resids
23	1	-1.6058	-1.6885	19.67	3.3296
29	2	-1.4760	-1.3508	35.18	-6.1792
49	3	-1.0432	-1.0131	50.69	-1.6880
64	4	-0.7187	-0.6754	66.20	-2.1967
74	4	-0.5023	-0.6754	66.20	7.8033
87	5	-0.2210	-0.3377	81.71	5.2945
96	6	-0.0263	0.0000	97.21	-1.2143
97	6	-0.0046	0.0000	97.21	-0.2143
109	7	0.2550	0.3377	112.72	-3.7231
119	8	0.4714	0.6754	128.23	-9.2318
149	9	1.1205	1.0131	143.74	5.2594
145	9	1.0339	1.0131	143.74	1.2594
154	10	1.2287	1.3508	159.25	-5.2494
166	10	1.4883	1.3508	159.25	6.7506

# dotplot (since only 14 data values, too small for boxplot or histogram)
gf_dotplot(~ resids, data=repairData)
`stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

Skew is not extreme. No obvious outliers.

The t-method should be valid here.

1.3.8.4 We can strongly reject \(H_0: \beta_1=0\)

If \(\beta_1=0\) then \(x\) has no (linear) information about \(Y\).

In fact, \(\beta_1=0\) is equivalent to \(\rho = \mathrm{corr}(Y,x) = 0\) since \[\beta_1 = \rho \frac{\sigma_Y}{\sigma_X}\] So, \(H_0: \beta_1=0\) and \(H_0: \rho=0\) both test for linear relationship.

If \(H_0\) is true, then \(\beta_1=0\) and \[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 0}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1}}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]

repairFit = lm(Minutes ~ Units, data=repairData)
(tidyRepairFit<-tidy(repairFit))

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

b1.se = tidyRepairFit$std.error[2]
b1.tstat = tidyRepairFit$statistic[2]
b1.pvalue = tidyRepairFit$p.value[2]
b1.estimate = tidyRepairFit$estimate[2]

Summaries	Test
\(\widehat{\beta}_{1} = 15.51\)	\(t_{\rm observed} = 30.71\)
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\)	pvalue = \(8.916e-13\)

We strongly reject \(H_0: \beta_1=0\).
There is a strong positive relationship between repair time and number of units serviced.

1.3.9 Testing a specific other value

Consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work.

If \(H_0\) is true, then \(\beta_1=14.5\) and \[t = \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 14.5}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]

Because our null hypothesis concerns a value other than zero, need to calculate \(t\) manually.

# The long way, but straightforward
b1.12.tstat = (b1.estimate - 14.5) / b1.se
b1.12.tstat
[1] 1.998

Once we do this, we can calculate the p-value using the function pt. We take 1 minus its output, then multiply by two since we are testing a two-sided hypothesis.

n<-nrow(repairData)
b1.12.pvalue =  2 * (1 - pt(b1.12.tstat, df=n-2))
b1.12.pvalue
[1] 0.06894

Summaries	Test
\(\widehat{\beta}_{1} = 15.51\)	\(t_{\rm observed} = 1.998\)
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\)	\(t_{\rm observed}^2 = 3.991\)
	pvalue = \(0.06894\)

There is a shortcut to test this hypothesis, using the linearHypothesis function in the car package. This function doesn’t report out the t-statistic – it reports the F-statistic, which we’ll discuss shortly – but the p-values are the same as if we’d done the t-test.

\(H_0: \beta_1 = 14.5\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	464.9	NA	NA	NA	NA
12	348.8	1	116	3.991	0.0689

1.3.10 Partitioning the Sum of Squares

1.3.10.1 Partitioning and residuals

\[(y_i - \overline{y}) \;=\; (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y}) \;=\; (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})\]

# I'm going to use Base R graphics here since I happen to know how to 
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
draw_residual_schematic <- function() {
  
a = 10
b = 4
mx = 15
sx = 5
s = 20
n = 100
set.seed(123456)
x = rnorm(n, m=mx, sd=sx)
y = a + b*x + rnorm(n, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)

# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)

xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi

myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)

points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)

lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))

text(x=xi, y=ybar + 2*s, labels="residual  =  (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
}
draw_residual_schematic()

1.3.10.2 Partitioning the sum of squares

It’s not obvious, but is true (for LS regression):

\[\sum (y_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i)^2 \;+\; \sum (\widehat{y}_i - \overline{y})^2\]

SST = \(\sum (y_i - \overline{y})^2\) = total sum of squares

SSE = \(\sum (y_i - \widehat{y}_i)^2\) = error sum of squares

SSR = \(\sum (\widehat{y}_i - \overline{y})^2\) = regression sum of squares

So, SST = SSE + SSR

1.3.10.3 Coefficient of determination: \(R^2\) (“R squared”)

One measure of the “goodness of fit” of the regression model to the data:

\(R^2\) = proportion of variability in the response (\(Y\))
explained by regression on the predictor (\(x\))

\[R^2 \;=\; \frac{\mathrm{SSR}}{\mathrm{SST}} \;=\; 1 - \frac{\mathrm{SSE}}{\mathrm{SST}} \;=\; 1 - \frac{\sum (y_i - \widehat{y}_i)^2}{\sum (y_i - \overline{y})^2}\]

If the predictors don’t include much information about the response,
then \(\sum (y_i - \widehat{y}_i)^2\) is close to \(\sum (y_i - \overline{y})^2\)
since \(\overline{y}\) is the least squares estimate for the mean without predictors.

In that case, \(R^2\) is near \(1 - 1 = 0\).

On the other hand, if the predictors have so much information that
residuals are very, very small, then \(R^2\) is near \(1 - 0 = 1\).

repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

summary(repairFit)

Call:
lm(formula = Minutes ~ Units, data = repairData)

Residuals:
   Min     1Q Median     3Q    Max 
-9.232 -3.341 -0.714  4.777  7.803 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    4.162      3.355    1.24     0.24    
Units         15.509      0.505   30.71  8.9e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.39 on 12 degrees of freedom
Multiple R-squared:  0.987, Adjusted R-squared:  0.986 
F-statistic:  943 on 1 and 12 DF,  p-value: 8.92e-13

1.3.10.4 Why do we call this number \(R^2\)?

When there is only one predictor (\(x\)), \(R^2 = \mathrm{corr}(y,x)^2 = r^2\)

cor(Minutes ~ Units, data=repairData)
[1] 0.9937
cor(Minutes ~ Units, data=repairData)^2
[1] 0.9874

\(r = 0.9937\)
\(r^2 = 0.9874\)

repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

So, 98.7% of the variability in repair times can be attributed to the regression on the number of units.

The rest of the variability (residuals) is not explained by the number of units repaired.

1.3.10.5 Another way to view \(R^2\)

Claim: \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)

Recall:

\(\mathrm{cov}(a+by, c+dx) = bd\mathrm{cov}(y,x)\)
\(\mathrm{sd}(a+bY) = |b|\mathrm{sd}(y)\) and \(\mathrm{sd}(c + dx) = |d|\mathrm{sd}(x)\)

\[\mathrm{corr}(a+bY, c+dx) = \frac{bd\mathrm{cov}(y,x)}{|b|\mathrm{sd}(Y)\, |d|\mathrm{sd}(x)} = {\rm sign}(bd)\mathrm{corr}(y,x)\]

\(\begin{align} \mathrm{corr}(y, \widehat{y}) = \mathrm{corr}(y, \widehat{\beta}_{0} + \widehat{\beta}_{1} x) = \textrm{sign}(\widehat{\beta}_{1}) \mathrm{corr}(y,x) \end{align}\)

\(\widehat{\beta}_{1}\) and \(\mathrm{corr}(y,x)\) always have the same sign: \(\widehat{\beta}_{1} = \mathrm{corr}(y,x) (s_y/s_x)\)

Thus, \(0 \le \mathrm{corr}(y, \widehat{y}) \le 1\) and \(\mathrm{corr}(y, \widehat{y}) = |\, \mathrm{corr}(y, x)\, |\)

So, \(R^2 = \left[\mathrm{corr}(y, x)\right]^2 = \left[\mathrm{corr}(y, \widehat{y})\right]^2\)

This will be handy when we have models with more than one predictor.

But we will have to be careful: when we add more predictors to a model, \(R^2\) will always increase or stay constant. Therefore we won’t be able to reliably use \(R^2\) as a way of comparing models to see which is better.

1.3.10.6 Adjusted R^2

repairFit = lm(Minutes ~ Units, data=repairData)
glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

Adjusted \(R^2\) includes the degrees of freedom:

\[\text{adjusted} R^2 = 1 - \frac{\mathrm{SSE}/ \mathrm{SSE}(\mathrm{df})}{\mathrm{SST}/ \mathrm{SST}(\mathrm{df})} = \frac{\sum (y_i - \widehat{y}_i)^2/(n-p)}{\sum (y_i - \overline{y})^2 / (n-1)}\]

Adjusted \(R^2\) does not always increase as we add more predictors to a model, so we will end up using it more often when we are comparing models to one another.

1.4 Lecture 4

1.4.1 Partitioning the sum of squares

\[(y_i - \overline{y}) \;=\; (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y}) \;=\; (y_i - \widehat{y}_i) + (\widehat{y}_i - \overline{y})\]

# I'm going to use Base R graphics here since I happen to know how to 
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
draw_residual_schematic <- function() {
  
a = 10
b = 4
mx = 15
sx = 5
s = 20
n = 100
set.seed(123456)
x = rnorm(n, m=mx, sd=sx)
y = a + b*x + rnorm(n, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)

# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)

xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi

myplot = plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)

points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)

lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))

text(x=xi, y=ybar + 2*s, labels="residual  =  (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)
}
draw_residual_schematic()

1.4.1.1 Partitioning the sum of squares

It’s not obvious, but is true (for LS regression):

\[\sum (y_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i + \widehat{y}_i - \overline{y})^2 \;=\; \sum (y_i - \widehat{y}_i)^2 \;+\; \sum (\widehat{y}_i - \overline{y})^2\]

SST = \(\sum (y_i - \overline{y})^2\) = total sum of squares

SSE = \(\sum (y_i - \widehat{y}_i)^2\) = error sum of squares

SSR = \(\sum (\widehat{y}_i - \overline{y})^2\) = regression sum of squares

So, SST = SSE + SSR

1.4.1.2 Partitioning of the degrees of freedom

Even the degrees of freedom for each sum of squares add up!

SST(df) = SSE(df) + SSR(df)

\((n-1) = (n-2) + 1\)

1.4.1.3 A little more about degrees of freedom

For a linear model with mean estimated by least squares:

degrees of freedom of the sum of squared errors
= sample size \(-\) # parameters estimated for the mean

For SST = \(\sum (y_i - \overline{y})^2\),
the linear model we have in mind is \(Y = \mu + \epsilon\) (one mean parameter).
Thus, SST has \((n-1)\) df.

For SSE = \(\sum (y_i - \widehat{y}_i)^2 = \sum (y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i))^2\),
the linear model we have in mind is \(Y = \beta_0 + \beta_1 x + \epsilon\) (two mean paramters).
Thus, SSE has \((n-2)\) df.

For SSR = SST - SSE = \(\sum (\widehat{y}_i - \overline{y})^2\)
the degrees of freedom is the difference in the number of mean parameters for the two models.
Thus, SSR has \((2 - 1) = 1\) df.

1.4.1.4 And a little bit more about degrees of freedom

Degrees of freedom can be thought of in many ways.
Let’s consider the concept of constraints or “subspaces”.

SST = \(\sum (y_i - \overline{y})^2\), but \(\sum (y_i - \overline{y}) = 0\) (a constraint).

From this constraint, we can write \(\displaystyle y_n = n\overline{y}- \sum_{i=1}^{n-1} y_i\).

Thus, once we know estimate for the mean \(\overline{y}\)
and \(y_1, y_2, \ldots,\) and \(y_{n-1}\),
then we can solve for \(y_n\).

We say that \((n-1)\) y-values are “free” to move,
but the last value is constrained by \(\overline{y}\).
That’s \((n-1)\) “degrees of freedom”.

Similarly,

SSE = \(\sum (y_i - \widehat{y}_i)^2\), but \(\sum (y_i - \widehat{y}_i) = 0\) (a constraint).

It is also true (mentioned far above) that

\(\sum((y_i - \widehat{y}_i)x_i)=0\), for any dataset.

From these constraints, we can write both \(y_n\) and \(y_{n-1}\) as functions of \(\widehat{\beta}_{0}, \widehat{\beta}_{1}, y_1, y_2, \ldots, y_{n-2}\) and all of the \(x\)-values.

Thus, once we know the estimate for the mean (\(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\))
and \(y_1, y_2, \ldots, y_{n-2}\) (and all of the \(x\)-values),
then we can solve for \(y_n\) and \(y_{n-1}\).

We say that \((n-2)\) y-values are “free” to move,
but the last two values are constrained by \(\widehat{\beta}_{0}, \widehat{\beta}_{1},\) and the x-values.
That’s \((n-2)\) “degrees of freedom”.

Finally,

SSR = \(\sum (\widehat{y}_i - \overline{y})^2\), but \(\sum (\widehat{y}_i - \overline{y}) = 0\) (a constraint).

We can show this is true: \[\begin{align} \sum (\widehat{y}_i - \overline{y}) &\;=\; \sum \widehat{y}_i - \sum \overline{y}\\ \\ &\;=\; \sum \left(\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\right) - n\overline{y}\\ \\ &\;=\; \sum \left[(\overline{y}- \widehat{\beta}_{1} \overline{x}) + \widehat{\beta}_{1} x_i\right] - n\overline{y}\\ \\ &\;=\; \sum \overline{y}+ \widehat{\beta}_{1} \sum (x_i - \overline{x}) - n\overline{y}\\ \\ &\;=\; n \overline{y}+ \widehat{\beta}_{1} (0) - n\overline{y}\\ \\ &\;=\; 0 \end{align}\]

Thus, once we know the estimate for the mean (\(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\))
and just one \(\widehat{y}_i\) (and all of the \(x\)-values),
then we can solve for all \((n-1)\) other \(\widehat{y}_j\) values.

We say that just one \(\widehat{y}\)-value is “free” to move,
and the remaining \(\widehat{y}\)-values are constrained by \(\overline{y}\) and the \(x\)-values.
That’s \(1\) (one) “degree of freedom”.

Let me try to convince you of this constraint.

Suppose we know all \(n\) of the \(x\)-values, but only \(\widehat{y}_5\).

Then, we know two points on the regression line:

\((x_5, \widehat{y}_5)\)
\((\overline{x}, \overline{y})\) since every simple linear regression line passes through this point

in which case, we know \(\displaystyle \frac{\widehat{y}_5 - \overline{y}}{x_5 - \overline{x}} = \frac{ \textrm{rise} }{ \textrm{run} } = \widehat{\beta}_{1}\) (the slope)

and we can find \(\widehat{\beta}_{0} = \overline{y}- \widehat{\beta}_{1} \overline{x}\) (the intercept).

From here, we can find all \((n-1)\) other values of \(\widehat{y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\).

1.4.2 An F-test for Multiple Parameters

1.4.2.1 What if we want a single (“omnibus”) test: \(H_0: \beta_0 = \beta_1 = 0\)?

Here is the process.

Start with the full model (FM): \(Y = \beta_0 + \beta_1 x + \epsilon\)
(no restrictions on \(\beta_0\) or \(\beta_1\))

Compare to the reduced model (RM): \(Y = \epsilon \;=\; 0 + \epsilon\;\;\) (assumes \(E(Y)=0\))
Both \(\beta_0=0\) and \(\beta_1=0\) as stated in \(H_0\).

Let’s get an intuition for what the FM and the RM look like. We’ll draw the regression lines for both models here, for two different examples. In the first example, the true values \(\beta_0\) and \(\beta_1\) are 3 and 4, respectively. In the second example, the true values are 0 and 0 – that is, the null hypothesis \(H_0\) is actually true.

library(reshape2)
set.seed(1234)

sd_error_sim1=5

beta0_sim2=3
beta1_sim2=4

beta0_sim3=0
beta1_sim3=0

n_sim1=30

# First example
X1 = runif(n_sim1)

error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim2 + beta1_sim2 * X1 + error # Generate y-values for sample i

# Second example
X2 = runif(n_sim1)
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim3 + beta1_sim3 * X2 + error # Generate y-values for sample i



# Plot  sample 1
  gf_point(Y1~X1,color="grey")%>% gf_smooth(Y1~X1,method='lm',se=FALSE,color="red") %>% gf_abline(intercept=0,slope=0,color="blue")


# Plot sample 2

 gf_point(Y2~X2,color="grey") %>% 
  gf_abline(color="red", slope = -0.10, intercept = 5) %>%
  gf_smooth(Y2~X2,method='lm',se=FALSE,color="red")

STOP Discuss intuition about the performance of the FM and the RM in each of the two cases.

If the model specifications are valid (including Normal errors),
Then the following comparison statistic has an \(F\) distribution.

\[F = \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})}\]

The degrees of freedom for the distribution of the F-statistic
are the numerator df and the denominator df.

1.4.2.2 The F-statistic is always positive. Why?

1.4.2.3 F-test for \(H_0: \beta_0 = \beta_1 = 0\)

For the reduced model: \(Y = \epsilon\quad\) and \(\quad \widehat{y}_i = 0\) for all \(i\)

Thus, \(\mathrm{SSE}(\mathrm{RM}) = \sum (y_i - \widehat{y}_i)^2 = \sum y_i^2\)
with degrees of freedom \(n-0=n\) (no parameters estimated for the mean).

For the full model: \(Y = \beta_0 + \beta_1 + \epsilon\)
and \(\widehat{y}_i = \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\) for all \(i\)

Thus, \(\mathrm{SSE}(\mathrm{FM}) = \sum (y_i - \widehat{y}_i)^2 = \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\)
with degrees of freedom \(n-2\) (two parameters estimated for the mean).

\[F \;=\; \frac{\left[ \sum y_i^2 - \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2 \right] \;/\; [n - (n-2)]}{\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2 \;/\; (n-2)} \;\sim\; F_{2, n-2}\]

The long (but illustrative) calculation:

SHOW CODE

# The long (but illustrative) way
y = repairData$Minutes
x = repairData$Units
n = length(y)
b0

[1] 4.162

b1

[1] 15.51

SSE.RM = sum(y^2)
SSE.FM = sum((y - b0 - b1*x)^2)
df.RM = n
df.FM = n - 2
F.stat = ( (SSE.RM - SSE.FM) / (df.RM - df.FM) ) / (SSE.FM / df.FM)
F.stat

[1] 2747

1 - pf(F.stat, df.RM-df.FM, df.FM)

[1] 1.11e-16

The short way …find a function in a package:

SHOW CODE

# The short, way ...find a function in a package.
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
# ...and so can make things look nice on your HW if you are using R Markdown
repairFit = lm(Minutes ~ Units, data=repairData)
library(car)
kable(linearHypothesis(repairFit, c("(Intercept) = 0", "Units = 0")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
14	160077.0	NA	NA	NA	NA
12	348.8	2	159728	2747	0

…and another method using the base R anova() function

SHOW CODE

# ...and another method using base R anova() function
# anova(reducedModelFit, fullModelFit)
# full model
repairFit = lm(Minutes ~ Units, data=repairData)
# reduced model (beta0 = 0, beta1 = 0)
reducedFit = lm(Minutes ~ -1 + I(0*Units), data=repairData)
# compare the two (nested) models with an F-test
anova(reducedFit, repairFit)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
14	160077.0	NA	NA	NA	NA
12	348.8	2	159728	2747	0

1.4.2.4 Just demonstrating that F-test protects against false positives

When \(\alpha=0.05\), we expect 5% false positives.
P(Type I Error) = P(reject \(H_0\) when it was actually true) = \(\alpha\)

So, if we do two tests, the false positive rate is higher.
(The more tests you do, the more likely you’ll get a false positive)

Test of \(H_0: \beta_0 = 11.8\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	499.5	NA	NA	NA	NA
12	348.8	1	150.7	5.183	0.0419

Test of \(H_0: \beta_1 = 14.5\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	464.9	NA	NA	NA	NA
12	348.8	1	116	3.991	0.0689

Test of \(H_0: \beta_0 = 11.8\) and \(\beta_1 = 14.5\) using an R package formula:

SHOW CODE

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8", "Units = 14.5")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
14	500.1	NA	NA	NA	NA
12	348.8	2	151.2	2.601	0.1153

And another method comparing the reduced (restricted) model with the full (unrestricted) model.

Use the base R anova() function:

SHOW CODE

# ...and another method using base R anova() function
# anova(reducedModelFit, fullModelFit)
# full model
repairFit = lm(Minutes ~ Units, data=repairData)
# reduced model (beta0 = 0, beta1 = 0)
reducedFit = lm(Minutes ~ -1, offset = 11.8 + 14.5*Units, data=repairData)
# compare the two (nested) models with an F-test
kable(anova(reducedFit, repairFit))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
14	500.1	NA	NA	NA	NA
12	348.8	2	151.2	2.601	0.1153

1.4.2.5 Bonferroni method: Also protects against false positives

This method is indeed simpler than the F-test, but…
it’s too conservative
(more protection against false positives than needed)

Bonferroni correction:
If \(k\) tests, then compare p-value to \(\displaystyle \frac{\alpha}{k}\) for each test.

Since we have two tests (slope and intercept) and \(\alpha=0.05\), we should not reject either null unless p-value \(< 0.05/2 = 0.025.\)

Test of \(H_0: \beta_0 = 11.8\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("(Intercept) = 11.8")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	499.5	NA	NA	NA	NA
12	348.8	1	150.7	5.183	0.0419

Test of \(H_0: \beta_1 = 14.5\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	464.9	NA	NA	NA	NA
12	348.8	1	116	3.991	0.0689

Using the Bonferroni correction, we would not reject either \(H_0\).

1.4.2.6 FYI. When \(H_0: \beta_1 =\) number, \(t^2=F\) (equivalent tests)

t-test of \(H_0: \beta_1=0\)

repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

F-test of \(H_0: \beta_1=0\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 0")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	27768.4	NA	NA	NA	NA
12	348.8	1	27420	943.2	0

The relationship holds when testing for \(\beta_1\) other than 0.

For example, consider testing whether or not one additional unit needing repair leads to an average of 14.5 more minutes of work.

If \(H_0\) is true, then \(\beta_1=14.5\) and \[\frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;=\; \frac{\widehat{\beta}_{1} - 14.5}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]

# The long way, but straightforward
b1.12.tstat = (b1 - 14.5) / b1.se
b1.12.tstat
[1] 1.998
b1.12.tstat^2
[1] 3.991
b1.12.pvalue =  2 * (1 - pt(b1.12.tstat, df=n-2))
b1.12.pvalue
[1] 0.06894

Summaries	Test
\(\widehat{\beta}_{1} = 15.51\)	\(t_{\rm observed} = 1.998\)
\(\mathrm{se}(\widehat{\beta}_{1}) = 0.505\)	\(t_{\rm observed}^2 = 3.991\)
	pvalue = \(0.06894\)

..and here is the equivalent F-test for \(H_0: \beta_1 = 14.5\)

repairFit = lm(Minutes ~ Units, data=repairData)
# linearHypothesis is in the "car" package.
# I had to Google for this one!
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(linearHypothesis(repairFit, c("Units = 14.5")))

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
13	464.9	NA	NA	NA	NA
12	348.8	1	116	3.991	0.0689

1.4.3 Similar Concepts in “STAT 101”

1.4.3.1 This is not so novel. Recall the “STAT 101” two-sample t-test.

We have two random samples (from two populations)

For population #1: \(Y = \mu_1 + \epsilon\)

For population #2: \(Y = \mu_2 + \epsilon\)

We can write this as a simple linear regression model.

Let \(\displaystyle x = \begin{cases} 0 & \textrm{data from population #1} \\ 1 & \textrm{data from population #2} \end{cases}\)

Then, write \(Y = \beta_0 + \beta_1 x + \epsilon\)

where \(E(\epsilon)=0\), \(\;\;\mathrm{var}(\epsilon) = \sigma^2\;\;\) and \(\;\;\epsilon\;\;\) follows a normal distn

What does this mean about variances for the two populations?

1.4.3.2 Back in “STAT 101” we found LS estimates

For data from population #1, \(\;\;x=0\;\;\) so, \(\;\;Y = \beta_0 + \epsilon\;\; \implies \; \beta_0 = \mu_1\)

For data from population #2, \(\;\;x=1\;\;\) so, \(\;\;Y = \beta_0 + \beta_1 + \epsilon\;\; \implies \; \beta_1 = \mu_2 - \mu_1\)

\(\widehat{\beta}_{0} = \widehat{\mu}_1 = \overline{y}_1\)

\(\widehat{\beta}_{1} = \widehat{\mu}_2 - \widehat{\mu}_1 = \overline{y}_2 - \overline{y}_1\)

1.4.3.3 Back in “STAT 101” we estimated the variance

Before reminding you of that variance estimate, let’s continue with the LS regression approach:

For any simple linear regression model, we use \(\displaystyle \widehat{\sigma}^2 = \frac{\mathrm{SSE}}{\mathrm{df}}\)

\[ \widehat{\sigma}^2 \;\;=\;\; \frac{\mathrm{SSE}}{\mathrm{df}} \;\;=\;\; \frac{\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x)^2}{n_1 + n_2 - 2} \]

Let’s work on the numerator \[\begin{align} \sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x)^2 &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \widehat{\beta}_{0})^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \widehat{\beta}_{0} - \widehat{\beta}_{1})^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \overline{y}_1 - (\overline{y}_2 - \overline{y}_1))^2 \\ &\;\;=\;\; \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 \;\;+\;\; \sum_{i=j}^{n_2} (y_j - \overline{y}_2)^2 \\ &\;\;=\;\; (n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2 \end{align}\]

Therefore, \(\displaystyle \widehat{\sigma}^2 = \frac{(n_1 - 1) s_1^2 \;\;+\;\; (n_2 -1 ) s_2^2}{n_1 + n_2 -2}\)

Does this expression look familiar?

\(\widehat{\sigma}^2 = s_p^2\) = pooled estimate of the population variance

1.4.3.4 “STAT 101” t-test for \(\beta_1\)

\(H_0: \beta_1 = 0\)
is equivalent to
\(H_0: (\mu_2 - \mu_1) = 0\)
which is the same as \(H_0: \mu_2 = \mu_1)\)

So, we have the same t-statistic as in “STAT 101”

\[\begin{align} t &\;\;=\;\; \frac{\widehat{\beta}_{1} - 0}{\mathrm{se}(\widehat{\beta}_{1})} \\ \\ &\;\;=\;\; \frac{\overline{y}_2 - \overline{y}_1}{\mathrm{se}(\overline{y}_2 - \overline{y}_1)} \;\;=\;\; \frac{\overline{y}_2 - \overline{y}_1}{{\sqrt{\displaystyle \frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} } \end{align}\]

1.4.3.5 “STAT 101” t-statistic squared = linear models F-statistic

A lot of algebra is required (not shown), but I’ll set up the scenario for you.

F-test for \(H_0: \mu_1 = \mu_2 = \mu\)

Reduced model: \(Y = \mu + \epsilon\quad \mathrm{df}= n_1+n_2-1\)
Full model: \(Y = \beta_0 + \beta_1 x + \epsilon\quad \mathrm{df}= n_1+n_2-2\)

\[\begin{align} F &\;\;=\;\; \frac{[ \mathrm{SSE}(\mathrm{RM}) - \mathrm{SSE}(\mathrm{FM}) ] \;/\; [ \mathrm{df}(\mathrm{RM}) - \mathrm{df}(\mathrm{FM}) ]}{\mathrm{SSE}(\mathrm{FM}) \;/\; \mathrm{df}(\mathrm{FM})} \\ \\ &\;\;=\;\; \frac{\sum_{i=1}^{n_1+n_2} (y_i - \overline{\overline{y}})^2 - \sum_{i=1}^{n_1} (y_i - \overline{y}_1)^2 - \sum_{j=1}^{n_2} (y_j - \overline{y}_2)^2}{s_p^2} \\ \\ &\;\;=\;\; \textrm{after lots of algebra...} \\ \\ &\;\;=\;\; \frac{(\overline{y}_2 - \overline{y}_1)^2}{\displaystyle{\frac{s_p^2}{n_1} + \frac{s_p^2}{n_2}}} \;\;=\;\; t^2 \end{align}\]

1.4.4 Confidence Intervals

1.4.4.1 Confidence interval for \(\beta_0\)

Recall the least squares regression results for the computer repair data

repairFit = lm(Minutes ~ Units, data=repairData)
tidy(repairFit)

term	estimate	std.error	statistic	p.value
(Intercept)	4.162	3.355	1.24	0.2385
Units	15.509	0.505	30.71	0.0000

glance(repairFit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.9874	0.9864	5.392	943.2	0	2	12

\[\widehat{\beta}_{0} \;\pm\; t^* \mathrm{se}(\widehat{\beta}_{0})\]

With \(n-2 = 12\) degrees of freedom and \(\alpha=0.05\)

n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179

Make an educated guess:
Will zero be inside the confidence interval?

STOP

Yes, we expect 0 to be in this confidence interval since we failed to reject \(H_0: \beta_0 = 0\) earlier.

# The long way, but illustrative
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
pm = c(-1, 1)
b0 + pm * t.star * b0.se
[1] -3.148 11.472

# The short way ...I Googled for a package with a nice function
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(confint(repairFit, "(Intercept)", level=0.95))

	2.5 %	97.5 %
(Intercept)	-3.148	11.47

1.4.4.2 Confidence interval for \(\beta_1\)

\[\widehat{\beta}_{1} \;\pm\; t^* \mathrm{se}(\widehat{\beta}_{1})\]

With \(n-2 = 12\) degrees of freedom and \(\alpha=0.05\)

n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179

Make an educated guess:
Will zero be inside the confidence interval?

STOP

No, we do not expect 0 to be in this confidence interval since we rejected \(H_0: \beta_1 = 0\) earlier.

# The long way, but illustrative
n
[1] 14
t.star = - qt(0.025, df=n-2)
t.star
[1] 2.179
pm = c(-1, 1)
b1 + pm * t.star * b1.se
[1] 14.41 16.61

# The short way ...I Googled for a package with a nice function
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(confint(repairFit, "Units", level=0.95))

	2.5 %	97.5 %
Units	14.41	16.61

Or use code to find both confidence intervals at the same time

repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
kable(confint(repairFit, level=0.95))

	2.5 %	97.5 %
(Intercept)	-3.148	11.47
Units	14.409	16.61

1.4.4.3 Using the model to estimate the mean response

The estimated average (expected) value
for all individuals in the sub-population with \(X=x_0\) is \[\widehat{\mu}_0 \;\;=\;\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\] A confidence interval for the mean response at some \(x_0\) is \[\widehat{\mu}_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{\sum (x_i-\overline{x})^2}}\]

a = 2
b = 7

Among the jobs where \(2\) or \(7\) units need repair, how long do we expect the work to take?

# I had defined new data values earlier
a
[1] 2
b
[1] 7
# Set up a data frame (dataset) to store the x-values for input to prediction
x.values = data.frame(Units = c(a,b))
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
# The short way ...I Googled and found "predict" 
# ...which is just in the "stats" package that comes with R by default
kable(predict(repairFit, x.values, interval="confidence"))

fit	lwr	upr
35.18	29.77	40.59
112.72	109.40	116.05

Which interval is narrower? Why?

1.4.4.4 Confidence bands: computer repair data

# After Googling, I initally only found ggplot2 functions 
# ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) ....

# (Although you will see in two plots later that I did successfully Google
# how to do this using the ggformula function gg_lm)
ggplot(repairData, aes(x = Units, y = Minutes)) +
  geom_point() + 
  geom_smooth(method="lm", se=TRUE)

1.4.4.5 Confidence bands: simulated data

# After Googling, I initally only found ggplot2 functions 
# ratherthan the ggformula interface: gg_point(Minutes ~ Units, data = repairData) ....

# (Although you will see in the next graph that I did successfully Google
# how to do this using the ggformula function gg_lm)
data_sim1 = data.frame(Y, X)
ggplot(data_sim1, aes(x = X, y = Y)) +
  geom_point() + 
  geom_smooth(method="lm", se=TRUE)

1.4.5 Prediction Intervals

1.4.5.1 Using the model to predict an individual outcome

The predicted value
for one yet-to-be-observed (or just observed) individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]

A prediction interval for the individual response at some \(x_0\) is \[\widehat{y}_0 \;\pm\; t^*\; \widehat{\sigma}\; \sqrt{1 + \frac{1}{n} + \frac{(x_0 - \overline{x})^2}{\sum (x_i-\overline{x})^2}}\] We just got two new jobs where \(2\) and \(7\) units need repair.

How long do should it take to complete either job?

# I had defined new data values earlier
a
[1] 2
b
[1] 7
# Set up a data frame (dataset) to store the x-values for input to prediction
x.values = data.frame(Units = c(a,b))
repairFit = lm(Minutes ~ Units, data=repairData)
# The "kable" function is not technically necessary.  
# It just makes things look better on my slides.
# The short way ...I Googled and found "predict" 
# ...which is just in the "stats" package that comes with R by default
kable(predict(repairFit, x.values, interval="prediction"))

fit	lwr	upr
35.18	22.25	48.11
112.72	100.51	124.93

1.4.5.2 Why are prediction intervals wider than confidence intervals?

Discuss…

1.4.5.3 Prediction limits: computer repair data

# I found out how to get the confidence bands on the plot using a ggformula 
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
gf_lm(Minutes ~ Units, data = repairData,
     interval = "prediction", fill = "skyblue") %>%
  gf_lm(interval = "confidence", fill="blue", color="black") %>%
  gf_point()

1.4.5.4 Prediction limits: simulated data

# I found out how to get the confidence bands on the plot using a ggformula 
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)
gf_lm(Y ~ X, data = data_sim1,
     interval = "prediction", fill = "skyblue") %>%
  gf_lm(interval = "confidence", fill="blue", color="black") %>%
  gf_point()

1.4.6 Example: Mercury in Fish

These data concern mercury levels in large-mouthed bass
in North Carolina rivers.

Mercury levels in river water typically at undetectable levels,
yet fish might contain unsafe amount to eat (1 part per million)

In principle, mercury concentration in fish can be measured,
but that’s expensive …and involves waiting days for lab results.

So, that doesn’t help you decide whether or not
to eat the fish you caught today.

So, can we eat large-mouthed bass from North Carolina rivers?

1.4.7 Mercury in Fish: A first data analysis

fishData = read.csv("http://statistics.uchicago.edu/~burbank/data/s224/fish.txt")
# rescale weight from grams to kg to make units of y and x similar in size, 
# so slope won't be huge (and thus, a little harder to interpret, at least for me anyway)

# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fishData = mutate(fishData, weight = weight / 1000)
# sample size
n = dim(fishData)[1]

Older fish contain more mercury.
Older fish weight more.
Expensive and lengthy: determine mercury level in a fish
Not feasible: determine the age of a fish
Easy and quick: weigh the fish

Data were collected on 171 large-mouth bass in North Carolina rivers.

mercury = concentration of mercury in fish tissue (ppm = mcg/g)
weight = weight in kilograms (kg)

Could you weigh a fish and then give a recommendation
as to whether it is safe to eat it?

“The Food and Drug Administration has set an”action level" of 1.0 ppm mercury for fish in interstate commerce."
https://www.greenleft.org.au/content/mercury-how-much-safe

A concentration over 1 part per million is considered unsafe for human consumption. In light of this, what recommendations can you make for fish caught from these rivers?

1.4.7.1 Mercury: Linear relationship?

Is there a linear relationship between mercury and weight?

fishFit = lm(mercury ~ weight, data=fishData)
gf_point(mercury ~ weight, data=fishData) %>%
  gf_coefline(model=fishFit)

Do the specifications of a linear regression model seem reasonable for these data?

The errors are uncorrelated
The errors have no systematic info about \(Y\) a. \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\)) b. The errors are not correlated with the \(x\) values c. \(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))
The mean of \(Y\) depends linearly on \(x\)
The errors have a normal distribution: \(\epsilon \;\sim\; N(0, \sigma^2)\)

What is the distribution of the response and predictor?

# histograms of the response and predictor variables
p1 = gf_histogram(~ mercury, data = fishData, bins=20, color="white")
p2 = gf_histogram(~ weight, data = fishData, bins=20, color="white")
grid.arrange(p1, p2, ncol=2)

1.4.7.2 Mercury: Plot of residuals vs. fitted values

# add fitted values and residuals to the data frame
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fishFit = lm(mercury ~ weight, data=fishData)
fishData = fishData %>%
  mutate(fitted = predict(fishFit), resids = residuals(fishFit))
# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fishData) %>%
  gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
  gf_hline(yintercept = 0, linetype = 2)

1.4.7.3 Mercury: Could residuls come from \(N(0,\sigma^2)\)?

p1 = gf_histogram(~ resids, bins = 20, data = fishData, color="white")
p2 = gf_qq(~ resids, data = fishData) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)

# NOT USED:  I was playing around writing a function to draw a normal quantile plot 
# with reference line using ggplot2, but the above is easier using ggformula package
qqplot.data = function (sample) # argument: vector of numbers
{
  # following four lines from base R's qqline()
  y = quantile(sample[!is.na(sample)], c(0.25, 0.75))
  x = qnorm(c(0.25, 0.75))
  slope = diff(y)/diff(x)
  intercept = y[1L] - slope * x[1L]
  
  df = data.frame(sample)

  ggplot(df, aes(sample = sample)) + 
    stat_qq() + 
    geom_abline(slope = slope, intercept = intercept)

}

p2 = qqplot.data(fishData$resids)

1.4.7.4 Mercury: Is the model useful for predictions?

# I found out how to get the confidence/prediction bands on the plot using a ggformula 
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)

# LS regression line with confidence and prediction bands
gf_lm(mercury ~ weight, data = fishData,
      formula = y ~ x, backtrans= identity, 
      interval = "prediction", fill = "skyblue") %>%
#  gf_lm(interval = "confidence", fill="blue", color="black") %>%
  gf_point()

1.4.8 Mercury in Fish: A second data analysis

1.4.8.1 Mercury: Transformation of the data

# scatterplot with LS regression line added
fishFit = lm(mercury ~ weight, data=fishData)
p1 = gf_point(mercury ~ weight, data = fishData) %>%
        gf_coefline(model = fishFit)
# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fishData) %>%
  gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
  gf_hline(yintercept = 0, linetype = 2)

grid.arrange(p1, p2, ncol=2)

The y-values and x-values are skewed right.
The larger values are twice the size of middle values.

The errors get larger with the predictor (multiplicative errors, not additive).

Let’s make a new response: logmercury = log(mercury, base=2)
and a new predictor: logweight = log(weight, base=2)

# Create new data frame adding log(weight) and log(mercury) to data frame (base 2 log)
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fish2Data = fishData %>%
  mutate(logweight = log(weight, base=2),
         logmercury = log(mercury, base=2))

This helps the skewness problems.

# histograms of response and predictor variables
p1 = gf_histogram(~ logmercury, data = fish2Data, bins = 15, color = "white")
p2 = gf_histogram(~ logweight, data = fish2Data, bins = 15, color = "white")
grid.arrange(p1, p2, ncol=2)

1.4.8.2 Mercury Transformed: Linear relationship?

Is the relationship still linear?

Did we fix the non-constant variance problem?

# Fit LS model to transformed variables
fish2Fit = lm(logmercury ~ logweight, data = fish2Data)
# Scatterplot of transformed data with LS regression line
gf_point(logmercury ~ logweight, data = fish2Data) %>%
  gf_coefline(model = fish2Fit)

1.4.8.3 Do transformed data better support model requirements?

Plot of residuals vs. fitted values (yhats)

# add fitted values and residuals to the data frame
# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
fish2Fit = lm(logmercury ~ logweight, data = fish2Data)
fish2Data = fish2Data %>%
  mutate(fitted = predict(fish2Fit), resids = residuals(fish2Fit))

# residual plot: plot of residuals vs. fitted values
gf_point(resids ~ fitted, data = fish2Data) %>%
  gf_labs(x = "Fitted values (yhats)", y = "Residuals") %>%
  gf_hline(yintercept = 0, linetype = 2)

1.4.8.4 Mercury Transformed: Could residuls come from \(N(0,\sigma^2)\)?

p1 = gf_histogram(~ resids, data = fish2Data, bins=15, color="white")

p2 = gf_qq(~ resids, data = fish2Data) %>% gf_qqline()
grid.arrange(p1, p2, ncol=2)

1.4.8.5 Mercury Transformed: Is the model useful for predictions?

# I found out how to get the confidence/prediction bands on the plot using a ggformula 
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)

# LS regression line with confidence and prediction bands
gf_lm(logmercury ~ logweight, data = fish2Data,
      interval = "prediction", fill = "skyblue") %>%
#  gf_lm(interval = "confidence", fill="blue", color="black") %>%
  gf_point()

1.4.9 So, is it safe to eat small fish?

From the graph, it appears on average, the fish have
below 1 ppm mercury (0 on the y-axis above)
as long as it weighs less than \(2^0=1\) kg.

However, for the particular fish you just caught (prediction),
you might be more comfortable if it weighed less than \(2^{-1.25}=0.4204\) kg (420.4 grams)!

Confidence and prediction bands transformed back to the original scale.
A bit easier to read.

# I found out how to get the confidence/prediction bands on the plot using a ggformula 
# function: gf_lm (ggformula is just a formula-based interface to ggplot2)

# confidence and prediction bands on the original scale
# but data fit using the log(y) and log(x) transformation
gf_lm(mercury ~ weight, data = fish2Data,
      formula = log(y) ~ log(x), backtrans = exp,
     interval = "prediction", fill = "skyblue") %>%
  gf_lm(
     formula = log(y) ~ log(x), backtrans = exp,
     interval = "confidence", color = "blue") %>%
  gf_point()

STAT 224 Chapters 1-2

1 Chapters 1 & 2

1.1 Lecture 1

1.1.1 Models

1.1.1.1 Deterministic (Functional) Models

1.1.1.2 Statistical Models

1.1.1.3 Linear Regression Models

1.1.1.4 “All models are wrong, but some are useful”

1.1.1.5 Some non-linear relationships can be linearized.

1.1.1.6 Linear or not? Can be linearized or not?

1.1.2 Data

1.1.2.1 The Data Format

1.1.2.2 Example: Computer Repair

1.1.2.3 Example: Computer Repairs (response = time, predictor = units)

1.1.3 Summary Statistics

1.1.3.1 Reminder: Summary Statistics

1.1.3.2 Example: Computer Repair

1.1.4 Correlation

1.1.4.1 Reminder: Covariance and Correlation

1.1.4.2 How covariance measures “relationship”

1.1.4.3 How covariance measures “relationship”

1.1.4.4 Facts about correlation and covariance

1.1.4.5 Guess the correlation

1.1.4.6 Always make a scatterplot when correlation calculated

1.1.4.7 Example: Computer Repair

1.1.5 Simple Linear Regression (one predictor)

1.1.5.1 The linear regression model

1.1.5.1.1 Why model the data-generating process

1.1.5.2 Keeping it simple: assumptions for Simple Linear Regression

1.1.5.2.1 Keeping it simple: Treat \(x\) values as “fixed”

1.1.5.2.2 Keeping it simple: A probability model for the errors

1.1.5.2.3 Keeping it simple: The mean of \(Y\) is a linear function of \(x\)

1.2 Lecture 2

1.2.1 The model for individuals in the population

1.2.1.0.1 What does \(\beta_0\) represent?

1.2.1.0.2 What does \(\beta_1\) represent?

1.2.2 Estimating \(Y\) For Simple Linear Regression

1.2.2.0.1 Some terminology for estimates of \(Y\) or its average

1.2.2.0.2 We are actually estimating model parameters

1.2.2.1 Residuals

1.2.3 Model Estimation

1.2.3.1 So, how do we estimate \(\beta_0\) and \(\beta_1\)?

1.2.3.2 How do we estimate \(\beta_0\) and \(\beta_1\)?

1.2.3.3 One criteria: Minimize the sum of squared residuals

1.2.3.4 Least squares estimates: \(\; \widehat{\beta}_{0}\) and \(\; \widehat{\beta}_{1}\)

1.2.3.5 Finally, let’s find the least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.2.3.6 The least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.2.3.7 Rewriting \(\widehat{\beta}_{1}\)

1.2.3.8 Summary: The simplest formulas for estimators \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.2.4 Side note: Simpler Model from Back in STAT 101 (No Predictor): Why did we choose \(\overline{y}\) to estimate \(\mu\)?

1.2.4.0.1 A “loss function”

1.2.4.1 Finding the least squares estimates using R

1.2.5 Checking the specifications of a linear regression model

1.2.5.1 Item 1: The errors are uncorrelated

1.2.5.2 Item 2a: The errors have mean zero

1.2.5.3 Item 2b: The errors are not correlated with the \(x\) values

1.2.5.4 Item 2c: \(\mathrm{var}(\epsilon_i) = \sigma^2\) is same for all \(i\) (for all \(x\))

1.2.5.5 Item 3: The mean of \(Y\) depends linearly on \(x\)

1.3 Lecture 3

1.3.1 Statistical Properties of LS Estimates of Model Parameters

1.3.1.1 Variability of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.3.1.2 Sampling distributions for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.3.1.3 Useful algebraic facts about \(\widehat{\beta}_{1}\)

1.3.1.4 OK. Now we can see \(\widehat{\beta}_{1}\) is an unbiased estimator for \(\beta_1\)

1.3.1.5 OK. Now we can see \(\widehat{\beta}_{0}\) is an unbiased estimator for \(\beta_0\)

1.3.2 Estimating the spread around the line: \(\sigma^2\)

1.3.3 The Normal Distribution

1.3.3.1 Useful choice of a probability distribution for the errors: Normal

1.3.4 Similar Concepts in “STAT 101”

1.3.4.1 This is not so novel. Recall the “STAT 101” one-sample t-test.

1.3.4.2 Back in STAT 101, we found LS estimate unbiased too.

1.3.4.3 So, back in STAT 101, we also used a t-distribution

1.3.5 Demo Sampling Distributions

1.3.5.1 Simulate sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.3.5.2 Analyze simulated sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

1.3.5.2.1 The means of the sampling distributions

1.3.5.2.2 The standard deviations of the sampling distributions

1.3.5.2.3 The shape of the sampling distributions

1.3.5.3 Back to the computer repair data…

1.3.6 Reminder: checking model assumptions