All lectures

1 Chapters 1 & 2

1.1 Lecture 1

First, I loaded some R packages needed for the code in this document. Click the Code button to see my R code.

library(tidyverse)
library(knitr)
library(mosaic)
library(gridExtra)
library(broom)
library(modelr)
library(car)

For any package, you install just once, then use library to load the package when needed. For example:

install.packages("mosaic") only once, ever.

library(mosaic) when you start up RStudio.

1.1.1 Models

1.1.1.1 Deterministic (Functional) Models

  • Some variables are related in nature.
  • Example: Equation from the CRC handbook describing a physical law.
    • \(F \;=\; ma\)
  • For some output variable \(Y\) and input \(X\),
    • \(Y \;=\; f(X)\)
  • A functional model perfectly relates \(Y\) to \(X\) (well… in theory).

1.1.1.2 Statistical Models

  • A Statistical model is a simple, low-dimensional summary of
    • the relationship present in the data
    • the data-generation process
    • the relationship present in the population
  • We may not know the functional form of relationship
  • The model allows errors (uncertainty)
    • \(Y \;=\; f(X) + \epsilon\)
    • random variable = deterministic function + random error
  • The model allows for deviations (variability) from the “true” function when estimating from data
    • \(\widehat{Y} \;=\; \widehat{f}(x) + e_i\)
    • estimate of variable = estimate of deterministic function + observed error

1.1.1.3 Linear Regression Models

Collect data on

  • \(Y\) = response variable
  • \(X_1, X_2, \ldots, X_p\) = predictor variables

General deterministic part of model \(\quad Y \;=\; f(X_1,X_2, \ldots, X_p)\)

We might opt for a linear relationship \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p\] …and then, allow for error (a statistical model) \[Y \;=\; \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots \beta_pX_p + \epsilon\] This is called a linear regression model

1.1.1.4 “All models are wrong, but some are useful”

(George Box)

Few processes are actually linear.

Still, the following are linear regression models

\[\begin{align} Y &\;=\; \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon \\ \\ Y &\;=\; \beta_0 + \beta_1 \log(X) + \epsilon \end{align}\] because the “coefficients” \(\beta_0\) and \(\beta_1\) enter the model linearly.

1.1.1.5 Some non-linear relationships can be linearized.

For example, the non-linear model \(Y \;=\; \beta_0 e^{\beta_1 X} \epsilon\)

can be transformed to a linear model (take the log of both sides) \[\begin{align} log(Y) &\;=\; \log(\beta_0) + \beta_1 X + \log(\epsilon) \\ \\ Y' & \;=\; \beta_0' + \beta_1 X + \epsilon' \end{align}\]

STOP

1.1.1.6 Linear or not? Can be linearized or not?

  1. \(Y \;=\; \beta_0 + \beta_1^X + \epsilon\)
  2. \(Y \;=\; \beta_0 \beta_1^X \epsilon\)
  3. \(Y \;=\; \beta_0 + \beta_1 e^{X^2} + \epsilon\)
  4. \(Y \;=\; \beta_0 + \beta_1 X^2 + \beta_2 \log(X) + \epsilon\)


Once we propose a linear model, the method is intuitive:
Fit a linear function through the data points using some objective criterion for a good “fit”.

1.1.2 Data

1.1.2.1 The Data Format

We will endevour to provide you with tidy data:

  • Each row corresponds to one observational unit (case)
  • Each variable is one column
observation response \(Y\) predictor \(x\)
1 \(y_1\) \(x_1\)
2 \(y_2\) \(x_2\)
\(n\) \(y_n\) \(x_n\)

1.1.2.2 Example: Computer Repair

A sample of records on computer repair service calls was collected.

Two variables were observed:

  • Time to complete work (minutes)
  • Number of units repaired

Which variable is response (\(Y\))?

Which variable is explanatory or “predictor” variable (\(X\))?

It depends… on the question of interest.

What if we know how long the repairer worked.
Did they repair as many units as we would have expected?

What if we knew how many units needed repair.
How much time should we schedule for the repairer?

1.1.2.3 Example: Computer Repairs (response = time, predictor = units)

\(Y\) = Time to complete work (minutes)

\(X\) = Number of units repaired

# Read in the data from the course data repository
repairData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P031.txt")

repairData # view the data (OK, since small sample)
Minutes Units
23 1
29 2
49 3
64 4
74 4
87 5
96 6
97 6
109 7
119 8
149 9
145 9
154 10
166 10
glimpse(repairData) # view a summary of the data (OK, for larger samples too)
Observations: 14
Variables: 2
$ Minutes <dbl> 23, 29, 49, 64, 74, 87, 96, 97, 109, 119, 149, 145, 154, 166
$ Units   <dbl> 1, 2, 3, 4, 4, 5, 6, 6, 7, 8, 9, 9, 10, 10
  1. Scatterplot using ggplot2 package (with ggformula package to help out)
  2. Scatterplot using lattice package
  3. Scatterplot using base R (defaults not pretty)
  4. Scatterplot using base R (make it prettier)
p1 = gf_point(Minutes ~ Units, data=repairData) # ggplot2 graphics made easier by ggformula package
p2 = xyplot(Minutes ~ Units, data=repairData) # lattice graphics package
grid.arrange(p1, p2, ncol=2) # arrange the plots side by side (2 columns)


par(mfrow=c(1,2)) # prepare to arrange base R plots with 1 row, 2 cols
plot(repairData$Units, repairData$Minutes)
plot(repairData$Units, repairData$Minutes, pch=20, xlab="Units", ylab="Minutes")

par(mfrow=c(1,1)) # return the base R setting to 1 plot (1 row, 1 col)

I usually use the ggplot2 package (with the ggformula package to help out).

STOP

Demonstrate ggformula syntax.




Fancy add-ons you can “chain” together with ggplot2

gf_point(Minutes ~ Units, data=repairData) %>%
  gf_rug() %>%
  gf_smooth(se = TRUE)

1.1.3 Summary Statistics

1.1.3.1 Reminder: Summary Statistics

Sample averages: \(\;\overline{y} \;=\; \frac{1}{n}\sum_{i=1}^n y_i \qquad \overline{x} \;=\; \frac{1}{n}\sum_{i=1}^n x_i\)

Sample variances: \[\mathrm{var}(y) \;=\; s_y^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2 \qquad \mathrm{var}(x) \;=\; s_x^2 \;=\; \frac{1}{n-1}\sum_{i=1}^n (x_i - \overline{x})^2\] Sample standard deviations: \[\mathrm{sd}(y) \;=\; s_y \;=\; \sqrt{s_y^2}\;, \quad \mathrm{sd}(x) \;=\; s_x \;=\; \sqrt{s_x^2}\]

Standardized data: \[z_i \;=\; \frac{y_i - \overline{y}}{s_y} \qquad w_i \;=\; \frac{x_i - \overline{x}}{s_x}\] Can show that \(\overline{z} \;=\; 0\) and \(\mathrm{sd}(z) \;=\; s_z \;=\; 1\) (same for \(w\))

STOP

Show that \(\overline{z} \;=\; 0\)






\[\overline{z} \;=\; \frac{1}{n}\sum_{i=1}^n z_i \;=\; \frac{1}{n}\sum_{i=1}^n \frac{y_i - \overline{y}}{s_y} \;=\; \frac{1}{n(s_y)} \sum_{i=1}^n (y_i - \overline{y}) \;=\; \frac{1}{n(s_y)} \left[ \sum_{i=1}^n y_i - \sum_{i=1}^n \overline{y}\right] \;=\; \frac{1}{n(s_y)} [ n\overline{y} - n\overline{y}] \quad\;=\;\quad 0 \]

1.1.3.2 Example: Computer Repair

Two variables:

  • \(Y\) = Time to complete work (minutes)
  • \(X\) = Number of units repaired
ybar = mean(~ Minutes, data=repairData)
ybar
[1] 97.21
xbar = mean(~ Units, data=repairData)
xbar
[1] 6
sy = sd(~ Minutes, data=repairData)
sy
[1] 46.22
sx = sd(~ Units, data=repairData)
sx
[1] 2.961

\(\bar y = 97.21 \quad \overline{x}= 6\)
\(s_y = 46.22 \quad s_x = 2.961\)

# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"

# Here we add standardized versions of the original variables
repairData = mutate(repairData, 
  stdMinutes = (Minutes - mean(Minutes)) / sd(Minutes),
  stdUnits = (Units - mean(Units)) / sd(Units)
)
repairData
Minutes Units stdMinutes stdUnits
23 1 -1.6058 -1.6885
29 2 -1.4760 -1.3508
49 3 -1.0432 -1.0131
64 4 -0.7187 -0.6754
74 4 -0.5023 -0.6754
87 5 -0.2210 -0.3377
96 6 -0.0263 0.0000
97 6 -0.0046 0.0000
109 7 0.2550 0.3377
119 8 0.4714 0.6754
149 9 1.1205 1.0131
145 9 1.0339 1.0131
154 10 1.2287 1.3508
166 10 1.4883 1.3508

# Confirm that the average of standardized data is 0, with sd=1
zbar = mean(~ stdMinutes, data=repairData)
zbar
[1] 1.192e-16
sz = sd(~ stdMinutes, data=repairData)
sz
[1] 1

For the standardized response (Minutes): \(\overline{z}= 0 \quad s_z = 1,\) as expected.

Standardizing the data does not change the “relationship”

p1 = gf_point(Minutes ~ Units, data=repairData)
p2 = gf_point(stdMinutes ~ stdUnits, data=repairData)
grid.arrange(p1, p2, ncol=2)

1.1.4 Correlation

1.1.4.1 Reminder: Covariance and Correlation

  • variance of \(y\) is

\[\mathrm{var}(y) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y})^2 \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (y_i - \overline{y})\]

  • covariance of \(y\) and \(x\):

\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]

  • variance of \(z\) is

\[\mathrm{var}(z) \;=\; \frac{1}{n-1} \sum_i (z_i - \overline{z})^2 \;=\; \frac{1}{n-1} \sum_i (z_i - 0)^2 \;=\; \frac{1}{n-1} \sum_i z_i^2 \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \] From here, it’s simple to show that \(\mathrm{var}(z) = 1\):

\[\mathrm{var}(z) \quad\;=\;\quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{y_i - \overline{y}}{s_y}\right) \quad\;=\;\quad \frac{1}{s_y^2} \frac{1}{n-1} \sum_i \left(y_i - \overline{y}\right) \left(y_i - \overline{y}\right) \quad\;=\;\quad \frac{1}{s_y^2} var(y) = \frac{1}{s_y^2} s_y^2 = 1 \]

  • correlation of \(y\) and \(x\) is \(cov(z,w)\) of standardized data:

\[\begin{align} r \;=\; \mathrm{corr}(y,x) \quad &\;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right) \\ \\ \quad &\;=\; \quad \frac{\mathrm{cov}(y,x)}{s_y s_x} \quad \;=\; \quad \frac{\mathrm{cov}(y,x)}{\sqrt{\mathrm{var}(y) \mathrm{var}(x)}} \end{align}\]

1.1.4.2 How covariance measures “relationship”

Consider the following variables

  • \(V\) = hours of sleep two nights before midterm exam
  • \(W\) = hours of sleep the night before midterm exam
  • \(Z\) = hours of sleep the night after the midterm exam

Consider \(V\) and \(W\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?

Consider \(W\) and \(Z\).
Which is reponse?
Which is predictor?
Would you expect a positive or negative correlation?

STOP

Ponder the questions above.








1.1.4.3 How covariance measures “relationship”

  • sleep2Before = \(V\) = hours of sleep two nights before midterm exam
  • sleepBefore = \(W\) = hours of sleep the night before midterm exam
  • sleepAfter = \(Z\) = hours of sleep the night after the midterm exam

Here is a little data:

v = c(5, 7, 4, 8, 6)
w = c(3, 8, 2, 5, 6)
z = c(8, 7, 9, 7, 8)
df = tibble(sleep2Before=v, sleepBefore=w, sleepAfter=z)
df
sleep2Before sleepBefore sleepAfter
5 3 8
7 8 7
4 2 9
8 5 7
6 6 8
# gf_vline adds a vertical line at "xintercept"
# gf_hline adds a horizontal line at "yintercept"
p1 = gf_point(sleepBefore ~ sleep2Before, data=df) %>%
  gf_vline(xintercept=mean(~ sleep2Before, data=df), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepBefore, data=df), linetype=2)
p2 = gf_point(sleepAfter ~ sleepBefore, data=df) %>%
  gf_vline(xintercept=mean(~ sleepBefore, data=df), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepAfter, data=df), linetype=2)
grid.arrange(p1, p2, ncol=2)

\[\mathrm{cov}(y,x) \;=\; \frac{1}{n-1} \sum_i (y_i - \overline{y}) (x_i - \overline{x})\]

covwv = cov(sleepBefore ~ sleep2Before, data=df)
covwv
[1] 2.75
covzw = cov(sleepAfter ~ sleepBefore, data=df)
covzw
[1] -1.55

\(\mathrm{cov}(w,v) = 2.75\)
\(\mathrm{cov}(z,w) = -1.55\)

What if we recorded the data in minutes instead of hours?
Then, the relationships are still the same

df2 = df*60
p3 = gf_point(sleepBefore ~ sleep2Before, data=df2) %>%
  gf_vline(xintercept=mean(~ sleep2Before, data=df2), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepBefore, data=df2), linetype=2)
p4 = gf_point(sleepAfter ~ sleepBefore, data=df2) %>%
  gf_vline(xintercept=mean(~ sleepBefore, data=df2), linetype=2) %>%
  gf_hline(yintercept=mean(~ sleepAfter, data=df2), linetype=2)
grid.arrange(p1, p2, p3, p4, ncol=2)

…but, the covariance changes:

cov(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 9900
cov(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -5580

\(\mathrm{cov}(60w,60v) = 9,900\)
\(\mathrm{cov}(60z,60w) = -5,580\)

Standardization (correlation) takes care of the units problem. \[r \;=\; \mathrm{corr}(y,x) \quad \;=\; \quad \frac{1}{n-1} \sum_i \left(\frac{y_i - \overline{y}}{s_y}\right) \left(\frac{x_i - \overline{x}}{s_x}\right)\]

corrwv = cor(sleepBefore ~ sleep2Before, data=df)
corrwv
[1] 0.7285
cor(60*sleepBefore ~ 60*sleep2Before, data=df)
[1] 0.7285

corrzw = cor(sleepAfter ~ sleepBefore, data=df)
corrzw
[1] -0.776
cor(60*sleepAfter ~ 60*sleepBefore, data=df)
[1] -0.776
covwv = cov(sleepBefore ~ sleep2Before, data=df)
covzw = cov(sleepAfter ~ sleepBefore, data=df)

\(\mathrm{corr}(w,v) = 0.7285\)
\(\mathrm{corr}(60w,60v) = 0.7285\)

\(\mathrm{corr}(z,w) = -0.776\)
\(\mathrm{corr}(60z,60w) = -0.776\)

1.1.4.4 Facts about correlation and covariance

  • Covariance and correlation are measures of association (relationship)
  • Positive or negative, according to the direction of the association
  • Covariance is not scale invariant \[\mathrm{cov}(a + by, c + dx) \;=\; bd\mathrm{cov}(y,x)\]
  • Correlation is scale invariant (unitless) \[\mathrm{corr}(a + by, c + dx) \;=\; {\rm sign}(bd)\mathrm{corr}(y,x)\]
  • The sample correlation coefficient between \(y\) and \(x\) is \(r \;=\; \mathrm{corr}(y,x)\)
  • \(-1 \le r \le 1\)
  • \(r = -1\) or \(r = 1\) means perfect correlation:
    • All the points are on a straight line.
  • Strength of association:\(\;\; r\) closer to \(-1\) or \(1\) is strongest
  • \(r\) is a sample estimate for population correlation coefficient: \(\;\rho \;=\; \mathrm{corr}(Y,X)\).

1.1.4.6 Always make a scatterplot when correlation calculated

Let \(Y = 50 - X^2\), then

  • \(Y\) and \(X\) are have a perfect relationship (no deviations from the curve)
  • …but the relationship is quadratic, not linear
  • By symmetry, the correlation in this case is exactly zero!
mydata = tibble(x = c(-7:7), y = 50 - x^2)
mydata
x y
-7 1
-6 14
-5 25
-4 34
-3 41
-2 46
-1 49
0 50
1 49
2 46
3 41
4 34
5 25
6 14
7 1
gf_point(y ~ x, data=mydata) +
  geom_vline(data=mydata, aes(xintercept=mean(x)), linetype=2) +
  geom_hline(data=mydata, aes(yintercept=mean(y)), linetype=2)

cor(y ~ x, data=mydata)
[1] 0

Believe it or not (HW),
all four relationships below have exactly the same correlation coefficient.

AnscombeData = read.delim("http://statistics.uchicago.edu/~burbank/data/RABE5/P029b.txt")
p1 = gf_point(Y1 ~ X1, data=AnscombeData)
p2 = gf_point(Y2 ~ X2, data=AnscombeData)
p3 = gf_point(Y3 ~ X3, data=AnscombeData)
p4 = gf_point(Y4 ~ X4, data=AnscombeData)
grid.arrange(p1, p2, p3, p4, ncol=2)

We definitely should make a plot to interpret correlation!

1.1.4.7 Example: Computer Repair

Two variables:

  • \(Y\) = Time to complete work (minutes)
  • \(X\) = Number of units repaired

The relationship appears linear.

repairData
Minutes Units stdMinutes stdUnits
23 1 -1.6058 -1.6885
29 2 -1.4760 -1.3508
49 3 -1.0432 -1.0131
64 4 -0.7187 -0.6754
74 4 -0.5023 -0.6754
87 5 -0.2210 -0.3377
96 6 -0.0263 0.0000
97 6 -0.0046 0.0000
109 7 0.2550 0.3377
119 8 0.4714 0.6754
149 9 1.1205 1.0131
145 9 1.0339 1.0131
154 10 1.2287 1.3508
166 10 1.4883 1.3508
gf_point(Minutes ~ Units, data=repairData)

Correlation is quite high.

covyx = cov(Minutes ~ Units, data=repairData)
covyx
[1] 136
corryx = cor(Minutes ~ Units, data=repairData)
corryx
[1] 0.9937

Covariance and correlation in original units:
\(\mathrm{cov}(y,x) = 136\)
\(\mathrm{corr}(y,x) = 0.9937\)

covzw = cov(stdMinutes ~ stdUnits, data=repairData)
covzw
[1] 0.9937
corrzw = cor(stdMinutes ~ stdUnits, data=repairData)
corrzw
[1] 0.9937

Covariance and correlation in standardized units:
\(\mathrm{cov}(z,w) = 0.9937\)
\(\mathrm{corr}(z,w) = 0.9937\)

1.1.5 Simple Linear Regression (one predictor)

1.1.5.1 The linear regression model

\[\begin{align} Y & \;=\; f(X) + \epsilon \\ \\ & \;=\; \beta_0 + \beta_1 X + \epsilon \end{align}\]

This is called a simple linear regression model
since there’s only one predictor: \(X\)

random variable = deterministic linear function + random error

** Illustration of data generation **





1.1.5.1.1 Why model the data-generating process

If we make a detailed model of how the data are generated, we can:

  • Generate samples from the model – simulated data
  • Find fitted regression coefficients for these simulated data, and then…
  • Plot sampling distributions for these fitted coefficients
  • See how much the fitted coefficients differ from the true population coefficients, and then…
  • Find sampling distributions for the differences between fitted and true coefficients, and then…
  • Do hypothesis testing, calculate confidence intervals, etc.

Or, we can do all of the above analytically. This gives us results in nice algebraic form, and lets us see which exact assumptions about the data generating process were necessary to our conclusions.

1.1.5.2 Keeping it simple: assumptions for Simple Linear Regression

If the real data generation process is not well-described by a model meeting the following criteria, then we need a different model.

1.1.5.2.1 Keeping it simple: Treat \(x\) values as “fixed”

Throughout the course, we will consider the \(x\) values as fixed.
That is, all models are built conditional on the \(x\) values observed.
Adding in a model for the distribution of \(X\) is beyond the scope of the course.

Still, the model conditional on the observed \(x\) values (although incorrect as all models are) is very useful for summarizing and describing relationships and making predictions for \(Y\) based on new observations on \(x\).

1.1.5.2.2 Keeping it simple: A probability model for the errors
  1. The errors are uncorrelated with each other
    • \(\mathrm{corr}(Y_i, Y_j) \;=\; \mathrm{corr}(\epsilon_i, \epsilon_j) \;=\; 0\) for all \(i\ne j\)
  2. The errors have no systematic info about \(Y\)
    1. \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))
      • The distribution of errors has mean zero
      • No systematic bias
    2. The errors are not correlated with the \(x\) values
    3. \(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))
1.1.5.2.3 Keeping it simple: The mean of \(Y\) is a linear function of \(x\)

The mean of \(Y\) must depend linearly on \(x\).

\[\begin{align} E(Y\; |\; X=x) &\;=\; E(\beta_0 +\beta_1 x + \epsilon ) \\ \\ & \;=\; \beta_0 +\beta_1 x + E(\epsilon) \\ \\ & \;=\; \beta_0 +\beta_1 x + 0 \;=\; \beta_0 +\beta_1 x \end{align}\]

So, if the errors have no relationship to the values of \(x\), it follows that the mean of \(Y\) is a linear function of \(x\). (Or a better way to put it: if the mean of \(Y\) is not a linear function of \(x\), the errors won’t meet criterion 2b.)







1.2 Lecture 2

1.2.1 The model for individuals in the population

For individual \(i\) in the population with predictor value \(X=x=x_i\)

\[Y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\] This model says that the data generation process can be described as fixing values for \(x\), then finding corresponding \(y=\beta_0 + \beta_1 x\), and finally adding a random error \(\epsilon\) for each individual.

The average response for such individuals is \(E(Y\; |\; X=x_i) \;=\; \beta_0 +\beta_1 x_i\)

Let’s try generating a sample from a population
where we have fixed values for \(\beta_0\), \(\beta_1\), and \(x\) values.

beta0_sim1 = 3
beta1_sim1 = 4
n_sim1 = 50
mx_sim1 = 2
sx_sim1 = 1

I set \(\beta_{0} = 3\), \(\beta_{1} = 4\), and \(n= 50\).

First we fix the values and pick a set of \(n = 50\) \(x\) values.
I’ll pick them from a Normal(\(\mu = 2, \sigma = 1\)) distribution.

set.seed(1234567)
# Generate x values for individuals in population
X = rnorm(n_sim1, m = mx_sim1, s = sx_sim1) 
gf_histogram(~ X, bins=12, color = "white")

If the relationship is deterministic,
we could generate \(Y\)-values with a simple linear relationship: \(Y = 3 + 4 x\)

Y_without_error = beta0_sim1 + beta1_sim1 * X        # Y values without errors
gf_point(Y_without_error ~ X, color='red')

What is the intercept of this line? What is its slope?


gf_point(Y_without_error ~ X, color = 'red') %>%
  gf_abline(slope = beta1_sim1, intercept = beta0_sim1)

Of course, for a statistical model,
we only require the mean of \(Y\) to be a linear function of \(x\)

sd_error_sim1 = 2

So, let’s generate a set of errors \(\epsilon\), one for each observation.
I’ll pick them from a normal distribution: Normal(\(\mu = 0, \sigma = 2\)) .

Reminder: According to our simple statistical model, the errors have mean = 0

set.seed(123466)
# Generate error values for individuals in population with given x-values
error = rnorm(n_sim1, m = 0, s = sd_error_sim1) 
gf_histogram(~ error, bins=12, color = "white")

Finally, we can generate \(Y\) values from the \(x\) and error values
by adding the errors to the linear function of \(x\) values.

Y = beta0_sim1 + beta1_sim1 * X + error # Y values, with errors
generated_data = tibble(X, error, Y_without_error, Y)
head(generated_data, 10)
X error Y_without_error Y
2.1567 -2.5555 11.627 9.0713
3.3738 -0.2737 16.495 16.2216
2.7307 -2.0445 13.923 11.8782
0.6492 0.8448 5.597 6.4416
1.9915 -0.8568 10.966 10.1092
2.3210 0.9133 12.284 13.1972
0.2219 -3.0707 3.887 0.8167
2.9095 1.4770 14.638 16.1150
1.0806 2.6609 7.322 9.9832
1.8423 0.4994 10.369 10.8685

gf_point(Y ~ X, data = generated_data, color = 'blue', show.legend = TRUE) %>%
  gf_point(Y_without_error ~ X, data = generated_data, color = 'red') %>%
  gf_abline(intercept = beta0_sim1, slope = beta1_sim1)

1.2.1.0.1 What does \(\beta_0\) represent?

We have already seen that \(\beta_0\) is the intercept of the above graphs. Let’s make this more precise.

When \(x=0\), we expect \(Y\) to be \(\beta_0\).

\[E(Y\; |\; X=0) \;=\; \beta_0 +\beta_1 (0) \;=\; \beta_0\]

Computer repair data: What does \(\beta_{0}\) represent?

gf_point(Minutes ~ Units, data=repairData) +
  xlim(-1, max(~ Units, data=repairData)) + 
  ylim(10, max(~ Minutes, data=repairData))

#  gf_smooth(method = "lm", se = FALSE) +


STOP …and think about it.





When \(x=0\), we estimate the mean of \(y\) to be \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} (0) = \widehat{\beta}_{0}\) minutes.

For the computer example, this means that when no units are serviced,
(guessing from the graph) we expect the work to take about 5 minutes.

Hmmmm…. That doesn’t make sense.

Perhaps interpret this a “start up” time before working on the 1st unit?

1.2.1.0.2 What does \(\beta_1\) represent?

We have already seen that \(\beta_1\) is the slope of the above graphs. Let’s make this more precise, too – and in a way that will carry forward when we go to multiple predictor variables, later in the course.

For individuals with predictor value \(x+1\) versus those with \(x\),
we expect the reponse to be \(\beta_1\) units higher (or lower).

\[\begin{align} E(Y\; |\; X=x+1) &\;=\; \beta_0 +\beta_1 (x+1) \\ \\ &\;=\; \beta_0 +\beta_1 x + \beta_1 \;=\; E(Y\; |\; X=x) + \beta_1 \end{align}\]

That is, \(\; E(Y\; |\; X=x+1) - E(Y\; |\; X=x) \;=\; \beta_1\)

Computer repair data: What does \(\beta_1\) represent?

gf_point(Minutes ~ Units, data=repairData) +
  xlim(-1, max(~ Units, data=repairData)) + 
  ylim(10, max(~ Minutes, data=repairData))

#  gf_smooth(method = "lm", se = FALSE) +


STOP …and think about it.





For each additional unit to be serviced, we estimate the mean time to completion (guessing from the graph) to be about 12 minutes longer.

Estimates of \(\beta_{0}\) and \(\beta_{1}\) from data:

tidy(lm(Minutes ~ Units, data=repairData))
term estimate std.error statistic p.value
(Intercept) 4.162 3.355 1.24 0.2385
Units 15.509 0.505 30.71 0.0000

We will soon learn how to “best” calculate a slope and intercept from data.

1.2.2 Estimating \(Y\) For Simple Linear Regression

1.2.2.0.1 Some terminology for estimates of \(Y\) or its average

For individual \(i\) in the sample

\[y_i \;=\; \beta_0 +\beta_1 x_i + \epsilon_i\]

Suppose we know \(x_i\) but not \(y_i\). The true value is \(y_i \;=\; \beta_0 +\beta_1 X + \epsilon_i\), but we don’t know \(\epsilon_i\)

The fitted (estimated) value of \(Y\)
for individual \(i\) already in the sample is \[\widehat{y}_i \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_i\]

The estimated average (expected) value
for all individuals in the population with \(X=x_0\) is \[\widehat{\mu}_0 = \;=\;\widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]

The predicted (estimated) value
for a yet-to-be-observed or just now observed individual in the population with \(X=x_0\) is \[\widehat{y}_0 \;=\; \widehat{\beta}_{0} + \widehat{\beta}_{1} x_0\]

1.2.2.0.2 We are actually estimating model parameters

Most of the time, we don’t know anything about the underlying population, including the true values of the parameters \(\beta_0\) and \(\beta_1\), and \(\mathrm{var}(\epsilon)=\sigma^2\).

All we have are observations in a sample.

We will use these observations to calculate estimates of \(\beta_0\), \(\beta_1\), as well as errors \(\epsilon_i\) and their variance.

1.2.2.1 Residuals

Suppose we have estimates for \(\beta_0\) and \(\beta_1\).

We can then use these estimates to calculate the fitted response \(\widehat{y}_i\) for each observation.

The difference between the fitted response and the observed response is an estimate of \(\epsilon_i\).

The estimated error (estimate of \(\epsilon_i\)) is called a residual (\(e_i\))

\[\begin{align} {\rm residual} \;=\; e_i &\;=\; {\rm observed - fitted} \\ &\;=\; {\rm observed - expected} \\ &\;=\; {\rm observed - predicted} \\ \\ &\;=\; y_i - \widehat{y}_i\\ \\ &\;=\; y_i - (\widehat{\beta}_{0} + \widehat{\beta}_{1} x_i) \end{align}\]

# I'm going to use Base R graphics here since I happen to know how to 
# add lines and text to a plot (off the top of my head).
# Someday, maybe I'll learn to do this in ggplot2 (ggformula).
a = 10
b = 4
mx = 15
sx = 5
s = 20
nsim2 = 100
set.seed(123456)
x = rnorm(nsim2, m=mx, sd=sx)
y = a + b*x + rnorm(nsim2, m=0, sd=s)
ybar = mean(y)
xbar = mean(x)

# plot(y ~ x, xaxt="n", yaxt="n", bty="l", xlim=c(0,max(x)), ylim=c(0,max(y)))
# abline(a=a, b=b, lwd=3)
# abline(h = ybar)
# identify(y ~ x)

xi = xbar + sx
yi = ybar + 3*s
yhati = a + b*xi

plot(y ~ x, type="n", xaxt="n", yaxt="n", bty="l")
abline(a=a, b=b, lwd=3)

points(xi, yi, pch=20, cex=2)
points(xi, yhati, cex=2)

lines(c(xi,xi), c(0, ybar))
lines(c(xi,xi), c(ybar, yhati), lty=2)
lines(c(xi,xi), c(yhati, yi), lty=2, lwd=3)
lines(c(0,xi), c(yi, yi))
lines(c(0,xi), c(ybar, ybar))
lines(c(xbar - sx, xbar - sx), c(ybar, yi))

text(x=xi, y=ybar + 2*s, labels="residual  =  (y_i - yhat_i)", pos=2)
text(x=xi, y=ybar + 0.5*s, labels="(yhat_i - ybar)", pos=4)
text(x=xbar - sx, y=ybar + 1.5*s, labels="(y_i - ybar)", pos=2)
text(x=xbar - 1.2*sx, ybar - 2*s, labels="yhat = b0 + b1(x)")
text(x=xbar-2.75*sx, y=yi, labels="y_i", pos=1)
text(x=xbar-2.75*sx, y=ybar, labels="ybar", pos=1)

Residuals for the computer repair data:

# Find the least squares estimates for intercept and slope (lm function)
repairFit = lm(Minutes ~ Units, data=repairData)
# Once you have an "lm object", you can grab all kinds of summaries from it
# save the fitted values: predict function
# save the residuals: residuals function

# Use the "mutate" verb in the dplyr package to add variables
# to an existing dataset (called a "data frame" in R)
# There is a DataCamp.com course on this: "Data Manipulation in R with dplyr"
repairData = mutate(repairData,
                     fitted = predict(repairFit),
                     resids = residuals(repairFit)
                     )
# This is a pretty fancy plot that I Googled and borrowed from here:
# https://www.r-bloggers.com/visualising-residuals/
ggplot(repairData, aes(x = Units, y = Minutes)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  # Plot regression slope
  geom_segment(aes(xend = Units, yend = fitted), alpha = 0.2) +  # alpha to fade lines
  geom_point() +
  geom_point(aes(y = fitted), shape = 1) +
  theme_bw()  # Add theme for cleaner look

1.2.3 Model Estimation

1.2.3.1 So, how do we estimate \(\beta_0\) and \(\beta_1\)?

What line should we choose to “fit” the model to data?

(This is just computer simulated data for illustration.)

# This entire section is borrowed from Hadley Wickham: R for Data Science
# http://r4ds.had.co.nz/model-basics.html
hadleyData <- modelr::sim1
ggplot(hadleyData, aes(x, y)) + 
  geom_point()

1.2.3.2 How do we estimate \(\beta_0\) and \(\beta_1\)?

m = 250

Here are \(250\) random possible models, but most are terrible!

set.seed(12345)
models = tibble(
  b0 = runif(m, -20, 40),
  b1 = runif(m, -5, 5)
)

ggplot(hadleyData, aes(x, y)) + 
  geom_abline(aes(intercept = b0, slope = b1), data = models, alpha = 1/4) +
  geom_point() 

### Loss functions No matter which two number we might choose to measure slope and intercept,
we are summarizing an entire distribution with two numbers.

  • There is a cost.
  • We lose information.
  • We should measure that loss and be aware of its magnitude.
  • Statisticians measure loss numerically with a “loss function”
    • measures the distance of the data from the line given with a two-number summary (the “slope” and “intercept”) and
    • is a measure of distance (spread) from the line.

1.2.3.3 One criteria: Minimize the sum of squared residuals

The least squares estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) minimize the sum of squared residuals

In our textbook,
SSE is called the “Sum of Squared Residuals (Errors)”.
MSE is called the “Mean Squared Residual (Error)”.

NOTATION CAUTION:
This is a bit confusing since “errors” (\(\epsilon_i\)) are not observed.
Only residuals are observed, but “SSR” is reserved for another quantity to be discussed later.

\[\mathrm{SSE}\;=\; \sum_i e_i^2 \;=\; \sum_i (y_i - \widehat{y}_i)^2 \;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i)^2\]

hadleyFit = lm(y ~ x, data=hadleyData)
mydata = hadleyData
mydata$fitted = predict(hadleyFit)   # Save the fitted values
mydata$residuals = residuals(hadleyFit) # Save the residual values
ggplot(mydata, aes(x, y)) +
  geom_smooth(method = "lm", se = FALSE, color = "lightgrey") +  # Plot regression slope
  geom_segment(aes(xend = x, yend = fitted), alpha = 0.2) +  # alpha to fade lines
  geom_point() +
  geom_point(aes(y = fitted), shape = 1) +
  theme_bw()  # Add theme for cleaner look

1.2.3.4 Least squares estimates: \(\; \widehat{\beta}_{0}\) and \(\; \widehat{\beta}_{1}\)

Which models have the smallest sum of squared residuals (SSE)?

model1 = function(b, data) {
  b[1] + data$x * b[2]
}
measure_distance = function(mod, data) {
  diff = data$y - model1(mod, data)
  sum(diff ^ 2) # changed to SSE from Hadley's RMSE sqrt(mean(diff ^ 2))
}
hadley_SSE = function(b0, b1) {
  measure_distance(c(b0, b1), hadleyData)
}
models = models %>% 
  mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))
ggplot(hadleyData, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = b0, slope = b1, colour = SSE), 
    data = filter(models, rank(SSE) <= 10)
  )

Let’s try to find the smallest SSE on our own.

http://www.rossmanchance.com/applets/RegShuffle.htm

Among the \(250\) models, we have isolated the 10 “closest” to the data

ggplot(models, aes(b0, b1)) +
  geom_point(data = filter(models, rank(SSE) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = SSE)) +
  labs(x = "Intercepts", y = "Slopes")

Here is a more organized approach: A grid search

grid = expand.grid(
  b0 = seq(-5, 20, length = 25),
  b1 = seq(1, 3, length = 25)
  ) %>% 
  mutate(SSE = purrr::map2_dbl(b0, b1, hadley_SSE))

grid %>% 
  ggplot(aes(b0, b1)) +
  geom_point(data = filter(grid, rank(SSE) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = SSE)) 

ggplot(hadleyData, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = b0, slope = b1, colour = SSE), 
    data = filter(grid, rank(SSE) <= 10)
  )

# End of section borrowed from Hadley Wickham: R for Data Science

1.2.3.5 Finally, let’s find the least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

Of course, we can make the grid finer and finer to get closer to the “best fit” line

…or just use calculus to minimize the SSE

We obtain the least squares estimators by minimizing the sum of squared residuals.

You do NOT need to be able to do the math, just understand it.

Find \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) that lead to:

\[\min(\mathrm{SSE}) \;=\; \min \sum_i e_i^2 \;=\; \min \sum_i (y_i-\widehat{y}_i)^2 \;=\; \min \sum_i (y_i-\widehat{\beta}_{0}-\widehat{\beta}_{1} x_i)^2\]

  • Differentiate the SSE with respect to \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\),
  • Set the two equations equal to zero, and
  • solve for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

\[\begin{align} \textrm{w.r.t } \widehat{\beta}_{0} \Rightarrow -2\sum (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (1) \\ \\ \textrm{w.r.t } \widehat{\beta}_{1} \Rightarrow -2\sum x_i(y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1} x_i) \;=\; 0 \hskip1in (2) \end{align}\]

1.2.3.6 The least squares estimates for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

From equation (\(1\)): \(\;\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)

Plug this into equation (\(2\)): \[\begin{align} & \sum x_i(y_i - \overline{y}- \widehat{\beta}_{1}\overline{x}-\widehat{\beta}_{1}x_i)=0 \\ \\ \implies\; & \sum x_i(y_i - \overline{y})\;=\; \widehat{\beta}_{1} \sum x_i(x_i - \overline{x}) \\ \\ \implies\; & \widehat{\beta}_{1}\;=\; \frac{\sum x_i(y_i - \overline{y})}{\sum x_i(x_i - \overline{x})} \hskip2in (3) \end{align}\]

1.2.3.7 Rewriting \(\widehat{\beta}_{1}\)

Note that \[\begin{align} & \sum (x_i-\overline{x})(y_i - \overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) -\sum \overline{x}(y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}\sum (y_i-\overline{y}) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) - \overline{x}(0) \\ \\ &\;=\; \sum x_i (y_i-\overline{y}) \end{align}\]

Similarly, we have

  • \(\sum (x_i-\overline{x})(y_i - \overline{y})\;=\; \sum y_i(x_i-\overline{x})\)

  • \(\sum (x_i-\overline{x})^2\;=\; \sum x_i(x_i-\overline{x})\)

Therefore, according to (\(3\)) we have other ways to write the estimator \[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]

1.2.3.8 Summary: The simplest formulas for estimators \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

\(\widehat{\beta}_{0} \;=\; \overline{y}- \widehat{\beta}_{1} \overline{x}\)

\(\widehat{\beta}_{1} \;=\; r \frac{s_y}{s_x}\)

1.2.4 Side note: Simpler Model from Back in STAT 101 (No Predictor): Why did we choose \(\overline{y}\) to estimate \(\mu\)?

When sampling from one population of quantitative values \(Y\), we estimated the mean of \(Y\): \(E(Y) = \mu\).

Using our new linear regression notation, we actually estimated the mean of \(Y\) for the linear model: \(\qquad Y = \beta_0 + \epsilon \;\;=\;\; \mu + \epsilon\)

Now, we want to estimate \(\widehat{\mu} = \widehat{y}= \widehat{\beta}_{0}\)

1.2.4.0.1 A “loss function”

We will again use as our loss function the sum (or mean) of squared residuals.
\[ SSE(a) = \sum_{i=1}^n\, (y_i-a)^2 \qquad\qquad MSE(a) = \frac{1}{n-1}\sum_{i=1}^n\, (y_i-a)^2 \]

What value of \(a\) should we choose if loss measured by SSE?

It seems reasonable that \(a\) should be in the “center” of the data.
But which value in the middle would be best?

One optimality criteria: Choose a number that minimizes SSE.

Using the least squares estimation method, \(\widehat{\beta}_{0}\) is the value that minimizes

\[\mathrm{SSE}\;=\; \sum (y_i - \widehat{\mu})^2 \;=\; \sum (y_i - \widehat{y})^2 \;=\; \sum(y_i - \widehat{\beta}_{0})^2\]

So, we want to minimize a function of the form \(f(a) = \sum (y_i -a)^2\) with respect to \(a\).

\[\frac{d}{da} f(a) = 2 \sum (y_i - a) (-1) = -2 \sum(y_i - a)\]

Set the derivative = 0 and solve for \(a\).

\[ \begin{align} -2 \sum(y_i - a) = 0 \implies\; & \sum(y_i - a) = 0 \\ \implies\; & \sum y_i - \sum a = 0 \\ \implies\; & \sum y_i - n a = 0 \\ \implies\; & \sum y_i = n a \\ \implies\; & a = \frac{1}{n} \sum y_i \\ \implies\; & a = \overline{y} \end{align} \]

The 2nd derivative of \(f(a)\) is

\[\frac{d}{da} (-2)\sum (y_i -a) = \frac{d}{da} (-2)\sum y_i + \frac{d}{da} (-2)\sum (-a) = 0 + 2n > 0\]

So, \(\widehat{\beta}_{0} = \overline{y}\) is a minimum, called the “least squares” estimate for \(\mu\).

So, \(\; \widehat{\beta}_{0} \;=\; \widehat{\mu} \;=\; \widehat{y}\;=\; \overline{y}\)

1.2.4.1 Finding the least squares estimates using R

lm = “linear model” function

lm(y ~ x, data = mydata)

Back to the computer repair data:

  • \(Y\) = Time to complete work (minutes)
  • \(X\) = Number of units repaired
# Find the least squares estimates for slope and intercept (lm function)
lmfit = lm(Minutes ~ Units, data=repairData)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Minutes ~ Units, data=repairData) %>%
  gf_coefline(model=lmfit)

repairFit = lm(Minutes ~ Units, data=repairData)
# tidy (from the broom package) gives a summary of the slope and intercept
# and accompanying estimate of variance in the estimates
tidy(repairFit)
term estimate std.error statistic p.value
(Intercept) 4.162 3.355 1.24 0.2385
Units 15.509 0.505 30.71 0.0000
b0 = tidy(repairFit)$estimate[1]
b0
[1] 4.162
b1 = tidy(repairFit)$estimate[2]
b1
[1] 15.51

Let’s try the least squares estimate on the simulated data we generated earlier
where we set \(\beta_{0} = 3\) and \(\beta_{1} = 4\) and generated random \(Y\) values.

Since we simulated the data from the linear regression model,
then the least squares estimates should be accurate.

# Find the least squares estimates for slope and intercept (lm function)
lmfit_sim = lm(Y ~ X, data = generated_data)
# gf_coefline will draw the regression line on the scatterplot
gf_point(Y ~ X, data = generated_data) %>%
  gf_coefline(model = lmfit_sim) %>%
  gf_abline(intercept = beta0_sim1, slope = beta1_sim1, color = 'red',alpha=1)

Remember, our true population parameters were \(\beta_0= 3\) and \(\beta_1= 4\).

lmfit_sim = lm(Y ~ X, data = generated_data)
tidy(lmfit_sim)
term estimate std.error statistic p.value
(Intercept) 2.937 0.7552 3.889 0.0003
X 3.983 0.3592 11.088 0.0000
b0_sim = tidy(lmfit_sim)$estimate[1]
b1_sim = tidy(lmfit_sim)$estimate[2]

Indeed \(\widehat{\beta}_{0}=2.937 \approx 3 = \beta_{0} \qquad \text{ and } \qquad \widehat{\beta}_{1}=3.983 \approx 4 = \beta_{1}\)

1.2.5 Checking the specifications of a linear regression model

\(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)

random variable = deterministic linear function + random error

This is called a simple linear regression model since only one predictor: \(x\)

  1. The errors are uncorrelated

    • \(\mathrm{corr}(Y_i, Y_j) \;=\; \mathrm{corr}(\epsilon_i, \epsilon_j) \;=\; 0\) for all \(i\ne j\)
  2. The errors have no systematic info about \(Y\)

    1. The average error = \(E(\epsilon_i)=0\) for all \(i\) (for all \(x\))

      • The distribution of errors has mean zero

      • No systematic bias

    2. The errors are not correlated with the \(x\) values

    3. \(\mathrm{var}(Y_i) \;=\; \mathrm{var}(\epsilon_i) \;=\; \sigma^2\) is same for all \(i\) (for all \(x\))

  3. The mean of \(Y\) depends linearly on \(x\)

    • \(E(Y\; |\; X=x) \;=\; \beta_0 +\beta_1 x\)

1.2.5.1 Item 1: The errors are uncorrelated

In order to determine if this is reasonable in light of the data
typically, you have to know how the data were collected

Example: Computer repair

  • if the computer repair person was new to the job and
  • we knew the order of the repair jobs

Then, perhaps at the start, residuals tend to be larger
(more variability in the minutes per computer)

…and later on, the residuals tend to be smaller
(less variability in the minutes per computer)

That is, the residuals could be correlated over time

We’ll take up this topic in Chapter 8: The Problem of Correlated Errors

1.2.5.2 Item 2a: The errors have mean zero

There is actually nothing we need to check in the data
because least squares residuals always add up to zero!

For any data, \[\begin{align} \sum_i e_i &\;=\; \sum_i (y_i - \widehat{y}_i) \\ \\ &\;=\; \sum_i (y_i - \widehat{\beta}_{0} - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - (\overline{y}- \widehat{\beta}_{1}\overline{x}) - \widehat{\beta}_{1}x_i) \\ \\ &\;=\; \sum_i (y_i - \overline{y}) - \widehat{\beta}_{1} \sum_i (x_i - \overline{x}) \\ \\ &\;=\; 0 - \widehat{\beta}_{1} (0) \;=\; 0 \end{align}\]

So, the observed regression line \(\widehat{y}= \widehat{\beta}_{0} + \widehat{\beta}_{1} x\) is the
observed average and is the balancing point of the observed \(y\)-values.

This should look familiar.
We showed that for data from a population \(Y\): \(\displaystyle{\sum_i (y_i - \overline{y}) = 0}\).
The observed average is the balancing point of the observed \(y\)-values.

1.2.5.3 Item 2b: The errors are not correlated with the \(x\) values

There is actually nothing we need to check in the data
because least squared residuals are always uncorrelated with the \(x\) values

\[\mathrm{cov}(x, e) \;=\; \frac{1}{n-1} \sum_i (x_i - \overline{x}) (e_i - \overline{e})\] We know that \(\sum_i e_i \;=\; 0\), so \(\overline{e}\;=\; 0\). So, \[\begin{align} \sum_i (x_i - \overline{x}) (e_i - \overline{e}) &= \sum_i (x_i - \overline{x}) e_i \\ &\;=\; \sum_i x_ie_i - \overline{x}\sum_i e_i \\ &\;=\; \sum_i x_ie_i - \overline{x}(0) \\ &\;=\; \sum_i x_ie_i \\ \end{align}\]

Finally, you can show that \(\sum_i x_i e_i \;=\; 0\) for any data.

Thus, \(\mathrm{cov}(x, e) \;=\; 0\). Therefore, \(\mathrm{corr}(x, e) \;=\; 0\)

1.2.5.4 Item 2c: \(\mathrm{var}(\epsilon_i) = \sigma^2\) is same for all \(i\) (for all \(x\))

Later, we will look at graphical diagnostics
to check for constant spread around the line for all \(x\) values.

We estimate \(\mathrm{var}(\epsilon)\) as

\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]

degrees of freedom (df)

= sample size \(-\) # parameters estimated for the mean (\(\beta_0\) and \(\beta_1\))

= \(n-2\)

1.2.5.5 Item 3: The mean of \(Y\) depends linearly on \(x\)

Again, we will look at useful graphical diagnostics later.

1.3 Lecture 3

1.3.1 Statistical Properties of LS Estimates of Model Parameters

1.3.1.1 Variability of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

We know that \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) will not exactly match the true values \(\beta_0\) and \(\beta_1\). Here, we are going to explore this variability.

We’ll start with a quick visualization. I generate several samples. The first is large, with 10,000 observations, and it’s meant to represent the population. The other three are small.

library(reshape2)

Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths
set.seed(1234)

sd_error_sim1=5


# Simulate the population (really just a big sample with 10,000 observations)
n_simbig=10000
X_big = runif(n_simbig)
error = rnorm(n_simbig, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y_big = beta0_sim1 + beta1_sim1 * X_big + error

# Now draw 3 smaller samples
n_sim1=10
X = runif(n_sim1)

# First sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y1 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Second sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y2 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Third sample
error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i
Y3 = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

# Reshape data for plotting
this_data<-data.frame(X,Y1,Y2,Y3)
this_data_long <- melt(this_data, id="X") 

# Plot population & sample 2
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% 
  gf_point(Y2~X,color="green")%>% gf_smooth(Y2~X,method='lm',se=FALSE,color="green") 


# Plot population & all samples
gf_point(Y_big~X_big,color="grey",opacity=0.2) %>% 
gf_point(value~X,data=this_data_long,color=~variable)%>% gf_smooth(value~X,data=this_data_long,color=~variable,method='lm',se=FALSE) 

As we can see, the estimated slopes \(\widehat{\beta}_{1}\) and intercepts \(\widehat{\beta}_{0}\) differ for each of our three samples. What we’ll discuss now is what distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) we’d expect to see as we increase the number of our samples: the sampling distributions.

1.3.1.2 Sampling distributions for \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

If the model is correct: \(Y \;=\; \beta_0 + \beta_1 x + \epsilon\)

then the least squares estimates are unbiased.

\[E(\widehat{\beta}_{0}) \;=\; \beta_0 \qquad {\rm and} \qquad E(\widehat{\beta}_{1}) \;=\; \beta_1\]

To show this, let’s ponder the formula for \(\widehat{\beta}_{1}\) first.

1.3.1.3 Useful algebraic facts about \(\widehat{\beta}_{1}\)

Recall the various expressions for \(\widehat{\beta}_{1}\):

\[\widehat{\beta}_{1} \;=\; \frac{\sum x_i(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_i-\overline{x})^2} \;=\; \frac{\sum (x_i-\overline{x})(y_i-\overline{y}) }{\sum(x_i-\overline{x})^2} \cdots {\rm some\, algebra} \cdots \;=\; r \frac{s_y}{s_x}\]

In particular, \[\widehat{\beta}_{1} \;=\; \frac{\sum (x_i-\overline{x})y_i}{\sum(x_j-\overline{x})^2} \;=\; \sum{k_i y_i}\] where,

\[k_i \; =\; \frac{(x_i-\overline{x})}{\sum(x_i-\overline{x})^2}\] We’ll soon need these facts about the \(k_i\):

\[\begin{align} \sum_i k_i &\;=\; \sum_i \frac{(x_i-\overline{x})}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})} {\sum(x_j-\overline{x})^2} \;=\; 0 \\ \\ \sum_i k_ix_i &\;=\; \sum_i \frac{(x_i-\overline{x})x_i}{\sum(x_j-\overline{x})^2} \;=\; \frac{\sum_i (x_i-\overline{x})^2} {\sum(x_j-\overline{x})^2} \;=\; 1 \\ \\ \sum_i k_i^2 &\;=\; \sum_i \frac{(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{\sum_i(x_i-\overline{x})^2}{\left[\sum(x_j-\overline{x})^2\right]^2} \;=\; \frac{1}{\sum(x_i-\overline{x})^2} \end{align}\]

1.3.1.4 OK. Now we can see \(\widehat{\beta}_{1}\) is an unbiased estimator for \(\beta_1\)

\[\begin{align} E(\widehat{\beta}_{1}) &\;=\; E\left(\sum k_iY_i\right) \;=\; \sum E(k_i Y_i) \;=\; \sum k_i E(Y_i) \\ \\ &\;=\; \sum k_i (\beta_0 + \beta_1 x_i) \;=\; \beta_0 \sum k_i + \beta_1 \sum k_i x_i \\ \\ &\;=\; \beta_0 (0) + \beta_1 (1) \;=\; \beta_1 \end{align}\]

…and we can find the variance of \(\widehat{\beta}_{1}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{1}) &\;=\; \mathrm{var}\left(\sum k_iY_i\right) \;=\; \sum \mathrm{var}(k_i Y_i) \;=\; \sum k_i^2 \mathrm{var}(Y_i) \\ \\ &\;=\; \sum k_i^2 \sigma^2 \;=\; \sigma^2 \sum k_i^2 \;=\; \frac{\sigma^2}{\sum(x_i-\overline{x})^2} \end{align}\]

1.3.1.5 OK. Now we can see \(\widehat{\beta}_{0}\) is an unbiased estimator for \(\beta_0\)

\[\begin{align} E(\widehat{\beta}_{0}) &\;=\; E(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; E(\overline{Y}) - \overline{x}E(\widehat{\beta}_{1}) \\ \\ &\;=\; (\beta_0 + \beta_1 \overline{x}) - \overline{x}\beta_1 \;=\; \beta_0 \end{align}\]

…and we can find the variance of \(\widehat{\beta}_{0}\): \[\begin{align} \mathrm{var}(\widehat{\beta}_{0}) &\;=\; \mathrm{var}(\overline{Y}- \widehat{\beta}_{1} \overline{x}) \;=\; \mathrm{var}(\overline{Y}) + \overline{x}^2\mathrm{var}(\widehat{\beta}_{1}) -\overline{x}\mathrm{cov}(\overline{Y}, \widehat{\beta}_{1})\\ \\ &(*) \;=\; \frac{\sigma^2}{n} + \frac{\sigma^2 \overline{x}^2}{\sum(x_i-\overline{x})^2} \;=\; \sigma^2 \left[ \frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}\right] \end{align}\] \((*)\) Can show that \(\overline{Y}\) and \(\widehat{\beta}_{1}\) are uncorrelated.

1.3.2 Estimating the spread around the line: \(\sigma^2\)

Of course, we don’t know \(\sigma^2\) (population variance).

So, we use an estimate from data:

\[\widehat\sigma^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{e_i^2}{n-2} \;=\; \frac{\sum_i (y_i - \widehat{y}_i)^2}{n-2}\]

Then, the standard errors (estimated standard deviations) are

\[\mathrm{se}(\widehat{\beta}_{1}) \;=\; \frac{\widehat{\sigma}}{\sqrt{\sum(x_i-\overline{x})^2}}\] \[\mathrm{se}(\widehat{\beta}_{0}) \;=\; \widehat{\sigma} \;\sqrt{\frac{1}{n} + \frac{\overline{x}^2}{\sum(x_i-\overline{x})^2}}\]

1.3.3 The Normal Distribution

1.3.3.1 Useful choice of a probability distribution for the errors: Normal

…and if we add one more model specification

  1. The errors have a normal distribution: \(\epsilon \;\sim\; N(0, \sigma^2)\)

That is \(Y \;\sim\; N(\beta_0 + \beta_1 x, \sigma^2)\)

Then, \[\widehat{\beta}_{0} \sim N(\beta_0, \mathrm{sd}(\widehat{\beta}_{0})) \qquad {\rm and} \qquad \widehat{\beta}_{1} \sim N(\beta_1, \mathrm{sd}(\widehat{\beta}_{1}))\]


Also,

\[\frac{\widehat{\beta}_{0} - \beta_0}{\mathrm{se}(\widehat{\beta}_{0})} \;\sim\; t_{n-2} \qquad {\rm and} \qquad \frac{\widehat{\beta}_{1} - \beta_1}{\mathrm{se}(\widehat{\beta}_{1})} \;\sim\; t_{n-2}\]

1.3.4 Similar Concepts in “STAT 101”

1.3.4.1 This is not so novel. Recall the “STAT 101” one-sample t-test.

Observe data on a response variable: \(y_1, y_2, \ldots, y_n\)

Estimate the population mean (\(\mu\)) with the sample average (\(\overline{y}\)).

And \(\overline{y}\) is the least squares (LS) estimate for \(\mu\).

If the population \(Y \sim N(\mu, \sigma^2)\) then

\[\frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]

In STAT 101, we estimated \(\sigma^2\) just as for all least squares estimates: \[\widehat\sigma^2 \;=\; s^2 \;=\; \frac{\mathrm{SSE}}{\mathrm{df}} \;=\; \frac{\sum_i (y_i - \overline{y})^2}{n-1}\]

degrees of freedom (df)

= sample size \(-\) # parameters estimated for the mean (\(\mu\))

= \(n-1\)

1.3.4.2 Back in STAT 101, we found LS estimate unbiased too.

\[E(\overline{Y}) \;=\; E\left[\frac{1}{n}\sum_i Y_i\right] \;=\; \frac{1}{n}\sum E(Y_i) \;=\; \frac{1}{n}\sum \mu \;=\; \frac{1}{n} (n \mu) = \mu\]

and for a random sample: \(\;\mathrm{var}(\overline{Y}) = \sigma^2/n\)

1.3.4.3 So, back in STAT 101, we also used a t-distribution

If the population is \(N(\mu, \sigma^2)\) then, \(\; \epsilon \;=\; (Y - \mu) \;\sim\; N(0, \sigma^2)\)

and

\[\frac{\overline{y}- \widehat{y}}{\mathrm{se}(\,\overline{y}\,)} = \frac{\overline{y}- \mu}{s/\sqrt{n}} \;\sim\; t_{n-1}\]

1.3.5 Demo Sampling Distributions

We can generate a set of \(Y\)-values from a known population model we used earlier:

kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1))
beta0 beta1 sd_error n
3 4 5 10

\(\beta_{0} = 3\)
\(\beta_{1} = 4\)
\(\sigma = 5\)
\(n = 10\)

  1. The mean of \(Y\) is linear: \(E(Y) = 3 + 4 x\)

  2. The \(x\) values come from a Normal(\(\mu = 2, \sigma = 1\)) distribution.

  3. The errors (\(\epsilon\)) from a Normal(\(\mu = 0, \sigma = 5\)) distribution.

  4. Finally, we can generate \(Y\) values from the \(x\) and error values by adding the errors to the linear function of \(x\) values.

\(Y_i = 3 + 4 x_i + \epsilon_i\)

1.3.5.1 Simulate sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

number_samples = 1000
kable(tibble(beta0=beta0_sim1, beta1=beta1_sim1, sd_error=sd_error_sim1, n=n_sim1, number_samples))
beta0 beta1 sd_error n number_samples
3 4 5 10 1000

Using An Applet:

http://www.rossmanchance.com/applets/RegSim/RegCoeff.html

Inputs needed for the applet:

\(\beta_{1} = 4\)
\(\beta_{0} = 3\)
\(\mu_x = 2\)
\(\sigma_x = 1\)
\(\sigma = 5\)

Range of \(y:\; (-1, 20)\) Range of \(x:\; (-1, 5)\)

Using R:

We can simulate 1,000 samples of \(Y\)-values as follows:

First, complete steps 1 and 2 (just once) to generate \(x\) values.

Then, repeat steps 3 and 4 1,000 times,
and each time calculate \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) and save those values.

Then, look at the distributions of the 1,000 \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\) values.

set.seed(1234)

View the code:

# Set up places to store the beta_0 and beta_1 estimates
beta0hats = 0
beta1hats = 0
plots=NULL
# The code below will repeat number_samples times
# i represents the sample number
for (i in 1:1000) {  

  error = rnorm(n_sim1, m=0, sd=sd_error_sim1) # Generate errors for sample i

  Y = beta0_sim1 + beta1_sim1 * X + error # Generate y-values for sample i

  fit = lm(Y ~ X) # Calculate LS estimates for intercept and slope for sample i

  # assign the current beta_0 estimate to the ith position in vector beta0hats
  beta0hats[i] = tidy(fit)$estimate[1]

  # assign the current beta_1 estimate to the ith position in vector beta1hats
  beta1hats[i] = tidy(fit)$estimate[2]
}
# end of the loop that gets repeated number_sample times

1.3.5.2 Analyze simulated sampling distributions of \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\)

What would you expect center, spread, and shape of these distributions to be?

STOP



gf_dhistogram(~ beta0hats, bins = 20, color = "white")
  
gf_dhistogram(~ beta1hats, bins = 20, color = "white")