All lectures

1 Transformations to achieve constant variance

We know some transformations to remedy non-linearity
are also variance-stabilizing transformations.

We can also adjust for non-consant variance by

1.1 Example: Supervisors vs. workers (a transformation)

Textbook Table 6.9 data:

A study of 27 industrial plants of varying sizes

\(y\) = number of supervisors
\(x\) = number of supervised workers

supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv")

head(supervisorData)
workers supervisors
294 30
247 32
267 37
358 44
423 47
311 49
gf_point(supervisors ~ workers, data = supervisorData)

The relationship looks roughly linear, but there is a clear problem with non-constant variance.

lmfit <- lm(supervisors ~ workers, data = supervisorData)
tidy(lmfit)
term estimate std.error statistic p.value
(Intercept) 14.4481 9.5620 1.511 0.1433
workers 0.1054 0.0113 9.303 0.0000
glance(lmfit)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.7759 0.7669 21.73 86.54 0 2 25
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>%
#   gf_labs(x = "predictor (workers)", y = "std residuals") %>%
#   gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>%
  gf_labs(x = "standardized residuals")

p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))

grid.arrange(p1, p2, p3, p4, ncol=2)

  • there is a fairly linear relationship,
  • residuals not inconsistent with data from a normal distn
  • but, non-constant variance

\[y_i = \beta_{0} + \beta_{1} x_i + \epsilon_i\]

1.1.1 What transformation might help?

Need a transformation that will

  • preserve the linear relationship between response and predictor(s)
  • results in residuals exhibiting constant variance

In many industrial, economic, and biological applications,

  • when unequal error variances are encountered,
  • often the std dev of residuals increases as predictor variable increases.
  • A “fan-shaped” residual plot indicates
    • the std dev of errors is proportional to size of predictor variable

\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\] If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by \(x_i\):

\[\mathrm{var}\left[\frac{\epsilon_i}{x_i}\right]=\frac{\mathrm{var}(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}\]

  • Divide each term in the model by \(x_i\).
  • Since we apply the same transformation to left and right sides of the equation,
    • if the model was a good fit for the data before the transformation,
    • it will remain so after the transformation

\[\frac{y_i}{x_i} = \beta_{0} \frac{1}{x_i} + \beta_{1} \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}\] We can see that this is a new linear model by renaming the transformed variables and coefficients:

\[Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}\] So we are left with a new model:

\[y^\prime_i = \beta_{0}^\prime + \beta_{1}^\prime x_i + \epsilon^\prime_i\] NOTE: \(\beta_0\) and \(\beta_1\) have switched places in this new model!
This can be a little confusing.

1.1.2 Back to the supervisors data…

supervisorData <- mutate(supervisorData, 
                         workers.reciprocal = 1 / workers,
                         supervisors.over.x = supervisors / workers
)

lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData)
p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>%
  gf_coefline(model = lmfit.reciprocal)

p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>%
#   gf_labs(x = "standardized residuals")

# p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>%
#   gf_labs(x = "standard normal", y = "standardized residuals") %>%
#   gf_qqline(~ rstandard(lmfit.reciprocal))

grid.arrange(p1, p2, ncol=2)

Variance is stabilized.

But, model-level inference from this model is NOT not a simple matter
of grabbing \(R^2\) or the F-statistic from the regression output.
This is a difficult statistical problem …still being investigated.

The F-test and \(R^2\) from the reciprocal model output should be ignored.

However, the t-tests for coefficients are meaningful
Just be careful that the intercept is the slope and the slope is the intercept. Ugh!

Comparison of the regression output:

tidy(lmfit)
term estimate std.error statistic p.value
(Intercept) 14.4481 9.5620 1.511 0.1433
workers 0.1054 0.0113 9.303 0.0000
b0 <- tidy(lmfit)$estimate[1]
b1 <- tidy(lmfit)$estimate[2]
tidy(lmfit.reciprocal)
term estimate std.error statistic p.value
(Intercept) 0.121 0.009 13.4454 0.0000
workers.reciprocal 3.803 4.570 0.8323 0.4131
b0prime <- tidy(lmfit.reciprocal)$estimate[2]
b1prime <- tidy(lmfit.reciprocal)$estimate[1]
glance(lmfit)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.7759 0.7669 21.73 86.54 0 2 25
glance(lmfit.reciprocal)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.027 -0.012 0.0227 0.6927 0.4131 2 25

Remember the relationships:
\(\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0\).

Without any transformation (ordinary least squares):
\(\widehat{y}_i = \beta_0 + \beta_1 x_i = 14.45 + 0.1054 x_i\)

Using transformed data:
\(\widehat{y}_i = \beta^\prime_0 + \beta^\prime_1 x_i = 3.803 + 0.121 x_i\)

In the original model, \(\beta_0\) (the intercept) was close to zero
…and in the new model, \(\beta_1^\prime\) (the slope) is close to zero.

1.2 Weighted least squares

Another approach to correcting non-constant variance problems
…that does not involve transforming data.

To see how it works, we need to remind ourselves why it’s a problem in the first place when we have non-constant variance. The main idea:

  • When computing ordinary least-squares (minimizing the SSE)
    • observations with large residuals are going to have a large influence on the fit
    • because we are minimizing the square of the residuals.

If the distribution of the response has larger error variance at different values of the predictor(s)

  • observations coming from areas of higher reponse variance will have
    • consistently larger errors (either positive or negative), and thus
    • a consistently larger influence on the fit.
  • \(\widehat\sigma = SSE/\text{df}\) will be inflated by these high-variance observations
    • coefficients will appear less significant than they are (larger SEs)
lmfit <- lm(supervisors ~ workers, data = supervisorData)

 gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

What if we could simply tell the computer to give these high-variance observations less weight,
to better equalize the influence of all the observations?

We can do this if we know the variance for each observation,
either for theoretical reasons or because we’ve got repeated measurements at each x value.

OLS finds \(\beta\)s which minimize

\[\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}\]

For weighted least squares, we want to minimize

\[\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}\]

1.3 Example: Supervisors vs. workers (WLS)

For the superviser data, the “fan-shaped” residual plot indicates
the std dev of errors is proportional to size of predictor variable

p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)

\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\]

so we use weights \(\displaystyle w_i=\frac{1}{x_i^2}\) and minimize the function

\[\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}\]

1.3.1 Supervisor data with weighted least squares

supervisorData <- mutate(supervisorData, wt = 1 / workers^2)
lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt)

Comparison of transformation method vs. weighted least squares

tidy(lmfit.reciprocal)
term estimate std.error statistic p.value
(Intercept) 0.121 0.009 13.4454 0.0000
workers.reciprocal 3.803 4.570 0.8323 0.4131
tidy(lmfit.wt)
term estimate std.error statistic p.value
(Intercept) 3.803 4.570 0.8323 0.4131
workers 0.121 0.009 13.4454 0.0000
glance(lmfit.reciprocal)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.027 -0.012 0.0227 0.6927 0.4131 2 25
glance(lmfit.wt)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.8785 0.8737 0.0227 180.8 0 2 25

With weight least squares: OK to use model-level statistics like \(R^2\) and F-test.

Residual plots:

p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>%
  gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)

1.4 Equivalence of SSE for reciprocal xfm and WLS

For reciprocal transformation:

\[\begin{align*} SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\ & = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2 \end{align*}\]

…which is the SSE for weighted least squares with weights \(\displaystyle w_i = \frac{1}{x_i^2}\)

So, both methods give the same estimates for the slope and intercept,
(but reversed roles for reciprocal transformation).

The WLS method results in valid \(R^2\) and F-test.