1 Transformations to achieve constant variance

We know some transformations to remedy non-linearity
are also variance-stabilizing transformations.

We can also adjust for non-consant variance by

weighting the observations
- observations related to more variability in the response…
- get less weight in determining the regression estimates.

1.1 Example: Supervisors vs. workers (a transformation)

Textbook Table 6.9 data:

A study of 27 industrial plants of varying sizes

\(y\) = number of supervisors
\(x\) = number of supervised workers

supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv")

head(supervisorData)

workers	supervisors
294	30
247	32
267	37
358	44
423	47
311	49

gf_point(supervisors ~ workers, data = supervisorData)

The relationship looks roughly linear, but there is a clear problem with non-constant variance.

lmfit <- lm(supervisors ~ workers, data = supervisorData)
tidy(lmfit)

term	estimate	std.error	statistic	p.value
(Intercept)	14.4481	9.5620	1.511	0.1433
workers	0.1054	0.0113	9.303	0.0000

glance(lmfit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.7759	0.7669	21.73	86.54	0	2	25

p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>%
#   gf_labs(x = "predictor (workers)", y = "std residuals") %>%
#   gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>%
  gf_labs(x = "standardized residuals")

p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))

grid.arrange(p1, p2, p3, p4, ncol=2)

there is a fairly linear relationship,
residuals not inconsistent with data from a normal distn
but, non-constant variance

\[y_i = \beta_{0} + \beta_{1} x_i + \epsilon_i\]

1.1.1 What transformation might help?

Need a transformation that will

preserve the linear relationship between response and predictor(s)
results in residuals exhibiting constant variance

In many industrial, economic, and biological applications,

when unequal error variances are encountered,
often the std dev of residuals increases as predictor variable increases.
A “fan-shaped” residual plot indicates
- the std dev of errors is proportional to size of predictor variable

\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\] If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by \(x_i\):

\[\mathrm{var}\left[\frac{\epsilon_i}{x_i}\right]=\frac{\mathrm{var}(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}\]

Divide each term in the model by \(x_i\).
Since we apply the same transformation to left and right sides of the equation,
- if the model was a good fit for the data before the transformation,
- it will remain so after the transformation

\[\frac{y_i}{x_i} = \beta_{0} \frac{1}{x_i} + \beta_{1} \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}\] We can see that this is a new linear model by renaming the transformed variables and coefficients:

\[Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}\] So we are left with a new model:

\[y^\prime_i = \beta_{0}^\prime + \beta_{1}^\prime x_i + \epsilon^\prime_i\] NOTE: \(\beta_0\) and \(\beta_1\) have switched places in this new model!
This can be a little confusing.

1.1.2 Back to the supervisors data…

supervisorData <- mutate(supervisorData, 
                         workers.reciprocal = 1 / workers,
                         supervisors.over.x = supervisors / workers
)

lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData)

p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>%
  gf_coefline(model = lmfit.reciprocal)

p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>%
#   gf_labs(x = "standardized residuals")

# p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>%
#   gf_labs(x = "standard normal", y = "standardized residuals") %>%
#   gf_qqline(~ rstandard(lmfit.reciprocal))

grid.arrange(p1, p2, ncol=2)

Variance is stabilized.

But, model-level inference from this model is NOT not a simple matter
of grabbing \(R^2\) or the F-statistic from the regression output.
This is a difficult statistical problem …still being investigated.

The F-test and \(R^2\) from the reciprocal model output should be ignored.

However, the t-tests for coefficients are meaningful
Just be careful that the intercept is the slope and the slope is the intercept. Ugh!

Comparison of the regression output:

tidy(lmfit)

term	estimate	std.error	statistic	p.value
(Intercept)	14.4481	9.5620	1.511	0.1433
workers	0.1054	0.0113	9.303	0.0000

b0 <- tidy(lmfit)$estimate[1]
b1 <- tidy(lmfit)$estimate[2]
tidy(lmfit.reciprocal)

term	estimate	std.error	statistic	p.value
(Intercept)	0.121	0.009	13.4454	0.0000
workers.reciprocal	3.803	4.570	0.8323	0.4131

b0prime <- tidy(lmfit.reciprocal)$estimate[2]
b1prime <- tidy(lmfit.reciprocal)$estimate[1]

glance(lmfit)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.7759	0.7669	21.73	86.54	0	2	25

glance(lmfit.reciprocal)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.027	-0.012	0.0227	0.6927	0.4131	2	25

Remember the relationships:
\(\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0\).

Without any transformation (ordinary least squares):
\(\widehat{y}_i = \beta_0 + \beta_1 x_i = 14.45 + 0.1054 x_i\)

Using transformed data:
\(\widehat{y}_i = \beta^\prime_0 + \beta^\prime_1 x_i = 3.803 + 0.121 x_i\)

In the original model, \(\beta_0\) (the intercept) was close to zero
…and in the new model, \(\beta_1^\prime\) (the slope) is close to zero.

1.2 Weighted least squares

Another approach to correcting non-constant variance problems
…that does not involve transforming data.

To see how it works, we need to remind ourselves why it’s a problem in the first place when we have non-constant variance. The main idea:

When computing ordinary least-squares (minimizing the SSE)
- observations with large residuals are going to have a large influence on the fit
- because we are minimizing the square of the residuals.

If the distribution of the response has larger error variance at different values of the predictor(s)

observations coming from areas of higher reponse variance will have
- consistently larger errors (either positive or negative), and thus
- a consistently larger influence on the fit.
\(\widehat\sigma = SSE/\text{df}\) will be inflated by these high-variance observations
- coefficients will appear less significant than they are (larger SEs)

lmfit <- lm(supervisors ~ workers, data = supervisorData)

 gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

What if we could simply tell the computer to give these high-variance observations less weight,
to better equalize the influence of all the observations?

We can do this if we know the variance for each observation,
either for theoretical reasons or because we’ve got repeated measurements at each x value.

OLS finds \(\beta\)s which minimize

\[\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}\]

For weighted least squares, we want to minimize

\[\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}\]

1.3 Example: Supervisors vs. workers (WLS)

For the superviser data, the “fan-shaped” residual plot indicates
the std dev of errors is proportional to size of predictor variable

p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)

\[ var(\epsilon_i) = k^2x_i^2,\quad k>0\]

so we use weights \(\displaystyle w_i=\frac{1}{x_i^2}\) and minimize the function

\[\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}\]

1.3.1 Supervisor data with weighted least squares

supervisorData <- mutate(supervisorData, wt = 1 / workers^2)
lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt)

Comparison of transformation method vs. weighted least squares

tidy(lmfit.reciprocal)

term	estimate	std.error	statistic	p.value
(Intercept)	0.121	0.009	13.4454	0.0000
workers.reciprocal	3.803	4.570	0.8323	0.4131

tidy(lmfit.wt)

term	estimate	std.error	statistic	p.value
(Intercept)	3.803	4.570	0.8323	0.4131
workers	0.121	0.009	13.4454	0.0000

glance(lmfit.reciprocal)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.027	-0.012	0.0227	0.6927	0.4131	2	25

glance(lmfit.wt)[-c(7:10)]

r.squared	adj.r.squared	sigma	statistic	p.value	df	df.residual
0.8785	0.8737	0.0227	180.8	0	2	25

With weight least squares: OK to use model-level statistics like \(R^2\) and F-test.

Residual plots:

p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>%
  gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)

1.4 Equivalence of SSE for reciprocal xfm and WLS

For reciprocal transformation:

\[\begin{align*} SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\ & = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2 \end{align*}\]

…which is the SSE for weighted least squares with weights \(\displaystyle w_i = \frac{1}{x_i^2}\)

So, both methods give the same estimates for the slope and intercept,
(but reversed roles for reciprocal transformation).

The WLS method results in valid \(R^2\) and F-test.

STAT 224 Chapter 7