---
title: "STAT 224 Chapter 7"
output:
html_document:
code_folding: hide
df_print: kable
number_sections: yes
self_contained: yes
toc: FALSE
toc_depth: 4
toc_float: FALSE
params:
dotabs: TRUE
---
```{r test-main, child = 'preamble.Rmd'}
knitr::opts_chunk$set(cache=TRUE)
```
# Transformations to achieve constant variance
We know some transformations to remedy non-linearity
are also **variance-stabilizing transformations**.
We can also adjust for non-consant variance by
+ weighting the observations
+ observations related to more variability in the response...
+ get less weight in determining the regression estimates.
## Example: Supervisors vs. workers (a transformation)
Textbook Table 6.9 data:
A study of 27 industrial plants of varying sizes
$y$ = number of supervisors
$x$ = number of supervised workers
```{r}
supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv")
head(supervisorData)
```
```{r}
gf_point(supervisors ~ workers, data = supervisorData)
```
The relationship looks roughly linear, but there is a clear problem with non-constant variance.
```{r}
lmfit <- lm(supervisors ~ workers, data = supervisorData)
tidy(lmfit)
glance(lmfit)[-c(7:10)]
```
```{r}
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
# p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>%
# gf_labs(x = "predictor (workers)", y = "std residuals") %>%
# gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>%
gf_labs(x = "standardized residuals")
p4 <- gf_qq(~ rstandard(lmfit)) %>%
gf_labs(x = "standard normal", y = "standardized residuals") %>%
gf_qqline(~ rstandard(lmfit))
grid.arrange(p1, p2, p3, p4, ncol=2)
```
+ there is a fairly linear relationship,
+ residuals not inconsistent with data from a normal distn
+ but, non-constant variance
$$y_i = \B0 + \B1 x_i + \epsilon_i$$
### What transformation might help?
Need a transformation that will
+ preserve the linear relationship between response and predictor(s)
+ results in residuals exhibiting constant variance
In many industrial, economic, and biological applications,
+ when unequal error variances are encountered,
+ often the std dev of residuals increases as predictor variable increases.
+ A "fan-shaped" residual plot indicates
+ the std dev of errors is proportional to size of predictor variable
$$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$
If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by $x_i$:
$$\var\left[\frac{\epsilon_i}{x_i}\right]=\frac{\var(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}$$
+ Divide each term in the model by $x_i$.
+ Since we apply the same transformation to left and right sides of the equation,
+ if the model was a good fit for the data before the transformation,
+ it will remain so after the transformation
$$\frac{y_i}{x_i} = \B0 \frac{1}{x_i} + \B1 \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}$$
We can see that this is a new linear model by renaming the transformed variables and coefficients:
$$Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}$$
So we are left with a new model:
$$y^\prime_i = \B0^\prime + \B1^\prime x_i + \epsilon^\prime_i$$
**NOTE:** $\beta_0$ and $\beta_1$ have switched places in this new model!
This can be a little confusing.
### Back to the supervisors data...
```{r,code_folding=show}
supervisorData <- mutate(supervisorData,
workers.reciprocal = 1 / workers,
supervisors.over.x = supervisors / workers
)
lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData)
```
```{r}
p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>%
gf_coefline(model = lmfit.reciprocal)
p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
# p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>%
# gf_labs(x = "standardized residuals")
# p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>%
# gf_labs(x = "standard normal", y = "standardized residuals") %>%
# gf_qqline(~ rstandard(lmfit.reciprocal))
grid.arrange(p1, p2, ncol=2)
```
Variance is stabilized.
But, model-level inference from this model is **NOT** not a simple matter
of grabbing $R^2$ or the F-statistic from the regression output.
This is a difficult statistical problem ...still being investigated.
The F-test and $R^2$ from the
reciprocal model output should be ignored.
However, the t-tests for coefficients are meaningful
Just be careful that the intercept is the slope and the slope is the intercept. Ugh!
Comparison of the regression output:
```{r}
tidy(lmfit)
b0 <- tidy(lmfit)$estimate[1]
b1 <- tidy(lmfit)$estimate[2]
tidy(lmfit.reciprocal)
b0prime <- tidy(lmfit.reciprocal)$estimate[2]
b1prime <- tidy(lmfit.reciprocal)$estimate[1]
```
```{r}
glance(lmfit)[-c(7:10)]
glance(lmfit.reciprocal)[-c(7:10)]
```
Remember the relationships:
$\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0$.
Without any transformation (ordinary least squares):
$\yhat_i = \beta_0 + \beta_1 x_i = `r b0` + `r b1` x_i$
Using transformed data:
$\yhat_i = \beta^\prime_0 + \beta^\prime_1 x_i = `r b0prime` + `r b1prime` x_i$
In the original model, $\beta_0$ (the intercept) was close to zero
...and in the *new* model, $\beta_1^\prime$ (the slope) is close to zero.
## Weighted least squares
Another approach to correcting non-constant variance problems
...that does not involve transforming data.
To see how it works, we need to remind ourselves why it's a problem in the first place when we have non-constant variance.
The main idea:
+ When computing ordinary least-squares (minimizing the SSE)
+ observations with large residuals are going to have a large influence on the fit
+ because we are minimizing the *square* of the residuals.
If the distribution of the response has larger error variance at different values of the predictor(s)
+ observations coming from areas of higher reponse variance will have
+ consistently larger errors (either positive or negative), and thus
+ a consistently larger influence on the fit.
+ $\widehat\sigma = SSE/\text{df}$ will be inflated by these high-variance observations
+ coefficients will appear less significant than they are (larger SEs)
```{r}
lmfit <- lm(supervisors ~ workers, data = supervisorData)
gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
```
What if we could simply tell the computer to give these high-variance observations less weight,
to better equalize the influence of all the observations?
We can do this if we **know** the variance for each observation,
either for theoretical reasons or because we've got repeated measurements at each x value.
OLS finds $\beta$s which minimize
$$\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}$$
For weighted least squares, we want to minimize
$$\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}$$
## Example: Supervisors vs. workers (WLS)
For the superviser data, the "fan-shaped" residual plot indicates
the std dev of errors is proportional to size of predictor variable
```{r, fig.height=3}
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
gf_coefline(model = lmfit)
p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
```
$$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$
so we use weights $\displaystyle w_i=\frac{1}{x_i^2}$ and minimize the function
$$\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}$$
### Supervisor data with weighted least squares
```{r}
supervisorData <- mutate(supervisorData, wt = 1 / workers^2)
lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt)
```
Comparison of transformation method vs. weighted least squares
```{r}
tidy(lmfit.reciprocal)
tidy(lmfit.wt)
```
```{r}
glance(lmfit.reciprocal)[-c(7:10)]
glance(lmfit.wt)[-c(7:10)]
```
With weight least squares: OK to use model-level statistics like $R^2$ and F-test.
Residual plots:
```{r,echo=TRUE,fig.height=3}
p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>%
gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>%
gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")
grid.arrange(p1, p2, ncol=2)
```
## Equivalence of SSE for reciprocal xfm and WLS
For reciprocal transformation:
$$\begin{align*}
SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\
& = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\
& = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\
& = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2
\end{align*}$$
...which is the SSE for weighted least squares with weights
$\displaystyle w_i = \frac{1}{x_i^2}$
So, both methods give the same estimates for the slope and intercept,
(but reversed roles for reciprocal transformation).
The WLS method results in valid $R^2$ and F-test.