---
title: "STAT 224 Chapter 7"
output:
  html_document:
    code_folding: hide
    df_print: kable
    number_sections: yes
    self_contained: yes
    toc: FALSE
    toc_depth: 4
    toc_float: FALSE
params:
  dotabs:  TRUE
---


```{r test-main, child = 'preamble.Rmd'}
knitr::opts_chunk$set(cache=TRUE)
```


# Transformations to achieve constant variance

We know some transformations to remedy non-linearity  
are  also **variance-stabilizing transformations**.

We can also adjust for non-consant variance by

  + weighting the observations 
    + observations related to more variability in the response...
    + get less weight in determining the regression estimates.

## Example: Supervisors vs. workers (a transformation)

Textbook Table 6.9 data:

A study of 27 industrial plants of varying sizes   

$y$ = number of supervisors   
$x$ = number of supervised workers

```{r}
supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv")

head(supervisorData)
```

```{r}
gf_point(supervisors ~ workers, data = supervisorData)
```

The relationship looks roughly linear, but there is a clear problem with non-constant variance.



```{r}
lmfit <- lm(supervisors ~ workers, data = supervisorData)
tidy(lmfit)
glance(lmfit)[-c(7:10)]
```

```{r}
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>%
#   gf_labs(x = "predictor (workers)", y = "std residuals") %>%
#   gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>%
  gf_labs(x = "standardized residuals")

p4 <- gf_qq(~ rstandard(lmfit)) %>%
  gf_labs(x = "standard normal", y = "standardized residuals") %>%
  gf_qqline(~ rstandard(lmfit))

grid.arrange(p1, p2, p3, p4, ncol=2)
```

  + there is a fairly linear relationship, 
  + residuals not inconsistent with data from a normal distn
  + but, non-constant variance

$$y_i = \B0 + \B1 x_i + \epsilon_i$$

### What transformation might help?

Need a transformation that will 

  + preserve the linear relationship between response and predictor(s)
  + results in residuals exhibiting constant variance

In many industrial, economic, and biological applications, 

  + when unequal error variances are encountered, 
  + often the std dev of residuals increases as predictor variable increases. 
  + A "fan-shaped" residual plot indicates 
    + the std dev of errors is proportional to size of predictor variable

$$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$
If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by $x_i$:

$$\var\left[\frac{\epsilon_i}{x_i}\right]=\frac{\var(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}$$



  + Divide each term in the model by $x_i$. 
  + Since we apply the same transformation to left and right sides of the equation, 
    + if the model was a good fit for the data before the transformation, 
    + it will remain so after the transformation

$$\frac{y_i}{x_i} = \B0 \frac{1}{x_i} + \B1 \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}$$
We can see that this is a new linear model by renaming the transformed variables and coefficients:

$$Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}$$
So we are left with a new model:

$$y^\prime_i = \B0^\prime + \B1^\prime x_i + \epsilon^\prime_i$$
**NOTE:** $\beta_0$ and $\beta_1$ have switched places in this new model!    
This can be a little confusing.


### Back to the supervisors data...

```{r,code_folding=show}
supervisorData <- mutate(supervisorData, 
                         workers.reciprocal = 1 / workers,
                         supervisors.over.x = supervisors / workers
)

lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData)
```

```{r}
p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>%
  gf_coefline(model = lmfit.reciprocal)

p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

# p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>%
#   gf_labs(x = "standardized residuals")

# p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>%
#   gf_labs(x = "standard normal", y = "standardized residuals") %>%
#   gf_qqline(~ rstandard(lmfit.reciprocal))

grid.arrange(p1, p2, ncol=2)
```

Variance is stabilized.     

But, model-level inference from this model is **NOT** not a simple matter    
of grabbing $R^2$ or the F-statistic from the regression output.     
This is a difficult statistical problem ...still being investigated.

The F-test and $R^2$ from the
reciprocal model output should be ignored.

However, the t-tests for coefficients are meaningful     
Just be careful that the intercept is the slope and the slope is the intercept.  Ugh!

Comparison of the regression output:
```{r}
tidy(lmfit)
b0 <- tidy(lmfit)$estimate[1]
b1 <- tidy(lmfit)$estimate[2]
tidy(lmfit.reciprocal)
b0prime <- tidy(lmfit.reciprocal)$estimate[2]
b1prime <- tidy(lmfit.reciprocal)$estimate[1]
```
```{r}
glance(lmfit)[-c(7:10)]
glance(lmfit.reciprocal)[-c(7:10)]
```

Remember the relationships:   
$\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0$. 

Without any transformation (ordinary least squares):    
$\yhat_i = \beta_0 + \beta_1 x_i = `r b0` + `r b1` x_i$

Using transformed data:    
$\yhat_i = \beta^\prime_0 + \beta^\prime_1 x_i = `r b0prime` + `r b1prime` x_i$


In the original model, $\beta_0$ (the intercept) was close to zero     
...and in the *new* model, $\beta_1^\prime$ (the slope) is close to zero.  


## Weighted least squares 

Another approach to correcting non-constant variance problems   
...that does not involve transforming data.

To see how it works, we need to remind ourselves why it's a problem in the first place when we have non-constant variance. 
The main idea:

  + When computing ordinary least-squares (minimizing the SSE)
    + observations with large residuals are going to have a large influence on the fit
    + because we are minimizing the *square* of the residuals.

If the distribution of the response has larger error variance at different values of the predictor(s)

  + observations coming from areas of higher reponse variance will have
    + consistently larger errors (either positive or negative), and thus 
    + a consistently larger influence on the fit. 
  + $\widehat\sigma = SSE/\text{df}$ will be inflated by these high-variance observations
    + coefficients will appear less significant than they are (larger SEs)

```{r}
lmfit <- lm(supervisors ~ workers, data = supervisorData)

 gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)
```

What if we could simply tell the computer to give these high-variance observations less weight,    
to better equalize the influence of all the observations?

We can do this if we **know** the variance for each observation,    
either for theoretical reasons or because we've got repeated measurements at each x value.

OLS finds $\beta$s which minimize

$$\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}$$

For weighted least squares, we want to minimize

$$\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}$$

## Example: Supervisors vs. workers (WLS)

For the superviser data, the "fan-shaped" residual plot indicates     
the std dev of errors is proportional to size of predictor variable

```{r, fig.height=3}
p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>%
  gf_coefline(model = lmfit)

p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>%
  gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)
```

$$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$

so we use weights $\displaystyle w_i=\frac{1}{x_i^2}$ and minimize the function

$$\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}$$

### Supervisor data with weighted least squares
```{r}
supervisorData <- mutate(supervisorData, wt = 1 / workers^2)
lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt)
```

Comparison of transformation method vs. weighted least squares
```{r}
tidy(lmfit.reciprocal)
tidy(lmfit.wt)
```

```{r}
glance(lmfit.reciprocal)[-c(7:10)]
glance(lmfit.wt)[-c(7:10)]
```

With weight least squares: OK to use model-level statistics like $R^2$ and F-test.


Residual plots:

```{r,echo=TRUE,fig.height=3}
p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>%
  gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>%
  gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed")

grid.arrange(p1, p2, ncol=2)
```


## Equivalence of SSE for reciprocal xfm and WLS

For reciprocal transformation:

$$\begin{align*}
SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\
& = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\
& = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\
& = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2
\end{align*}$$

...which is the SSE for weighted least squares with weights
$\displaystyle w_i = \frac{1}{x_i^2}$

So, both methods give the same estimates for the slope and intercept,   
(but reversed roles for reciprocal transformation).

The WLS method results in valid $R^2$ and F-test.