--- title: "STAT 224 Chapter 7" output: html_document: code_folding: hide df_print: kable number_sections: yes self_contained: yes toc: FALSE toc_depth: 4 toc_float: FALSE params: dotabs: TRUE --- ```{r test-main, child = 'preamble.Rmd'} knitr::opts_chunk$set(cache=TRUE) ``` # Transformations to achieve constant variance We know some transformations to remedy non-linearity are also **variance-stabilizing transformations**. We can also adjust for non-consant variance by + weighting the observations + observations related to more variability in the response... + get less weight in determining the regression estimates. ## Example: Supervisors vs. workers (a transformation) Textbook Table 6.9 data: A study of 27 industrial plants of varying sizes $y$ = number of supervisors $x$ = number of supervised workers ```{r} supervisorData <- read.csv("http://statistics.uchicago.edu/~collins/data/s224/supervisor_supervised.csv") head(supervisorData) ``` ```{r} gf_point(supervisors ~ workers, data = supervisorData) ``` The relationship looks roughly linear, but there is a clear problem with non-constant variance. ```{r} lmfit <- lm(supervisors ~ workers, data = supervisorData) tidy(lmfit) glance(lmfit)[-c(7:10)] ``` ```{r} p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>% gf_coefline(model = lmfit) p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") # p3 <- gf_point(rstandard(lmfit) ~ workers, data = supervisorData) %>% # gf_labs(x = "predictor (workers)", y = "std residuals") %>% # gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") p3 <- gf_histogram(~ rstandard(lmfit), bins = 6, color = "white") %>% gf_labs(x = "standardized residuals") p4 <- gf_qq(~ rstandard(lmfit)) %>% gf_labs(x = "standard normal", y = "standardized residuals") %>% gf_qqline(~ rstandard(lmfit)) grid.arrange(p1, p2, p3, p4, ncol=2) ``` + there is a fairly linear relationship, + residuals not inconsistent with data from a normal distn + but, non-constant variance $$y_i = \B0 + \B1 x_i + \epsilon_i$$ ### What transformation might help? Need a transformation that will + preserve the linear relationship between response and predictor(s) + results in residuals exhibiting constant variance In many industrial, economic, and biological applications, + when unequal error variances are encountered, + often the std dev of residuals increases as predictor variable increases. + A "fan-shaped" residual plot indicates + the std dev of errors is proportional to size of predictor variable $$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$ If the errors have this variance structure, we could calculate new errors with constant variance by dividing them all by $x_i$: $$\var\left[\frac{\epsilon_i}{x_i}\right]=\frac{\var(\epsilon_i)}{x_i^2}=\frac{ k^2 x_i^2}{x_i^2}=k^2 = \text{constant variance}$$ + Divide each term in the model by $x_i$. + Since we apply the same transformation to left and right sides of the equation, + if the model was a good fit for the data before the transformation, + it will remain so after the transformation $$\frac{y_i}{x_i} = \B0 \frac{1}{x_i} + \B1 \frac{x_i}{x_i} + \frac{\epsilon_i}{x_i}$$ We can see that this is a new linear model by renaming the transformed variables and coefficients: $$Y^\prime = \frac{Y}{X} \qquad X^\prime = \frac{1}{X}\qquad \beta^\prime_0 = \beta_1 \qquad \beta^\prime_1 = \beta_0\qquad \epsilon^\prime=\frac{\epsilon}{X}$$ So we are left with a new model: $$y^\prime_i = \B0^\prime + \B1^\prime x_i + \epsilon^\prime_i$$ **NOTE:** $\beta_0$ and $\beta_1$ have switched places in this new model! This can be a little confusing. ### Back to the supervisors data... ```{r,code_folding=show} supervisorData <- mutate(supervisorData, workers.reciprocal = 1 / workers, supervisors.over.x = supervisors / workers ) lmfit.reciprocal <- lm(supervisors.over.x ~ workers.reciprocal, data = supervisorData) ``` ```{r} p1 <- gf_point(supervisors.over.x ~ workers.reciprocal, data = supervisorData) %>% gf_coefline(model = lmfit.reciprocal) p2 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>% gf_labs(x = "fitted values (supervisors over x)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") # p3 <- gf_histogram(~ rstandard(lmfit.reciprocal), bins=6, color = "white") %>% # gf_labs(x = "standardized residuals") # p4 <- gf_qq(~ rstandard(lmfit.reciprocal)) %>% # gf_labs(x = "standard normal", y = "standardized residuals") %>% # gf_qqline(~ rstandard(lmfit.reciprocal)) grid.arrange(p1, p2, ncol=2) ``` Variance is stabilized. But, model-level inference from this model is **NOT** not a simple matter of grabbing $R^2$ or the F-statistic from the regression output. This is a difficult statistical problem ...still being investigated. The F-test and $R^2$ from the reciprocal model output should be ignored. However, the t-tests for coefficients are meaningful Just be careful that the intercept is the slope and the slope is the intercept. Ugh! Comparison of the regression output: ```{r} tidy(lmfit) b0 <- tidy(lmfit)$estimate[1] b1 <- tidy(lmfit)$estimate[2] tidy(lmfit.reciprocal) b0prime <- tidy(lmfit.reciprocal)$estimate[2] b1prime <- tidy(lmfit.reciprocal)$estimate[1] ``` ```{r} glance(lmfit)[-c(7:10)] glance(lmfit.reciprocal)[-c(7:10)] ``` Remember the relationships: $\beta^{\prime}_0 = \beta_1, \quad \beta^{\prime}_1 = \beta_0$. Without any transformation (ordinary least squares): $\yhat_i = \beta_0 + \beta_1 x_i = `r b0` + `r b1` x_i$ Using transformed data: $\yhat_i = \beta^\prime_0 + \beta^\prime_1 x_i = `r b0prime` + `r b1prime` x_i$ In the original model, $\beta_0$ (the intercept) was close to zero ...and in the *new* model, $\beta_1^\prime$ (the slope) is close to zero. ## Weighted least squares Another approach to correcting non-constant variance problems ...that does not involve transforming data. To see how it works, we need to remind ourselves why it's a problem in the first place when we have non-constant variance. The main idea: + When computing ordinary least-squares (minimizing the SSE) + observations with large residuals are going to have a large influence on the fit + because we are minimizing the *square* of the residuals. If the distribution of the response has larger error variance at different values of the predictor(s) + observations coming from areas of higher reponse variance will have + consistently larger errors (either positive or negative), and thus + a consistently larger influence on the fit. + $\widehat\sigma = SSE/\text{df}$ will be inflated by these high-variance observations + coefficients will appear less significant than they are (larger SEs) ```{r} lmfit <- lm(supervisors ~ workers, data = supervisorData) gf_point(supervisors ~ workers, data = supervisorData) %>% gf_coefline(model = lmfit) ``` What if we could simply tell the computer to give these high-variance observations less weight, to better equalize the influence of all the observations? We can do this if we **know** the variance for each observation, either for theoretical reasons or because we've got repeated measurements at each x value. OLS finds $\beta$s which minimize $$\sum{(y_i - \widehat{y}_i)^2}=\sum{e_i^2}$$ For weighted least squares, we want to minimize $$\sum{w_i\,(y_i - \widehat{y}_i)^2}\;=\;\sum{w_i\, e_i^2},\qquad \text{where }\;w_i\propto\frac{1}{var(\epsilon_i)}$$ ## Example: Supervisors vs. workers (WLS) For the superviser data, the "fan-shaped" residual plot indicates the std dev of errors is proportional to size of predictor variable ```{r, fig.height=3} p1 <- gf_point(supervisors ~ workers, data = supervisorData) %>% gf_coefline(model = lmfit) p2 <- gf_point(rstandard(lmfit) ~ fitted(lmfit)) %>% gf_labs(x = "fitted values (supervisors)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") grid.arrange(p1, p2, ncol=2) ``` $$ var(\epsilon_i) = k^2x_i^2,\quad k>0$$ so we use weights $\displaystyle w_i=\frac{1}{x_i^2}$ and minimize the function $$\sum{\frac{1}{x_i^2}\,(y_i - \widehat{y}_i)^2}\;=\;\sum\frac{e_i^2}{x_i^2}$$ ### Supervisor data with weighted least squares ```{r} supervisorData <- mutate(supervisorData, wt = 1 / workers^2) lmfit.wt <- lm(supervisors ~ workers, data = supervisorData, weights = wt) ``` Comparison of transformation method vs. weighted least squares ```{r} tidy(lmfit.reciprocal) tidy(lmfit.wt) ``` ```{r} glance(lmfit.reciprocal)[-c(7:10)] glance(lmfit.wt)[-c(7:10)] ``` With weight least squares: OK to use model-level statistics like $R^2$ and F-test. Residual plots: ```{r,echo=TRUE,fig.height=3} p1 <- gf_point(rstandard(lmfit.reciprocal) ~ fitted(lmfit.reciprocal)) %>% gf_labs(x = "fitted values (supervisors/workers)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") p2 <- gf_point(rstandard(lmfit.wt) ~ fitted(lmfit.wt), data = supervisorData) %>% gf_labs(x = "fitted values (supervisors, weighted)", y = "std residuals") %>% gf_hline(yintercept = c(-2, 0, 2), linetype = "dashed") grid.arrange(p1, p2, ncol=2) ``` ## Equivalence of SSE for reciprocal xfm and WLS For reciprocal transformation: $$\begin{align*} SSE & = \sum_{i=1}^n (y_i^\prime - \beta_0^\prime - \beta_1^\prime x_i^\prime)^2\\ & = \left[\frac{y_i}{x_i} - \beta_1 - \beta_0\frac{1}{x_i}\right]^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_1 x_i - \beta_0)^2\\ & = \sum_{i=1}^n \frac{1}{x_i^2} (y_i - \beta_0 - \beta_1 x_i)^2 \end{align*}$$ ...which is the SSE for weighted least squares with weights $\displaystyle w_i = \frac{1}{x_i^2}$ So, both methods give the same estimates for the slope and intercept, (but reversed roles for reciprocal transformation). The WLS method results in valid $R^2$ and F-test.