All lectures

1 Chapter 5: Categorical Variables as Predictors

1.1 Lecture 8

1.1.1 Intro

So far… predictor variables quantitative

1.1.1.1 What’s new… categorical predictors

Not all potential predictors need to be measured on a continuous numeric scale.

Categorical variables include numeric and non-numeric variables
usually with just a few possible values:

  • education level (high school, some college, college degree, grad school)
  • gender (male, female)
  • number of siblings (0, 1, 2, 3, …, some smallish number)
  • management style (delegator, micromanager, somewhere in between)
  • preferences (very low, low, neutral, high, very high)
  • job type (unskilled, skilled, white collar, management, executive)
  • furniture color (blue, brown, black, white)
  • marital status (single, married, widowed, divorced)
  • age group (youth, young adult, middle-aged, elderly)
  • race/ethnicity (black, hispanic, white, other)
1.1.1.1.1 Ordinal vs. nominal categorical variables

Categorical variables are ordinal or nominal

ordinal variables: categories have a natural ordering

  • education level
  • number of siblings (this even has natural numerical values: 0, 1, 2, …)
  • preferences
  • job type
  • age group
  • management style

nominal variables: no ordering

  • gender
  • furniture color
  • marital status
  • race/ethnicity
1.1.1.1.2 Indicator variables

Linear regression requires minimization of a function of numeric values.

So, we need to recode the categories as numbers
in order to include a categorical variable in the regression.

A categorical variable can be represented by one or more indicator variables.

Indicator variables…

  • “indicate” the presence or absence of a given feature,
  • encoded using two values, usually 1 (feature present) and 0 (feature absent),

For example, \[ \mbox{gender} = \left\{ \begin{array}{l} 0 & \mbox{if female} \\ 1 & \mbox{if male} \end{array} \right. \]

Indicator variables are sometimes called “dummy variables”
since they are not the actual predictor variable, but a representation of the variable.

…like crash car dummies are not real people, but represent us in crash tests. :)

1.1.1.1.2.1 Why indicate using 0’s and 1’s?

Ease of interpretation.

Example:

\(Y =\) annual salary
\(x_1 =\) gender (male = 1, female = 0)
\(x_2 =\) years of experience

\[Y = \beta_0 + \beta_1\, x_1 + \beta_2\, x_2 + \epsilon\]

Among females, \(x_1 = 0\) and
\[E(Y \mid x_1, x_2) = \beta_0 + \beta_1 (0) + \beta_2\, x_2 = \beta_0 + \beta_2\, x_2\]

Among males, \(x_1 = 1\) and
\[E(Y \mid x_1, x_2) = \beta_0 + \beta_1 (1) + \beta_2\, x_2 = (\beta_0 + \beta_1) + \beta_2\, x_2\]

What does \(\beta_0\) mean?

Females (\(x_1=0\)) with no experience (\(x_2=0\)) can expect to earn \(\beta_0\) per year
\(\beta_0\) = the base salary for females.

What does \(\beta_1\) mean?

Males (\(x_1=0\)) with no experience (\(x_2=0\)) can expect to earn \((\beta_0 + \beta_1)\) per year.
\(\beta_0 + \beta_1\) = the base salary for males.
\(\beta_1\) = the difference in base salary for men vs. women.

\(\beta_1\) is also the average difference in salary for men
among men and women with the same experience level.



We could code gender \[ x_1 = \mbox{gender} = \left\{ \begin{array}{l} 1 & \mbox{if male} \\ 2 & \mbox{if female} \end{array} \right. \]

But, then the average salary for females with no experience
is \(\beta_0 + 2\beta_1\) and the coefficients get harder to interpret.

1.1.2 More Details

1.1.2.1 SLR with a categorical predictor

We want to predict some quantity \(Y\) with gender (male, female)
As we will see, this is tantamount to comparing the means between the two groups.

We could write the model as \[ {\mathrm{E}}(Y \mid \mbox{gender}) \;=\; \beta_0 + \beta_1\, x \;=\; \beta_0 + \beta_1\, \mbox{gender} \]

where

\[ \mbox{gender} = \left\{ \begin{array}{l} 0 & \mbox{if female} \\ 1 & \mbox{if male} \end{array} \right. \]

Recall that the meaning of \(\beta_0\) is \({\mathrm{E}}(Y \mid x=0)\).
So, \(\beta_0\) is the mean of \(Y\) for females.

What is \(\beta_1\)?











For men, \({\mathrm{E}}(Y \mid \mbox{gender}) = \beta_0 + \beta_1 (1) = \beta_0 + \beta_1\)

As always in SLR, \(\beta_1\) is the change in the mean of \(Y\)
associated with a one unit increase in \(x\).

Thus, \(\beta_1\) is the shift, change, …or difference in the mean
for gender=1 (male) vs. gender=0 (female):

\(\beta_1 = \mu_{\rm male} - \mu_{\rm female}\)

So, the test of \(\beta_1 = 0\) is equivalent to
testing “no difference in the mean” due to group membership (male vs. female).

That is, all of the following hypotheses are equivalent:

  • \(H_0: \beta_1 = 0\)
  • \(H_0: \mu_{\rm male} - \mu_{\rm female} = 0\)
  • \(H_0: \mu_{\rm male} = \mu_{\rm female}\)

The last two hypotheses you might recognize can be tested
using the two-sample (pooled-variance) t-test

Why the pooled-variance test?
Because in regression model requires constant variance for all \(x\) values.
…and would it even make sense to compare the means if the variances were quite different?

So, if the regression assumptions hold, then \({\mathrm{var}}(Y) = \sigma^2\)
for both men (gender=1) and women (gender=0).

The pooled-variance two-sample t-test has \(n_1 + n_2 -2 = n-2\) degrees of freedom
…the same as when testing \(H_0: \beta_1 = 0\)

And, yes, the two test statistics are algebraically equivalent:

\[t = \frac{{\widehat{\beta}_{1}}}{{\mathrm{se}}({\widehat{\beta}_{1}})} \;=\; \frac{{\overline{y}}_{\rm male} - {\overline{y}}_{\rm female}}{\displaystyle s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}\]

Note: \({\overline{y}}_{\rm male}\) and \({\overline{y}}_{\rm female}\) are the least squares estimates for \(\mu_{\rm male}\) and \(\mu_{\rm female}\), respectively.

1.1.2.1.1 Why only one indicator variable for predictor with two levels?

Why not set up the following model?

\[ {\mathrm{E}}(Y \mid \mbox{gender}) \;=\; \beta_0 + \beta_1\, x_1 + \beta_1\, x_2 \;=\; \beta_0 + \beta_1(\mbox{genderMale}) + \beta_1(\mbox{genderFemale}) \]

where

\[ \mbox{genderMale} = \left\{ \begin{array}{l} 0 & \mbox{if female} \\ 1 & \mbox{if male} \end{array} \right. \]

and

\[ \mbox{genderFemale} = \left\{ \begin{array}{l} 0 & \mbox{if male} \\ 1 & \mbox{if female} \end{array} \right. \]

Two problems:

  1. What is the meaning of \(\beta_0\) in this model?
    • By definition: \(\beta_0 = {\mathrm{E}}(Y \mid x_1=0, x_2=0)\)
    • but \(x_1=0\) and \(x_2=0\) is not possible.
    • …well not in the little dichotomous world of this example
  2. \(x_1\) and \(x_2\) are perfectly correlated
    • each one is a linear function of the other, completely “collinear”
    • genderMale = 1 \(-\) genderFemale
    • This strongly violates the regression assumptions

Conclusion: we need only one indicator variable
for a categorical predictor variable with two categories

1.1.2.2 What if there are more than two categories?

If there are three categories or more,
we likely cannot just plug in values (like 0, 1, 2, …)

For example, marital status: single, married, widowed, divorced

If we assign values 1, 2, 3, 4, we assume these are ordered numeric categories.
What is the correct ordering? Is it meaningful?

…and for the ordinal variable “preference” (very low, low, neutral, high, very high),
if we assign values 1, 2, 3, 4, 5, we are stating that
the average change in the response variable is the same from “very low” to “low”
as it is from “high” to “very high” (a linear relationship)

…but maybe that’s not true.

1.1.2.2.1 Indicator variables for predictors with more than two categories

A categorical predictor with \(k\) possible values (\(k\) feature types, groups, etc)
requires a set of \(k-1\) indicator variables.

The \(k-1\) variables answer the sequence of questions:
``Is feature \(1\) present? Is feature \(2\) present? … Is feature \(k-1\) present?"

Note: If all answers are “no” then it must be true that feature \(k\) is present.
Likewise, if the answer to one of the \(k-1\) questions is “yes”, then the \(k\)-th feature is not present.

There can be at most one “yes” answer among the \(k-1\) questions.

1.1.2.3 A little example with three (or five) categories

Consider the variable “smoking status”

We can categorize it as never smoker, occasional smoker, or heavy smoker

\(k = 3\).

If you want to be more specific: never, ex-, light, moderate, or heavy smoker

\(k = 5\)

Note: The different categories should be mutually exclusive.

When \(k=3\), we can code this variable using \(k-1 = 3-1 = 2\) indicator (dummy) variables.

\[ \mbox{occasSmoker} = \left\{ \begin{array}{l} 1 & \mbox{if occasionally smoked}\\ 0 & \mbox{otherwise} \end{array} \right. \] \[ \mbox{heavySmoker} = \left\{ \begin{array}{l} 1 & \mbox{if smoked above some threshold frequency} \\ 0 & \mbox{otherwise} \end{array} \right. \]

The omitted category, we might call neverSmoker (corresponding to 0’s for the other two indicators).
This is called the base, control, or reference category

If the variable is ordinal (as it is here),
we typically set the reference category to the “lowest” group
and bring the other variables into the model in order from “lowest” to “highest”.

Why? This usually makes the model coefficients easiest to interpret.

1.1.3 Data Description

1.1.3.1 Re-analysis of Section 5.2 salary data

Survey of computer professionals at a large company

Response variable:
\(S =\) salary (thousands of dollars)

Predictor variables:
\(x =\) experience (years)
\(e =\) education (highest level: 1=HS, 2=BS, 3=higher)
\(m =\) management (management responsibility: 1=yes, 0=no)

salaryData <- read.delim("http://statistics.uchicago.edu/~collins/data/RABE5/P130.txt")
salaryData <- read.delim("./data/P130.txt")
n <- nrow(salaryData)

There are observations on n = 46 computer professionals.

head(salaryData, 10)
S X E M
13876 1 1 1
11608 1 3 0
18701 1 3 1
11283 1 2 0
11767 1 3 0
20872 2 2 1
11772 2 2 0
10535 2 1 0
12195 2 3 0
12313 3 2 0
1.1.3.1.1 How to represent education in the model?

Three categories: HS, BS, and higher

In the textbook dataset, coded as ordinal (ordered categories: 1, 2, 3)

  • Should we treat education as ordinal?
    • The categories do have quantitative meaning (years of education)
    • But coding as numbers imposes a linearity constraint (perhaps not plausible)
  • Does “average increase corresponding to one unit change in education” make sense?
  • Does it make sense to expect the same average increase?
    • from HS to BS and again same salary change from BS to higher?
    • Perhaps diminishing marginal returns on salary? (1, 3, 3.5) vs. (1, 2, 3)?

For now, let’s treat education as a nominal variable
and later see if we can justify a linear effect.

1.1.3.1.2 Generating an indicator variable

We need to generate 0-1 indicator variables for the education levels.

The following code will create three indicator variables
even though we know we really need to only use two in the model.

3 categories \(\leadsto\) 2 indicator variables

There are probably dozens of ways to do this in R.
I Googled and found a package called “dummies”. That sounded worth a try!

I asked for dummy variables to be created based on the levels of the variable E


SHOW CODE

The new dataset looks like this:

# install package once if not already installed
# install.packages("dummies")

# load the package when you first need it
library(dummies)
dummies-1.5.6 provided by Decision Patterns

# Create the dummy variables for variable E (1, 2, 3)
salaryDummy <- dummy.data.frame(salaryData, names=c("E"))

# view the first few data observations
head(salaryDummy, 10)
S X E1 E2 E3 M
13876 1 1 0 0 1
11608 1 0 0 1 0
18701 1 0 0 1 1
11283 1 0 1 0 0
11767 1 0 0 1 0
20872 2 0 1 0 1
11772 2 0 1 0 0
10535 2 1 0 0 0
12195 2 0 0 1 0
12313 3 0 1 0 0
# view the last few data observations
tail(salaryDummy, 10)
S X E1 E2 E3 M
37 24170 12 0 0 1 1
38 15990 13 1 0 0 0
39 26330 13 0 1 0 1
40 17949 14 0 1 0 0
41 25685 15 0 0 1 1
42 27837 16 0 1 0 1
43 18838 16 0 1 0 0
44 17483 16 1 0 0 0
45 19207 17 0 1 0 0
46 19346 20 1 0 0 0

Hmmm…. I would have preferred to keep the original variable E as well.
I’ll just grab E from the original data and stick it into the new dataset.


SHOW CODE

# Put the education variable back in the dataset
# Select the variables in the order you want them to appear in the dataset
salaryDummy <- salaryDummy %>%
  mutate(E = salaryData$E) %>%
  select(S, X, E, E1, E2, E3, M)
head(salaryDummy,10)
S X E E1 E2 E3 M
13876 1 1 1 0 0 1
11608 1 3 0 0 1 0
18701 1 3 0 0 1 1
11283 1 2 0 1 0 0
11767 1 3 0 0 1 0
20872 2 2 0 1 0 1
11772 2 2 0 1 0 0
10535 2 1 1 0 0 0
12195 2 3 0 0 1 0
12313 3 2 0 1 0 0

Here is a summary (counts) just to confirm that everything is as expected

data.frame(tally(~ E, data=salaryDummy))
E Freq
1 14
2 19
3 13

data.frame(tally(~ E1, data=salaryDummy))
E1 Freq
0 32
1 14
data.frame(tally(~ E2, data=salaryDummy))
E2 Freq
0 27
1 19
data.frame(tally(~ E3, data=salaryDummy))
E3 Freq
0 33
1 13

# data.frame(tally(E1 + E2 + E3 ~ E, data=salaryDummy))
1.1.3.1.3 Fitting a first model including indicator variables

3 categories \(\leadsto\) 2 indicator variables

I’ll choose to use E2 and E3
Then, E1 (high school) becomes the baseline, reference group

Note: If the categories do have a sense of order,
setting the lowest as the reference category usually makes model easier to interpret.

Then, include the remaining variables from the next lowest up to the highest.

The textbook chose to use the highest level of education as the reference group.
For myself, I found it very difficult to interpret the model coefficients “backwards”.
See textbook Table 5.3.

So, the model we would like to fit is then

\[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 m + \epsilon \]

1.1.3.2 Interpreting coefficients for indicator variables

The model above predicts the following:

For people with HS only (reference category: \(e_2=e_3=0\)) \[ S = \beta_0 + \beta_1 x + \beta_4 m + \epsilon \]

What does \(\beta_0\) represent?

Do you think \(\beta_0\) is positive or negative?

\(\beta_0\) is the average salary for those with a HS education only
with no experience (\(x=0\)) and not in a management position (\(m=0\))



For people with a college degree (\(e_2 = 1\))

\[ S = \beta_0 + \beta_1 x + \beta_2 + \beta_4 m + \epsilon \;=\; = (\beta_0 + \beta_2) + \beta_1 x + \beta_4 m + \epsilon \]

\(\beta_2\) interpretation:



Measures the difference in average salary for those with a college degree,
relative to those with a high-school diploma only
…adjusted for years of experience and management status.

Do you think \(\beta_2\) is positive or negative?

What does the sum (\(\beta_0 + \beta_2\)) represent in the model?

For people with a higher degree (\(e_3 = 1\))

\[ S = \beta_0 + \beta_1 x + \beta_3 + \beta_4 m + \epsilon \;=\; (\beta_0 + \beta_3) + \beta_1 x + \beta_4 m + \epsilon \]

What does \(\beta_3\) represent in the model?
Do you think \(\beta_3\) is positive or negative?


What does \((\beta_0 + \beta_3)\) represent in the model?


What does \(\beta_1\) represent in the model?
Do you think \(\beta_1\) is positive or negative?


What does \(\beta_4\) represent in the model? Do you think \(\beta_4\) is positive or negative?

1.1.3.2.1 The estimated model coefficients

\[{\widehat{y}}= {\widehat{\beta}_{0}} + {\widehat{\beta}_{1}} x + {\widehat{\beta}_{2}} e_2 + {\widehat{\beta}_{3}} e_3 + {\widehat{\beta}_{4}} m\]

lmfit.nominal <- lm(S ~ X + E2 + E3 + M, data=salaryDummy)

lmfit.nominal <- lm(S ~ X + E2 + E3 + M, data = salaryDummy)
lmfit.null.model <- lm(S ~ 1, data = salaryDummy) # model with no predictors
tidy(lmfit.nominal) # coefficients and their statistics
term estimate std.error statistic p.value
(Intercept) 8035.6 386.69 20.780 0
X 546.2 30.52 17.896 0
E2 3144.0 361.97 8.686 0
E3 2996.2 411.75 7.277 0
M 6883.5 313.92 21.928 0
anova(lmfit.null.model, lmfit.nominal) # F-test for all predictor coeff = 0
Res.Df RSS Df Sum of Sq F Pr(>F)
45 1001097577 NA NA NA NA
41 43280719 4 957816858 226.8 0
glance(lmfit.nominal)[-c(7:10)] # R^2 and other summaries
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9568 0.9525 1027 226.8 0 5 41

1.1.4 Inference

1.1.4.1 F-tests to assess questions of interest

1.1.4.1.1 Does education effect salary?

\(H_0: \beta_2 = \beta_3 = 0\)

Full model: \(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 m + \epsilon\)

Reduced model: \(S = \beta_0 + \beta_1 x + \beta_4 m + \epsilon\)

Sample size: \(n = 46\)

p.FM <- 4
p.RM <- 2

\[F = \frac{({\mathrm{SSE}}(RM) - {\mathrm{SSE}}(FM)) \;/\; 2}{{\mathrm{SSE}}(FM) \;/\; 41}\]

lmfit.FM <- lm(S ~ X + E2 + E3 + M, data = salaryDummy) # full model estimate
lmfit.RM <- lm(S ~ X + M, data = salaryDummy) # reduced model estimate
anova(lmfit.RM, lmfit.FM) # F-test
Res.Df RSS Df Sum of Sq F Pr(>F)
43 134806963 NA NA NA NA
41 43280719 2 91526243 43.35 0
1.1.4.1.2 Is the higher education really worth more than the Bachelors?

\(H_0: \beta_2 = \beta_3\)

Full model: \(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 m + \epsilon\)

Reduced model: \(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_2 e_3 + \beta_4 m + \epsilon\)

That is, \(S = \beta_0 + \beta_1 x + \beta_2 (e_2 + e_3) + \beta_4 m + \epsilon\)

You can fit this reduced model in R as follows:
lmfit.RM <- lm(S ~ X + I(E2 + E3) + M, data = salaryDummy)

p.FM <- 4
p.RM <- 3

\[F = \frac{({\mathrm{SSE}}(RM) - {\mathrm{SSE}}(FM)) \;/\; 1}{{\mathrm{SSE}}(FM) \;/\; 41}\]

lmfit.FM <- lm(S ~ X + E2 + E3 + M, data = salaryDummy) # full model estimate
lmfit.RM <- lm(S ~ X + I(E2 + E3) + M, data = salaryDummy) # reduced model estimate
anova(lmfit.RM, lmfit.FM) # F-test
Res.Df RSS Df Sum of Sq F Pr(>F)
42 43434219 NA NA NA NA
41 43280719 1 153499 0.1454 0.7049

Hmmm… Perhaps the higher education is not worth it in this industry!

1.1.4.1.3 What if we chose a different reference group?

\[S = \beta_0 + \beta_1 x + \beta_2 e_1 + \beta_3 e_2 + \beta_4 m + \epsilon\]

This is what the authors’ chose in Section 5.2

The reference category has changed to the highest education group.

lmfit2.nominal <- lm(S ~ X + E1 + E2 + M, data = salaryDummy)
tidy(lmfit2.nominal)
term estimate std.error statistic p.value
(Intercept) 11031.8 383.22 28.7874 0.0000
X 546.2 30.52 17.8964 0.0000
E1 -2996.2 411.75 -7.2767 0.0000
E2 147.8 387.66 0.3813 0.7049
M 6883.5 313.92 21.9277 0.0000
anova(lmfit.null.model, lmfit2.nominal)
Res.Df RSS Df Sum of Sq F Pr(>F)
45 1001097577 NA NA NA NA
41 43280719 4 957816858 226.8 0
glance(lmfit2.nominal)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9568 0.9525 1027 226.8 0 5 41

Things to notice…

  • The p-value for the F-test of \(H_0: \beta_2=\beta_3\) for our earlier model is identical to the p-value for the t-test of \(H_0: \beta_3 = 0\) in the new model. Why?

  • ANOVA table output is identical (we haven’t really changed the overall model fit)

1.1.4.1.4 The “saturated model” (too many indicators)

What if we tried to use all three indicator variables in the model?

\[S = \beta_0 + \beta_1 x + \beta_2 e_1 + \beta_3 e_2 + \beta_4 e_3 + \beta_5 m + \epsilon\]

Well… The lm function in R seems to be smart enough to ignore our overzealous attempt.

lmfit3.nominal <- lm(S ~ X + E1 + E2 + E3 + M, data = salaryDummy)

lmfit3.nominal <- lm(S ~ X + E1 + E2 + E3 + M, data = salaryDummy)
tidy(lmfit3.nominal)
term estimate std.error statistic p.value
(Intercept) 11031.8 383.22 28.7874 0.0000
X 546.2 30.52 17.8964 0.0000
E1 -2996.2 411.75 -7.2767 0.0000
E2 147.8 387.66 0.3813 0.7049
M 6883.5 313.92 21.9277 0.0000

R took in the first two indicators and dropped the third (redundant).
That is so polite. I wish R would fix my errors all the time.

The effect is that the highest education group became the reference category.

I prefer the lowest group as the reference and would force R to do that myself:
lmfit.nominal <- lm(S ~ X + E2 + E3 + M, data = salaryDummy)



There are other ways to set the base/control category in R.
I chose to use the dummies package for transparency (pedagogy, simplicity).

1.1.4.2 Another salary analysis: ordinal variable

What if we had used the original education variable \(e\) (coded 1, 2, 3)?

Then, the regression model would be

\[S = \beta_0 + \beta_1 x + \beta_2 e + \beta_3 m + \epsilon\] and specifies that

For people with HS only (\(e=1\)) \[ S = \beta_0 + \beta_1 x + \beta_2 + \beta_3 m + \epsilon \;=\; (\beta_0 + \beta_2) + \beta_1 x + \beta_3 m + \epsilon \]

For people with a BS (\(e=2\))

\[ S = \beta_0 + \beta_1 x + 2\beta_2 + \beta_3 m + \epsilon \;=\; (\beta_0 + 2\beta_2) + \beta_1 x + \beta_3 m + \epsilon \]

For people with higher degree (\(e=3\))

\[ S = \beta_0 + \beta_1 x + 3\beta_2 + \beta_3 m + \epsilon \;=\; (\beta_0 + 3\beta_2) + \beta_1 x + \beta_3 m + \epsilon \]

1.1.4.2.1 Model fit with ordinal education variable
lmfit.ordinal <- lm(S ~ X + E + M, data = salaryData)
tidy(lmfit.ordinal)
term estimate std.error statistic p.value
(Intercept) 6963.5 665.69 10.460 0
X 570.1 38.56 14.785 0
E 1578.8 262.32 6.018 0
M 6688.1 398.28 16.793 0
anova(lmfit.null.model, lmfit.ordinal)
Res.Df RSS Df Sum of Sq F Pr(>F)
45 1001097577 NA NA NA NA
42 72383410 3 928714168 179.6 0
glance(lmfit.ordinal)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9277 0.9225 1313 179.6 0 4 42
beta <- tidy(lmfit.ordinal)$estimate

The estimated coefficient for education is \(\widehat{\beta}_2 = 1,579\).

In particular, this model implies we expect
the average salary difference between HS vs BS holders (\(1,579\)) to be the same as
the difference between BS vs higher-degree holders (\(1,579\)).

Does that seem reasonable?

1.1.4.3 Deciding between ordinal and nominal

How would we check if the model with ordinal education levels makes sense?

  • First, the variable has to make sense to be treated as ordered categories.
    • For education level, we would expect some sort of monotone increasing effect,
    • but may be non-linear (diminishing marginal returns, not equal increments)
  • We could do a test by fitting the nominal variable model (indicator variables)
    • Then test \(H_0: 2\beta_2 = \beta_3\) to see if equal (linear) increments are appropriate
    • Does higher educ have twice the effect of BS relative to HS?
    • This was clearly not the case in the model with indicator variables for education level.
    • There is a large and statistically significant salary bump for HS-to-BS vs. BS-to-higher
    • … the latter bump not even practically (or statistically) significant.


Of course, look at residual plots for both models: nominal and ordinal

1.1.4.3.1 An advantage of the ordinal representation

If an ordinal representation is appropriate for the model,
we prefer to use the ordinal vs. nominal (indicator) representation.

What are the df for the two variance estimates: MSE = \(\widehat{\sigma}\)?
Equivalently, what are the degrees of freedom for the two overall F-tests?

Ordinal:

Nominal:

This “savings” in df is more efficient (higher statistical power per \(n\) observations)

For the salary/education data, the difference is only one degree of freedom
(since \(k=3\) leads to \(k-1=2\) indicator variables vs. one ordinal variable).

But, if the number of categories (\(k\)) is large and the sample size small,
this difference becomes very important.

1.1.4.4 Investigating the residuals

Now that we have decided the indicator variables are most appropriate for bringing education level into the model, let’s check the model assumptions.

lmfit.nominal <- lm(S ~ X + E2 + E3 + M, data = salaryDummy)
gf_point(rstandard(lmfit.nominal) ~ fitted(lmfit.nominal), data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal education levels)") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = 2)

Whoa! Why do we see clusters of points?

Let’s highlight education level and the management indicator in the plot.

gf_point(rstandard(lmfit.nominal) ~ fitted(lmfit.nominal), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal education levels)") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = 2)

Wow! Using the model

\[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 m + \epsilon \]

We are clearly not picking up all of the structure present in the data.

Here is another plot I think tells the story of the residuals a little better.

gf_point(rstandard(lmfit.nominal) ~ (E + 3*M), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "Education and management categories") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = 2)

1.1.5 Interaction

1.1.5.1 Interactions (“effect modification”)

All our models thus far specify that any given level/value of one predictor,
the relative increment in (mean of) \(Y\) due to another predictor is the same.

There are many situations that the effects of one predictor on the response are
NOT the same within different categories of some other predictor.

An example:
Response: risk of blood clots in women
Predictor: number of months taking oral contraceptives (OC)

For women who don’t smoke, the increase in risk per month on OC is modest.
For women who smoke, the increase in risk per month on OC is very large.

That is, the relationship of OC and blood clot risk is different for smokers and non-smokers.

So, the following model would not be adequate: \[{\rm risk} = \beta_0 + \beta_1 {\rm smoke} + \beta_2 {\rm OC} + \epsilon\]

because for smokers (smoke = 1), \[{\rm risk} = (\beta_0 + \beta_1) + \beta_2 {\rm OC} + \epsilon\] and for non-smokers (smoke = 0), \[{\rm risk} = \beta_0 + \beta_2 {\rm OC} + \epsilon\] This model sets the additional risk-per-month = \(\beta_2\) regardless of smoking status

In fact, if we plotted separate regression lines for clot risk versus time-on-OC
for data from smokers and data from non-smokers,
we would see two lines with different slopes for smokers and non-smokers.

The model above is inadequate. It only allows for different intercepts, not different slopes.

Interaction: In this situation, we say that OC and smoking interact

The effect of smoking is a multiplicative (slope) effect, not just additive (intercept).

1.1.5.1.1 Including effect modification (interaction) in the model

We want a model that allows a different slope (steeper increase in risk from OC)
for smokers vs. non-smokers

We include an “interaction term” between the primary predictor and the other predictor that modifies the effect of the primary predictor.

This is achieved by multiplying the two predictors together to form a new predictor called an interaction or interaction term in the model.

\[{\rm risk} = \beta_0 + \beta_1 {\rm smoke} + \beta_2 {\rm OC} + \beta_3 ({\rm OC}\times {\rm smoke}) + \epsilon\]

For smokers (smoke = 1), \[{\rm risk} = (\beta_0 + \beta_1) + (\beta_2 + \beta_3) {\rm OC} + \epsilon\] and for non-smokers (smoke = 0), \[{\rm risk} = \beta_0 + \beta_2 {\rm OC} + \epsilon\]

In this scenario, \(\beta_3 > 0\) and increases the rate at which blood clot risk increases
for each month of oral contraceptive use for those who smoke vs. non-smokers.

1.1.5.1.2 Another interaction between quantitative and nominal variables

Is the effect of experience on salary the same for people with different education levels?

First, let’s look at the fit of salary regressed on experience only: \[S = \beta_0 + \beta_1 e_2 + \beta2 e_3 + \beta_4 x + \epsilon\]

gf_point(S ~ X, data=salaryDummy, size = ~E, color = ~E) %>%
  gf_labs(x = "Experience (years)", y = "Salary ($1,000)")

Is the slope the same for each education level?

It appears that the slope of a regression line for each education level
would have about the same slope as the overall fit seen here.

To explore whether there might by different effects (slopes) for experience
according to education level, fit a separate regression line for each education level.


This is called a stratified analysis.


Look for non-parallel regression lines.

lmfitE1 <- lm(S ~ X, data = subset(salaryDummy, E1 == 1))
lmfitE2 <- lm(S ~ X, data = subset(salaryDummy, E2 == 1))
lmfitE3 <- lm(S ~ X, data = subset(salaryDummy, E3 == 1))
gf_point(S ~ X, data=salaryDummy, size = ~E, color = ~E, alpha=0.6) %>%
  gf_labs(x = "Experience (years)", y = "Salary ($1,000)") %>%
  gf_coefline(model = lmfitE1) %>%
  gf_coefline(model = lmfitE2, linetype=2) %>%
  gf_coefline(model = lmfitE3, linetype=3)
Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

The slopes do not appear to be the same!

This suggests a differential effect of experience modified by education level:
called a modified effects …or interaction effect.

1.1.5.1.3 Incorporating interaction effects into the model

How “non-parallel” do the lines need to be to declare the slopes different?

To some extent, an answer requires considerations beyond the statistical.

But, we can formally test for the necessity of interaction effects in the model.

The new model looks like this:

\[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 (e_2\times x) + \beta_5 (e_3 \times x) + \epsilon \]

The product (multiplication) of experience and education variables are predictors in the model.

For people with HS only (\(e_2 = 0\), \(e_3 = 0\)) \[ S = \beta_0 + \beta_1 x + \epsilon \]

For people with a BS (\(e_2 = 1\), \(e_3 = 0\))

\[ S = (\beta_0 + \beta_2) + (\beta_1 + \beta_4) x + \epsilon \]

For people with higher degree (\(e_2 = 0\), \(e_3 = 1\))

\[ S = (\beta_0 + \beta_3) + (\beta_1 + \beta_5) x + \epsilon \]

1.1.5.1.4 What is the effect of experience on salary in the model?

With one more year of experience, how much do we expect average salary to change?

Now it depends… on the level of education!

  • For HS, the effect (slope) is \(\beta_1\)
  • For BS, it is (\(\beta_1+\beta_4\))
  • For higher degrees, it is (\(\beta_1+\beta_5\))

1.1.5.2 Fitting a model with interaction effects

We can create the additional predictors needed
…or just ask R to include interaction effects.

I’ll demonstrate both methods.

Once we have the model fit, we can use an F-test
to evaluate the need for these extra predictors in the model
(that is, to test whether the slopes are different)

Method #1: Just ask R to include interaction terms

lmfit.interact <- lm(S ~ X + E2 + E3 + E2*X + E3*X, data = salaryDummy)

lmfit.interact <- lm(S ~ X + E2 + E3 + E2*X + E3*X, data = salaryDummy)
tidy(lmfit.interact)
term estimate std.error statistic p.value
(Intercept) 12299.0 1740.4 7.0669 0.0000
X 324.5 179.6 1.8065 0.0784
E2 1461.2 2326.4 0.6281 0.5335
E3 898.2 2357.1 0.3811 0.7052
X:E2 216.3 238.6 0.9066 0.3700
X:E3 595.5 288.9 2.0615 0.0458

Method #2: Create new predictors by multiplying old predictors together

salaryInteract <- mutate(salaryDummy, E2X = E2 * X, E3X = E3 * X)
lmfit.interact2 <- lm(S ~ X + E2 + E3 + E2X + E3X, data = salaryInteract)

salaryInteract <- mutate(salaryDummy, E2X = E2 * X, E3X = E3 * X)
lmfit.interact2 <- lm(S ~ X + E2 + E3 + E2X + E3X, data = salaryInteract)
tidy(lmfit.interact2)
term estimate std.error statistic p.value
(Intercept) 12299.0 1740.4 7.0669 0.0000
X 324.5 179.6 1.8065 0.0784
E2 1461.2 2326.4 0.6281 0.5335
E3 898.2 2357.1 0.3811 0.7052
E2X 216.3 238.6 0.9066 0.3700
E3X 595.5 288.9 2.0615 0.0458
1.1.5.2.1 F-test for interaction effects

From the regression output, the two interaction terms have low p-values,
but not so low that we would reject \(H_0: \beta = 0\) in either case.

Since we are testing two coefficients simultaneously, use an F-test or Bonferroni correction.

Here is the F-test:

Full model: \(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 (e_2\times x) + \beta_5 (e_3 \times x) + \epsilon\)

Reduced model: \(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \epsilon\)

The models differ by \(2\) df (two coefficients).
The full model has \(n - 5 - 1 = 40\) df.

lmfit.FM <- lm(S ~ X + E2 + E3 + E2*X + E3*X, data = salaryDummy) # full model estimate
lmfit.RM <- lm(S ~ X + E2 + E3, data = salaryDummy) # reduced model estimate
anova(lmfit.RM, lmfit.FM) # F-test
Res.Df RSS Df Sum of Sq F Pr(>F)
42 550853135 NA NA NA NA
40 497897342 2 52955792 2.127 0.1325

This result suggests there is weak evidence at best of different slopes per education level.
We might choose to omit the interaction terms from the model,
opting for the simpler main effects model.

Note: The Bonferroni method is more conservative (less likely to reject null hypotheses),
but at least when we do reject, we know exactly which terms were rejected.

If we reject an omnibus test like the F-test, we just know that one of the predictors has a coefficient significantly non-zero …but not which predictor.

1.1.5.2.2 Interactions between continuous and ordinal variables

If we had used education as the original variable “\(e\)” (coded as 1,2,3)
instead of indicator variables \(e_2\) and \(e_3\), then the regression (with interaction) would look like:

\[ S = \beta_0 + \beta_1 x + \beta_2 e + \beta_3 (e\times x) + \epsilon \]

Interpretation:
For each additional year of experience increase, the salary increase will depend on education.

  • For HS, the coefficient for experience is (\(\beta_1+\beta_3\))
  • For BS, it is (\(\beta_1+2\beta_3\))
  • For higher degrees, it is (\(\beta_1+ 3\beta_3\))

From earlier analyses, we don’t actually believe education is ordinal,
so we won’t bother to fit this model.

Note: Two continuous covariates that interact are interpreted similarly.

1.1.5.3 Summary of quantitative-categorical interaction

For a model with continuous predictor \(x\)
and categorical predictor \(c\) (\(k=2\) levels: 0 and 1),
the full intercation model is

\[ Y = \beta_0 + \beta_1 x + \beta_2 c + \beta_3 (c \times x) + \epsilon \]

The parameters (\(\beta\)s) in the model are:

  • \(\beta_0\): mean of \(Y\) when \(x=0, c=0\)
  • \(\beta_1\): effect on mean of \(Y\) of incrementing \(x\) one unit, (when \(c=0\))
  • \(\beta_2\): shift of mean of \(Y\) when \(c=1\), (when \(x = 0\))
  • \(\beta_3\): When \(c=1\) and when there is one unit increase in \(x\), the expected change in \(Y\) is (\(\beta_1+\beta_3\)).
    • So \(\beta_3\) is the additional effect on \(Y\) of a unit increase in \(x\) when \(c=1\), comparing when \(c=0\).
1.1.5.3.1 The variety of models with and without interaction

Different intercepts, different slopes \[ Y = \beta_0 + \beta_1 x + \beta_2 c + \beta_3 (c \times x) + \epsilon \]

Different intercepts, same slope \[ Y = \beta_0 + \beta_1 x + \beta_2 c + \epsilon \]

Same intercept, different slopes \[ Y = \beta_0 + \beta_1 x + \beta_3 (c \times x) + \epsilon \]

The last model is not common.

Instead, it is common practice when including interaction terms to also include all main effects.

1.1.5.3.2 Comments on regression with interactions
  • When interaction effects exist, we call the effects of predictor alone main effects

Accepted practice:
If interaction effects are in the model, leave the main effects in the model too
…even if not statistically significant (makes interpretation much easier)

  • In the presence of interaction effects,
    • we can no longer talk about effects of one predictor variable without considering the value taken on by another.
    • That is, the phrase, “while other variables are held constant” makes no sense
  • A critical issue with interaction effects is statistical power.
    • The interaction effects may be present, but detecting them depends on adequate sample size for the many possible predictor-value-multiplied-by-other-predictor-value combinations

Some other thoughts you can read later:

There are some situations in which you may want to leave one or all of the main effects out. It must be that the interaction effect is more relevant in predicting the response variable than the main. Often, sample size is really small and so a parsimonious model is desired. It could be that main effect itself is intuitively not relevant for predicting the response variable but the interaction is. OR when including the main effect would result in violations of assumptions while the interaction terms would not, e.g. the main effects and the interaction terms are highly correlated and including both would result in collinearity, and you may drop one depending on which is more relevant or more significant.

1.1.5.4 Using residual plots to help choose an appropriate model

Residuals from the model with nominal education level (\(e_2 = 0, 1\) and \(e_3 = 0, 1\)) \[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 m + \epsilon \]

gf_point(rstandard(lmfit.nominal) ~ fitted(lmfit.nominal)) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal education levels)") %>%
  gf_hline(yintercept = 0, linetype = 2)

gf_point(rstandard(lmfit.nominal) ~ fitted(lmfit.nominal), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal education levels)") %>%
  gf_hline(yintercept = 0, linetype = 2)

Residuals from the model with ordinal education level (\(e = 1, 2, 3\)) \[ S = \beta_0 + \beta_1 x + \beta_2 e + \beta_3 m + \epsilon \]

gf_point(rstandard(lmfit.ordinal) ~ fitted(lmfit.ordinal)) %>%
  gf_labs(y = "standardized residuals", x = "yhat (ordinal education levels)")

gf_point(rstandard(lmfit.ordinal) ~ fitted(lmfit.ordinal), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (ordinal education levels)")

Residuals from the model with ordinal education level (\(e_2, e_2\)) and experience-education interaction \[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 (e_2\times x) + \beta_5 (e_3 \times x) + \beta_6 m + \epsilon \]

gf_point(rstandard(lmfit.interact) ~ fitted(lmfit.interact)) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, experience interaction)")

gf_point(rstandard(lmfit.interact) ~ fitted(lmfit.interact), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, experience interaction)")

Remember that we did not consider the interaction effects to be significant,
so the above plot is just for illustration purposes.

In all residual plots, it seems management status has a differential effect by education level.

Even though interaction between experience and education level was not a significant improvement,
perhaps we need to include interaction of management status and education level?

First, does management also have a differential effect on experience?

lmfitM0 <- lm(S ~ X, data = salaryDummy[salaryDummy$M==0, ])
lmfitM1 <- lm(S ~ X, data = salaryDummy[salaryDummy$M==1, ])
gf_text(S ~ X, data=salaryDummy, label = ~E) %>%
  gf_labs(x = "Experience (years)", y = "Salary ($1,000)") %>%
  gf_coefline(model = lmfitM0) %>%
  gf_coefline(model = lmfitM1)
Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

The regression lines are pretty much parallel.
So, we won’t consider an experience-management interaction effect further.

Now, what about education-management interaction?

We can fit a separate regression line for each of 6 education-management combinations.

Caution: We are really slicing up this small data set into little bits: \(n = 46\).

lmfitE1M0 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==0 & salaryDummy$E==1), ])
lmfitE2M0 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==0 & salaryDummy$E==2), ])
lmfitE3M0 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==0 & salaryDummy$E==3), ])
lmfitE1M1 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==1 & salaryDummy$E==1), ])
lmfitE2M1 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==1 & salaryDummy$E==2), ])
lmfitE3M1 <- lm(S ~ X, data = salaryDummy[(salaryDummy$M==1 & salaryDummy$E==3), ])
gf_text(S ~ X, data=salaryDummy, label = ~E, color = ~M) %>%
  gf_labs(x = "Experience (years)", y = "Salary ($1,000)") %>%
  gf_coefline(model = lmfitE1M0) %>%
  gf_coefline(model = lmfitE2M0) %>%
  gf_coefline(model = lmfitE3M0) %>%
  gf_coefline(model = lmfitE1M1, linetype=2) %>%
  gf_coefline(model = lmfitE2M1, linetype=2) %>%
  gf_coefline(model = lmfitE3M1, linetype=2)
Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

Warning: Ignoring unknown parameters: inherit.aes

These lines look very close to parallel. They are not perfectly parallel, but very close. This suggests that we try fitting a single model where the 6 different management-education subgroups share the same slope, but are allowed to have different intercepts.

So, let’s fit the following model: \[ S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 (e_2\times m) + \beta_5 (e_3 \times m) + \beta_6 m + \epsilon \]

Each of the 6 subgroups will have the same \(\beta_1\), but they will have six different intercepts.

lmfit.interact.educ.manage <- lm(S ~ X + E2 + E3 + E2*M + E3*M + M, data = salaryDummy)

lmfit.interact.educ.manage <- lm(S ~ X + E2 + E3 + E2*M + E3*M + M, data = salaryDummy)
tidy(lmfit.interact.educ.manage)
term estimate std.error statistic p.value
(Intercept) 9473 80.344 117.90 0
X 497 5.566 89.28 0
E2 1382 77.319 17.87 0
E3 1731 105.334 16.43 0
M 3981 101.175 39.35 0
E2:M 4903 131.359 37.32 0
E3:M 3066 149.330 20.53 0

The education-management interaction effects are highly significant.

What do the residuals from this model look like?

gf_point(rstandard(lmfit.interact.educ.manage) ~ fitted(lmfit.interact.educ.manage)) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, management interaction)")

gf_point(rstandard(lmfit.interact.educ.manage) ~ fitted(lmfit.interact.educ.manage), size = ~E, color = ~ M, data=salaryDummy) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, management interaction)")

Hmmmm…. It seems there is one manager with a college education
who gets paid much, much, much lower than the model would predict.

In order to see the pattern in the residuals better, I’m going to temporarily remove this observation.

omit <- which(rstandard(lmfit.interact.educ.manage) < -5)
omit
33 
33 
lmfit.interact.educ.manage.omit <- lm(S ~ X + E2 + E3 + E2*M + E3*M + M, data = salaryDummy[-omit,])
tidy(lmfit.interact.educ.manage.omit)
term estimate std.error statistic p.value
(Intercept) 9458.4 31.041 304.71 0
X 498.4 2.152 231.64 0
E2 1384.3 29.858 46.36 0
E3 1741.3 40.682 42.80 0
M 3988.8 39.073 102.09 0
E2:M 5049.3 51.667 97.73 0
E3:M 3051.8 57.674 52.91 0
gf_point(rstandard(lmfit.interact.educ.manage.omit) ~ fitted(lmfit.interact.educ.manage.omit)) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, management interaction)")

gf_point(rstandard(lmfit.interact.educ.manage.omit) ~ fitted(lmfit.interact.educ.manage.omit), size = ~E, color = ~ M, data=salaryDummy[-omit, ]) %>%
  gf_labs(y = "standardized residuals", x = "yhat (nominal educ, management interaction)")

What do you think about the residual plot now?
Does it support the model assumptions?

Does the outlier have much influence on predictions?

h.outlier <- hatvalues(lmfit.interact.educ.manage)[omit]
C.outlier <- cooks.distance(lmfit.interact.educ.manage)[omit]
C <- cooks.distance(lmfit.interact.educ.manage)
gf_point(cooks.distance(lmfit.interact.educ.manage) ~ rstandard(lmfit.interact.educ.manage)) %>%
  gf_labs(x = "standardized residuals", y = "Cook's distance (nominal educ, manage interact)")

Regression results with outlier included.
\(S = \beta_0 + \beta_1 x + \beta_2 e_2 + \beta_3 e_3 + \beta_4 (e_2\times m) + \beta_5 (e_3 \times m) + \beta_6 m + \epsilon\)

tidy(lmfit.interact.educ.manage)
term estimate std.error statistic p.value
(Intercept) 9473 80.344 117.90 0
X 497 5.566 89.28 0
E2 1382 77.319 17.87 0
E3 1731 105.334 16.43 0
M 3981 101.175 39.35 0
E2:M 4903 131.359 37.32 0
E3:M 3066 149.330 20.53 0
glance(lmfit.interact.educ.manage)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9988 0.9986 173.8 5517 0 7 39

Regression results with outlier omitted.

tidy(lmfit.interact.educ.manage.omit)
term estimate std.error statistic p.value
(Intercept) 9458.4 31.041 304.71 0
X 498.4 2.152 231.64 0
E2 1384.3 29.858 46.36 0
E3 1741.3 40.682 42.80 0
M 3988.8 39.073 102.09 0
E2:M 5049.3 51.667 97.73 0
E3:M 3051.8 57.674 52.91 0
glance(lmfit.interact.educ.manage.omit)[-c(7:10)]
r.squared adj.r.squared sigma statistic p.value df df.residual
0.9998 0.9998 67.12 35428 0 7 38

The outlier has very little influence on the regression results,
…except a huge reduction in variance.

Plus, Cook’s distance implies this observation may have influence on predictions.

1.1.5.4.1 Checking other model assumptions with residuals

Excluding the residual, we confirmed the linearity and non-constant variance assumptions.

Here are more residual plots (with the outlier omitted)

gf_point(rstandard(lmfit.interact.educ.manage.omit) ~ X, data = salaryDummy[-omit, ]) %>%
  gf_labs(y = "standardized residuals", x = "x (years of experience)")


# Googled quite a bit to get the legends labeled nicely on this plot
gf_point(rstandard(lmfit.interact.educ.manage.omit) ~ (E + 3*M), size = ~ as.factor(M), color = ~ as.factor(E), data=salaryDummy[-omit, ]) %>%
  gf_labs(y = "standardized residuals", x = "Education and management categories") %>%
  gf_hline(yintercept = c(-2, 0, 2), linetype = 2) + 
  scale_size_discrete(range = c(1.5, 3), name="Manager", 
                      breaks=c("0", "1"),  labels=c("No", "Yes")) +
  scale_color_discrete(name="Education", 
                      breaks=c("1", "2", "3"),  labels=c("HS", "BS", "Grad"))
Warning: Using size for a discrete variable is not advised.

The last plot is interesting (and is an improvement on textbook Figure 5.5).

There seems to be more variability in salary for non-managers vs. managers.
This violates the assumption of constant variance for all levels of the predictors.

So, this model is quite good, but we might be overestimating salary variability for managers.