The best way to assess a model is how well it predicts. Virtually all model statistics are based on the idea of (mostly in-sample) prediction. This problem set is made to give you the intuition behind the three assumptions of the linear model that we’ve explored. The excercises are intended to give you a) some milage in R and 2) communicate the intuition behind the model assumptions.

We will share the responses on Padlet: https://padlet.com/siljesynnove/blr

We will continue to work on cross-sectional data on MEPs in 2014. However, compared to last time, I’ve added some variables, so you need to download the data anew. If you’ve set your working directory correctly, then the following code would work.

#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
              destfile = "MEP2014.rda")

#Load in
load("MEP2014.rda")

#Rename the data for convenience
df <- MEP2014

#Rename the three main variables for convenience
df <-
  df %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = LocalAssistants,
    #The vote share (in percentage points) that the national party got last election
    x = VoteShare_LastElection,
    #The average labor cost (measured in 100 euros) in the MEPs home state 
    z = LaborCost)

Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.

mod1 <- lm(y ~ x,
           df)
MEPs’ local staff (a linear model)
Dependent variable
y
x -1.667***
(0.553)
Constant 2.611***
(0.165)
Observations 679
R2 0.013
Adjusted R2 0.012
Residual Std. Error 2.607 (df = 677)
F Statistic 9.080*** (df = 1; 677)
Note: p<0.1; p<0.05; p<0.01

Exercise 1: Interpret the model

  1. Run model 1, set a scenario and make an intuitive interpretation.

  2. Rephrase your finding: On average, how many MEPs would see a decrease of one employee in your “treatment group” (the second scenario you set)?

Exercise 2: Assess the assumption of a normal distribution

  1. Calculate the predicted outcomes, residuals and standardized residuals of the model and store them in your data frame.

  2. Generate a vector with the quantiles (percentiles) of the standardized residuals: you can use the function: quantile(x, probs = seq(from = 0, to = 1, by = 0.1), na.rm = T).

  3. Generate a similar vector from a long vector that follows a standard normal distribution. You may generate the first vector by simulation rnorm(n = 1000, mean = 0, sd = 1).

  4. You now have two vectors: One with the quantiles from your data, another with quantiles from your standard normal distribution. Create a scatterplot with your theoretical quantiles (the standard normal distribution) on your x-axis and the quantiles from your standardized residuals on the y-axis.

  5. Draw a diagonal line with intercept = 0 and slope = 1. You can either use geom_abline(aes(intercept = 0, slope = 1)) in ggplot2 or abline(a = 0, b = 1) in base R.

  6. Compare what you find with what you find if you simply run the code qqnorm(x), where x is the standardized residuals. What can you do to make your plot look like the new plot?

Exercise 3: Assess the assumption of homoscedasiticty.

The calcualtion of standard errors rely on the assumption that the model’s errors/residuals are distributed evenly over the entire range of your conditional outcome. I.e. your predictions are surrounded by the same uncertainty over its entire (empirical) range.

  1. Plot your standardized residuals (y-axis) against your predicted outcomes (x-axis).

  2. Take the square root of your standardized residuals, then make the same plot again. Compare the two plots and tell me what you see.

It is common to take the square root of the residuals because it makes it easier to detect small changes in low values.

  1. Use the cut() function to create a new categorical variable that slices up your predicted outcomes into quartiles: cut(x, quantile(x, probs = seq(0,1,0.25), na.rm = T).

  2. Group the standardized residuals by your new cut-variable (group_by) and calculate the standard deviation of the residuals within each group (`reframe()``). What do you see?

  3. In your opinion, is the distribution of residuals heteroscedastic?

Exercise 4: Assess the omitted variable assumption.

  1. Create a scatterplot where you plot the residuals against x in ggplot2.

  2. Draw a local regression line by replacing the function geom_point() by geom_smooth(). What do you see?

  3. As ChatGPT what a “local regression” is. If the answer is too complicated, ask ChatGPT to ELI5 it for you (“explain it like I’m 5 years old”). Was it useful?

  4. Consider the variable LaborCost (z). Correlate it with x and with y. What do you find? What do you think would happen if you add it to your main model as a control?

  5. Fit a second model (mod2). Use stargazer() to plot the two models against each other? What happened?