The best way to assess a model is how well it predicts. Virtually all model statistics are based on the idea of (mostly in-sample) prediction. This problem set is made to give you the intuition behind the three assumptions of the linear model that we’ve explored. The excercises are intended to give you a) some milage in R and 2) communicate the intuition behind the model assumptions.
We will share the responses on Padlet: https://padlet.com/siljesynnove/blr
We will continue to work on cross-sectional data on MEPs in 2014. However, compared to last time, I’ve added some variables, so you need to download the data anew. If you’ve set your working directory correctly, then the following code would work.
#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
destfile = "MEP2014.rda")
#Load in
load("MEP2014.rda")
#Rename the data for convenience
df <- MEP2014
#Rename the three main variables for convenience
df <-
df %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = LocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#The average labor cost (measured in 100 euros) in the MEPs home state
z = LaborCost)
Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.
mod1 <- lm(y ~ x,
df)
Dependent variable | |
y | |
x | -1.667*** |
(0.553) | |
Constant | 2.611*** |
(0.165) | |
Observations | 679 |
R2 | 0.013 |
Adjusted R2 | 0.012 |
Residual Std. Error | 2.607 (df = 677) |
F Statistic | 9.080*** (df = 1; 677) |
Note: | p<0.1; p<0.05; p<0.01 |
Run model 1, set a scenario and make an intuitive interpretation.
Rephrase your finding: On average, how many MEPs would see a decrease of one employee in your “treatment group” (the second scenario you set)?
Calculate the predicted outcomes, residuals and standardized residuals of the model and store them in your data frame.
Generate a vector with the quantiles (percentiles) of the
standardized residuals: you can use the function:
quantile(x, probs = seq(from = 0, to = 1, by = 0.1), na.rm = T)
.
Generate a similar vector from a long vector that follows a
standard normal distribution. You may generate the first vector by
simulation rnorm(n = 1000, mean = 0, sd = 1)
.
You now have two vectors: One with the quantiles from your data, another with quantiles from your standard normal distribution. Create a scatterplot with your theoretical quantiles (the standard normal distribution) on your x-axis and the quantiles from your standardized residuals on the y-axis.
Draw a diagonal line with intercept = 0 and slope = 1. You can
either use geom_abline(aes(intercept = 0, slope = 1))
in
ggplot2 or abline(a = 0, b = 1)
in base R.
Compare what you find with what you find if you simply run the
code qqnorm(x)
, where x is the standardized residuals. What
can you do to make your plot look like the new plot?
The calcualtion of standard errors rely on the assumption that the model’s errors/residuals are distributed evenly over the entire range of your conditional outcome. I.e. your predictions are surrounded by the same uncertainty over its entire (empirical) range.
Plot your standardized residuals (y-axis) against your predicted outcomes (x-axis).
Take the square root of your standardized residuals, then make the same plot again. Compare the two plots and tell me what you see.
It is common to take the square root of the residuals because it makes it easier to detect small changes in low values.
Use the cut()
function to create a new categorical
variable that slices up your predicted outcomes into quartiles:
cut(x, quantile(x, probs = seq(0,1,0.25), na.rm = T)
.
Group the standardized residuals by your new cut-variable
(group_by
) and calculate the standard deviation of the
residuals within each group (`reframe()``). What do you see?
In your opinion, is the distribution of residuals heteroscedastic?
Create a scatterplot where you plot the residuals against x in ggplot2.
Draw a local regression line by replacing the function
geom_point()
by geom_smooth()
. What do you
see?
As ChatGPT what a “local regression” is. If the answer is too complicated, ask ChatGPT to ELI5 it for you (“explain it like I’m 5 years old”). Was it useful?
Consider the variable LaborCost
(z). Correlate it
with x and with y. What do you find? What do you think would happen if
you add it to your main model as a control?
Fit a second model (mod2
). Use
stargazer()
to plot the two models against each other? What
happened?