We can think of hierarchical/multilevel models as a way to mix and match information from different sources (at different levels).
We will share the responses on Padlet: https://padlet.com/siljesynnove/blr
We will work on cross-sectional time-series data on MEPs in the 2012-2017-period.
#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics
#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
destfile = "MEP.rda")
#Load in
load("MEP.rda")
#Rename the data for convenience
df <- MEP
#Rename the three main variables for convenience
df <-
df %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.
Describe the bivariate relationship between MEPs’ local staff
(ShareOfLocalStaff
, y) and their party’s vote share
(VoteShare_LastElection
, x) by fitting a linear regression.
What do you find?
Illustrate the relationship between the two in
ggplot2
. You can use the geometric function for local
regression and specify the method
geom_smooth(aes(x, y), method = "loess")
. Change the method
to “lm” (linear model). What do you think?
Building on your previous ggplot code, use the
color = "Nationality"
argument to create a separate
regression for each MEP nationality. You can remove the uncertainty by
adding se = FALSE
if the shaded lines makes it hard to read
the plot. Compare the new plot with the plot from 1b. What
happened?
Ok. Calculate the grouped averages of both x and y using
group_by()
and reframe()
. You now have a new
data frame with 28 observations. Plot x and y against each other in a
scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit
a regression line to the data. What do you see?
Fit a fixed-effects model where you control for
Nationality
.
Use stargazer()
to compare your pooled model with
the fixed-effects model. If you don’t want to see the regression
coefficients relating to Nationality, you may use the argument
omit = "Nationality"
. What happened to the effect of vote
share? Why do you think this happened?
Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.
Fit a pooled model with predictors for vote share (x) and electoral system (z).
Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?
Tweak your fixed-effects model such that you remove z. What happens to Sweden?
We can think of hierarchical models/models with varying intercepts as a way to label the errors.
Go back to your pooled model, calculate the residuals and store the result as a variable in your data.
Calculate the group means of the residuals per nationality
(i.e. the unexplained staff size per country) using
group_by()
and reframe()
. Store the data frame
(n = 28) in an R object.
Fit a fixed-effects model without intercept:
mod.stage1 <- lm(y ~ -1 + x + Nationality, df)
.
Extract the coefficients for nationality. Store them as a new variable in the same data frame (N = 28).
Compare the grouped residuals and the fixed-effects coefficients. You can plot them against each other in a scatter plot, for example. What do you see? What is the link between residuals and fixed effects?
Gelman and Hill (2007) explain the intuition of the 2-level model in chapter 11: It is a way for us to group the residuals according to their identities, then run a second regression on the residuals.
Level-1 regression: Ok, continue with the mod.stage1
results. Compare with your previous fixed-effects model. What is the
difference in how you interpret them?
Level-2 data frame: You can build on your data frame (n=28) from exercise 3 or make it anew, but this time, add a variable on the electoral system.
If you make the dataframe anew:
Use the group_by()
and reframe()
arguments to make a new data frame from df
that reports the
mean value of z, the electoral system, by country (N = 28).
Extract the fixed-effects coefficients for Nationality from your
level-1 regression and add them as a variable to this new data frame.
You can do this using mutate()
or
cbind
.
Level-2 regression: Regress your fixed-effects coefficients on
the aggregated values of z by drawing on the values in your new,
aggregated data frame (e.g. lm(coefs ~ z_m, df2)
).
Let R do a two-level regression for you. Fetch the
lme4
package to run a two-level hierarchical model:
lmer(y ~ x + z + (1|Nationality), df)
.
Use stargazer to compare the results from your four regressions: the pooled model (from exercise 1), level 1-, level 2- and random-effects model (from this exercise). What are your take-aways?