We can think of hierarchical/multilevel models as a way to partition and leverage variation at different levels.
We begin to will work on cross-sectional time-series data on MEPs in the 2012-2017-period.
#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics
#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
destfile = "MEP.rda")
#Load in
load("MEP.rda")
#Rename the data for convenience
df <- MEP
#Rename the three main variables for convenience
df <-
df %>%
filter(Period == 5) %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.
Let’s start by subsetting the data to period 5 (in 2014) for
excercise 1 and 2:
df <- MEP %>% filter(Period == 5)
.
Describe the bivariate relationship between MEPs’ local staff
(ShareOfLocalStaff
, y) and their party’s vote share
(VoteShare_LastElection
, x) by fitting a linear regression.
What do you find?
Fit a fixed-effects model where you control for
Nationality
.
Illustrate the bivariate relationship between voteshare and the
size of MEP’s local staff in ggplot2
. You can use the
geometric function for local regression and specify the method
geom_smooth(aes(x, y), method = "loess")
. Change the method
to “lm” (linear model). What do you think?
Building on your previous ggplot code, use the
color = "Nationality"
or
facet_wrap(~Nationality)
arguments to create a separate
regression for each MEP’s nationality. You can remove the uncertainty by
adding se = FALSE
if the shaded lines makes it hard to read
the plot. Compare the new plot with the plot from 1c. What
happened?
Ok. Calculate the grouped averages of both x and y using
group_by()
and reframe()
. You now have a new
data frame with 28 observations. Plot x and y against each other in a
scatterplot. Using geom_smooth(aes(x,y), method = "lm")
in
ggplot2 you can also fit a regression line to the data. What do you
see?
Use stargazer()
to compare your pooled model with
the fixed-effects model. If you don’t want to see the regression
coefficients relating to Nationality, you may use the argument
omit = "Nationality"
. What happened to the effect of vote
share? Why do you think this happened?
Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.
Fit a pooled model with predictors for vote share (x) and electoral system (z).
Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?
Tweak your fixed-effects model such that you remove z. What happens to Sweden?
Ok, now, let’s consider the entire data set:
df <- MEP
.
Reestimate the first model on the entire data set, then calculate the residuals. I did this on the slides last week.
Describe the residuals: What is their mean and spread. Plot them as a histogram.
Calculate the group mean of the residuals for each MEP. The ID
variable for each person is ID
. Store the result in a
separate data frame. Describe the spread and plot a histogram. What does
this say about the between-group variation in the data, once you’ve
accounted for vote share (and nationality)? Is there more within- or
between-group variation here?
Re-do all of the above, but on the model where you control for
nationality (mod.fix
). Store the results in separate
variables for comparison. What happened? Why?
In the two next exercises, we work with the entire data set:
df <- MEP
.
We can think of hierarchical models/models with varying intercepts as a way to label the errors.
Go back to your pooled model and re-estimate it on the entire data and calculate the residuals. Store them as a variable in your data. Remember, the residuals are just the difference between what you predicted and observed (y - predicted y). T
Calculate the group means of the residuals per nationality
(i.e. the unexplained staff size per country) using
group_by()
and reframe()
. Store the data frame
(n = 28) in an R object.
Fit a fixed-effects model without intercept:
mod.stage1 <- lm(y ~ -1 + x + Nationality, df)
.
Extract the coefficients for nationality. Store them as a new
variable in the same data frame (N = 28). You can see them stored as a
data frame in the model object mod.fix$coefficients
. Remove
the coefficients not related to nationality.
Compare the grouped residuals and the fixed-effects coefficients. You can plot them against each other in a scatter plot, for example. What do you see? What is the link between residuals and fixed effects?
Gelman and Hill (2007) explain the intuition of the 2-level model in chapter 11: It is a way for us to group the residuals according to their identities, then run a second regression on the residuals.
Level-1 regression: Ok, continue with the mod.stage1
results. Compare with your previous fixed-effects model. What is the
difference in how you interpret them?
Level-2 data frame: You can build on your data frame (n=28) from
exercise 3 or make it anew, but this time, add a variable on the
electoral system (z
; OpenList
).
If you make the dataframe anew:
Use the group_by()
and reframe()
arguments to make a new data frame from df
that reports the
mean value of z, the electoral system, by country (N = 28).
Extract the fixed-effects coefficients for Nationality from your
level-1 regression and add them as a variable to this new data frame.
You can do this using mutate()
or
cbind
.
Level-2 regression: Regress your fixed-effects coefficients on
the aggregated values of z by drawing on the values in your new,
aggregated data frame (e.g. lm(coefs ~ z_m, df2)
).
Let R do a two-level regression for you. Fetch the
lme4
package to run a two-level hierarchical model:
mod.ran <- lmer(y ~ x + z + (1|Nationality), df)
.
Use stargazer to compare the results from your four regressions: the pooled model, level 1-, level 2- and random-effects model (from this exercise). What are your take-aways?