We can think of hierarchical/multilevel models as a way to partition and leverage variation at different levels.

We begin to will work on cross-sectional time-series data on MEPs in the 2012-2017-period.

#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics

#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
              destfile = "MEP.rda")

#Load in
load("MEP.rda")

#Rename the data for convenience
df <- MEP

#Rename the three main variables for convenience
df <-
  df %>%
  filter(Period == 5) %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = ShareOfLocalAssistants,
    #The vote share (in percentage points) that the national party got last election
    x = VoteShare_LastElection,
    #Candidate-centered (or not) electoral system for EU elections
    z = OpenList)

Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.

Exercise 1: Simpson’s paradox and leveraging variation

Let’s start by subsetting the data to period 5 (in 2014) for excercise 1 and 2: df <- MEP %>% filter(Period == 5).

Describe the bivariate relationship between MEPs’ local staff (ShareOfLocalStaff, y) and their party’s vote share (VoteShare_LastElection, x) by fitting a linear regression. What do you find?
Fit a fixed-effects model where you control for Nationality.
Illustrate the bivariate relationship between voteshare and the size of MEP’s local staff in ggplot2. You can use the geometric function for local regression and specify the method geom_smooth(aes(x, y), method = "loess"). Change the method to “lm” (linear model). What do you think?
Building on your previous ggplot code, use the color = "Nationality" or facet_wrap(~Nationality) arguments to create a separate regression for each MEP’s nationality. You can remove the uncertainty by adding se = FALSE if the shaded lines makes it hard to read the plot. Compare the new plot with the plot from 1c. What happened?
Ok. Calculate the grouped averages of both x and y using group_by() and reframe(). You now have a new data frame with 28 observations. Plot x and y against each other in a scatterplot. Using geom_smooth(aes(x,y), method = "lm") in ggplot2 you can also fit a regression line to the data. What do you see?
Use stargazer() to compare your pooled model with the fixed-effects model. If you don’t want to see the regression coefficients relating to Nationality, you may use the argument omit = "Nationality". What happened to the effect of vote share? Why do you think this happened?

Exercise 2: Non-identified models.

Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.

Fit a pooled model with predictors for vote share (x) and electoral system (z).
Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?
Tweak your fixed-effects model such that you remove z. What happens to Sweden?

Exercise 3:

Ok, now, let’s consider the entire data set: df <- MEP.

Reestimate the first model on the entire data set, then calculate the residuals. I did this on the slides last week.
Describe the residuals: What is their mean and spread. Plot them as a histogram.
Calculate the group mean of the residuals for each MEP. The ID variable for each person is ID. Store the result in a separate data frame. Describe the spread and plot a histogram. What does this say about the between-group variation in the data, once you’ve accounted for vote share (and nationality)? Is there more within- or between-group variation here?
Re-do all of the above, but on the model where you control for nationality (mod.fix). Store the results in separate variables for comparison. What happened? Why?

Exercise 4: Labelling the errors, take 2.

In the two next exercises, we work with the entire data set: df <- MEP.

We can think of hierarchical models/models with varying intercepts as a way to label the errors.

Go back to your pooled model and re-estimate it on the entire data and calculate the residuals. Store them as a variable in your data. Remember, the residuals are just the difference between what you predicted and observed (y - predicted y). T
Calculate the group means of the residuals per nationality (i.e. the unexplained staff size per country) using group_by() and reframe(). Store the data frame (n = 28) in an R object.
Fit a fixed-effects model without intercept: mod.stage1 <- lm(y ~ -1 + x + Nationality, df).
Extract the coefficients for nationality. Store them as a new variable in the same data frame (N = 28). You can see them stored as a data frame in the model object mod.fix$coefficients. Remove the coefficients not related to nationality.
Compare the grouped residuals and the fixed-effects coefficients. You can plot them against each other in a scatter plot, for example. What do you see? What is the link between residuals and fixed effects?

Exercise 5: Two-level regression “by hand”.

Gelman and Hill (2007) explain the intuition of the 2-level model in chapter 11: It is a way for us to group the residuals according to their identities, then run a second regression on the residuals.

Level-1 regression: Ok, continue with the mod.stage1 results. Compare with your previous fixed-effects model. What is the difference in how you interpret them?
Level-2 data frame: You can build on your data frame (n=28) from exercise 3 or make it anew, but this time, add a variable on the electoral system (z; OpenList).

If you make the dataframe anew:

Use the group_by() and reframe() arguments to make a new data frame from df that reports the mean value of z, the electoral system, by country (N = 28).
Extract the fixed-effects coefficients for Nationality from your level-1 regression and add them as a variable to this new data frame. You can do this using mutate() or cbind.

Level-2 regression: Regress your fixed-effects coefficients on the aggregated values of z by drawing on the values in your new, aggregated data frame (e.g. lm(coefs ~ z_m, df2)).
Let R do a two-level regression for you. Fetch the lme4 package to run a two-level hierarchical model: mod.ran <- lmer(y ~ x + z + (1|Nationality), df).
Use stargazer to compare the results from your four regressions: the pooled model, level 1-, level 2- and random-effects model (from this exercise). What are your take-aways?

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.

Problem set 2: Hierarchical modeling

Silje Synnøve Lyder Hermansen

2025-03-05