We can think of hierarchical/multilevel models as a way to mix and match information from different sources (at different levels).

We will share the responses on Padlet: https://padlet.com/siljesynnove/blr

We will work on cross-sectional time-series data on MEPs in the 2012-2017-period.

#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics

#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
              destfile = "MEP.rda")

#Load in
load("MEP.rda")

#Rename the data for convenience
df <- MEP

#Rename the three main variables for convenience
df <-
  df %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = ShareOfLocalAssistants,
    #The vote share (in percentage points) that the national party got last election
    x = VoteShare_LastElection,
    #Candidate-centered (or not) electoral system for EU elections
    z = OpenList)

Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.

Exercise 1: Simpson’s paradox and leveraging variation

Describe the bivariate relationship between MEPs’ local staff (ShareOfLocalStaff, y) and their party’s vote share (VoteShare_LastElection, x) by fitting a linear regression. What do you find?
Illustrate the relationship between the two in ggplot2. You can use the geometric function for local regression and specify the method geom_smooth(aes(x, y), method = "loess"). Change the method to “lm” (linear model). What do you think?
Building on your previous ggplot code, use the color = "Nationality" argument to create a separate regression for each MEP nationality. You can remove the uncertainty by adding se = FALSE if the shaded lines makes it hard to read the plot. Compare the new plot with the plot from 1b. What happened?
Ok. Calculate the grouped averages of both x and y using group_by() and reframe(). You now have a new data frame with 28 observations. Plot x and y against each other in a scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit a regression line to the data. What do you see?
Fit a fixed-effects model where you control for Nationality.
Use stargazer() to compare your pooled model with the fixed-effects model. If you don’t want to see the regression coefficients relating to Nationality, you may use the argument omit = "Nationality". What happened to the effect of vote share? Why do you think this happened?

Exercise 2: Non-identified models.

Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.

Fit a pooled model with predictors for vote share (x) and electoral system (z).
Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?
Tweak your fixed-effects model such that you remove z. What happens to Sweden?

Exercise 3: Labelling the errors.

We can think of hierarchical models/models with varying intercepts as a way to label the errors.

Go back to your pooled model, calculate the residuals and store the result as a variable in your data.
Calculate the group means of the residuals per nationality (i.e. the unexplained staff size per country) using group_by() and reframe(). Store the data frame (n = 28) in an R object.
Fit a fixed-effects model without intercept: mod.stage1 <- lm(y ~ -1 + x + Nationality, df).
Extract the coefficients for nationality. Store them as a new variable in the same data frame (N = 28).
Compare the grouped residuals and the fixed-effects coefficients. You can plot them against each other in a scatter plot, for example. What do you see? What is the link between residuals and fixed effects?

Exercise 4: Two-level regression “by hand”.

Gelman and Hill (2007) explain the intuition of the 2-level model in chapter 11: It is a way for us to group the residuals according to their identities, then run a second regression on the residuals.

Level-1 regression: Ok, continue with the mod.stage1 results. Compare with your previous fixed-effects model. What is the difference in how you interpret them?
Level-2 data frame: You can build on your data frame (n=28) from exercise 3 or make it anew, but this time, add a variable on the electoral system.

If you make the dataframe anew:

Use the group_by() and reframe() arguments to make a new data frame from df that reports the mean value of z, the electoral system, by country (N = 28).
Extract the fixed-effects coefficients for Nationality from your level-1 regression and add them as a variable to this new data frame. You can do this using mutate() or cbind.

Level-2 regression: Regress your fixed-effects coefficients on the aggregated values of z by drawing on the values in your new, aggregated data frame (e.g. lm(coefs ~ z_m, df2)).
Let R do a two-level regression for you. Fetch the lme4 package to run a two-level hierarchical model: lmer(y ~ x + z + (1|Nationality), df).
Use stargazer to compare the results from your four regressions: the pooled model (from exercise 1), level 1-, level 2- and random-effects model (from this exercise). What are your take-aways?

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.

Problem set 2: Hierarchical modeling

Silje Synnøve Lyder Hermansen

2024-02-28

Exercise 1: Simpson’s paradox and leveraging variation

Exercise 2: Non-identified models.

Exercise 3: Labelling the errors.

Exercise 4: Two-level regression “by hand”.

Literature