We can think of hierarchical/multilevel models as a way to mix and match information / variation from different sources (at different levels). Here, we will work on within and between group variation through the exploration of fixed effects.

We will work on an inbalanced panel; a cross-sectional time-series data on MEPs in the 2012-2017-period.

#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics

#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
              destfile = "MEP.rda")

#Load in
load("MEP.rda")

#Rename the data for convenience
df <- MEP

#Rename the three main variables for convenience
df <-
  df %>%
  mutate(
    #The number of local assistants employed by each MEP in 2014
    y = ShareOfLocalAssistants,
    #The seat share that national party has in national election
    x = SeatsNatPal.prop,
    #Candidate-centered (or not) electoral system for EU elections
    z = OpenList) 

Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.

Exercise 1: Simpson’s paradox and leveraging variation

\(y_i = \alpha + \beta x_i + \epsilon_i\)

  1. Describe the bivariate relationship between MEPs’ local staff (ShareOfLocalStaff, y) and their party’s seat share (SeatNatPal.prop, x) by fitting a linear regression. What do you find?

  2. Illustrate the relationship between the two in ggplot2. You can use the geometric function for local regression and specify the method geom_smooth(aes(x, y), method = "loess"). Change the method to “lm” (linear model). What do you think?

  3. Building on your previous ggplot code, use the color = "Nationality" argument to create a separate regression for each MEP nationality. You can remove the uncertainty by adding se = FALSE if the shaded lines makes it hard to read the plot. Compare the new plot with the plot from 1b. What happened?

  4. Ok. Calculate the grouped averages of both x and y using group_by() and reframe(). You now have a new data frame with 28 observations. Plot x and y against each other in a scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit a regression line to the data. What do you see?

  5. Fit a fixed-effects model where you control for Nationality.

  6. Use stargazer() to compare your pooled model with the fixed-effects model. If you don’t want to see the regression coefficients relating to Nationality, you may use the argument omit = "Nationality". What happened to the effect of vote share? Why do you think this happened?

Exercise 2: Two ways to obtain the same result

The fixed effects subsume country-level variation. Even though labor cost varies over time, most of it is at the national level.

  1. Re-fit a pooled model where you control for LaborCost. What happens?

  2. What happens if you add national fixed effects to that model?

Exercise 3: Non-identified models.

Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.

  1. Fit a pooled model with predictors for party size (x) and electoral system (z).

  2. Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?

  3. Tweak your fixed-effects model such that you remove z. What happens to Sweden?

Exercise 4: Demeaning and fixed effects models

The MEP data is in effect an unbalanced panel where each parliamentarian is observed up to 10 times (twice a year for 5 years). Let the observations in the data (\(j\)) run from 1 to the number of periods they have been observed, while \(i\) denotes each of the 1174 individual MEPs. Observations are thus nested in MEPs, with a total data size of N = 7143. Let’s explore a fixed-effects model where we use individual-level fixed effects.

Your baseline model is the pooled model from exercise 1a:

\(y_{ij} = \alpha + \beta x_{ij} + \epsilon_{ij}\)

  1. begin by calculating the number of MEPs that experienced at least one change in national party size (x) and local staff (y). In your opinion, do you have enough variation in the data to estimate within-individual effects?

To do so, you can use group_by, lag() and create a boolean variable (guessing game) using any(). Specifically, for each MEP (their unique identifying variable is “ID”), ask whether any x is unequal to the lagged version of x. You can use the R notebook for inspiration.

  1. leverage only the between-individual variation between MEPs’ local staff and their party size.

If individual MEPs are denoted as i and group-level residuals are denoted \(u\), then the equation is the following:

\(\bar{y_i} = \alpha + \beta \bar{x_i} + u_i\)

Create a new data frame and call it df0 where you calculate the group mean in y (local staff) and x (seat share) for each MEP. Use group_by() and reframe(). Estimate a model where you correlate the group means with each other. Store the result as mod_between.

  1. leverage only the within-individual variation in a similar model.

\((y_{ij} - \bar{y}_i) = \beta (x_{ij} - \bar{x}_i) + \epsilon_{ij}\)

This time, work in the main data frame. Recalculate the group means for x and y and call the new variables x_id and y_id, then demean the x (party size) and the y by estimating the equation: I(y - y_id) ~ -1 + I(x - x_id). The - 1 removes the intercept for you. Store the results as mod_within.

  1. estimate a fixed effects model where you add MEP’s id variable as a covariate.

\(y_{ij} = \alpha_i + \beta x_{ij} + \epsilon_{ij}\)

In baseR: y ~ x + as.factor(ID). Store the result as mod_fe. Display the results in a results table in stargazer(). You can use the argument omit = "as.factor" to remove the display of all covariates containing these characters (i.e. your fixed effects).

Compare the regression coefficients from mod_within and mod_fe. Are they the same? Good! Why?

  1. do the Mundlak move.

\(y_{ij} = \alpha + \beta_1 \bar{x}_i + \beta_2 (x_{ij} - \bar{x}_i) + \epsilon_{ij}\)

Estimate a model on you main data frame with both the between- and within- variation variables included:y = x_id + I(x - x_id). Store the result as mod_both and add it to stargazer.

  1. discuss with your neighbor: What did you just do? What do these models tell you?

Exercise 5: Variation is information

Go back to your very first code and store in your data which MEP experienced at least one change in x (and in y).

  1. reestimate the individual fixed-effects model on data where you filter out all MEPs that have not experienced a change (filter(change_x == 0)). Store the the result in mod_fe_var and add it to the stargazer results table.

  2. compare the regression coefficents of mod_fe and mod_fe_var. What do you see? Why? What about the standard errors?

Exercise 5: Interpret using partial scenarios

Let’s theorize that MEPs will move to compensate for loss in party funding coming from an unfortunate national election. To do so, we expect that MEPs will respond to changes in their own party’s size. That’s the intuition the fixed effects model narrows down on.

Now that we are focusing on the within-individual change in party size (a proxy for the party’s funding), the partial scenario for interpretation changes. We are no longer interested in the between-MEP variation in party size, but what a realistic change in x is after a national election.

  1. use the same group_by() and mutate() trick and calculate some metric of the typical change in party size following a national election. To do so, calculate the difference between x and the previous x for each MEP in the main data, then filter out all non-changes using filter(). This is the increment (change) you use when you interpret \(\beta\), your slope coefficient for x.

  2. calculate the marginal effect of such a change in MEP’s local staff.

  3. formulate this in a plain English way. In your opinion, is this a “big” effect?

Exercise 6: Consider time varying effects

The general level of MEP’s local staff changed over the period for reasons unrelated to national electoral fortunes. First, the European Parliament finally adressed a phenomenon among some MEPs to invest all their funding in local presence rather than legislative work. In 2016 the European Parliament therefore reformed the system by imposing a rule that only 75% of the funding can be spent locally. After that, the general level of local staff dropped.

Second, the overall local staff size was higher in the spring of 2014 because of the European election where many of them were competing for seats.

  1. Are these elements confounders? How could you adress it?

  2. Explore the general variation in local staff size across periods (Period) in the data.

  3. Address the problem if you perceive any.

  4. Re-interpret your results. What changed?

Exercise 7: Your call

  1. considering all the covariates in the data, which model would you go for?

  2. what did you learn?

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.