We can think of hierarchical/multilevel models as a way to mix and match information / variation from different sources (at different levels). Here, we will work on within and between group variation through the exploration of fixed effects.
We will work on an inbalanced panel; a cross-sectional time-series data on MEPs in the 2012-2017-period.
#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics
#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
destfile = "MEP.rda")
#Load in
load("MEP.rda")
#Rename the data for convenience
df <- MEP
#Rename the three main variables for convenience
df <-
df %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The seat share that national party has in national election
x = SeatsNatPal.prop,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.
\(y_i = \alpha + \beta x_i + \epsilon_i\)
Describe the bivariate relationship between MEPs’ local staff
(ShareOfLocalStaff, y) and their party’s seat share
(SeatNatPal.prop, x) by fitting a linear regression. What
do you find?
Illustrate the relationship between the two in
ggplot2. You can use the geometric function for local
regression and specify the method
geom_smooth(aes(x, y), method = "loess"). Change the method
to “lm” (linear model). What do you think?
Building on your previous ggplot code, use the
color = "Nationality" argument to create a separate
regression for each MEP nationality. You can remove the uncertainty by
adding se = FALSE if the shaded lines makes it hard to read
the plot. Compare the new plot with the plot from 1b. What
happened?
Ok. Calculate the grouped averages of both x and y using
group_by() and reframe(). You now have a new
data frame with 28 observations. Plot x and y against each other in a
scatterplot. Using geom_smooth(aes(x,y), method = “lm”) you can also fit
a regression line to the data. What do you see?
Fit a fixed-effects model where you control for
Nationality.
Use stargazer() to compare your pooled model with
the fixed-effects model. If you don’t want to see the regression
coefficients relating to Nationality, you may use the argument
omit = "Nationality". What happened to the effect of vote
share? Why do you think this happened?
The fixed effects subsume country-level variation. Even though labor cost varies over time, most of it is at the national level.
Re-fit a pooled model where you control for LaborCost. What happens?
What happens if you add national fixed effects to that model?
Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.
Fit a pooled model with predictors for party size (x) and electoral system (z).
Fixed-effects model: Add national-level fixed effects to the pooled model. What happened to Sweden? Why?
Tweak your fixed-effects model such that you remove z. What happens to Sweden?
The MEP data is in effect an unbalanced panel where each parliamentarian is observed up to 10 times (twice a year for 5 years). Let the observations in the data (\(j\)) run from 1 to the number of periods they have been observed, while \(i\) denotes each of the 1174 individual MEPs. Observations are thus nested in MEPs, with a total data size of N = 7143. Let’s explore a fixed-effects model where we use individual-level fixed effects.
Your baseline model is the pooled model from exercise 1a:
\(y_{ij} = \alpha + \beta x_{ij} + \epsilon_{ij}\)
To do so, you can use group_by, lag() and
create a boolean variable (guessing game) using any().
Specifically, for each MEP (their unique identifying variable is “ID”),
ask whether any x is unequal to the lagged version of x. You can use the
R notebook for inspiration.
If individual MEPs are denoted as i and group-level residuals are denoted \(u\), then the equation is the following:
\(\bar{y_i} = \alpha + \beta \bar{x_i} + u_i\)
Create a new data frame and call it df0 where you
calculate the group mean in y (local staff) and x (seat share) for each
MEP. Use group_by() and reframe(). Estimate a
model where you correlate the group means with each other. Store the
result as mod_between.
\((y_{ij} - \bar{y}_i) = \beta (x_{ij} - \bar{x}_i) + \epsilon_{ij}\)
This time, work in the main data frame. Recalculate the group means
for x and y and call the new variables x_id and
y_id, then demean the x (party size) and the y by
estimating the equation: I(y - y_id) ~ -1 + I(x - x_id).
The - 1 removes the intercept for you. Store the results as
mod_within.
\(y_{ij} = \alpha_i + \beta x_{ij} + \epsilon_{ij}\)
In baseR: y ~ x + as.factor(ID). Store the result as
mod_fe. Display the results in a results table in
stargazer(). You can use the argument
omit = "as.factor" to remove the display of all covariates
containing these characters (i.e. your fixed effects).
Compare the regression coefficients from mod_within and mod_fe. Are they the same? Good! Why?
\(y_{ij} = \alpha + \beta_1 \bar{x}_i + \beta_2 (x_{ij} - \bar{x}_i) + \epsilon_{ij}\)
Estimate a model on you main data frame with both the between- and
within- variation variables
included:y = x_id + I(x - x_id). Store the result as
mod_both and add it to stargazer.
Go back to your very first code and store in your data which MEP experienced at least one change in x (and in y).
reestimate the individual fixed-effects model on data where you
filter out all MEPs that have not experienced a change
(filter(change_x == 0)). Store the the result in
mod_fe_var and add it to the stargazer results
table.
compare the regression coefficents of mod_fe and mod_fe_var. What do you see? Why? What about the standard errors?
Let’s theorize that MEPs will move to compensate for loss in party funding coming from an unfortunate national election. To do so, we expect that MEPs will respond to changes in their own party’s size. That’s the intuition the fixed effects model narrows down on.
Now that we are focusing on the within-individual change in party size (a proxy for the party’s funding), the partial scenario for interpretation changes. We are no longer interested in the between-MEP variation in party size, but what a realistic change in x is after a national election.
use the same group_by() and mutate()
trick and calculate some metric of the typical change in party
size following a national election. To do so, calculate the difference
between x and the previous x for each MEP in the main data, then filter
out all non-changes using filter(). This is the increment
(change) you use when you interpret \(\beta\), your slope coefficient for
x.
calculate the marginal effect of such a change in MEP’s local staff.
formulate this in a plain English way. In your opinion, is this a “big” effect?
The general level of MEP’s local staff changed over the period for reasons unrelated to national electoral fortunes. First, the European Parliament finally adressed a phenomenon among some MEPs to invest all their funding in local presence rather than legislative work. In 2016 the European Parliament therefore reformed the system by imposing a rule that only 75% of the funding can be spent locally. After that, the general level of local staff dropped.
Second, the overall local staff size was higher in the spring of 2014 because of the European election where many of them were competing for seats.
Are these elements confounders? How could you adress it?
Explore the general variation in local staff size across periods
(Period) in the data.
Address the problem if you perceive any.
Re-interpret your results. What changed?
considering all the covariates in the data, which model would you go for?
what did you learn?