We can think of hierarchical/multilevel models as a way to partition and leverage variation at different levels.
We begin to will work on cross-sectional time-series data on MEPs in the 2012-2017-period.
#libraries
library(lme4) #hierarchical models
library(stargazer) #regression tables
library(dplyr) #data wrangling + pipes
library(ggplot2) #graphics
#download
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda",
destfile = "MEP.rda")
#Load in
load("MEP.rda")
#Rename the data for convenience
df <- MEP
#Rename the three main variables for convenience
df <-
df %>%
filter(Period == 5) %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
Here, we will check out the predictions of the following linear model. It formulates the expectation that MEPs from parties with larger electoral support in the past will invest less in local staff.
Let’s start by subsetting the data to period 5 (in 2014) for
excercise 1 and 2:
df <- MEP %>% filter(Period == 5).
ShareOfLocalStaff, y) and their party’s vote share
(VoteShare_LastElection, x) by fitting a linear regression.
What do you find?mod0.pool <- lm(y ~ x,
df)
# summary(mod0.pool)
Nationality.mod.fix <- lm(y ~ x + Nationality, df)
ggplot2. You can use the geometric
function for local regression and specify the method
geom_smooth(aes(x, y), method = "loess"). Change the method
to “lm” (linear model). What do you think?df %>%
ggplot +
geom_smooth(
aes(
y = y,
x = x
),
method = "lm", se = T
)+
ggtitle("Pooled regression: Within- and between-country variation")
color = "Nationality" or
facet_wrap(~Nationality) arguments to create a separate
regression for each MEP’s nationality. You can remove the uncertainty by
adding se = FALSE if the shaded lines makes it hard to read
the plot. Compare the new plot with the plot from 1c. What
happened?df %>%
ggplot +
geom_smooth(
aes(y = y,
x = x,
color = Nationality),
method = "lm", se = F) +
ggtitle("28 non-pooled regressions: Within-country variation")
group_by() and reframe(). You now have a new
data frame with 28 observations. Plot x and y against each other in a
scatterplot. Using geom_smooth(aes(x,y), method = "lm") in
ggplot2 you can also fit a regression line to the data. What do you
see?df %>%
group_by(Nationality) %>%
reframe(y_bar = mean(y, na.rm = T),
x_bar = mean(x, na.rm = T),
n = n()) %>%
ggplot +
geom_point(
aes(
y = y_bar,
x = x_bar,
size = n/10
), alpha = 0.3
) +
geom_smooth(
aes(
y = y_bar,
x = x_bar
),
method = "lm"
) +
ggtitle("Between-group variation: Correlation between vote share and local staff")
stargazer() to compare your pooled model with the
fixed-effects model. If you don’t want to see the regression
coefficients relating to Nationality, you may use the argument
omit = "Nationality". What happened to the effect of vote
share? Why do you think this happened?stargazer(mod0.pool, mod.fix,
omit = "Natio",
title = "Pooled vs. non-pooled regression: The effect of the predictor changes direction",
type = "text")
Gelman and Hill (2007) emphasize at several points that fixed-effects (non-pooled) models quickly become non-identified (chapter 11). When introducing fixed effects for your group identity variables (e.g. nationality), the R will relegate the first group as a reference group (in the intercept). If you then introduce other variables that are colinear with your groups (e.g. electoral system), they too will have to omit one category. In other words, you have limited possibility to add group-level predictors once you have grouped fixed-effects in your model.
mod.pool <- lm(y ~ x + z, df)
# mod.pool %>% summary
mod.fix <- lm(y ~ x + z + Nationality, df)
# mod.fix %>% summary
mod2.fix <- lm(y ~ x + Nationality, df)
stargazer(mod.fix, mod2.fix, type = "text")
#Sweden disappears/reappears because model is unidentified
#Rename the three main variables for convenience
df <-
MEP %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
Ok, now, let’s consider the entire data set:
df <- MEP.
mod1.pool <- lm(y ~ x, df)
mod1.fix <- lm(y ~ x + Nationality, df)
df <-
df %>%
mutate(
predicted = predict(mod1.pool, df),
predicted2 = predict(mod1.fix, df),
residuals = ShareOfLocalAssistants - predicted,
residuals2 = ShareOfLocalAssistants - predicted2)
sd(df$residuals, na.rm = T); #68% of MEPs have a staff size p/m 2.87 + the prediction
sd(df$residuals2, na.rm = T); #they're lower
df %>%
ggplot +
geom_histogram(aes(residuals))
ID. Store the result in a
separate data frame. Describe the spread and plot a histogram. What does
this say about the between-group variation in the data, once you’ve
accounted for vote share (and nationality)? Is there more within- or
between-group variation here?df_mep <-
df %>%
group_by(ID) %>%
reframe(m =mean(residuals, na.rm = T),
m2 = mean(residuals2, na.rm = T))
sd(df_mep$m, na.rm = T) #between-group variation is 2.78
sd(df_mep$m2, na.rm = T) #between-group variation is smaller
df_mep %>%
ggplot +
geom_histogram(aes(m))
mod.fix). Store the results in separate
variables for comparison. What happened? Why?#Rename the three main variables for convenience
df <-
MEP %>%
mutate(
#The number of local assistants employed by each MEP in 2014
y = ShareOfLocalAssistants,
#The vote share (in percentage points) that the national party got last election
x = VoteShare_LastElection,
#Candidate-centered (or not) electoral system for EU elections
z = OpenList)
In the two next exercises, we work with the entire data set:
df <- MEP.
We can think of hierarchical models/models with varying intercepts as a way to label the errors.
mod.pool <- lm(y ~ x,
df)
df <-
df %>%
mutate(
predicted_pool = predict(mod.pool, df),
residuals_pool = y - predicted_pool)
group_by() and
reframe(). Store the data frame (n = 28) in an R
object.df_nat <-
df %>%
group_by(Nationality) %>%
reframe(residuals_grouped = mean(residuals_pool, na.rm = T),
n = n())
mod.stage1 <- lm(y ~ -1 + x + Nationality, df).mod.fix <- lm(y ~ -1
+ x
+ Nationality
,
df)
mod.fix$coefficients. Remove
the coefficients not related to nationality.df_nat <-
df_nat %>%
#Remove first coefficient/intercept
mutate(fix = mod.fix$coefficients[-1])
#They're two ways of grouping the residuals. However they are not perfectly correlated, since the fixed effects model takes out some of the within-group variation explained by z.
df_nat %>%
ggplot+
geom_point(
aes(
residuals_grouped,
fix
)
)
Gelman and Hill (2007) explain the intuition of the 2-level model in chapter 11: It is a way for us to group the residuals according to their identities, then run a second regression on the residuals.
mod.stage1
results. Compare with your previous fixed-effects model. What is the
difference in how you interpret them?mod.level1 <- lm(y ~ -1 + x + Nationality, df)
# mod.level1 %>% summary
z; OpenList).If you make the dataframe anew:
Use the group_by() and reframe()
arguments to make a new data frame from df that reports the
mean value of z, the electoral system, by country (N = 28).
Extract the fixed-effects coefficients for Nationality from your
level-1 regression and add them as a variable to this new data frame.
You can do this using mutate() or
cbind.
df2 <-
df %>%
group_by(Nationality) %>%
reframe(z_m = mean(z)) %>%
mutate(coefs = mod.level1$coefficients[-1])
lm(coefs ~ z_m, df2)).mod.level2 <- lm(coefs ~ z_m, df2)
# mod.level2 %>% summary
lme4
package to run a two-level hierarchical model:
mod.ran <- lmer(y ~ x + z + (1|Nationality), df).mod.ran <- lmer(y ~ x + z + (1|Nationality), df)
stargazer(mod.pool, mod.level1, mod.level2, mod.ran,
omit = "National",
column.labels = c("Pooled", "Level 1", "Level 2", "Two-level"),
model.numbers = F,
model.names = F,
dep.var.caption = "Local staff size (y)",
type = "text")