Models for event counts

## Warning: pakke 'topmodels' blev bygget under R version 4.3.3

In this session, we will be analyzing count outcomes. We will explore the base-line poisson model, test for overdispersion and adress potential problems.

We will familiarize with:

  • how to estimate different count models

    • poisson and quasi-poisson models from base R
    • negative binomial models from MASS
    • zero-inflated and hurdle models from pscl
  • clustered standard errors from sandwich

    • while these are not specific to count models, we sometimes need to cluster our standard errors because our observations are clustered. You’ll find an example at the end of the script.

We will be working with data from Yoshinaka, Mcelroy, and Bowler (2010) on report allocation in the European Parliament (EP). All legislative proposals are handled by a member of one of the committees in the EP. This person (the “rapporteur”) is in charge of collecting information, negotiate a consensus between the party groups and with the other institutions (the Commission and the Council) and draft the resulting parliamentary amendments on behalf of the committee. For any single policy proposal, the rapporteur is the single-most influential member of parliament (MEP). The rapporteur is appointed by the group leader in the committee in charge of the proposal. Group leaders are committee members of the transnational party group of MEPs. These are (transnational) parliamentary parties to which national parties adhere (or not).

Back in the days, it was a common belief that national parties were running the show in the EP, since they are in charge of the candidate selection before elections, and because transnational groups have little presence outside of Parliament.

This article explores the argument that party leaders will prefer to appoint group members that are ideologically close to the transnational party median and contrasts this with the ideological proximity between MEPs and their national party group. The aim is to understand who allocates political influence in the European Parliament: National parties or European political groups? The model also includes an indicator of whether being a policy expert is an asset.

Introduction

We start by loading in the data and relevant packages.

#Data wrangling
library(dplyr); 

#Plots
library(ggplot2)
#Set preferences for esthetics on ggplot2
theme_set(
  #The "minimal" theme; try out others
  theme_minimal()
)

#Load in data
load("df_yoshinaka.rda")
#Rename the object
df <- df_yoshinaka %>%
  #Data originally from stata, so redefine as a data frame in R
  as.data.frame

Descriptive statistics

Outcome variable: number of legislative proposals

The outcome variable is the number of reports (i.e. legislative proposals, nreports) each MEP has garnered during the legislative period. We have, in other words, a data generating process in which events occur – in principle unrelated to each other – within a certain window/exposure time. This is the type of outcomes where we will resolve to the poisson model or one of its affiliates.

df %>%
  ggplot + 
  geom_histogram(
    aes(nreports)
    ) +
  xlab("Number of reports  (proposals) per MEPs") +
  ylab("Number of MEPs") +
  ggtitle("Distribution of legislative proposals among  MEPs")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of legislative proposals is quite unequal. The distribution looks like a “ski slope”: I have many zero counts (MEPs who never proposed legislation) and some really high counts (MEPs that have handled quite a few proposals).

#Number of observations (MEPs)
nrow(df)
## [1] 906
#Number of no events (MEPs without any reports)
sum(df$nreports==0)
## [1] 367
#Number of extreme events
quantile(df$nreports, 0.95)
## 95% 
##   7
max(df$nreports)
## [1] 22

A fairly large number of MEPs (more than a third) never drafted a report at all, while the 5% most productive members drafted 7 or more reports. The most prolific MEP drafted no less than 22 proposals!

Overall, the variance in our outcome is higher than the mean.

#Mean
mean(df$nreports)
## [1] 1.970199
#Variance
var(df$nreports)
## [1] 8.340547

This is not uncommon for count data. The poisson distribution used in poisson models assume that the mean is the same as the spread/variance in the conditional outcome. If our regression model contains enough predictors to to explain this variance, that will not be a problem. However, it means I need predictors that explain both the long tail and the high zero-count.

Main predictors: political distance

Our research question concerns the political distance between each MEP and their political groups (i.e. the median “voter” in the group). Here, I explore the distribution of our main predictor nom_dist1_ep, the left-right orientation of each MEP and their absolute distance from their group. Out of curiosity, I plot it per political group.

df %>%
  ggplot + 
  geom_histogram(
    aes(nom_dist1_ep)
    ) +
  xlab("Left-rigth political distance between MEP and EP group median") +
  ylab("Number of MEPs") +
  ggtitle("Distribution of political distance among  MEPs") 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).

My interest in the distribution of the predictor is first and foremost to check what variation I have. I can see that there is quite some variation between MEPs, but I already get a sense that there are quite a few MEPs that are really close to their group, while some people are far off from the group median. It already tells me that – even if my hypothesis is correct – my predictor cannot explain all the variation in the dependent variable. That’s because I’d expect that the higher counts in reports would be among MEPs that are close to their leaders. Here, we see that the group leaders have many MEPs to choose from that are politically close to them; yet only very few MEPs are going to be “super-rapporteur” (i.e. long tails). Similarly, we have too many zero-counts in the outcome compared to political outliers in the predictor.

Control variables

We will include other control variables as well: The political distance to the national party, expertise, activity levels etc.

library(stargazer)
df_tab <- 
  df %>%
  #The select function from dplyr; not from MASS
  dplyr::select(nreports, nom_dist1_ep, nom_dist1_nat,
                ctee_expert, activity, attendance, months) %>%
  #Transform to a data frame for stargazer to "understand" (sometimes necessary)
  as.data.frame

stargazer(df_tab,
          title = "Descriptive statistics",
          #File type; could be "text"
          type = "html",
          #I want summary statistics from the data frame
          summary = T)
Descriptive statistics
Statistic N Mean St. Dev. Min Max
nreports 906 1.970 2.888 0 22
nom_dist1_ep 899 0.075 0.115 0.000 1.205
nom_dist1_nat 899 0.028 0.076 0.000 0.849
ctee_expert 905 0.310 0.463 0 1
activity 904 0.753 0.211 0.000 0.994
attendance 889 0.517 0.257 0.004 0.984
months 906 42.435 19.388 1 60

Bivariate statistics

With two metric variables, I’d go for a scatterplot and a local regression. Since the outcome variable is a count, I’ll jitter the y-axis to get a feel for how many observations I have.

df %>%
  ggplot +
  #Scatterplot with jittering
  geom_jitter(aes(
    y = nreports,
    x = nom_dist1_ep
  ),
  #Make dots transparent
  alpha = 0.4) +
  #Local regression
  geom_smooth(aes(
    y = nreports,
    x = nom_dist1_ep
  )
  ) +
  #Linear regression
  geom_smooth(aes(
    y = nreports,
    x = nom_dist1_ep
  ),
  method = "lm",
  color = "darkgrey",
  se = F
  ) +
  ggtitle("Bivariate statistics: Legislative proposals as a function of political distance") +
  ylab("Legislative proposals (reports)") +
  xlab("Left-right distance between MEP and the EP group")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).

We get the feel that there is definitely something here. However, there are quite some outliers here too.

Why not a linear model? (what not to do)

Ward and Ahlquist (2018) warn us to not run linear regressions on count data. There is one exception to that: When all your predictors are categorical, then you’re just calculating the difference between group averages anyways; you could to a linear model.

In our case, however, we have a linear predictor and my base-line go-to is the poisson regression. The poisson distribution has a single paramter \(\lambda\) (lambda). It describes both its mean/expected value and the spread/variance. That’s both its biggest asset and a draw-back.

The asset is that the variance increases with the mean. That is, the distribution “knows” that your data will be heteroscedastic. Look:

mod1.ols <- lm(nreports ~ 
                 nom_dist1_ep +
                 nom_dist1_nat+
                 ctee_expert,
              df)

Standardized residuals:

#Predictions
df$preds_ols <- predict(mod1.ols, df)
#Residuals
df$resid_ols <- df$nreports - df$preds_ols
#Standardize to z-scores
df$resid_ols_z <- df$resid_ols/sd(df$resid_ols, na.rm = T)

Plot the residuals against the predicted values.

df %>%
  ggplot +
  geom_point(aes(y = resid_ols_z,
                 x = preds_ols)) +
  ggtitle("Residual plot")
## Warning: Removed 8 rows containing missing values (`geom_point()`).

We have a fan-shaped distribution. As my predictions increase, I will be increasingly wrong (i.e. my residuals increase). If I were to run an OLS, my standard errors would be too small, because the normal distribution assumes an equal spread.

Some people “fix” the problem by log-transforming the dependent variable before they run a linear model (drawing from a normal distribution). However, even though this fixes the problem that my distribution is bounded at zero, it does not fix the problem with heteroscedasticity.

mod2.ols <- lm(log(nreports+1) ~ 
                 nom_dist1_ep +
                 nom_dist1_nat+
                 ctee_expert,
              df)

Use the ready-made function to get a residual plot. Hit “Return” on your key-board to flip through different model statistics.

plot(mod2.ols)

The drawback of the assumption of an equal mean and spread, is that sometimes the spread is still too high, even for the poisson distribution. That’s when we have overdispersion.

Baseline model: poisson

Estimation

Running the poisson model is simple. It is GLM and is part of the base R package. The only code we need to change is the probability distribution.

mod1.pois <- glm(nreports ~ 
                 nom_dist1_ep +
                 nom_dist1_nat +
                 ctee_expert,
                 #tell R which probability distribution we use
                 family = "poisson",
              df)

summary(mod1.pois)
## 
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert, 
##     family = "poisson", data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.70968    0.03609  19.666  < 2e-16 ***
## nom_dist1_ep  -1.76930    0.31886  -5.549 2.88e-08 ***
## nom_dist1_nat -1.47164    0.50029  -2.942  0.00327 ** 
## ctee_expert    0.35966    0.04849   7.417 1.20e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2970.4  on 897  degrees of freedom
## Residual deviance: 2849.2  on 894  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 4371.9
## 
## Number of Fisher Scoring iterations: 6

Exposure (the event window)

Thinking about the data generating process, I realize that MEPs don’t all serve entire terms in the European Parliament. Some quit before the end of the term, others swap committees. In other words, the exposure time varies across MEPs. In the data, there is a variable called monthts that reports the number of months the member served on the committee.

If the exposure time is not the same across observations, the probability of a number of events to happen is also variable. We could address this by controlling for the logtransformed variable. It is more common to use an offset. The offset is – for all practical purposes – a logtransformed control variable we include in the model but for which the slope coefficient (parameter) is fixed at 1.

In R, I can do this by either using the offset()-function or the argument in the glm()-function. These two solutions are equivalent.

#Take 1
mod2.pois <- glm(nreports ~ 
                   #Political distance to EP party
                   nom_dist1_ep +
                   #Political distance to national party
                   nom_dist1_nat +
                   #Committee expert
                   ctee_expert,
                 family = "poisson",
                 #Time on the committee
                 offset = log(months),
                 df)

#Take 2
mod2.pois <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert,
                 offset = log(months),
                 family = "poisson",
              df)

The offset in a poisson regression is typically used when we want the coefficient of a predictor to represent a multiplicative effect on the rate rather than the count. This is common in scenarios such as rate per time or area where the exposure (e.g., time, area, population) can vary between observations.

In our case, we are interested in modeling the count of reports (events) with respect to exposure time (months served on the committee). Then using log(x) as an offset is necessary to keep the relationship in a multiplicative scale when working on the log scale of the response in poisson regression.

As we can see from comparing mod1.pois and mod2.pois, the regression coefficients change substantially. However, the intercept changed too. Since the model is a member of the exponential family with inbuilt interaction effects, it is hard to know to what extent the change in coefficients is substantial.

stargazer(mod1.ols, mod2.ols, mod1.pois, mod2.pois,
          type = "html")
Dependent variable:
nreports log(nreports + 1) nreports
OLS OLS Poisson
(1) (2) (3) (4)
nom_dist1_ep -2.313*** -0.799*** -1.769*** -0.853***
(0.858) (0.224) (0.319) (0.281)
nom_dist1_nat -1.928 -0.589* -1.472*** -0.937**
(1.287) (0.336) (0.500) (0.439)
ctee_expert 0.771*** 0.250*** 0.360*** 0.223***
(0.206) (0.054) (0.048) (0.049)
Constant 1.975*** 0.772*** 0.710*** -3.076***
(0.133) (0.035) (0.036) (0.035)
Observations 898 898 898 898
R2 0.029 0.045
Adjusted R2 0.025 0.042
Log Likelihood -2,181.944 -1,868.368
Akaike Inf. Crit. 4,371.889 3,744.736
Residual Std. Error (df = 894) 2.858 0.746
F Statistic (df = 3; 894) 8.809*** 14.041***
Note: p<0.1; p<0.05; p<0.01

Now, you only have to set the exposure if it is different between observations. If all observations have the same exposure, you can leave it untouched. R will assume your window of events is 1.

Interpretation

Interpreting the results is fairly straightforward. As per usual, it involves back-transforming the recoding of the dependent variable. In our case, the model has logtransformed the count in order to approximate the latent variable on which it runs the regression. Interpreting the marginal effects is the same as with binomial logistic regression, the predicted outcomes (and thus the first-difference) is easier; it suffices to take the exponent.

Marginal effects

A one-unit change in x would change y by a factor of \(exp(\beta)\).

exp(mod2.pois$coefficients["nom_dist1_ep"])
## nom_dist1_ep 
##    0.4260668

Once I’ve taken the exponent, I see that the factor is less than one; i.e. the number of legislative proposals decrease as the political distance increases. I’m then doing my usual trick to change this into percentage change.

eff1 <- (1-exp(mod2.pois$coefficients["nom_dist1_ep"]))
eff1
## nom_dist1_ep 
##    0.5739332

A one-unit increase in the political distance decreases the number of legislative proposals handled by the MEP by 57%.

How likely is such a shift? That is, how much “elasticity” is there in my predictor?

To determine a likely shift in my x-variable – the political distance between the MEP and the group – I consider the interquartile range. That’s a shift across the 50% most likely values in the data.

#Consider
summary(df$nom_dist1_ep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.01700 0.03900 0.07476 0.08200 1.20500       7
#Interquartile range (1st to 3rd quartiles)
iqr <- IQR(df$nom_dist1_ep, na.rm = T)
iqr
## [1] 0.06499999

Hmmm… a one unit change is extremely unlikely. In fact, a common distance is 0.065. Let’s reformulate, then.

eff2 <- 1-(exp(mod2.pois$coefficients["nom_dist1_ep"] * iqr))
eff2
## nom_dist1_ep 
##   0.05394572

Right, so: comparing an MEP that is fairly close to the group (1st quartile) to an MEP that is fairly far from the group median (3rd quartile), I find that the most distant MEP tends to write 5% fewer proposals on behalf of the group than the ideologically close MEP.

Another scenario, given the topic, could be to compare someone that shares the group’s views completely (x = 0) to a distant MEP (3rd quartile).

#Consider
summary(df$nom_dist1_ep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.01700 0.03900 0.07476 0.08200 1.20500       7
#Interquartile range (1st to 3rd quartiles)
distance <- quantile(df$nom_dist1_ep, na.rm = T, probs = c(0.75))
distance
##   75% 
## 0.082
eff3 <- 1-(exp(mod2.pois$coefficients["nom_dist1_ep"] * distance))
eff3
## nom_dist1_ep 
##     0.067568

The relative effect is somewhat larger, but not massive.

First-differences

To calculate the first-difference, I’d have to set up some full-fledged scenarios, then fill in the equation for predictions, backtransform, then compare.

Scenario

I start by setting a scenario. Here, I use the same as before, but I also have to specify the control variables. As long as the glm-recoding of the outcome involves logtransformation, all effects will be proportional to each other.

scenario <- data.frame(
  #Two scenarios: one close and one far away
  nom_dist1_ep = quantile(df$nom_dist1_ep, na.rm = T, probs = c(0, 0.75)),
  #Set the control to a typical value; here, the median
  nom_dist1_nat = median(df$nom_dist1_nat, na.rm = T),
  #A non-expert
  ctee_expert = 0,
  #NB: the offset variable also has to be in, but don't log transform!
  months = 60)

When you set your scenario, you will also have to include the offset if it is anything but 1. That is, if you specified an offset during the estimation, then your predictions will have to take it into account. This is because the offset determines the “window opportunity” or exposure to an event as a place-holder for the non-events we cannot observe.

Fill in the equation

I can fill in the equation “by hand”:

#The coefficients
a = mod2.pois$coefficients[1]
b1 = mod2.pois$coefficients[2]
b2 = mod2.pois$coefficients[3]
b3 = mod2.pois$coefficients[4]

#The equation; first scenario
pred1.hand <- 
  #intercept
  a * 1 +
  #first scenario, political distance == 0
  b1 * scenario[1,1]+
  #typical distance to national party
  b2 * scenario[1,2]+
  #committee expert
  b3 * scenario[1,3] +
  #The offset variable has a regression coefficient of 1
  1 * log(scenario[1,4])

#The equation; second scenario 
pred2.hand <- 
  #intercept
  a * 1 +
  #second scenario, political distance == 0
  b1 * scenario[2,1]+
  #typical distance to national party
  b2 * scenario[2,2]+
  #committee expert
  b3 * scenario[2,3]  +
  #offset
  1 * log(scenario[1,4])

#backtransform
exp(pred1.hand); 
## (Intercept) 
##     2.74206
exp(pred2.hand)
## (Intercept) 
##    2.556784
#first difference
exp(pred2.hand) - exp(pred1.hand)
## (Intercept) 
##  -0.1852755

The predicted change in the outcome is next to zero when we consider a likely different in x. It seems that the preference distance is mostly a threat to political outliers in the group.

I can make these and similar predictions using the predict() function too.

predict(mod2.pois, scenario,  type = "response")
##       0%      75% 
## 2.742060 2.556784

Be aware that if you have set the offset variable, you’d have to add it to your scenario; it’s a variable in its own right. However, the predict() function takes the exponent (backtransforms), so in the scenario, it should be left untransformed…

preds1 <- predict(mod2.pois, scenario,  type = "response")
preds1
##       0%      75% 
## 2.742060 2.556784

Here, we see that the predicted number of reports for an MEP who is politically close to their EP group, reasonably close to their national group and who is not an expert, is 2.74. An equivalent MEP whose preferences are further away, can expect to receive 2.56 reports. A substantial move in preferences therefore leads to a 0.1852755 decrease in number of reports. This is a very small effect.

Another way of phrasing this is to say that 5th political outlier will lose out on one legislative report during the 60 months (exposure) they served in the EP.

Effectplot with the ggpredict

I can use the ggeffects package to plot the effect of political distance on influence. However, the package does not like it when we recode our variables in the equation. This is what I did when I set the offset. I therefore reestimate the model.

#A new offset variable
df$offset <- log(df$months)
#Re-estimate the model
mod2.pois <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert,
                 offset = offset,
                 family = "poisson",
              df)

I let the political distance slide across a reasonable spectrum.

library(ggeffects)
eff <- ggpredict(mod2.pois,
                 terms = list("nom_dist1_ep" = seq(0, 0.5, 0.01)))

Here is the “dirty” way of plotting.

eff %>% plot

R fits the y axis so that everything looks fantastic. I can build on this to create a better plot.

eff %>%
  ggplot +
  geom_ribbon(
    aes(
      ymax = conf.high,
      ymin = conf.low,
      x = x
    ),
    #Transparent ribbon of uncertainty
    alpha = 0.3
  ) +
  geom_line(
    aes(
      y = predicted,
      x = x
    )
  ) +
  ylim(0,3) +
  ylab("Number of legislative proposals") +
  xlab("Political distance from the leadership") +
  ggtitle("Effect of political distance on productivity (legislative proposals) among MEPs")

Model fit

Let’s assess the model’s fit (Ward and Ahlquist 2018, 197–201). This means comparing predicted and observed outcomes. This time, I am scouting for overdispersion, which is the one hitch with the poisson model.

Overdispersion happens when I predict too few high values and low values compared to the actual distribution of the outcome variable. Because the poisson distribution assumes a variance equal to the mean, my standard errors are going to be too small/underestimated.”Everything” becomes statistically significant, yet it would all be lies.

Histograms

I usually start the really intuitive way. How does the distribution of my predictions compare to the distribution of the observed counts?

#Predict outcomes
df$preds_pois2 <- predict(mod2.pois, 
                          newdata = df,
                          #Note that we should not use the argument name here; but we want exponentiated outcomes
                          "response")

#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_pois2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (`stat_bin()`).

This looks dismal. We are predicting too few zero events and too few extreme counts. Clearly, the variation in the distribution of influence in the EP is larger than what our current model accounts for. It seems we are in the presence of overdispersion. It means that the standard errors are too small, since the poisson distribution has clear expectations about the variance.

Hanging rootogram

I can plot the same thing using hanging rootograms (Ward and Ahlquist 2018, p.). The countreg/topmodels package has a ready-made function for us. It is available for installation online.

#install.packages("distributions3"); hit "Yes" in R console
#install.packages("topmodels", repos="http://R-Forge.R-project.org")
library(topmodels)

#Rootogram
rootogram(mod1.pois,
          main = "Hanging rootogram of mod1.pois")

The plot gives us the same message: It uses the span of predicted values to create the bars, but compares within each predicted count the number of wrong predictions. Once again, we underpredict zeros as well as extreme values. That is, all bars that hang below the zero-line, are counts we underpredict. All bars that never reach the zero line are counts that are overpredicted.

Addressing overdispersion

Overdispersion is not necessarily a big problem if i) we believe that we don’t have an omitted variable bias and ii) we have described the data generating process correctly. If we believe none of the above apply to us, we can simply make the standard errors larger. That is what the quasi-poisson model does (Ward and Ahlquist 2018, 202).

However, I usually take the opportunity to think more closely about the data generating process, i.e. I put on my polisci hat and start thinking about how the events occur. That’s what we will do in the following.

Quasi-poisson: fixing the standard errors

If I believe my model is an unbiased description of the phenomenon I want to study, it suffices to revise my certainty about the size of the predictors, without re-estimating them. The quasipoisson model adds an additional parameter \(\phi\) (“phi”) to the variance parameter for the poisson model (Ward and Ahlquist 2018, 202). It describes the (over)dispersion and corrects the standard errors for it.

\(\phi V(\lambda_i) = \phi \lambda_i\)

We can easily run the model in baseR by changing the probability distribution argument.

mod.quasipois <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert +
                   offset(log(months)),
                   #Specify quasipoisson
                 family = "quasipoisson",
              df)

summary(mod.quasipois)
## 
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert + 
##     offset(log(months)), family = "quasipoisson", data = df)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.07626    0.06155 -49.978  < 2e-16 ***
## nom_dist1_ep  -0.85316    0.48985  -1.742  0.08191 .  
## nom_dist1_nat -0.93748    0.76630  -1.223  0.22151    
## ctee_expert    0.22292    0.08472   2.631  0.00865 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.048904)
## 
##     Null deviance: 2263.5  on 897  degrees of freedom
## Residual deviance: 2222.0  on 894  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Here, we see the dispersion is reported to 3.0489044.

The results are reported alongside the baseline model in table 3. We can see that the coefficieents are the same, but now, the politial distance between MEPs and their national parties is no longer statistically significant. In fact, the standard error is almost as big as the coefficient.

Adding predictors

I generally prefer taking the opportunity to think about my model choice instead of going for the post-hoc fixing operated by the quasipoisson. An alternative route is to add predictors. Do not make this into a garbage can by adding intermediate predictors, and then control away your main effect. However, you can add predictors for two reasons:

  1. Additional predictors: are correlated with the outcome, but not the theoretical variable you are interested in. Here, I happen to know that some MEPs don’t generally take part in parliamentary work. They are not present during the plenary sessions, for example. If my theory is that group leaders select some committee members – and not that some committee members self-deselect – then this is an interesting control. However, there is no real reason to think that the MEPs with extreme political views (compared to their party groups) are less present.
mod1.predictors <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert +
                     #Additional predictor
                     activity +
                   offset(log(months)),
                 family = "poisson",
              df)

The result is reported in the regression table. We see that the political distance is estimated to be (approximately) the same also after including the control.

  1. Confounders: In this particular case, I happen to know a few mechanisms that determine how many proposals an MEP could potentially be eligible for. They might be correlated to their ideological distance from their party: The leadership position and the committee on which the MEP serves.

First, legislative proposals that none is interested in are taken on by the committee chair. It means that some of my extreme counts may be due to the committee chair doing his/her job of filling in for the other members. The committee chair is elected by all of the party groups, so this position may be reserved for people that are close to the median voter in the committee, rather than the party. That’s a real potential confounder.

Second, the Commission is more active in some policy areas than others. If parties are strategic, they will put eligible rapporteurs on the more active committees. In the second model, I add fixed effects on committees and add a predictor for the commitee chair.

mod2.predictors <- glm(nreports ~ 
                         nom_dist1_ep +
                         nom_dist1_nat +
                         ctee_expert +
                         activity +
                         chair +
                         committee+
                         offset(log(months)),
                       #Specify quasipoisson
                       family = "poisson",
                       df)

Let’s check the histogram again.

#Predict outcomes
df$preds_predictors2 <- predict(mod2.predictors, 
                                newdata = df,
                                #Note that we should not use the argument name here; but we want exponentiated outcomes
                                "response")

#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_predictors2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

… and the rootogram.

#Rootogram
rootogram(mod2.predictors,
          main = "Hanging rootogram of mod2.predictors")

This looks so much better! However, I still have a problem in predicting my zeros, though.

Comparing with the initial poisson model, we see that the significance level is lower for ideological distance (the ratio between the parameter and the error is smaller). However, it is still more precise than the initial quasipoisson model.

Important for my research question is also that I now have an insignificant effect of the alternative explanation that MEPs would have to be close to their national party to get influence; I find no support for this in the more appropriate model.

stargazer( mod1.pois, mod2.pois,
           mod.quasipois, 
           mod1.predictors, mod2.predictors,
           #don't report the fixed effects
           omit = "committee",
           #change to "text"
           type = "html")
Dependent variable:
nreports
Poisson glm: quasipoisson Poisson
link = log
(1) (2) (3) (4) (5)
nom_dist1_ep -1.769*** -0.853*** -0.853* -0.850*** -0.714**
(0.319) (0.281) (0.490) (0.279) (0.302)
nom_dist1_nat -1.472*** -0.937** -0.937 -0.497 -0.641
(0.500) (0.439) (0.766) (0.412) (0.439)
ctee_expert 0.360*** 0.223*** 0.223*** 0.182*** 0.200***
(0.048) (0.049) (0.085) (0.049) (0.052)
activity 1.666*** 1.746***
(0.186) (0.189)
chair 1.584***
(0.072)
Constant 0.710*** -3.076*** -3.076*** -4.422*** -4.897***
(0.036) (0.035) (0.062) (0.157) (0.197)
Observations 898 898 898 896 896
Log Likelihood -2,181.944 -1,868.368 -1,820.734 -1,505.961
Akaike Inf. Crit. 4,371.889 3,744.736 3,651.468 3,055.922
Note: p<0.1; p<0.05; p<0.01

Negative binomial model

Another route is to opt for the negative binomial model (Ward and Ahlquist 2018, .p. 202-205). The model doesn’t really have a single probability distribution, instead it merges two parameters into one using the gamma distribution (\(\Gamma\)). It means that the functional form of the negative binomial model is not set in stone and can empirically adjust to the excess counts that we have by creating a “fatter tail”.

However, this also means that we change the hypothesis. That is, the negative binomial distribution assumes the probability of an additional count increases over the number of events. The best example is the soccer player that – having scored once – feels on top of things and continue to excel throughout the game, as (s)he is accumulating successes.

In our case, we could imagine a world in which group leaders are rational time-savers. That is, when they receive legislative proposals, they will rely on people they already know; either because they are on the leader’s mind, or because they think there will be an economy of scale (for example).

To fit the negative binomial model, we need the MASS-package. The regression function is slightly different glm.nb(), but otherwise the specification is the same as in base R. (This is, with reference to Ward and Ahlquist a NB2 model).

library(MASS)
mod2.nb <- glm.nb(nreports ~ 
                    nom_dist1_ep +
                    nom_dist1_nat +
                    ctee_expert +
                    activity + 
                    chair + 
                    committee +
                    offset(log(months)),
                    df)

The negative binomial model uses a second parameter (\(\alpha\)) to mediate between the mean (expected value) and the variance. Confusingly, Ward and Ahlquist use \(\alpha\), while R refers to \(\theta\).

Here, the \(\alpha^{-1}\) is reported to be 2.6655782. This is the same as \(\frac{1}{\alpha{^-1}} =\) 0.3751531. As \(\alpha\) increases, the negative binomial and the poisson models are going to look similar.

The results are also reported in table 4.

stargazer(mod1.predictors, mod2.predictors,
          mod2.nb,
          omit = "committee",
          title = "The poisson vs. the negative binomial model",
          type = "html")
The poisson vs. the negative binomial model
Dependent variable:
nreports
Poisson negative
binomial
(1) (2) (3)
nom_dist1_ep -0.850*** -0.714** -0.761*
(0.279) (0.302) (0.390)
nom_dist1_nat -0.497 -0.641 -0.220
(0.412) (0.439) (0.537)
ctee_expert 0.182*** 0.200*** 0.249***
(0.049) (0.052) (0.076)
activity 1.666*** 1.746*** 1.938***
(0.186) (0.189) (0.249)
chair 1.584*** 1.651***
(0.072) (0.149)
Constant -4.422*** -4.897*** -5.148***
(0.157) (0.197) (0.264)
Observations 896 896 896
Log Likelihood -1,820.734 -1,505.961 -1,415.508
theta 2.666*** (0.346)
Akaike Inf. Crit. 3,651.468 3,055.922 2,875.015
Note: p<0.1; p<0.05; p<0.01
#Create a plot with one row, and two columns
par(mfrow = c(1,3))
#Plot old model
rootogram(mod1.pois)

#Plot new model
rootogram(mod2.predictors)

#Plot new model
rootogram(mod2.nb)

#Return to initial parameters; one row and one column per plot
par(mfrow = c(1,1))

We can compare models. It is clear that the more complex poisson model is betteer than the simple one. However, the negative binomial model is a bit of a mixed bag. I have mor correct zero counts, and some more extreme counts (>15 reports).

Zero-inflated and hurdle models: a focus on the excess zeros

Another alternative is to consider if, in fact we are in the presence of two data generating processes: One accounting for the excess zeros, another for the positive counts (Ward and Ahlquist 2018, 207–14).

We can, for example model the excess zeros as two data generating processes. One accounting for some of the zero counts. These would be observations that – for some reason – would not respond to our predictors in the same way (“never takers”).

Effectively, we run two models in parallel: A binomial logistic regression modelling the “never takers”, and another count model (poisson or negative binomial) with the usual predictors.

Zero-inflated models

Here, I model a poisson count model. The package I use is the pscl package containing – among other things – functions for zero-inflated and hurdle models.

#install.packages("pscl")
library(pscl)
## Warning: pakke 'pscl' blev bygget under R version 4.3.1

I can also make a theory out of this. Let’s say that there is a group of MEPs that are never takers; they’d rather be caught dead than write a legislative proposal. That is, before the group leader makes the selection, there is an initial stage of self (de-)selection into legislative work.

One predictor for this is the activity level of the MEP. The variable attendance captures the share of committee meetings that the member has attended. The idea is that the less you attend a committee, the more you opt out of legislative work.

I specify the predictors for the two models in the function and separate them with a vertical bar “|”). Before the bar, I specify the predictors for the count part of the model. After the bar are the predictors for the zero/hurdle part (binomial logistic model). Here, I specify that Attendance is a predictor for whether an MEP is willing to engage in committee work, but not for how many positive counts will see.

mod1.zero.pois <- zeroinfl(nreports ~ 
                             nom_dist1_ep + 
                             nom_dist1_nat +
                             chair +
                             ctee_expert +
                             #Remove the fixed effects for convenience
                             # committee +
                             offset(log(months))|attendance,
                           #Count model == poisson
                           dist = "poisson",
                    df)

summary(mod1.zero.pois)
## 
## Call:
## zeroinfl(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + chair + 
##     ctee_expert + offset(log(months)) | attendance, data = df, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0975 -0.7359 -0.3825  0.3695  9.2827 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.92656    0.03987 -73.408   <2e-16 ***
## nom_dist1_ep  -0.39240    0.36559  -1.073   0.2831    
## nom_dist1_nat -0.75077    0.43319  -1.733   0.0831 .  
## chair          1.32410    0.07129  18.573   <2e-16 ***
## ctee_expert    0.08714    0.05080   1.715   0.0863 .  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4276     0.2428   5.879 4.12e-09 ***
## attendance   -5.6132     0.5893  -9.525  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 14 
## Log-likelihood: -1548 on 7 Df

The effect comes out significant and with a massive substantial effect.

#Typical difference betweeen assiduous members and active members
change <- df$attendance %>%
  quantile(., c(0.2, .8), na.rm=T) %>% 
  diff %>%
  as.numeric()

change
## [1] 0.5028251
#Extract the relevant predictor from the count part of the model
coef <- mod1.zero.pois$coefficients$zero["attendance"]
effect <- exp(coef * change)-1
effect 
## attendance 
##  -0.940541

Comparing a low-attendee (i.e. low quantile) with an assiduous member (high quantile), I find a 94% relative decrease in an MEP’s likelihood of getting at least one legislative proposal from their leader.

Same predictors in both models

If I don’t specify anything else, R will use the same predictor for both models. We can take the explorative approach and look at what the model yields.

mod2.zero.pois <- zeroinfl(nreports ~ 
                             nom_dist1_ep +
                             nom_dist1_nat +
                             ctee_expert +
                             chair +
                             attendance +
                             offset(log(months)),
                           #Count model == poisson
                           dist = "poisson",
                           df)

# summary(mod2.zero.pois)

Something strange seems to happen. Attendance is clearly connected to receiving legislative influence. However, my theory about political proximity is challenged by the explorative approach. Right-leaning MEPs seem to be overrepresented as never takers, while my theory linked it to the count effect.

Here, I do a negative binomial.

library(pscl)
mod1.zero.nb <- zeroinfl(nreports ~ 
                      nom_dist1_ep +
                      nom_dist1_nat +
                      ctee_expert +
                        chair +
                        # committee +
                    offset(log(months)) | attendance,
                    #Count model == negative binomial
                    dist = "negbin",
                    df)

Stargazer will plot the results of either of the two models, but not both. I use the zero.component=T to plot the binomial model that predicts my zero events.

stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
           omit = "committee",
          title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Zero events (no reports)",
          zero.component = T,
          type = "html")
The ZERO part of the zero-inflated models: poisson vs. the negative binomial model
Zero events (no reports)
nreports
(1) (2) (3)
nom_dist1_ep 3.908***
(1.262)
nom_dist1_nat -1.957
(2.277)
ctee_expert -0.487
(0.379)
chair -12.719
(307.246)
attendance -5.613*** -6.191*** -8.446***
(0.589) (0.726) (1.203)
Constant 1.428*** -2.423*** 1.634***
(0.243) (0.327) (0.322)
Observations 885 885 885
Log Likelihood -1,547.991 -1,536.608 -1,449.776
Note: p<0.1; p<0.05; p<0.01
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
           omit = "committee",
          title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Number of reports",
          zero.component = F,
          type = "html")
The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model
Number of reports
nreports
(1) (2) (3)
nom_dist1_ep -0.392 -0.029 -0.747*
(0.366) (0.340) (0.415)
nom_dist1_nat -0.751* -0.637 -0.565
(0.433) (0.451) (0.567)
chair 1.324*** 1.338*** 1.433***
(0.071) (0.071) (0.150)
attendance 0.663***
(0.138)
ctee_expert 0.087* 0.072 0.118
(0.051) (0.053) (0.074)
Constant -2.927*** -3.398*** -3.028***
(0.040) (0.101) (0.055)
Observations 885 885 885
Log Likelihood -1,547.991 -1,536.608 -1,449.776
Note: p<0.1; p<0.05; p<0.01

Hurdle models

The hurdle models are similar. They assume that there is a certain threshhold an observation has to pass in order to acquire positive counts.

This time, we use the hurdle() function to estimate the model.

mod1.hurdle <- hurdle(nreports ~ 
                      nom_dist1_ep +
                      nom_dist1_nat +
                      ctee_expert +
                        chair +
                        # committee +
                    offset(log(months)) | attendance,
                    #Count model == negative binomial
                    # dist = "negbin",
                    df)

summary(mod1.hurdle)
## 
## Call:
## hurdle(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert + 
##     chair + offset(log(months)) | attendance, data = df)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3173 -0.7379 -0.4270  0.4003  8.7110 
## 
## Count model coefficients (truncated poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.90541    0.04091 -71.011   <2e-16 ***
## nom_dist1_ep  -0.02383    0.36089  -0.066   0.9473    
## nom_dist1_nat -0.95820    0.52361  -1.830   0.0673 .  
## ctee_expert    0.06116    0.05263   1.162   0.2451    
## chair          1.29816    0.07142  18.177   <2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.5738     0.1771  -8.888   <2e-16 ***
## attendance    4.0378     0.3286  12.289   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 12 
## Log-likelihood: -1629 on 7 Df

As you can see from the regression table, these models often yield similar results.

stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
           omit = "committee",
          title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Zero events (no reports)",
          zero.component = T,
          type = "html")
The ZERO part of the zero-inflated models: poisson vs. the negative binomial model
Zero events (no reports)
nreports
zero-inflated hurdle
count data
(1) (2) (3) (4)
nom_dist1_ep 3.908***
(1.262)
nom_dist1_nat -1.957
(2.277)
ctee_expert -0.487
(0.379)
chair -12.719
(307.246)
attendance -5.613*** -6.191*** -8.446*** 4.038***
(0.589) (0.726) (1.203) (0.329)
Constant 1.428*** -2.423*** 1.634*** -1.574***
(0.243) (0.327) (0.322) (0.177)
Observations 885 885 885 885
Log Likelihood -1,547.991 -1,536.608 -1,449.776 -1,628.550
Note: p<0.1; p<0.05; p<0.01
stargazer(mod1.zero.pois, mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
           omit = "committee",
          title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Number of reports",
          zero.component = F,
          type = "html")
The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model
Number of reports
nreports
zero-inflated hurdle
count data
(1) (2) (3) (4)
nom_dist1_ep -0.392 -0.029 -0.747* -0.024
(0.366) (0.340) (0.415) (0.361)
nom_dist1_nat -0.751* -0.637 -0.565 -0.958*
(0.433) (0.451) (0.567) (0.524)
chair 1.324*** 1.338*** 1.433*** 1.298***
(0.071) (0.071) (0.150) (0.071)
attendance 0.663***
(0.138)
ctee_expert 0.087* 0.072 0.118 0.061
(0.051) (0.053) (0.074) (0.053)
Constant -2.927*** -3.398*** -3.028*** -2.905***
(0.040) (0.101) (0.055) (0.041)
Observations 885 885 885 885
Log Likelihood -1,547.991 -1,536.608 -1,449.776 -1,628.550
Note: p<0.1; p<0.05; p<0.01

Assessing the zeroinflated and hurdle models

Assessing and presenting the results from the models of excess zero is somewhat harder because the predicted outcomes are a function of two processes/models. Which of those two should we focus on? Or should we do both?

The results of the binomial part (zero-part) reports the likelihood of positive/negative counts, and can be interpreted as such. However, the results of the count part of the model, should be interpreted as the effects of the covariates contitional on being in the group of MEPs eligible for reports (given the binomial model).

Predictions

The predict() function will provide predictions as per usual. However, since the binomial model moderates the results of the count part, we should think about which of the models we’d like to present.

The type= argument lets you decide which outcome you’re interested in. * “count” reports the results from the count part of the model condiitonal on passing the hurdle * “zero” reports the probability from the binomial model.

  • “response” will yield the total expected outcome (with decimals). However these predictions are done at the MEP-level, and may therefore deviate quite substantially from the observed counts, due to the interaction of the two models.

To compare the model predictions, it might be more useful to compare the probabilities of the counts, i.e. the probability of receiving 0, 1, 2 etc legislative proposals from the group leaders.

  • “prob” reports the probability of the count number of reports

This last option is interesting. Now, the function yields not a vector of predictions but a matrix. Each column corresponds to the count categories (the nreports), while the observations (MEPs) are reported in the rows. The cells report the probability for each MEP to be in a specific category of counts.

#Predict outcomes as a matrix
preds_zero<- predict(mod1.zero.pois, 
                        newdata = df,
                        #Specify the probability of a count
                        "prob")

dim(preds_zero)
## [1] 906  23
head(preds_zero)
##            0          1         2         3          4           5           6
## 1 0.23082468 0.11318799 0.1749544 0.1802843 0.13933247 0.086146324 0.044385377
## 2 0.31116564 0.09609989 0.1520713 0.1604280 0.12693297 0.080344966 0.042380070
## 3 0.08108013 0.11723650 0.1926832 0.2111220 0.17349404 0.114057945 0.062486367
## 4 0.36665427 0.29673189 0.2023309 0.0919748 0.03135717 0.008552533 0.001943889
## 5 0.30150344 0.26567402 0.2256174 0.1277335 0.05423735 0.018423918 0.005215360
## 6 0.08743681 0.12699462 0.2011780 0.2124635 0.16828654 0.106636155 0.056309067
##              7            8            9           10           11           12
## 1 0.0196018149 7.574618e-03 2.601792e-03 8.043166e-04 2.260418e-04 5.823203e-05
## 2 0.0191609874 7.580226e-03 2.665593e-03 8.436225e-04 2.427223e-04 6.401514e-05
## 3 0.0293425747 1.205644e-02 4.403395e-03 1.447434e-03 4.325307e-04 1.184805e-04
## 4 0.0003787053 6.455641e-05 9.781935e-06 1.333991e-06 1.653819e-07 1.879467e-08
## 5 0.0012654348 2.686602e-04 5.070075e-05 8.611284e-06 1.329624e-06 1.881920e-07
## 6 0.0254862179 1.009347e-02 3.553228e-03 1.125766e-03 3.242505e-04 8.561000e-05
##             13           14           15           16           17           18
## 1 1.384755e-05 3.057732e-06 6.301769e-07 1.217578e-07 2.214126e-08 3.802632e-09
## 2 1.558453e-05 3.523058e-06 7.433320e-07 1.470338e-07 2.737300e-08 4.812870e-09
## 3 2.995811e-05 7.033918e-06 1.541406e-06 3.166708e-07 6.123082e-08 1.118172e-08
## 4 1.971602e-09 1.920522e-10 1.746047e-11 1.488209e-12 1.193831e-13 9.044787e-15
## 5 2.458732e-08 2.982887e-09 3.377528e-10 3.585358e-11 3.582096e-12 3.380012e-13
## 6 2.086442e-05 4.721753e-06 9.973260e-07 1.974887e-07 3.680598e-08 6.478449e-09
##             19           20           21           22
## 1 6.187072e-10 9.563339e-11 1.407812e-11 1.978228e-12
## 2 8.016870e-10 1.268613e-10 1.911895e-11 2.750398e-12
## 3 1.934488e-09 3.179413e-10 4.976667e-11 7.435787e-12
## 4 6.491913e-16 4.426604e-17 2.874613e-18 1.781905e-19
## 5 3.021469e-14 2.565912e-15 2.075276e-16 1.602163e-17
## 6 1.080295e-09 1.711345e-10 2.581924e-11 3.718312e-12

This option will give us a better idea about the model’s performance in the aggregate because we can assess the distribution of the outcomes. We can either identify the most likely outcome as if each count were a category, or simply calculate the aggregated probabilities for each category/outcome.

Most likely outcome Here I sum over each column

colSums(preds_zero, na.rm = T) 
##            0            1            2            3            4            5 
## 328.49388994 128.11649395 132.15558796 111.25136593  78.76002226  47.94733459 
##            6            7            8            9           10           11 
##  25.70487224  12.60178980   6.06443721   3.23055103   2.13098970   1.71105210 
##           12           13           14           15           16           17 
##   1.49721003   1.31128497   1.10656961   0.88807610   0.67564952   0.48748810 
##           18           19           20           21           22 
##   0.33411593   0.21797666   0.13564867   0.08068526   0.04595930
which.max(preds_zero[1,]) 
## 0 
## 1
max(preds_zero[906,])  == preds_zero[906,]
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE 
##    13    14    15    16    17    18    19    20    21    22 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_pois2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")

Here, we reported the most likely count category for each MEP. As is clear from the overlay, my model doesn’t predict the categories very precisely. That is, the probability of an MEP to fall into a single count category of reports is not very precisely estimated. However, if we treat the outcome as metric rather than categorical, the perception might change.

Distribution of probabilities We can simply compare the proportion of observed data in each category of nreports with the predicted probabilities. That way, we can take into account that an MEP has a fair chance of getting – say – 3-5 reports – but a low probability of getting 0 or 10.

To do so, I create two data sets – one with predicted values and one with observed values – then merge them. Last, I make a histogram.

#Data frame with predicted probabilities
predicted_prob <- 
  #the matrix from the predict function
  preds_zero %>%
  #make it into a data frame
  as.data.frame %>%
  #Take the mean of all the columns
  reframe(across(everything(), mean, na.rm = T)) %>%
  #Flip columns and rows to get a data set with two variables
  tidyr::pivot_longer(everything(),
                      names_to = "nreports",
                      values_to = "pred_prob") %>%
  #Recode to a numeric variable
  mutate(nreports = as.numeric(nreports))

#check
head(predicted_prob)
#Data frame with observed proportions
observed_prob <- 
  df %>%
  #Count the number of cases in each category
  group_by(nreports) %>%
  summarise(count = n()) %>%
  #Ungroup to consider all the data
  ungroup %>%
  #Calculate the number of each category
  mutate(N = sum(count),
         #divide by all observations to get a proportion
         prob = count/N,
         #Recode count category to numeric
         nreports = as.numeric(nreports))


#Merge the two data frames
df_probs <- full_join(observed_prob, 
                      predicted_prob,
                      by = "nreports")

#Plot
df_probs %>%
  # You can filter away the zeros to zoom in on the counts
  # filter(nreports > 0 ) %>%
  ggplot +
  geom_histogram(aes(y = prob,
                     x = nreports),
                 stat = "identity") +
  #Red == predicted values
  geom_histogram(aes(y = pred_prob,
                     x = nreports),
                 stat = "identity",
                 fill = "red", alpha = 0.5) +
  #Cut the x-axis to the values you want to see
  coord_cartesian(xlim = c(0,25)) +
  ggtitle("Predicted vs. observed counts of legislative proposals",
          subtitle = "Results from a zeroinflated model") +
  ylab("Probability") +
  xlab("Number of legislative proposals (`reports`)") +
  theme_minimal()

Rootogram

The rootogram function handles these models without any problems. The expected frequency is based on the total expected counts (from the two models).

rootogram(mod1.zero.pois)

Robust standard errors: when observations are clustered in groups

The standard errors reported in regression models assume that observations are independently drawn, i.e. that all observations have an equal probability of getting the outcome we observe. This is NOT a particular assumption for count models, it is an assumption most models share.

However, this is not always the case. Sometimes, our observations share some common features. We may, for example think that the count number of legislative proposals are determined by the committee that the MEP is a member of. This creates – potentially – two problems: One is that committee membership may be a confounder for the relationship between MEPs ideology and their probability of garnering legislative proposals. This could happen, for example, if MEPs with extreme political views were assigned to committees with little legislative activity (as a way for the leadership to side-line the extremists). We can address this by controlling for committee membership the way we did in the previous section. This will potentially change the regression parameter, if – indeed – there is such a relationship.

Another problem is that the MEPs are in fact clustered in committee and therefore are not independently drawn. This has an impact on how we calculate the standard error (i.e. the statistical significance) of our effects, but doesn’t imply any change in the parameter as such. We can address this by clustering the standard errors. They are a way of adjusting the standard errors of regression coefficients to account for correlation within clusters (here: committees), while assuming independence between clusters (here: between committees). In other words, we assume that MEPs that sit on the same committee have a similar base-line probability of getting legislative proposals.

Normal standard errors

We can estimate normal standard errors by using the variance-covariance matrix from the model object using the vcov() function. It reports how the regression parameters vary/covary.

vcov(mod2.zero.pois)
count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
count_(Intercept) 0.010 -0.005 -0.008 -0.001 -0.001 -0.013 0.011 -0.005 -0.015 -0.003 -0.003 -0.011
count_nom_dist1_ep -0.005 0.116 -0.024 0.001 0.000 -0.002 -0.002 0.092 0.030 0.002 -0.003 -0.015
count_nom_dist1_nat -0.008 -0.024 0.204 0.000 0.000 0.008 -0.015 -0.003 0.292 0.011 -0.014 0.012
count_ctee_expert -0.001 0.001 0.000 0.003 0.000 0.000 -0.002 0.000 0.006 0.006 -0.001 0.002
count_chair -0.001 0.000 0.000 0.000 0.005 0.000 0.000 0.003 -0.002 -0.001 0.002 -0.003
count_attendance -0.013 -0.002 0.008 0.000 0.000 0.019 -0.014 -0.005 0.010 0.001 0.003 0.022
zero_(Intercept) 0.011 -0.002 -0.015 -0.002 0.000 -0.014 0.107 -0.040 -0.189 -0.034 0.007 -0.181
zero_nom_dist1_ep -0.005 0.092 -0.003 0.000 0.003 -0.005 -0.040 1.592 -0.545 -0.003 0.030 -0.217
zero_nom_dist1_nat -0.015 0.030 0.292 0.006 -0.002 0.010 -0.189 -0.545 5.187 0.102 -0.244 0.200
zero_ctee_expert -0.003 0.002 0.011 0.006 -0.001 0.001 -0.034 -0.003 0.102 0.144 -0.035 0.005
zero_chair -0.003 -0.003 -0.014 -0.001 0.002 0.003 0.007 0.030 -0.244 -0.035 94400.207 -0.056
zero_attendance -0.011 -0.015 0.012 0.002 -0.003 0.022 -0.181 -0.217 0.200 0.005 -0.056 0.527

Standard errors are calculated as the square root from the diagonal of that matrix.

#Variance-covariance matrix
vcov(mod2.zero.pois) %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt
##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##          0.10079098          0.34019014          0.45140501          0.05283447 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##          0.07137039          0.13777445          0.32682849          1.26191966 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##          2.27742363          0.37903328        307.24616665          0.72611315

Can you recognize the results from mod1.zero?

Clustered standard errors

We can estimate clustered standard errors by using the sandwich package in R.

#install.packages("sandwich")
library(sandwich)
## Warning: pakke 'sandwich' blev bygget under R version 4.3.1

The relevant function is vcovCL(). It requires an additional argument clustered= where we provide the variable we want to cluster the errors on. Here, I cluster by committee. In GLMs the standard type= of errors is “HC0”. This is done automatically.

#Variance-covariance matrix
vcovCL(mod2.zero.pois,
     #Cluster according to committee membership
     cluster = df$committee,
     type = "HC0") %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

Be mindful that if you have NAs in your model (as I do here), you would have to subset the data so that the NAs dissappear. My hack is to subset using the predict function.

#predict
df$preds <- predict(mod2.zero.pois, df)

#Variance-covariance matrix
vcov(mod2.zero.pois,
     #Subset, then cluster
     cluster = df$committee[!is.na(df$preds)],
     type = "HC0") %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt
##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##          0.10079098          0.34019014          0.45140501          0.05283447 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##          0.07137039          0.13777445          0.32682849          1.26191966 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##          2.27742363          0.37903328        307.24616665          0.72611315

I now have regression coefficients from the model object and standard errors estimated separately.

#Copy standard errors to a separate object
ses.clustered <-
  vcovCL(mod2.zero.pois,
     #Subset, then cluster
     cluster = df$committee[!is.na(df$preds)]) %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

#summary statsitics from the original model (for comparison)
summary.mod1.zero <- summary(mod2.zero.pois)
#Extract the standard errors of the count part of the model
ses.normal <- summary.mod1.zero$coefficient$count[,"Std. Error"]

We usually do this by adding a list of named standard errors. These names must be the same as the names of the regression coefficients.

Here we see that the names of the standard errors are prefixed by count_ for the count model and zero_ for the zero hurdle.

ses.clustered
##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##           0.1999302           0.3590123           0.7006974           0.1399392 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##           0.1522670           0.2343490           0.4476172           1.3484687 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##           1.8209910           0.4929240           0.5422434           1.2752056

… however, the coefficients do not have these names.

mod2.zero.pois$coefficients
## $count
##   (Intercept)  nom_dist1_ep nom_dist1_nat   ctee_expert         chair 
##   -3.39832253   -0.02883625   -0.63717556    0.07216832    1.33814804 
##    attendance 
##    0.66311572 
## 
## $zero
##   (Intercept)  nom_dist1_ep nom_dist1_nat   ctee_expert         chair 
##    -2.4234745     3.9084822    -1.9573037    -0.4873135   -12.7187862 
##    attendance 
##    -6.1911903

I will have to change the names to get the stargazer to understand my objects and link coefficients and errors correctly. I use the gsub() function (from our session on recoding of variables) to replace the string count_ with nothing "".

names(ses.clustered) <- names(ses.clustered) %>%
  gsub("count_", "", .)
ses.clustered
##        (Intercept)       nom_dist1_ep      nom_dist1_nat        ctee_expert 
##          0.1999302          0.3590123          0.7006974          0.1399392 
##              chair         attendance   zero_(Intercept)  zero_nom_dist1_ep 
##          0.1522670          0.2343490          0.4476172          1.3484687 
## zero_nom_dist1_nat   zero_ctee_expert         zero_chair    zero_attendance 
##          1.8209910          0.4929240          0.5422434          1.2752056

Now, I can finally make my regression table. Here, I report the results from the original model and the results with the clustered errors.

stargazer(mod2.zero.pois, mod2.zero.pois,
          se = list(ses.normal, ses.clustered),
          title = "Normal vs. clustered standard errors",
          type = "html")
Normal vs. clustered standard errors
Dependent variable:
nreports
(1) (2)
nom_dist1_ep -0.029 -0.029
(0.340) (0.359)
nom_dist1_nat -0.637 -0.637
(0.451) (0.701)
ctee_expert 0.072 0.072
(0.053) (0.140)
chair 1.338*** 1.338***
(0.071) (0.152)
attendance 0.663*** 0.663***
(0.138) (0.234)
Constant -3.398*** -3.398***
(0.101) (0.200)
Observations 885 885
Log Likelihood -1,536.608 -1,536.608
Note: p<0.1; p<0.05; p<0.01

I always check the results: My clustered errors should be larger than the normal errors, since the whole point of this is to account for the fact that my number of observations doesn’t reflect the uncertainty in the data (i.e. I have less variation than what we might think, because outcomes depend on committee membership).


Need a brush-up on robust/clustered standard errors? Robust standard errors and clustered standard errors are both methods for estimating standard errors that are robust to violations of the assumption of homoscedasticity (i.e., equal variance) in linear regression models. However, they are used in different contexts and with different assumptions.

Robust standard errors are used when the errors are believed to be heteroscedastic, but the source of the heteroscedasticity is unknown. In this case, the robust standard errors adjust the standard errors of the regression coefficients to account for the potential heteroscedasticity of the errors. Robust standard errors can be estimated using various methods, such as the sandwich estimator or the White estimator, and do not require any additional information about the structure of the data.

Clustered standard errors, on the other hand, are used when the errors are believed to be heteroscedastic and correlated within groups, but uncorrelated between groups. In this case, the groups can be defined in various ways, such as geographic regions, schools, or firms, and the clustered standard errors adjust the standard errors of the regression coefficients to account for the correlation of the errors within each group. The clustered standard errors can be estimated using the same methods as the robust standard errors, but they require information about the grouping structure of the data.


newdata1 <- data.frame(
  intercept1 = 1,
  nom_dist1_ep = seq(0,1 , 0.01),
  nom_dist1_nat = 0.028,
  ctee_expert = 0,
  chair = 0)

newdata0 <- data.frame(
                       intercept0 = 1,
                       attendance = 0.7)
# predict(mod.zero, newdata = newdata)

sim1 <- mvrnorm(n = 1000, 
                mu = coefficients(mod.zero)[1:5], 
                Sigma = vcov(mod.zero)[1:5, 1:5])

sim0 <- mvrnorm(n = 1000, 
                mu = coefficients(mod.zero)[6:7], 
                Sigma = vcov(mod.zero)[6:7, 6:7])

# Simulate outcomes from the zero-inflated model
prob <- predict(mod.zero, 
                newdata = cbind(newdata1, newdata0, months = log(60)) %>% as.data.frame(), 
                type = "prob")

# sim_outcomes <- rbinom(n, 1, prob[, 1]) * rpois(n, prob[, 2])

preds <- as.matrix(newdata) %*% t(simb)

preds <- preds %>% exp()

pdf <- as.data.frame(t(apply(X = preds,  
                                      MARGIN = 1, 
                                      quantile, probs = c(.025,.5,.975))))

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.
Yoshinaka, Antoine, Gail Mcelroy, and Shaun Bowler. 2010. “The Appointment of Rapporteurs in the European Parliament.” Legislative Studies Quarterly 35 (4): 457–86. https://doi.org/10.3162/036298010793322384.