Models for event counts

In this session, we will be analyzing count outcomes. We will explore the base-line poisson model, test for overdispersion and adress potential problems.

We will familiarize with:

how to estimate different count models
- poisson and quasi-poisson models from baseR
- negative binomial models from MASS
- zero-inflated and hurdle models from pscl
clustered standard errors from sandwich
- while these are not specific to count models, we sometimes need to cluster our standard errors because our observations are clustered. You’ll find an example at the end of the script.

We will be working with data from Yoshinaka, Mcelroy, and Bowler (2010) on report allocation in the European Parliament (EP). All legislative proposals are handled by a member of one of the committees in the EP. This person (the “rapporteur”) is in charge of collecting information, negotiate a consensus between the party groups and with the other institutions (the Commission and the Council) and draft the resulting parliamentary amendments on behalf of the committee. For any single policy proposal, the rapporteur is the single-most influential member of parliament (MEP). The rapporteur is appointed by the group leader in the committee in charge of the proposal. Group leaders are committee members of the transnational party group of MEPs. These are (transnational) parliamentary parties to which national parties adhere (or not).

Back in the days, it was a common belief that national parties were running the show in the EP, since they are in charge of the candidate selection before elections, and because transnational groups have little presence outside of Parliament.

This article explores the argument that party leaders will prefer to appoint group members that are ideologically close to the transnational party median and contrasts this with the ideological proximity between MEPs and their national party group. The aim is to understand who allocates political influence in the European Parliament. The model also includes an indicator of whether being a policy expert is an asset.

We start by loading in the data and relevant packages.

library(dplyr); library(ggplot2)

#Load in data
load("df_yoshinaka.rda")
df <- df_yoshinaka

Introduction

The outcome variable of choice is the number of reports (i.e. legislative proposals, nreports) each MEP has garnered during the legislative period. We have, in other words, a data generating process in which events occur – in principle unrelated to each others – within a certain window/exposure time. This is the type of outcomes where we will resolve to the poisson model or one of its affiliates.

hist(df$nreports,
     xlab = "Number of reports  (proposals)",
     main = "Distribution of legislative proposals among  MEPs")

#Number of observations
nrow(df)

## [1] 906

#Number of no events
sum(df$nreports==0)

## [1] 367

#Number of extreme events
quantile(df$nreports, 0.95)

## 95% 
##   7

max(df$nreports)

## [1] 22

As we can see from the histogram, the distribution of legislative proposals is quite unequal. A fairly large number of MEPs (more than a third) never drafted a report at all, while the 5% most productive members drafted 7 or more reports. The most prolific MEP drafted no less than 22 proposals!

Overall, the variance in our outcome is higher than the mean.

#Mean
mean(df$nreports)

## [1] 1.970199

#Variance
var(df$nreports)

## [1] 8.340547

This is not uncommon for count data. The poisson distribution used in poisson models assume that the mean is the same as the spread/variance in the conditional outcome. If our regression model contains enough predictors to to explain this variance

mod1.ols <- lm(nreports ~ 
                 nom_dist1_ep +
                 nom_dist1_nat+
                 ctee_expert,
              df)

mod2.ols <- lm(log(nreports+1) ~ 
                 nom_dist1_ep +
                 nom_dist2_nat+
                 ctee_expert,
              df)

Baseline model: poisson

Estimation

Runnning the poisson model is simple. It is GLM and is part of the baseR package. The only code we need to change is the probability distribution.

mod1.pois <- glm(nreports ~ 
                 nom_dist1_ep +
                 nom_dist1_nat +
                 ctee_expert,
                 #tell R which probability distribution we use
                 family = "poisson",
              df)

summary(mod1.pois)

## 
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert, 
##     family = "poisson", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4100  -1.8955  -0.7374   0.6569   9.1278  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.70968    0.03609  19.666  < 2e-16 ***
## nom_dist1_ep  -1.76930    0.31886  -5.549 2.88e-08 ***
## nom_dist1_nat -1.47164    0.50029  -2.942  0.00327 ** 
## ctee_expert    0.35966    0.04849   7.417 1.20e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2970.4  on 897  degrees of freedom
## Residual deviance: 2849.2  on 894  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: 4371.9
## 
## Number of Fisher Scoring iterations: 6

Exposure (the event window)

Thinking about the data generating process, I realize that MEPs don’t all serve entire terms in the European Parliament. Some quit before the end of the term, others swap committees. In other words, the exposure time In the data, there is a variable called monthts that reports the number of months the member served on the committee.

If the exposure time is not the same across observations, the probability of a number of events to happen is also variable. We could address this by controlling for the logtransformed variable. It is more common to use an offset. The offset is – for all practical purposes – a logtransformed control variable we include in the model but for which the slope coefficient (parameter) is fixed at 1.

In R, I can do this by either using the offset()-function or the argument in the glm()-function.

#Take 1
mod2.pois <- glm(nreports ~ 
                 nom_dist1_ep +
                 nom_dist1_nat +
                 ctee_expert,
                 family = "poisson",
                 offset = log(months),
              df)

#Take 2
mod2.pois <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert +
                   offset(log(months)),
                 family = "poisson",
              df)

As we can see from comparing mod1.pois and mod2.pois, the regression coefficients change substantially. However, the intercept changed too. Since the model is a member of the exponential family with inbuild interaction effects, it is hard to know to what extent the change in coefficients is substantial.

stargazer(mod1.ols, mod2.ols, mod1.pois, mod2.pois,
          type = "html")


	Dependent variable:

	nreports	log(nreports + 1)	nreports
	OLS	OLS	Poisson
	(1)	(2)	(3)	(4)

nom_dist1_ep	-2.313^***	-0.815^***	-1.769^***	-0.853^***
	(0.858)	(0.220)	(0.319)	(0.281)

nom_dist1_nat	-1.928		-1.472^***	-0.937^**
	(1.287)		(0.500)	(0.439)

nom_dist2_nat		-0.780^**
		(0.346)

ctee_expert	0.771^***	0.245^***	0.360^***	0.223^***
	(0.206)	(0.054)	(0.048)	(0.049)

Constant	1.975^***	0.786^***	0.710^***	-3.076^***
	(0.133)	(0.036)	(0.036)	(0.035)


Observations	898	898	898	898
R²	0.029	0.047
Adjusted R²	0.025	0.044
Log Likelihood			-2,181.944	-1,868.368
Akaike Inf. Crit.			4,371.889	3,744.736
Residual Std. Error (df = 894)	2.858	0.745
F Statistic (df = 3; 894)	8.809^***	14.736^***

Note:	p<0.1; p<0.05; p<0.01

Model fit

Let’s assess the model’s fit. This means comparing predicted and obseved outcomes. This time, I am scouting for overdispersion.

Histograms

I usually start the really intuitive way. How does the distribution of my predictions compare to the distribution of the observed counts?

#Predict outcomes
df$preds_pois2 <- predict(mod2.pois, 
                          newdata = df,
                          #Note that we should not use the argument name here; but we want exponentiated outcomes
                          "response")

#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_pois2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8 rows containing non-finite values (`stat_bin()`).

This looks dismall. We are predicting too few zero events and too few extreme counts. Clearly, the variation in the distribution of influence in the EP is larger than what our current model accounts for. It seems we are in the presence of overdispersion. It means that the standard errors are too small, since the poisson distribution has clear expectations about the variance.

Hanging rootogram

I can plot the same thing using hanging rootograms (Ward and Ahlquist 2018, p.). The countreg package has a ready-made function for us. It is available for installation online.

#install.packages("countreg", repos="http://R-Forge.R-project.org")

#Rootogram
rootogram(mod1.pois,
          main = "Hanging rootogram of mod1.pois")

The plot gives us the same message: It uses the span of predicted values to create the bars, but compares within each predicted count the number of wrong predictions. Once again, we underpredict zeros as well as extreme values.

Addressing overdispersion

Overdispersion is not necessarily a big problem if i) we believe that we don’t have an omitted variable bias and ii) we have described the data generating process correctly. If we believe none of the above apply to us, we can simply make the standard errors larger. That is what the quasi poisson model does.

However, I usually take the opportunity to think more closely about the data generating process, i.e. I put on my polisci hat and start thinking about how the events occur. That’s what we will do in the following.

Quasi-poisson: fixing the standard errors

If I believe my model is an unbiased description of the phenomenon I want to study, it suffices to revise my certainty about the size of the predictors, without re-estimating them. The quasipoisson model adds an additional parameter \(\phi\) to the variance parameter for the poisson model (Ward and Ahlquist 2018, 202). It describes the (over)dispersion and corrects the standard errors for it.

\(\phi V(\lambda_i) = \phi \lambda_i\)

We can easily run the model in baseR by changing the probability distribution argument.

mod.quasipois <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert +
                   offset(log(months)),
                   #Specify quasipoisson
                 family = "quasipoisson",
              df)

summary(mod.quasipois)

## 
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert + 
##     offset(log(months)), family = "quasipoisson", data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6217  -1.5013  -0.5819   0.3392   7.8374  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.07626    0.06155 -49.978  < 2e-16 ***
## nom_dist1_ep  -0.85316    0.48985  -1.742  0.08191 .  
## nom_dist1_nat -0.93748    0.76630  -1.223  0.22151    
## ctee_expert    0.22292    0.08472   2.631  0.00865 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.048904)
## 
##     Null deviance: 2263.5  on 897  degrees of freedom
## Residual deviance: 2222.0  on 894  degrees of freedom
##   (8 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Here, we see the dispersion is reported to 3.0489044.

The results are reported alongside the baseline model in table 3. We can see that the coefficieents are the same, but now, the politial distance between MEPs and their national parties is no longer statistically significant. In fact, the standard error is almost as big as the coefficient.

Adding predictors

I generally prefer taking the opportunity to think about my model choice instead of going for the post-hoc fixing operated by the quasipoisson. An alternative route is to add predictors. Do not make this into a garbage can by adding intermediate predictors, and then control away your main effect. However, you can add predictors for two reasons:

Additional predictors: are correlated with the outcome, but not the theoretical variable you are interested in. Here, I happen to know that some MEPs don’t generally take part in parliamentary work. They are not present during the plenary sessions, for example. If my theory is that group leaders deselect some committee members – and not that some committee members do not self-deselect – then this is an interesting control. However, there is no real reason to think that the MEPs with extreme political views (compared to their party groups) are less present.

mod1.predictors <- glm(nreports ~ 
                   nom_dist1_ep +
                   nom_dist1_nat +
                   ctee_expert +
                  activity +
                  # chair +
                   offset(log(months)),
                   #Specify quasipoisson
                 family = "poisson",
              df)

The result is reported in the regression table. We see that the political distance is estimated to be (approximately) the same also after including the control.

Confounders: You discover

In this particular case, I happen to know a few mechanisms that determine how many proposals an MEP could potentially be eligible for. They might becorrelated to their ideological distance from their party: The leadership position and the committee on which the MEP serves.

First, legislative proposals that none is interested in handling are taken on by the committee chair. It means that some of my extreme counts may be due to the committee chair doing his/her job of filling in for the other members. The committee chair is elected by all of the party groups, so this position may be reserved for people that are close to the median voter in the committee, rather than the party.

Second, the Commission is more active in some policy areas than others. If parties are strategic, they will put eligible rapporteurs on the more active committees. In the second model, I add fixed effects on committees and add a predictor for the commitee chair.

mod2.predictors <- glm(nreports ~ 
                         nom_dist1_ep +
                         nom_dist1_nat +
                         ctee_expert +
                         activity +
                         chair +
                         committee+
                         offset(log(months)),
                       #Specify quasipoisson
                       family = "poisson",
                       df)

Let’s check the histogram again.

#Predict outcomes
df$preds_predictors2 <- predict(mod2.predictors, 
                                newdata = df,
                                #Note that we should not use the argument name here; but we want exponentiated outcomes
                                "response")

#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_predictors2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

… and the rootogram.

#Rootogram
rootogram(mod2.predictors,
          main = "Hanging rootogram of mod2.predictors")

This looks so much better! However, I still have a problem in predicting my zeros, though.

Comparing with the initial poisson model, we see that the significance level is lower for ideological distance (the ratio between the parameter and the error is smaller). However, it is still more precise than the initial quasipoisson model.

stargazer( mod1.pois, mod2.pois,
           mod.quasipois, 
           mod1.predictors, mod2.predictors,
           omit = "committee",
          type = "html")


	Dependent variable:

	nreports
	Poisson		glm: quasipoisson	Poisson
			link = log
	(1)	(2)	(3)	(4)	(5)

nom_dist1_ep	-1.769^***	-0.853^***	-0.853^*	-0.850^***	-0.714^**
	(0.319)	(0.281)	(0.490)	(0.279)	(0.302)

nom_dist1_nat	-1.472^***	-0.937^**	-0.937	-0.497	-0.641
	(0.500)	(0.439)	(0.766)	(0.412)	(0.439)

ctee_expert	0.360^***	0.223^***	0.223^***	0.182^***	0.200^***
	(0.048)	(0.049)	(0.085)	(0.049)	(0.052)

activity				1.666^***	1.746^***
				(0.186)	(0.189)

chair					1.584^***
					(0.072)

Constant	0.710^***	-3.076^***	-3.076^***	-4.422^***	-4.897^***
	(0.036)	(0.035)	(0.062)	(0.157)	(0.197)


Observations	898	898	898	896	896
Log Likelihood	-2,181.944	-1,868.368		-1,820.734	-1,505.961
Akaike Inf. Crit.	4,371.889	3,744.736		3,651.468	3,055.922

Note:	p<0.1; p<0.05; p<0.01

Negative binomial model

Another route is to opt for the negative binomial model (Ward and Ahlquist 2018, .p. 202-205). The model doesn’t really have a single probability distribution, instead it merges two parameters into one using the gamma distribution (\(\Gamma\)). It means that the functional form of the negative binomial model is not set in stone and can empirically adjust to the excess counts that we have by creating a “fatter tail”.

However, this also means that we change the hypothesis. That is, the negative binomial distribution assumes the probability of an additional count increases over the number of events. The best example is the soccer player that – having scored once – feels on top of things and continue to excell throughout the game, as (s)he is accumulating successes.

In our case, we could imagine a world in which group leaders are rational time-savers. That is, when they receive legislative proposals, they will rely on people they already know; either because they are on the leader’s mind, or because they think there will be an economy of scale (for example).

To fit the negative binomial model, we need to resolve to the MASS-packages; the same we use for our simulations. The function is slightly different glm.nb(), but otherwise the specification is the same.

library(MASS)
mod1.nb <- glm.nb(nreports ~ 
                    nom_dist1_ep +
                    nom_dist1_nat +
                    ctee_expert +
                    activity + 
                    chair + 
                    committee +
                    offset(log(months)),
                    df)

The results are reported in table 4.

stargazer(mod1.predictors, mod2.predictors,
          mod1.nb,
          omit = "committee",
          title = "The poisson vs. the negative binomial model",
          type = "html")

**The poisson vs. the negative binomial model**

	Dependent variable:

	nreports
	Poisson		negative
			binomial
	(1)	(2)	(3)

nom_dist1_ep	-0.850^***	-0.714^**	-0.761^*
	(0.279)	(0.302)	(0.390)

nom_dist1_nat	-0.497	-0.641	-0.220
	(0.412)	(0.439)	(0.537)

ctee_expert	0.182^***	0.200^***	0.249^***
	(0.049)	(0.052)	(0.076)

activity	1.666^***	1.746^***	1.938^***
	(0.186)	(0.189)	(0.249)

chair		1.584^***	1.651^***
		(0.072)	(0.149)

Constant	-4.422^***	-4.897^***	-5.148^***
	(0.157)	(0.197)	(0.264)


Observations	896	896	896
Log Likelihood	-1,820.734	-1,505.961	-1,415.508
theta			2.666^*** (0.346)
Akaike Inf. Crit.	3,651.468	3,055.922	2,875.015

Note:	p<0.1; p<0.05; p<0.01

#Create a plot with one row, and two columns
par(mfrow = c(1,3))
#Plot old model
rootogram(mod1.pois)
#Plot new model
rootogram(mod2.predictors)
#Plot new model
rootogram(mod1.nb)

#Return to initial parameters; one row and one column per plot
par(mfrow = c(1,1))

We can compare models. It is clear that the more complex poisson model is betteer than the simple one. However, the negative binomial model is a bit of a mixed bag. I have mor correct zero counts, and some more extreme counts (>15 reports).

Zero-inflated and hurdle models: a focus on the excess zeros

Another alternative is to consider if, in fact we are in the presence of two data generating processes: One accounting for the excess zeros, another for the positive counts.

We can, for example model the excess zeros as two data generating processes. One accounting for some of the zero counts. These would be observations that – for some reason – would not respond to our predictors in the same way (“never takers”).

Effectively, we run two models in parallel: A binomial logisitic regression modelling the “never takers”, and another count model (poisson or negative binomial) with the ususal predictors.

Zero-inflated models

Here, I model a poisson count model. The package I use is the pscl package containing – among other things – functions for zero-inflated and hurdle models.

#install.packages("pscl")
library(pscl)

I can also make a theory out of this. Let’s say that there is a group of MEPs that are never takers; they’d rather be caught dead than write a legislative proposal. That is, before the group leader makes the selection, there is an initial stage of self (de-)selection into legislative work.

One predictor for this is the activity level of the MEP. The variable attendance captures the share of committee meetings that the member has attended. The idea is that the less you attend a committee, the more you opt out of legislative work.

I specify the predictors for the two models in the function and separate them with a vertical bar “|”). Before the bar, I specify the predictors for the count part of the model. After the bar are the predictors for the zero/hurdle part (binomial logistic model). Here, I specify that Attendance is a predictor for whether an MEP is willing to engage in committee work, but not for how many positive counts will see.

mod1.zero.pois <- zeroinfl(nreports ~ 
                             nom_dist1_ep + 
                             nom_dist1_nat +
                             chair +
                             ctee_expert +
                             committee +
                             offset(log(months))|attendance,
                           #Count model == poisson
                           dist = "poisson",
                    df)

The effect comes out significant and with a massive substantial effect.

#Typical difference betweeen assiduous members and active members
change <- df$attendance %>%
  quantile(., c(0.2, .8), na.rm=T) %>% 
  diff %>%
  as.numeric()

change

## [1] 0.5028251

#Extract the relevant predictor from the count part of the model
coef <- mod1.zero.pois$coefficients$zero["attendance"]
effect <- exp(coef * change)-1
effect

## attendance 
## -0.9585354

Comparing a low-attendee (i.e. low quantile) with an assiduous member (high quantile), I find a 96% relative decrease in an MEP’s likelihood of getting at least one legislative proposal from their leader.

Same predictors in both models

If I don’t specify anything else, R will use the same predictor for both models. We can take the explorative approach and look at what the model yields.

mod2.zero.pois <- zeroinfl(nreports ~ 
                             nom_dist1_ep +
                             nom_dist1_nat +
                             ctee_expert +
                             chair +
                             attendance +
                             offset(log(months)),
                           #Count model == poisson
                           dist = "poisson",
                           df)

summary(mod2.zero.pois)

## 
## Call:
## zeroinfl(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert + 
##     chair + attendance + offset(log(months)), data = df, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8074 -0.7864 -0.3954  0.3415  8.6202 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.39832    0.10079 -33.717  < 2e-16 ***
## nom_dist1_ep  -0.02884    0.34019  -0.085    0.932    
## nom_dist1_nat -0.63718    0.45141  -1.412    0.158    
## ctee_expert    0.07217    0.05283   1.366    0.172    
## chair          1.33815    0.07137  18.749  < 2e-16 ***
## attendance     0.66312    0.13777   4.813 1.49e-06 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.4235     0.3268  -7.415 1.22e-13 ***
## nom_dist1_ep    3.9085     1.2619   3.097  0.00195 ** 
## nom_dist1_nat  -1.9573     2.2774  -0.859  0.39010    
## ctee_expert    -0.4873     0.3790  -1.286  0.19856    
## chair         -12.7188   307.2462  -0.041  0.96698    
## attendance     -6.1912     0.7261  -8.526  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 51 
## Log-likelihood: -1537 on 12 Df

Something strange seems to happen. Attendance is clearly connected to receiving legislative influence. However, my theory about political proximity is challenged by the explorative approach. Right-leaning MEPs seem to be overrepresented as never takers, while my theory linked it to the count effect.

Here, I do a negative binomial.

library(pscl)
mod1.zero.nb <- zeroinfl(nreports ~ 
                      nom_dist1_ep +
                      nom_dist1_nat +
                      ctee_expert +
                        chair +
                        committee +
                    offset(log(months)) | attendance,
                    #Count model == negative binomial
                    dist = "negbin",
                    df)

Stargazer will plot the results of either of the two models, but not both. I use the zero.component=T to plot the binomial model that predicts my zero events.

stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
           omit = "committee",
          title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Zero events (no reports)",
          zero.component = T,
          type = "html")

**The ZERO part of the zero-inflated models: poisson vs. the negative binomial model**

	Zero events (no reports)

	nreports
	(1)	(2)	(3)

nom_dist1_ep		3.908^***
		(1.262)

nom_dist1_nat		-1.957
		(2.277)

ctee_expert		-0.487
		(0.379)

chair		-12.719
		(307.246)

attendance	-6.330^***	-6.191^***	-8.179^***
	(0.689)	(0.726)	(1.153)

Constant	1.451^***	-2.423^***	1.625^***
	(0.261)	(0.327)	(0.317)


Observations	885	885	885
Log Likelihood	-1,420.134	-1,536.608	-1,376.439

Note:	p<0.1; p<0.05; p<0.01

stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
           omit = "committee",
          title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Number of reports",
          zero.component = F,
          type = "html")

**The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model**

	Number of reports

	nreports
	(1)	(2)	(3)

nom_dist1_ep	-0.358	-0.029	-0.488
	(0.335)	(0.340)	(0.395)

nom_dist1_nat	-0.780	-0.637	-0.700
	(0.518)	(0.451)	(0.543)

chair	1.347^***	1.338^***	1.413^***
	(0.072)	(0.071)	(0.125)

attendance		0.663^***
		(0.138)

ctee_expert	0.131^**	0.072	0.154^**
	(0.055)	(0.053)	(0.072)

Constant	-3.270^***	-3.398^***	-3.391^***
	(0.116)	(0.101)	(0.147)


Observations	885	885	885
Log Likelihood	-1,420.134	-1,536.608	-1,376.439

Note:	p<0.1; p<0.05; p<0.01

Hurdle models

The hurdle models are similar. They assume that there is a certain threshhold an observation has to pass in order to acquire positive counts.

This time, we use the hurdle() function to estimate the model.

mod1.hurdle <- hurdle(nreports ~ 
                      nom_dist1_ep +
                      nom_dist1_nat +
                      ctee_expert +
                        chair +
                        committee +
                    offset(log(months)) | attendance,
                    #Count model == negative binomial
                    # dist = "negbin",
                    df)

summary(mod1.hurdle)

## 
## Call:
## hurdle(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert + 
##     chair + committee + offset(log(months)) | attendance, data = df)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3664 -0.7268 -0.3941  0.5516  5.8558 
## 
## Count model coefficients (truncated poisson with log link):
##                                                                                         Estimate
## (Intercept)                                                                             -3.25496
## nom_dist1_ep                                                                             0.01663
## nom_dist1_nat                                                                           -1.05261
## ctee_expert                                                                              0.09707
## chair                                                                                    1.33031
## committeeCommittee on Budgetary Control                                                  0.42114
## committeeCommittee on Budgets                                                            0.92462
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs            0.43619
## committeeCommittee on Constitutional Affairs                                            -0.01209
## committeeCommittee on Culture, Youth, Education, the Media and Sport                    -0.30796
## committeeCommittee on Development and Cooperation                                       -0.56570
## committeeCommittee on Economic and Monetary Affairs                                      0.48585
## committeeCommittee on Employment and Social Affairs                                     -0.25478
## committeeCommittee on Fisheries                                                          0.59848
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy  0.33292
## committeeCommittee on Industry, External Trade, Research and Energy                      0.44134
## committeeCommittee on Legal Affairs and the Internal Market                              0.90077
## committeeCommittee on Petitions                                                         -0.92295
## committeeCommittee on Regional Policy, Transport and Tourism                             0.12009
## committeeCommittee on the Environment, Public Health and Consumer Policy                 0.57760
## committeeCommittee on Women's Rights and Equal Opportunities                            -0.20261
##                                                                                         Std. Error
## (Intercept)                                                                                0.12747
## nom_dist1_ep                                                                               0.40082
## nom_dist1_nat                                                                              0.59465
## ctee_expert                                                                                0.05842
## chair                                                                                      0.07307
## committeeCommittee on Budgetary Control                                                    0.19313
## committeeCommittee on Budgets                                                              0.14955
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs              0.15040
## committeeCommittee on Constitutional Affairs                                               0.20074
## committeeCommittee on Culture, Youth, Education, the Media and Sport                       0.21052
## committeeCommittee on Development and Cooperation                                          0.23656
## committeeCommittee on Economic and Monetary Affairs                                        0.15068
## committeeCommittee on Employment and Social Affairs                                        0.18518
## committeeCommittee on Fisheries                                                            0.16468
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy    0.14801
## committeeCommittee on Industry, External Trade, Research and Energy                        0.14729
## committeeCommittee on Legal Affairs and the Internal Market                                0.14279
## committeeCommittee on Petitions                                                            0.35093
## committeeCommittee on Regional Policy, Transport and Tourism                               0.15077
## committeeCommittee on the Environment, Public Health and Consumer Policy                   0.13935
## committeeCommittee on Women's Rights and Equal Opportunities                               0.23147
##                                                                                         z value
## (Intercept)                                                                             -25.535
## nom_dist1_ep                                                                              0.041
## nom_dist1_nat                                                                            -1.770
## ctee_expert                                                                               1.662
## chair                                                                                    18.206
## committeeCommittee on Budgetary Control                                                   2.181
## committeeCommittee on Budgets                                                             6.183
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs             2.900
## committeeCommittee on Constitutional Affairs                                             -0.060
## committeeCommittee on Culture, Youth, Education, the Media and Sport                     -1.463
## committeeCommittee on Development and Cooperation                                        -2.391
## committeeCommittee on Economic and Monetary Affairs                                       3.224
## committeeCommittee on Employment and Social Affairs                                      -1.376
## committeeCommittee on Fisheries                                                           3.634
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy   2.249
## committeeCommittee on Industry, External Trade, Research and Energy                       2.996
## committeeCommittee on Legal Affairs and the Internal Market                               6.308
## committeeCommittee on Petitions                                                          -2.630
## committeeCommittee on Regional Policy, Transport and Tourism                              0.797
## committeeCommittee on the Environment, Public Health and Consumer Policy                  4.145
## committeeCommittee on Women's Rights and Equal Opportunities                             -0.875
##                                                                                         Pr(>|z|)
## (Intercept)                                                                              < 2e-16
## nom_dist1_ep                                                                            0.966905
## nom_dist1_nat                                                                           0.076702
## ctee_expert                                                                             0.096596
## chair                                                                                    < 2e-16
## committeeCommittee on Budgetary Control                                                 0.029215
## committeeCommittee on Budgets                                                           6.30e-10
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs           0.003729
## committeeCommittee on Constitutional Affairs                                            0.951973
## committeeCommittee on Culture, Youth, Education, the Media and Sport                    0.143509
## committeeCommittee on Development and Cooperation                                       0.016786
## committeeCommittee on Economic and Monetary Affairs                                     0.001263
## committeeCommittee on Employment and Social Affairs                                     0.168856
## committeeCommittee on Fisheries                                                         0.000279
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy 0.024495
## committeeCommittee on Industry, External Trade, Research and Energy                     0.002731
## committeeCommittee on Legal Affairs and the Internal Market                             2.82e-10
## committeeCommittee on Petitions                                                         0.008537
## committeeCommittee on Regional Policy, Transport and Tourism                            0.425726
## committeeCommittee on the Environment, Public Health and Consumer Policy                3.40e-05
## committeeCommittee on Women's Rights and Equal Opportunities                            0.381395
##                                                                                            
## (Intercept)                                                                             ***
## nom_dist1_ep                                                                               
## nom_dist1_nat                                                                           .  
## ctee_expert                                                                             .  
## chair                                                                                   ***
## committeeCommittee on Budgetary Control                                                 *  
## committeeCommittee on Budgets                                                           ***
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs           ** 
## committeeCommittee on Constitutional Affairs                                               
## committeeCommittee on Culture, Youth, Education, the Media and Sport                       
## committeeCommittee on Development and Cooperation                                       *  
## committeeCommittee on Economic and Monetary Affairs                                     ** 
## committeeCommittee on Employment and Social Affairs                                        
## committeeCommittee on Fisheries                                                         ***
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy *  
## committeeCommittee on Industry, External Trade, Research and Energy                     ** 
## committeeCommittee on Legal Affairs and the Internal Market                             ***
## committeeCommittee on Petitions                                                         ** 
## committeeCommittee on Regional Policy, Transport and Tourism                               
## committeeCommittee on the Environment, Public Health and Consumer Policy                ***
## committeeCommittee on Women's Rights and Equal Opportunities                               
## Zero hurdle model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.5738     0.1771  -8.888   <2e-16 ***
## attendance    4.0378     0.3286  12.289   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 29 
## Log-likelihood: -1516 on 23 Df

As you can see from the regression table, these models often yield similar results.

stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
           omit = "committee",
          title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Zero events (no reports)",
          zero.component = T,
          type = "html")

**The ZERO part of the zero-inflated models: poisson vs. the negative binomial model**

	Zero events (no reports)

	nreports
	zero-inflated			hurdle
	count data
	(1)	(2)	(3)	(4)

nom_dist1_ep		3.908^***
		(1.262)

nom_dist1_nat		-1.957
		(2.277)

ctee_expert		-0.487
		(0.379)

chair		-12.719
		(307.246)

attendance	-6.330^***	-6.191^***	-8.179^***	4.038^***
	(0.689)	(0.726)	(1.153)	(0.329)

Constant	1.451^***	-2.423^***	1.625^***	-1.574^***
	(0.261)	(0.327)	(0.317)	(0.177)


Observations	885	885	885	885
Log Likelihood	-1,420.134	-1,536.608	-1,376.439	-1,516.094

Note:	p<0.1; p<0.05; p<0.01

stargazer(mod1.zero.pois, mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
           omit = "committee",
          title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
          dep.var.caption = "Number of reports",
          zero.component = F,
          type = "html")

**The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model**

	Number of reports

	nreports
	zero-inflated			hurdle
	count data
	(1)	(2)	(3)	(4)

nom_dist1_ep	-0.358	-0.029	-0.488	0.017
	(0.335)	(0.340)	(0.395)	(0.401)

nom_dist1_nat	-0.780	-0.637	-0.700	-1.053^*
	(0.518)	(0.451)	(0.543)	(0.595)

chair	1.347^***	1.338^***	1.413^***	1.330^***
	(0.072)	(0.071)	(0.125)	(0.073)

attendance		0.663^***
		(0.138)

ctee_expert	0.131^**	0.072	0.154^**	0.097^*
	(0.055)	(0.053)	(0.072)	(0.058)

Constant	-3.270^***	-3.398^***	-3.391^***	-3.255^***
	(0.116)	(0.101)	(0.147)	(0.127)


Observations	885	885	885	885
Log Likelihood	-1,420.134	-1,536.608	-1,376.439	-1,516.094

Note:	p<0.1; p<0.05; p<0.01

Assessing the zeroinflated and hurdle models

Assessing and presenting the results from the models of excess zero is somewhat harder because the predicted outcomes are a function of two processes/models. Which of those two should we focus on? Or should we do both?

The results of the binomial part (zero-part) reports the likelihood of positive/negative counts, and can be interpreted as such. However, the results of the count part of the model, should be interpreted as the effects of the covariates contitional on being in the group of MEPs eligible for reports (given the binomial model).

Predictions

The predict() function will provide predictions as per usual. However, since the binomial model moderates the results of the count part, we should think about which of the models we’d like to present.

The type= argument lets you decide which outcome you’re interested in. * “count” reports the results from the count part of the model condiitonal on passing the hurdle * “zero” reports the probability from the binomial model.

“response” will yield the total expected outcome (with decimals). However these predictions are done at the MEP-level, and may therefore deviate quite substantially from the observed counts, due to the interaction of the two models.

To compare the model predictions, it might be more useful to compare the probabilities of the counts, i.e. the probability of receiving 0, 1, 2 etc legislative proposals from the group leaders.

“prob” reports the probability of the count number of reports

This last option is interesting. Now, the function yields not a vector of predictions but a matrix. Each column corresponds to the count categories (the nreports), while the observations (MEPs) are reported in the rows. The cells report the probability for each MEP to be in a specific category of counts.

#Predict outcomes as a matrix
preds_zero<- predict(mod1.zero.pois, 
                        newdata = df,
                        #Specify the probability of a count
                        "prob")

dim(preds_zero)

## [1] 906  23

head(preds_zero)

##           0         1         2          3          4           5            6
## 1 0.2417732 0.2090101 0.2290520 0.16734384 0.09169520 0.040195136 0.0146831467
## 2 0.3100102 0.1837342 0.2060861 0.15410480 0.08642610 0.038776055 0.0144977666
## 3 0.1120386 0.2075920 0.2527395 0.20513706 0.12487528 0.060813337 0.0246797029
## 4 0.4461237 0.3283828 0.1589427 0.05128725 0.01241193 0.002403032 0.0003877023
## 5 0.3577113 0.3203768 0.2019022 0.08482618 0.02672883 0.006737829 0.0014153985
## 6 0.1308131 0.2308237 0.2593701 0.19429801 0.10916363 0.049065653 0.0183779026
##              7            8            9           10           11           12
## 1 4.597460e-03 1.259577e-03 3.067462e-04 6.723202e-05 1.339616e-05 2.446786e-06
## 2 4.646135e-03 1.302838e-03 3.247407e-04 7.284932e-05 1.485667e-05 2.777340e-06
## 3 8.584880e-03 2.612983e-03 7.069463e-04 1.721388e-04 3.810469e-05 7.731959e-06
## 4 5.361551e-05 6.487696e-06 6.978115e-07 6.755044e-08 5.944640e-09 4.795506e-10
## 5 2.548536e-04 4.015233e-05 5.623132e-06 7.087423e-07 8.120928e-08 8.529714e-09
## 6 5.900211e-03 1.657476e-03 4.138798e-04 9.301305e-05 1.900294e-05 3.558846e-06
##             13           14           15           16           17           18
## 1 4.125243e-07 6.458302e-08 9.436783e-09 1.292709e-09 1.666667e-10 2.029426e-11
## 2 4.792635e-07 7.679538e-08 1.148504e-08 1.610280e-09 2.124913e-10 2.648240e-11
## 3 1.448233e-06 2.518853e-07 4.088876e-08 6.222663e-09 8.912914e-10 1.205701e-10
## 4 3.570928e-11 2.469126e-12 1.593463e-13 9.640779e-15 5.489759e-16 2.952371e-17
## 5 8.269917e-10 7.445316e-11 6.256075e-12 4.928241e-13 3.653870e-14 2.558531e-15
## 6 6.152271e-07 9.875908e-08 1.479638e-08 2.078285e-09 2.747424e-10 3.430227e-11
##             19           20           21           22
## 1 2.341081e-12 2.565567e-13 2.677694e-14 2.667689e-15
## 2 3.126745e-12 3.507125e-13 3.746456e-14 3.820205e-15
## 3 1.545177e-11 1.881225e-12 2.181291e-13 2.414256e-14
## 4 1.504206e-18 7.280604e-20 3.356126e-21 1.476745e-22
## 5 1.697255e-16 1.069614e-17 6.419746e-19 3.677944e-20
## 6 4.057316e-12 4.559093e-13 4.878977e-14 4.983973e-15

This option will give us a better idea about the model’s performance in the aggregate because we can assess the distribution of the outcomes. We can either identify the most likely outcome as if each count were a category, or simply calculate the aggregated probabilities for each category/outcome.

Most likely outcome Here I sum over each column

colSums(preds_zero, na.rm = T)

##           0           1           2           3           4           5 
## 338.1925200 142.5799967 124.2709222  95.7526814  67.5581729  44.4231577 
##           6           7           8           9          10          11 
##  27.5343934  16.2979559   9.3912330   5.4211660   3.2588708   2.1232026 
##          12          13          14          15          16          17 
##   1.5306716   1.2069874   1.0078823   0.8629301   0.7406705   0.6289277 
##          18          19          20          21          22 
##   0.5244628   0.4277650   0.3404720   0.2641097   0.1995317

which.max(preds_zero[1,])

## 0 
## 1

max(preds_zero[906,])  == preds_zero[906,]

##     0     1     2     3     4     5     6     7     8     9    10    11    12 
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
##    13    14    15    16    17    18    19    20    21    22 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

#Plot two overlayed histograms
df %>%
  ggplot() +
  #Observed outcomes
  geom_histogram(aes(x = nreports),
                 #transparent color
                 alpha=0.5)+
  #Predicted outcomes
  geom_histogram(aes(x = preds_pois2),
                 #transparent red color
                 fill = "red", 
                 alpha =  0.3)+
  #ggtheme
  theme_minimal() +
  ggtitle("Distribution of observed vs. predicted outcomes") +
  ylab("Number of MEPs") +
  xlab("Number of reports (legislative proposals)")

Here, we reported the most likely count category for each MEP. As is clear from the overlay, my model doesn’t predict the categories very precisely. That is, the probability of an MEP to fall into a single count category of reports is not very precisely estimated. However, if we treat the outcome as metric rather than categorical, the perception might change.

Distribution of probabilities We can simply compare the proportion of observed data in each category of nreports with the predicted probabilities. That way, we can take into account that an MEP has a fair chance of getting – say – 3-5 reports – but a low probability of getting 0 or 10.

To do so, I create two data sets – one with predicted values and one with observed values – then merge them. Last, I make a histogram.

#Data frame with predicted probabilities
predicted_prob <- 
  #the matrix from the predict function
  preds_zero %>%
  #make it into a data frame
  as.data.frame %>%
  #Take the mean of all the columns
  summarize(across(everything(), mean, na.rm = T)) %>%
  #Flip columns and rows to get a data set with two variables
  tidyr::pivot_longer(everything(),
                      names_to = "nreports",
                      values_to = "pred_prob") %>%
  #Recode to a numeric variable
  mutate(nreports = as.numeric(nreports))

#check
head(predicted_prob)

#Data frame with observed proportions
observed_prob <- 
  df %>%
  #Count the number of cases in each category
  group_by(nreports) %>%
  summarise(count = n()) %>%
  #Ungroup to consider all the data
  ungroup %>%
  #Calculate the number of each category
  mutate(N = sum(count),
         #divide by all observations to get a proportion
         prob = count/N,
         #Recode count category to numeric
         nreports = as.numeric(nreports))


#Merge the two data frames
df_probs <- full_join(observed_prob, 
                      predicted_prob,
                      by = "nreports")

#Plot
df_probs %>%
  # You can filter away the zeros to zoom in on the counts
  # filter(nreports > 0 ) %>%
  ggplot +
  geom_histogram(aes(y = prob,
                     x = nreports),
                 stat = "identity") +
  #Red == predicted values
  geom_histogram(aes(y = pred_prob,
                     x = nreports),
                 stat = "identity",
                 fill = "red", alpha = 0.5) +
  #Cut the x-axis to the values you want to see
  coord_cartesian(xlim = c(0,25)) +
  ggtitle("Predicted vs. observed counts of legislative proposals",
          subtitle = "Results from a zeroinflated model") +
  ylab("Probability") +
  xlab("Number of legislative proposals (`reports`)") +
  theme_minimal()

Dette ser mye bedre ut!

Rootogram

The rootogram function handles these models without any problems. The expected frequency is based on the total expected counts (from the two models).

rootogram(mod1.zero.pois)

Robust standard errors: when observations are clustered in groups

The standard errors reported in regression models assume that observations are independently drawn, i.e. that all observations have an equal probability of getting the outcome we observe. This is NOT a particular assumption for count models, it is an assumption most models share.

However, this is not always the case. Sometimes, our observations share some common features. We may, for example think that the count number of legislative proposals are determined by the committee that the MEP is a member of. This creates – potentially – two problems: One is that committee membership may be a confounder for the relationship between MEPs ideology and their probability of garnering legislative proposals. This could happen, for example, if MEPs with extreme political views were assigned to committees with little legislative activity (as a way for the leadership to side-line the extremists). We can address this by controlling for committee membership the way we did in the previous section. This will potentially change the regression parameter, if – indeed – there is such a relationship.

Another problem is that the MEPs are in fact clustered in committee and therefore are not independently drawn. This has an impact on how we calculate the standard error (i.e. the statistical significance) of our effects, but doesn’t imply any change in the parameter as such. We can address this by clustering the standard errors. They are a way of adjusting the standard errors of regression coefficients to account for correlation within clusters (here: committees), while assuming independence between clusters (here: between committees). In other words, we assume that MEPs that sit on the same committee have a similar base-line probability of getting legislative proposals.

Normal standard errors

We can estimate normal standard errors by using the variance-covariance matrix from the model object using the vcov() function. It reports how the regression parameters vary/covary.

vcov(mod2.zero.pois)

	count_(Intercept)	count_nom_dist1_ep	count_nom_dist1_nat	count_ctee_expert	count_chair	count_attendance	zero_(Intercept)	zero_nom_dist1_ep	zero_nom_dist1_nat	zero_ctee_expert	zero_chair	zero_attendance
count_(Intercept)	0.010	-0.005	-0.008	-0.001	-0.001	-0.013	0.011	-0.005	-0.015	-0.003	-0.003	-0.011
count_nom_dist1_ep	-0.005	0.116	-0.024	0.001	0.000	-0.002	-0.002	0.092	0.030	0.002	-0.003	-0.015
count_nom_dist1_nat	-0.008	-0.024	0.204	0.000	0.000	0.008	-0.015	-0.003	0.292	0.011	-0.014	0.012
count_ctee_expert	-0.001	0.001	0.000	0.003	0.000	0.000	-0.002	0.000	0.006	0.006	-0.001	0.002
count_chair	-0.001	0.000	0.000	0.000	0.005	0.000	0.000	0.003	-0.002	-0.001	0.002	-0.003
count_attendance	-0.013	-0.002	0.008	0.000	0.000	0.019	-0.014	-0.005	0.010	0.001	0.003	0.022
zero_(Intercept)	0.011	-0.002	-0.015	-0.002	0.000	-0.014	0.107	-0.040	-0.189	-0.034	0.007	-0.181
zero_nom_dist1_ep	-0.005	0.092	-0.003	0.000	0.003	-0.005	-0.040	1.592	-0.545	-0.003	0.030	-0.217
zero_nom_dist1_nat	-0.015	0.030	0.292	0.006	-0.002	0.010	-0.189	-0.545	5.187	0.102	-0.244	0.200
zero_ctee_expert	-0.003	0.002	0.011	0.006	-0.001	0.001	-0.034	-0.003	0.102	0.144	-0.035	0.005
zero_chair	-0.003	-0.003	-0.014	-0.001	0.002	0.003	0.007	0.030	-0.244	-0.035	94400.207	-0.056
zero_attendance	-0.011	-0.015	0.012	0.002	-0.003	0.022	-0.181	-0.217	0.200	0.005	-0.056	0.527

Standard errors are calculated as the square root from the diagonal of that matrix.

#Variance-covariance matrix
vcov(mod2.zero.pois) %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##          0.10079098          0.34019014          0.45140501          0.05283447 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##          0.07137039          0.13777445          0.32682849          1.26191966 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##          2.27742363          0.37903328        307.24616665          0.72611315

Can you recognize the results from mod1.zero?

Clustered standard errors

We can estimate clustered standard errors by using the sandwich package in R.

#install.packages("sandwich")
library(sandwich)

The relevant function is vcovCL(). It requires an additional argument clustered= where we provide the variable we want to cluster the errors on. Here, I cluster by committee. In GLMs the standard type= of errors is “HC0”. This is done automatically.

#Variance-covariance matrix
vcovCL(mod2.zero.pois,
     #Cluster according to committee membership
     cluster = df$committee,
     type = "HC0") %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

Be mindful that if you have NAs in your model (as I do here), you would have to subset the data so that the NAs dissappear. My hack is to subset using the predict function.

#predict
df$preds <- predict(mod2.zero.pois, df)

#Variance-covariance matrix
vcov(mod2.zero.pois,
     #Subset, then cluster
     cluster = df$committee[!is.na(df$preds)],
     type = "HC0") %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##          0.10079098          0.34019014          0.45140501          0.05283447 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##          0.07137039          0.13777445          0.32682849          1.26191966 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##          2.27742363          0.37903328        307.24616665          0.72611315

I now have regression coefficients from the model object and standard errors estimated separately.

#Copy standard errors to a separate object
ses.clustered <-
  vcovCL(mod2.zero.pois,
     #Subset, then cluster
     cluster = df$committee[!is.na(df$preds)]) %>%
  #Diagonal
  diag %>%
  #Square root
  sqrt

#summary statsitics from the original model (for comparison)
summary.mod1.zero <- summary(mod2.zero.pois)
#Extract the standard errors of the count part of the model
ses.normal <- summary.mod1.zero$coefficient$count[,"Std. Error"]

We usually do this by adding a list of named standard errors. These names must be the same as the names of the regression coefficients.

Here we see that the names of the standard errors are prefixed by count_ for the count model and zero_ for the zero hurdle.

ses.clustered

##   count_(Intercept)  count_nom_dist1_ep count_nom_dist1_nat   count_ctee_expert 
##           0.1999302           0.3590123           0.7006974           0.1399392 
##         count_chair    count_attendance    zero_(Intercept)   zero_nom_dist1_ep 
##           0.1522670           0.2343490           0.4476172           1.3484687 
##  zero_nom_dist1_nat    zero_ctee_expert          zero_chair     zero_attendance 
##           1.8209910           0.4929240           0.5422434           1.2752056

… however, the coefficients do not have these names.

mod2.zero.pois$coefficients

## $count
##   (Intercept)  nom_dist1_ep nom_dist1_nat   ctee_expert         chair 
##   -3.39832253   -0.02883625   -0.63717556    0.07216832    1.33814804 
##    attendance 
##    0.66311572 
## 
## $zero
##   (Intercept)  nom_dist1_ep nom_dist1_nat   ctee_expert         chair 
##    -2.4234745     3.9084822    -1.9573037    -0.4873135   -12.7187862 
##    attendance 
##    -6.1911903

I will have to change the names to get the stargazer to understand my objects and link coefficients and errors correctly. I use the gsub() function (from our session on recoding of variables) to replace the string count_ with nothing "".

names(ses.clustered) <- names(ses.clustered) %>%
  gsub("count_", "", .)
ses.clustered

##        (Intercept)       nom_dist1_ep      nom_dist1_nat        ctee_expert 
##          0.1999302          0.3590123          0.7006974          0.1399392 
##              chair         attendance   zero_(Intercept)  zero_nom_dist1_ep 
##          0.1522670          0.2343490          0.4476172          1.3484687 
## zero_nom_dist1_nat   zero_ctee_expert         zero_chair    zero_attendance 
##          1.8209910          0.4929240          0.5422434          1.2752056

Now, I can finally make my regression table. Here, I report the results from the original model and the results with the clustered errors.

stargazer(mod2.zero.pois, mod2.zero.pois,
          se = list(ses.normal, ses.clustered),
          title = "Normal vs. clustered standard errors",
          type = "html")

**Normal vs. clustered standard errors**

	Dependent variable:

	nreports
	(1)	(2)

nom_dist1_ep	-0.029	-0.029
	(0.340)	(0.359)

nom_dist1_nat	-0.637	-0.637
	(0.451)	(0.701)

ctee_expert	0.072	0.072
	(0.053)	(0.140)

chair	1.338^***	1.338^***
	(0.071)	(0.152)

attendance	0.663^***	0.663^***
	(0.138)	(0.234)

Constant	-3.398^***	-3.398^***
	(0.101)	(0.200)


Observations	885	885
Log Likelihood	-1,536.608	-1,536.608

Note:	p<0.1; p<0.05; p<0.01

I always check the results: My clustered errors should be larger than the normal errors, since the whole point of this is to account for the fact that my number of observations doesn’t reflect the uncertainty in the data (i.e. I have less variation than what we might think, because outcomes depend on committee membership).

Need a brush-up on robust/clustered standard errors? Robust standard errors and clustered standard errors are both methods for estimating standard errors that are robust to violations of the assumption of homoscedasticity (i.e., equal variance) in linear regression models. However, they are used in different contexts and with different assumptions.

Robust standard errors are used when the errors are believed to be heteroscedastic, but the source of the heteroscedasticity is unknown. In this case, the robust standard errors adjust the standard errors of the regression coefficients to account for the potential heteroscedasticity of the errors. Robust standard errors can be estimated using various methods, such as the sandwich estimator or the White estimator, and do not require any additional information about the structure of the data.

Clustered standard errors, on the other hand, are used when the errors are believed to be heteroscedastic and correlated within groups, but uncorrelated between groups. In this case, the groups can be defined in various ways, such as geographic regions, schools, or firms, and the clustered standard errors adjust the standard errors of the regression coefficients to account for the correlation of the errors within each group. The clustered standard errors can be estimated using the same methods as the robust standard errors, but they require information about the grouping structure of the data.

Display

mod.zero <- zeroinfl(nreports ~ 
                       nom_dist1_ep +
                       nom_dist1_nat +
                       ctee_expert +
                       chair +
                       offset(log(months)) | attendance,
                     #Count model == negative binomial
                    dist = "poisson",
                    df)

newdata1 <- data.frame(
  intercept1 = 1,
  nom_dist1_ep = seq(0,1 , 0.01),
  nom_dist1_nat = 0.028,
  ctee_expert = 0,
  chair = 0)

newdata0 <- data.frame(
                       intercept0 = 1,
                       attendance = 0.7)
# predict(mod.zero, newdata = newdata)

sim1 <- mvrnorm(n = 1000, 
                mu = coefficients(mod.zero)[1:5], 
                Sigma = vcov(mod.zero)[1:5, 1:5])

sim0 <- mvrnorm(n = 1000, 
                mu = coefficients(mod.zero)[6:7], 
                Sigma = vcov(mod.zero)[6:7, 6:7])

# Simulate outcomes from the zero-inflated model
prob <- predict(mod.zero, 
                newdata = cbind(newdata1, newdata0, months = log(60)) %>% as.data.frame(), 
                type = "prob")

# sim_outcomes <- rbinom(n, 1, prob[, 1]) * rpois(n, prob[, 2])

preds <- as.matrix(newdata) %*% t(simb)

preds <- preds %>% exp()

pdf <- as.data.frame(t(apply(X = preds,  
                                      MARGIN = 1, 
                                      quantile, probs = c(.025,.5,.975))))

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.

Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.

Yoshinaka, Antoine, Gail Mcelroy, and Shaun Bowler. 2010. “The Appointment of Rapporteurs in the European Parliament.” Legislative Studies Quarterly 35 (4): 457–86. https://doi.org/10.3162/036298010793322384.