Models for event counts
In this session, we will be analyzing count outcomes. We will explore the base-line poisson model, test for overdispersion and adress potential problems.
We will familiarize with:
how to estimate different count models
- poisson and quasi-poisson models from baseR
- negative binomial models from
MASS
- zero-inflated and hurdle models from
pscl
clustered standard errors from
sandwich
- while these are not specific to count models, we sometimes need to cluster our standard errors because our observations are clustered. You’ll find an example at the end of the script.
We will be working with data from Yoshinaka, Mcelroy, and Bowler (2010) on report allocation in the European Parliament (EP). All legislative proposals are handled by a member of one of the committees in the EP. This person (the “rapporteur”) is in charge of collecting information, negotiate a consensus between the party groups and with the other institutions (the Commission and the Council) and draft the resulting parliamentary amendments on behalf of the committee. For any single policy proposal, the rapporteur is the single-most influential member of parliament (MEP). The rapporteur is appointed by the group leader in the committee in charge of the proposal. Group leaders are committee members of the transnational party group of MEPs. These are (transnational) parliamentary parties to which national parties adhere (or not).
Back in the days, it was a common belief that national parties were running the show in the EP, since they are in charge of the candidate selection before elections, and because transnational groups have little presence outside of Parliament.
This article explores the argument that party leaders will prefer to appoint group members that are ideologically close to the transnational party median and contrasts this with the ideological proximity between MEPs and their national party group. The aim is to understand who allocates political influence in the European Parliament. The model also includes an indicator of whether being a policy expert is an asset.
We start by loading in the data and relevant packages.
library(dplyr); library(ggplot2)
#Load in data
load("df_yoshinaka.rda")
<- df_yoshinaka df
Introduction
The outcome variable of choice is the number of reports
(i.e. legislative proposals, nreports
) each MEP has
garnered during the legislative period. We have, in other words, a data
generating process in which events occur – in principle unrelated to
each others – within a certain window/exposure time. This is the type of
outcomes where we will resolve to the poisson model or one of its
affiliates.
hist(df$nreports,
xlab = "Number of reports (proposals)",
main = "Distribution of legislative proposals among MEPs")
#Number of observations
nrow(df)
## [1] 906
#Number of no events
sum(df$nreports==0)
## [1] 367
#Number of extreme events
quantile(df$nreports, 0.95)
## 95%
## 7
max(df$nreports)
## [1] 22
As we can see from the histogram, the distribution of legislative proposals is quite unequal. A fairly large number of MEPs (more than a third) never drafted a report at all, while the 5% most productive members drafted 7 or more reports. The most prolific MEP drafted no less than 22 proposals!
Overall, the variance in our outcome is higher than the mean.
#Mean
mean(df$nreports)
## [1] 1.970199
#Variance
var(df$nreports)
## [1] 8.340547
This is not uncommon for count data. The poisson distribution used in poisson models assume that the mean is the same as the spread/variance in the conditional outcome. If our regression model contains enough predictors to to explain this variance
<- lm(nreports ~
mod1.ols +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,
df)
<- lm(log(nreports+1) ~
mod2.ols +
nom_dist1_ep +
nom_dist2_nat
ctee_expert, df)
Baseline model: poisson
Estimation
Runnning the poisson model is simple. It is GLM and is part of the baseR package. The only code we need to change is the probability distribution.
<- glm(nreports ~
mod1.pois +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,#tell R which probability distribution we use
family = "poisson",
df)
summary(mod1.pois)
##
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert,
## family = "poisson", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4100 -1.8955 -0.7374 0.6569 9.1278
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.70968 0.03609 19.666 < 2e-16 ***
## nom_dist1_ep -1.76930 0.31886 -5.549 2.88e-08 ***
## nom_dist1_nat -1.47164 0.50029 -2.942 0.00327 **
## ctee_expert 0.35966 0.04849 7.417 1.20e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2970.4 on 897 degrees of freedom
## Residual deviance: 2849.2 on 894 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 4371.9
##
## Number of Fisher Scoring iterations: 6
Exposure (the event window)
Thinking about the data generating process, I realize that MEPs don’t
all serve entire terms in the European Parliament. Some quit before the
end of the term, others swap committees. In other words, the exposure
time In the data, there is a variable called monthts
that
reports the number of months the member served on the committee.
If the exposure time is not the same across observations, the probability of a number of events to happen is also variable. We could address this by controlling for the logtransformed variable. It is more common to use an offset. The offset is – for all practical purposes – a logtransformed control variable we include in the model but for which the slope coefficient (parameter) is fixed at 1.
In R, I can do this by either using the offset()-function or the argument in the glm()-function.
#Take 1
<- glm(nreports ~
mod2.pois +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,family = "poisson",
offset = log(months),
df)
#Take 2
<- glm(nreports ~
mod2.pois +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert offset(log(months)),
family = "poisson",
df)
As we can see from comparing mod1.pois and mod2.pois, the regression coefficients change substantially. However, the intercept changed too. Since the model is a member of the exponential family with inbuild interaction effects, it is hard to know to what extent the change in coefficients is substantial.
stargazer(mod1.ols, mod2.ols, mod1.pois, mod2.pois,
type = "html")
Dependent variable: | ||||
nreports | log(nreports + 1) | nreports | ||
OLS | OLS | Poisson | ||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | -2.313*** | -0.815*** | -1.769*** | -0.853*** |
(0.858) | (0.220) | (0.319) | (0.281) | |
nom_dist1_nat | -1.928 | -1.472*** | -0.937** | |
(1.287) | (0.500) | (0.439) | ||
nom_dist2_nat | -0.780** | |||
(0.346) | ||||
ctee_expert | 0.771*** | 0.245*** | 0.360*** | 0.223*** |
(0.206) | (0.054) | (0.048) | (0.049) | |
Constant | 1.975*** | 0.786*** | 0.710*** | -3.076*** |
(0.133) | (0.036) | (0.036) | (0.035) | |
Observations | 898 | 898 | 898 | 898 |
R2 | 0.029 | 0.047 | ||
Adjusted R2 | 0.025 | 0.044 | ||
Log Likelihood | -2,181.944 | -1,868.368 | ||
Akaike Inf. Crit. | 4,371.889 | 3,744.736 | ||
Residual Std. Error (df = 894) | 2.858 | 0.745 | ||
F Statistic (df = 3; 894) | 8.809*** | 14.736*** | ||
Note: | p<0.1; p<0.05; p<0.01 |
Model fit
Let’s assess the model’s fit. This means comparing predicted and obseved outcomes. This time, I am scouting for overdispersion.
Histograms
I usually start the really intuitive way. How does the distribution of my predictions compare to the distribution of the observed counts?
#Predict outcomes
$preds_pois2 <- predict(mod2.pois,
dfnewdata = df,
#Note that we should not use the argument name here; but we want exponentiated outcomes
"response")
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_pois2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (`stat_bin()`).
This looks dismall. We are predicting too few zero events and too few extreme counts. Clearly, the variation in the distribution of influence in the EP is larger than what our current model accounts for. It seems we are in the presence of overdispersion. It means that the standard errors are too small, since the poisson distribution has clear expectations about the variance.
Hanging rootogram
I can plot the same thing using hanging rootograms (Ward and Ahlquist 2018, p.). The
countreg
package has a ready-made function for us. It is
available for installation online.
#install.packages("countreg", repos="http://R-Forge.R-project.org")
#Rootogram
rootogram(mod1.pois,
main = "Hanging rootogram of mod1.pois")
The plot gives us the same message: It uses the span of predicted values to create the bars, but compares within each predicted count the number of wrong predictions. Once again, we underpredict zeros as well as extreme values.
Addressing overdispersion
Overdispersion is not necessarily a big problem if i) we believe that we don’t have an omitted variable bias and ii) we have described the data generating process correctly. If we believe none of the above apply to us, we can simply make the standard errors larger. That is what the quasi poisson model does.
However, I usually take the opportunity to think more closely about the data generating process, i.e. I put on my polisci hat and start thinking about how the events occur. That’s what we will do in the following.
Quasi-poisson: fixing the standard errors
If I believe my model is an unbiased description of the phenomenon I want to study, it suffices to revise my certainty about the size of the predictors, without re-estimating them. The quasipoisson model adds an additional parameter \(\phi\) to the variance parameter for the poisson model (Ward and Ahlquist 2018, 202). It describes the (over)dispersion and corrects the standard errors for it.
\(\phi V(\lambda_i) = \phi \lambda_i\)
We can easily run the model in baseR by changing the probability distribution argument.
<- glm(nreports ~
mod.quasipois +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert offset(log(months)),
#Specify quasipoisson
family = "quasipoisson",
df)
summary(mod.quasipois)
##
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert +
## offset(log(months)), family = "quasipoisson", data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6217 -1.5013 -0.5819 0.3392 7.8374
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.07626 0.06155 -49.978 < 2e-16 ***
## nom_dist1_ep -0.85316 0.48985 -1.742 0.08191 .
## nom_dist1_nat -0.93748 0.76630 -1.223 0.22151
## ctee_expert 0.22292 0.08472 2.631 0.00865 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.048904)
##
## Null deviance: 2263.5 on 897 degrees of freedom
## Residual deviance: 2222.0 on 894 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Here, we see the dispersion is reported to 3.0489044.
The results are reported alongside the baseline model in table 3. We can see that the coefficieents are the same, but now, the politial distance between MEPs and their national parties is no longer statistically significant. In fact, the standard error is almost as big as the coefficient.
Adding predictors
I generally prefer taking the opportunity to think about my model choice instead of going for the post-hoc fixing operated by the quasipoisson. An alternative route is to add predictors. Do not make this into a garbage can by adding intermediate predictors, and then control away your main effect. However, you can add predictors for two reasons:
- Additional predictors: are correlated with the outcome, but not the theoretical variable you are interested in. Here, I happen to know that some MEPs don’t generally take part in parliamentary work. They are not present during the plenary sessions, for example. If my theory is that group leaders deselect some committee members – and not that some committee members do not self-deselect – then this is an interesting control. However, there is no real reason to think that the MEPs with extreme political views (compared to their party groups) are less present.
<- glm(nreports ~
mod1.predictors +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
activity # chair +
offset(log(months)),
#Specify quasipoisson
family = "poisson",
df)
The result is reported in the regression table. We see that the political distance is estimated to be (approximately) the same also after including the control.
- Confounders: You discover
In this particular case, I happen to know a few mechanisms that determine how many proposals an MEP could potentially be eligible for. They might becorrelated to their ideological distance from their party: The leadership position and the committee on which the MEP serves.
First, legislative proposals that none is interested in handling are taken on by the committee chair. It means that some of my extreme counts may be due to the committee chair doing his/her job of filling in for the other members. The committee chair is elected by all of the party groups, so this position may be reserved for people that are close to the median voter in the committee, rather than the party.
Second, the Commission is more active in some policy areas than others. If parties are strategic, they will put eligible rapporteurs on the more active committees. In the second model, I add fixed effects on committees and add a predictor for the commitee chair.
<- glm(nreports ~
mod2.predictors +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
activity +
chair +
committeeoffset(log(months)),
#Specify quasipoisson
family = "poisson",
df)
Let’s check the histogram again.
#Predict outcomes
$preds_predictors2 <- predict(mod2.predictors,
dfnewdata = df,
#Note that we should not use the argument name here; but we want exponentiated outcomes
"response")
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_predictors2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).
… and the rootogram.
#Rootogram
rootogram(mod2.predictors,
main = "Hanging rootogram of mod2.predictors")
This looks so much better! However, I still have a problem in predicting my zeros, though.
Comparing with the initial poisson model, we see that the significance level is lower for ideological distance (the ratio between the parameter and the error is smaller). However, it is still more precise than the initial quasipoisson model.
stargazer( mod1.pois, mod2.pois,
mod.quasipois,
mod1.predictors, mod2.predictors,omit = "committee",
type = "html")
Dependent variable: | |||||
nreports | |||||
Poisson | glm: quasipoisson | Poisson | |||
link = log | |||||
(1) | (2) | (3) | (4) | (5) | |
nom_dist1_ep | -1.769*** | -0.853*** | -0.853* | -0.850*** | -0.714** |
(0.319) | (0.281) | (0.490) | (0.279) | (0.302) | |
nom_dist1_nat | -1.472*** | -0.937** | -0.937 | -0.497 | -0.641 |
(0.500) | (0.439) | (0.766) | (0.412) | (0.439) | |
ctee_expert | 0.360*** | 0.223*** | 0.223*** | 0.182*** | 0.200*** |
(0.048) | (0.049) | (0.085) | (0.049) | (0.052) | |
activity | 1.666*** | 1.746*** | |||
(0.186) | (0.189) | ||||
chair | 1.584*** | ||||
(0.072) | |||||
Constant | 0.710*** | -3.076*** | -3.076*** | -4.422*** | -4.897*** |
(0.036) | (0.035) | (0.062) | (0.157) | (0.197) | |
Observations | 898 | 898 | 898 | 896 | 896 |
Log Likelihood | -2,181.944 | -1,868.368 | -1,820.734 | -1,505.961 | |
Akaike Inf. Crit. | 4,371.889 | 3,744.736 | 3,651.468 | 3,055.922 | |
Note: | p<0.1; p<0.05; p<0.01 |
Negative binomial model
Another route is to opt for the negative binomial model (Ward and Ahlquist 2018, .p. 202-205). The model doesn’t really have a single probability distribution, instead it merges two parameters into one using the gamma distribution (\(\Gamma\)). It means that the functional form of the negative binomial model is not set in stone and can empirically adjust to the excess counts that we have by creating a “fatter tail”.
However, this also means that we change the hypothesis. That is, the negative binomial distribution assumes the probability of an additional count increases over the number of events. The best example is the soccer player that – having scored once – feels on top of things and continue to excell throughout the game, as (s)he is accumulating successes.
In our case, we could imagine a world in which group leaders are rational time-savers. That is, when they receive legislative proposals, they will rely on people they already know; either because they are on the leader’s mind, or because they think there will be an economy of scale (for example).
To fit the negative binomial model, we need to resolve to the
MASS
-packages; the same we use for our simulations. The
function is slightly different glm.nb()
, but otherwise the
specification is the same.
library(MASS)
<- glm.nb(nreports ~
mod1.nb +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
activity +
chair +
committee offset(log(months)),
df)
The results are reported in table 4.
stargazer(mod1.predictors, mod2.predictors,
mod1.nb,omit = "committee",
title = "The poisson vs. the negative binomial model",
type = "html")
Dependent variable: | |||
nreports | |||
Poisson | negative | ||
binomial | |||
(1) | (2) | (3) | |
nom_dist1_ep | -0.850*** | -0.714** | -0.761* |
(0.279) | (0.302) | (0.390) | |
nom_dist1_nat | -0.497 | -0.641 | -0.220 |
(0.412) | (0.439) | (0.537) | |
ctee_expert | 0.182*** | 0.200*** | 0.249*** |
(0.049) | (0.052) | (0.076) | |
activity | 1.666*** | 1.746*** | 1.938*** |
(0.186) | (0.189) | (0.249) | |
chair | 1.584*** | 1.651*** | |
(0.072) | (0.149) | ||
Constant | -4.422*** | -4.897*** | -5.148*** |
(0.157) | (0.197) | (0.264) | |
Observations | 896 | 896 | 896 |
Log Likelihood | -1,820.734 | -1,505.961 | -1,415.508 |
theta | 2.666*** (0.346) | ||
Akaike Inf. Crit. | 3,651.468 | 3,055.922 | 2,875.015 |
Note: | p<0.1; p<0.05; p<0.01 |
#Create a plot with one row, and two columns
par(mfrow = c(1,3))
#Plot old model
rootogram(mod1.pois)
#Plot new model
rootogram(mod2.predictors)
#Plot new model
rootogram(mod1.nb)
#Return to initial parameters; one row and one column per plot
par(mfrow = c(1,1))
We can compare models. It is clear that the more complex poisson model is betteer than the simple one. However, the negative binomial model is a bit of a mixed bag. I have mor correct zero counts, and some more extreme counts (>15 reports).
Zero-inflated and hurdle models: a focus on the excess zeros
Another alternative is to consider if, in fact we are in the presence of two data generating processes: One accounting for the excess zeros, another for the positive counts.
We can, for example model the excess zeros as two data generating processes. One accounting for some of the zero counts. These would be observations that – for some reason – would not respond to our predictors in the same way (“never takers”).
Effectively, we run two models in parallel: A binomial logisitic regression modelling the “never takers”, and another count model (poisson or negative binomial) with the ususal predictors.
Zero-inflated models
Here, I model a poisson count model. The package I use is the
pscl
package containing – among other things – functions
for zero-inflated and hurdle models.
#install.packages("pscl")
library(pscl)
I can also make a theory out of this. Let’s say that there is a group of MEPs that are never takers; they’d rather be caught dead than write a legislative proposal. That is, before the group leader makes the selection, there is an initial stage of self (de-)selection into legislative work.
One predictor for this is the activity level of the MEP. The variable
attendance
captures the share of committee meetings that
the member has attended. The idea is that the less you attend a
committee, the more you opt out of legislative work.
I specify the predictors for the two models in the function and
separate them with a vertical bar “|”). Before the bar, I specify the
predictors for the count part of the model. After the bar are the
predictors for the zero/hurdle part (binomial logistic model). Here, I
specify that Attendance
is a predictor for whether an MEP
is willing to engage in committee work, but not for how many positive
counts will see.
<- zeroinfl(nreports ~
mod1.zero.pois +
nom_dist1_ep +
nom_dist1_nat +
chair +
ctee_expert +
committee offset(log(months))|attendance,
#Count model == poisson
dist = "poisson",
df)
The effect comes out significant and with a massive substantial effect.
#Typical difference betweeen assiduous members and active members
<- df$attendance %>%
change quantile(., c(0.2, .8), na.rm=T) %>%
%>%
diff as.numeric()
change
## [1] 0.5028251
#Extract the relevant predictor from the count part of the model
<- mod1.zero.pois$coefficients$zero["attendance"]
coef <- exp(coef * change)-1
effect effect
## attendance
## -0.9585354
Comparing a low-attendee (i.e. low quantile) with an assiduous member (high quantile), I find a 96% relative decrease in an MEP’s likelihood of getting at least one legislative proposal from their leader.
Same predictors in both models
If I don’t specify anything else, R will use the same predictor for both models. We can take the explorative approach and look at what the model yields.
<- zeroinfl(nreports ~
mod2.zero.pois +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair +
attendance offset(log(months)),
#Count model == poisson
dist = "poisson",
df)
summary(mod2.zero.pois)
##
## Call:
## zeroinfl(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert +
## chair + attendance + offset(log(months)), data = df, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.8074 -0.7864 -0.3954 0.3415 8.6202
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.39832 0.10079 -33.717 < 2e-16 ***
## nom_dist1_ep -0.02884 0.34019 -0.085 0.932
## nom_dist1_nat -0.63718 0.45141 -1.412 0.158
## ctee_expert 0.07217 0.05283 1.366 0.172
## chair 1.33815 0.07137 18.749 < 2e-16 ***
## attendance 0.66312 0.13777 4.813 1.49e-06 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4235 0.3268 -7.415 1.22e-13 ***
## nom_dist1_ep 3.9085 1.2619 3.097 0.00195 **
## nom_dist1_nat -1.9573 2.2774 -0.859 0.39010
## ctee_expert -0.4873 0.3790 -1.286 0.19856
## chair -12.7188 307.2462 -0.041 0.96698
## attendance -6.1912 0.7261 -8.526 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 51
## Log-likelihood: -1537 on 12 Df
Something strange seems to happen. Attendance is clearly connected to receiving legislative influence. However, my theory about political proximity is challenged by the explorative approach. Right-leaning MEPs seem to be overrepresented as never takers, while my theory linked it to the count effect.
Here, I do a negative binomial.
library(pscl)
<- zeroinfl(nreports ~
mod1.zero.nb +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair +
committee offset(log(months)) | attendance,
#Count model == negative binomial
dist = "negbin",
df)
Stargazer will plot the results of either of the two models, but not
both. I use the zero.component=T
to plot the binomial model
that predicts my zero events.
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
omit = "committee",
title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Zero events (no reports)",
zero.component = T,
type = "html")
Zero events (no reports) | |||
nreports | |||
(1) | (2) | (3) | |
nom_dist1_ep | 3.908*** | ||
(1.262) | |||
nom_dist1_nat | -1.957 | ||
(2.277) | |||
ctee_expert | -0.487 | ||
(0.379) | |||
chair | -12.719 | ||
(307.246) | |||
attendance | -6.330*** | -6.191*** | -8.179*** |
(0.689) | (0.726) | (1.153) | |
Constant | 1.451*** | -2.423*** | 1.625*** |
(0.261) | (0.327) | (0.317) | |
Observations | 885 | 885 | 885 |
Log Likelihood | -1,420.134 | -1,536.608 | -1,376.439 |
Note: | p<0.1; p<0.05; p<0.01 |
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
omit = "committee",
title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Number of reports",
zero.component = F,
type = "html")
Number of reports | |||
nreports | |||
(1) | (2) | (3) | |
nom_dist1_ep | -0.358 | -0.029 | -0.488 |
(0.335) | (0.340) | (0.395) | |
nom_dist1_nat | -0.780 | -0.637 | -0.700 |
(0.518) | (0.451) | (0.543) | |
chair | 1.347*** | 1.338*** | 1.413*** |
(0.072) | (0.071) | (0.125) | |
attendance | 0.663*** | ||
(0.138) | |||
ctee_expert | 0.131** | 0.072 | 0.154** |
(0.055) | (0.053) | (0.072) | |
Constant | -3.270*** | -3.398*** | -3.391*** |
(0.116) | (0.101) | (0.147) | |
Observations | 885 | 885 | 885 |
Log Likelihood | -1,420.134 | -1,536.608 | -1,376.439 |
Note: | p<0.1; p<0.05; p<0.01 |
Hurdle models
The hurdle models are similar. They assume that there is a certain threshhold an observation has to pass in order to acquire positive counts.
This time, we use the hurdle()
function to estimate the
model.
<- hurdle(nreports ~
mod1.hurdle +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair +
committee offset(log(months)) | attendance,
#Count model == negative binomial
# dist = "negbin",
df)
summary(mod1.hurdle)
##
## Call:
## hurdle(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert +
## chair + committee + offset(log(months)) | attendance, data = df)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.3664 -0.7268 -0.3941 0.5516 5.8558
##
## Count model coefficients (truncated poisson with log link):
## Estimate
## (Intercept) -3.25496
## nom_dist1_ep 0.01663
## nom_dist1_nat -1.05261
## ctee_expert 0.09707
## chair 1.33031
## committeeCommittee on Budgetary Control 0.42114
## committeeCommittee on Budgets 0.92462
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs 0.43619
## committeeCommittee on Constitutional Affairs -0.01209
## committeeCommittee on Culture, Youth, Education, the Media and Sport -0.30796
## committeeCommittee on Development and Cooperation -0.56570
## committeeCommittee on Economic and Monetary Affairs 0.48585
## committeeCommittee on Employment and Social Affairs -0.25478
## committeeCommittee on Fisheries 0.59848
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy 0.33292
## committeeCommittee on Industry, External Trade, Research and Energy 0.44134
## committeeCommittee on Legal Affairs and the Internal Market 0.90077
## committeeCommittee on Petitions -0.92295
## committeeCommittee on Regional Policy, Transport and Tourism 0.12009
## committeeCommittee on the Environment, Public Health and Consumer Policy 0.57760
## committeeCommittee on Women's Rights and Equal Opportunities -0.20261
## Std. Error
## (Intercept) 0.12747
## nom_dist1_ep 0.40082
## nom_dist1_nat 0.59465
## ctee_expert 0.05842
## chair 0.07307
## committeeCommittee on Budgetary Control 0.19313
## committeeCommittee on Budgets 0.14955
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs 0.15040
## committeeCommittee on Constitutional Affairs 0.20074
## committeeCommittee on Culture, Youth, Education, the Media and Sport 0.21052
## committeeCommittee on Development and Cooperation 0.23656
## committeeCommittee on Economic and Monetary Affairs 0.15068
## committeeCommittee on Employment and Social Affairs 0.18518
## committeeCommittee on Fisheries 0.16468
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy 0.14801
## committeeCommittee on Industry, External Trade, Research and Energy 0.14729
## committeeCommittee on Legal Affairs and the Internal Market 0.14279
## committeeCommittee on Petitions 0.35093
## committeeCommittee on Regional Policy, Transport and Tourism 0.15077
## committeeCommittee on the Environment, Public Health and Consumer Policy 0.13935
## committeeCommittee on Women's Rights and Equal Opportunities 0.23147
## z value
## (Intercept) -25.535
## nom_dist1_ep 0.041
## nom_dist1_nat -1.770
## ctee_expert 1.662
## chair 18.206
## committeeCommittee on Budgetary Control 2.181
## committeeCommittee on Budgets 6.183
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs 2.900
## committeeCommittee on Constitutional Affairs -0.060
## committeeCommittee on Culture, Youth, Education, the Media and Sport -1.463
## committeeCommittee on Development and Cooperation -2.391
## committeeCommittee on Economic and Monetary Affairs 3.224
## committeeCommittee on Employment and Social Affairs -1.376
## committeeCommittee on Fisheries 3.634
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy 2.249
## committeeCommittee on Industry, External Trade, Research and Energy 2.996
## committeeCommittee on Legal Affairs and the Internal Market 6.308
## committeeCommittee on Petitions -2.630
## committeeCommittee on Regional Policy, Transport and Tourism 0.797
## committeeCommittee on the Environment, Public Health and Consumer Policy 4.145
## committeeCommittee on Women's Rights and Equal Opportunities -0.875
## Pr(>|z|)
## (Intercept) < 2e-16
## nom_dist1_ep 0.966905
## nom_dist1_nat 0.076702
## ctee_expert 0.096596
## chair < 2e-16
## committeeCommittee on Budgetary Control 0.029215
## committeeCommittee on Budgets 6.30e-10
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs 0.003729
## committeeCommittee on Constitutional Affairs 0.951973
## committeeCommittee on Culture, Youth, Education, the Media and Sport 0.143509
## committeeCommittee on Development and Cooperation 0.016786
## committeeCommittee on Economic and Monetary Affairs 0.001263
## committeeCommittee on Employment and Social Affairs 0.168856
## committeeCommittee on Fisheries 0.000279
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy 0.024495
## committeeCommittee on Industry, External Trade, Research and Energy 0.002731
## committeeCommittee on Legal Affairs and the Internal Market 2.82e-10
## committeeCommittee on Petitions 0.008537
## committeeCommittee on Regional Policy, Transport and Tourism 0.425726
## committeeCommittee on the Environment, Public Health and Consumer Policy 3.40e-05
## committeeCommittee on Women's Rights and Equal Opportunities 0.381395
##
## (Intercept) ***
## nom_dist1_ep
## nom_dist1_nat .
## ctee_expert .
## chair ***
## committeeCommittee on Budgetary Control *
## committeeCommittee on Budgets ***
## committeeCommittee on Citizens' Freedoms and Rights, Justice and Home Affairs **
## committeeCommittee on Constitutional Affairs
## committeeCommittee on Culture, Youth, Education, the Media and Sport
## committeeCommittee on Development and Cooperation *
## committeeCommittee on Economic and Monetary Affairs **
## committeeCommittee on Employment and Social Affairs
## committeeCommittee on Fisheries ***
## committeeCommittee on Foreign Affairs, Human Rights, Common Security and Defence Policy *
## committeeCommittee on Industry, External Trade, Research and Energy **
## committeeCommittee on Legal Affairs and the Internal Market ***
## committeeCommittee on Petitions **
## committeeCommittee on Regional Policy, Transport and Tourism
## committeeCommittee on the Environment, Public Health and Consumer Policy ***
## committeeCommittee on Women's Rights and Equal Opportunities
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5738 0.1771 -8.888 <2e-16 ***
## attendance 4.0378 0.3286 12.289 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 29
## Log-likelihood: -1516 on 23 Df
As you can see from the regression table, these models often yield similar results.
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
omit = "committee",
title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Zero events (no reports)",
zero.component = T,
type = "html")
Zero events (no reports) | ||||
nreports | ||||
zero-inflated | hurdle | |||
count data | ||||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | 3.908*** | |||
(1.262) | ||||
nom_dist1_nat | -1.957 | |||
(2.277) | ||||
ctee_expert | -0.487 | |||
(0.379) | ||||
chair | -12.719 | |||
(307.246) | ||||
attendance | -6.330*** | -6.191*** | -8.179*** | 4.038*** |
(0.689) | (0.726) | (1.153) | (0.329) | |
Constant | 1.451*** | -2.423*** | 1.625*** | -1.574*** |
(0.261) | (0.327) | (0.317) | (0.177) | |
Observations | 885 | 885 | 885 | 885 |
Log Likelihood | -1,420.134 | -1,536.608 | -1,376.439 | -1,516.094 |
Note: | p<0.1; p<0.05; p<0.01 |
stargazer(mod1.zero.pois, mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
omit = "committee",
title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Number of reports",
zero.component = F,
type = "html")
Number of reports | ||||
nreports | ||||
zero-inflated | hurdle | |||
count data | ||||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | -0.358 | -0.029 | -0.488 | 0.017 |
(0.335) | (0.340) | (0.395) | (0.401) | |
nom_dist1_nat | -0.780 | -0.637 | -0.700 | -1.053* |
(0.518) | (0.451) | (0.543) | (0.595) | |
chair | 1.347*** | 1.338*** | 1.413*** | 1.330*** |
(0.072) | (0.071) | (0.125) | (0.073) | |
attendance | 0.663*** | |||
(0.138) | ||||
ctee_expert | 0.131** | 0.072 | 0.154** | 0.097* |
(0.055) | (0.053) | (0.072) | (0.058) | |
Constant | -3.270*** | -3.398*** | -3.391*** | -3.255*** |
(0.116) | (0.101) | (0.147) | (0.127) | |
Observations | 885 | 885 | 885 | 885 |
Log Likelihood | -1,420.134 | -1,536.608 | -1,376.439 | -1,516.094 |
Note: | p<0.1; p<0.05; p<0.01 |
Assessing the zeroinflated and hurdle models
Assessing and presenting the results from the models of excess zero is somewhat harder because the predicted outcomes are a function of two processes/models. Which of those two should we focus on? Or should we do both?
The results of the binomial part (zero-part) reports the likelihood of positive/negative counts, and can be interpreted as such. However, the results of the count part of the model, should be interpreted as the effects of the covariates contitional on being in the group of MEPs eligible for reports (given the binomial model).
Predictions
The predict()
function will provide predictions as per
usual. However, since the binomial model moderates the results of the
count part, we should think about which of the models we’d like to
present.
The type=
argument lets you decide which outcome you’re
interested in. * “count” reports the results from the count part of the
model condiitonal on passing the hurdle * “zero” reports the
probability from the binomial model.
- “response” will yield the total expected outcome (with decimals). However these predictions are done at the MEP-level, and may therefore deviate quite substantially from the observed counts, due to the interaction of the two models.
To compare the model predictions, it might be more useful to compare the probabilities of the counts, i.e. the probability of receiving 0, 1, 2 etc legislative proposals from the group leaders.
- “prob” reports the probability of the count number of reports
This last option is interesting. Now, the function yields not a
vector of predictions but a matrix. Each column corresponds to the count
categories (the nreports
), while the observations (MEPs)
are reported in the rows. The cells report the probability for each MEP
to be in a specific category of counts.
#Predict outcomes as a matrix
<- predict(mod1.zero.pois,
preds_zeronewdata = df,
#Specify the probability of a count
"prob")
dim(preds_zero)
## [1] 906 23
head(preds_zero)
## 0 1 2 3 4 5 6
## 1 0.2417732 0.2090101 0.2290520 0.16734384 0.09169520 0.040195136 0.0146831467
## 2 0.3100102 0.1837342 0.2060861 0.15410480 0.08642610 0.038776055 0.0144977666
## 3 0.1120386 0.2075920 0.2527395 0.20513706 0.12487528 0.060813337 0.0246797029
## 4 0.4461237 0.3283828 0.1589427 0.05128725 0.01241193 0.002403032 0.0003877023
## 5 0.3577113 0.3203768 0.2019022 0.08482618 0.02672883 0.006737829 0.0014153985
## 6 0.1308131 0.2308237 0.2593701 0.19429801 0.10916363 0.049065653 0.0183779026
## 7 8 9 10 11 12
## 1 4.597460e-03 1.259577e-03 3.067462e-04 6.723202e-05 1.339616e-05 2.446786e-06
## 2 4.646135e-03 1.302838e-03 3.247407e-04 7.284932e-05 1.485667e-05 2.777340e-06
## 3 8.584880e-03 2.612983e-03 7.069463e-04 1.721388e-04 3.810469e-05 7.731959e-06
## 4 5.361551e-05 6.487696e-06 6.978115e-07 6.755044e-08 5.944640e-09 4.795506e-10
## 5 2.548536e-04 4.015233e-05 5.623132e-06 7.087423e-07 8.120928e-08 8.529714e-09
## 6 5.900211e-03 1.657476e-03 4.138798e-04 9.301305e-05 1.900294e-05 3.558846e-06
## 13 14 15 16 17 18
## 1 4.125243e-07 6.458302e-08 9.436783e-09 1.292709e-09 1.666667e-10 2.029426e-11
## 2 4.792635e-07 7.679538e-08 1.148504e-08 1.610280e-09 2.124913e-10 2.648240e-11
## 3 1.448233e-06 2.518853e-07 4.088876e-08 6.222663e-09 8.912914e-10 1.205701e-10
## 4 3.570928e-11 2.469126e-12 1.593463e-13 9.640779e-15 5.489759e-16 2.952371e-17
## 5 8.269917e-10 7.445316e-11 6.256075e-12 4.928241e-13 3.653870e-14 2.558531e-15
## 6 6.152271e-07 9.875908e-08 1.479638e-08 2.078285e-09 2.747424e-10 3.430227e-11
## 19 20 21 22
## 1 2.341081e-12 2.565567e-13 2.677694e-14 2.667689e-15
## 2 3.126745e-12 3.507125e-13 3.746456e-14 3.820205e-15
## 3 1.545177e-11 1.881225e-12 2.181291e-13 2.414256e-14
## 4 1.504206e-18 7.280604e-20 3.356126e-21 1.476745e-22
## 5 1.697255e-16 1.069614e-17 6.419746e-19 3.677944e-20
## 6 4.057316e-12 4.559093e-13 4.878977e-14 4.983973e-15
This option will give us a better idea about the model’s performance in the aggregate because we can assess the distribution of the outcomes. We can either identify the most likely outcome as if each count were a category, or simply calculate the aggregated probabilities for each category/outcome.
Most likely outcome Here I sum over each column
colSums(preds_zero, na.rm = T)
## 0 1 2 3 4 5
## 338.1925200 142.5799967 124.2709222 95.7526814 67.5581729 44.4231577
## 6 7 8 9 10 11
## 27.5343934 16.2979559 9.3912330 5.4211660 3.2588708 2.1232026
## 12 13 14 15 16 17
## 1.5306716 1.2069874 1.0078823 0.8629301 0.7406705 0.6289277
## 18 19 20 21 22
## 0.5244628 0.4277650 0.3404720 0.2641097 0.1995317
which.max(preds_zero[1,])
## 0
## 1
max(preds_zero[906,]) == preds_zero[906,]
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 13 14 15 16 17 18 19 20 21 22
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_pois2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
Here, we reported the most likely count category for each MEP. As is clear from the overlay, my model doesn’t predict the categories very precisely. That is, the probability of an MEP to fall into a single count category of reports is not very precisely estimated. However, if we treat the outcome as metric rather than categorical, the perception might change.
Distribution of probabilities We can simply compare
the proportion of observed data in each category of
nreports
with the predicted probabilities. That way, we can
take into account that an MEP has a fair chance of getting – say – 3-5
reports – but a low probability of getting 0 or 10.
To do so, I create two data sets – one with predicted values and one with observed values – then merge them. Last, I make a histogram.
#Data frame with predicted probabilities
<-
predicted_prob #the matrix from the predict function
%>%
preds_zero #make it into a data frame
%>%
as.data.frame #Take the mean of all the columns
summarize(across(everything(), mean, na.rm = T)) %>%
#Flip columns and rows to get a data set with two variables
::pivot_longer(everything(),
tidyrnames_to = "nreports",
values_to = "pred_prob") %>%
#Recode to a numeric variable
mutate(nreports = as.numeric(nreports))
#check
head(predicted_prob)
#Data frame with observed proportions
<-
observed_prob %>%
df #Count the number of cases in each category
group_by(nreports) %>%
summarise(count = n()) %>%
#Ungroup to consider all the data
%>%
ungroup #Calculate the number of each category
mutate(N = sum(count),
#divide by all observations to get a proportion
prob = count/N,
#Recode count category to numeric
nreports = as.numeric(nreports))
#Merge the two data frames
<- full_join(observed_prob,
df_probs
predicted_prob,by = "nreports")
#Plot
%>%
df_probs # You can filter away the zeros to zoom in on the counts
# filter(nreports > 0 ) %>%
+
ggplot geom_histogram(aes(y = prob,
x = nreports),
stat = "identity") +
#Red == predicted values
geom_histogram(aes(y = pred_prob,
x = nreports),
stat = "identity",
fill = "red", alpha = 0.5) +
#Cut the x-axis to the values you want to see
coord_cartesian(xlim = c(0,25)) +
ggtitle("Predicted vs. observed counts of legislative proposals",
subtitle = "Results from a zeroinflated model") +
ylab("Probability") +
xlab("Number of legislative proposals (`reports`)") +
theme_minimal()
Dette ser mye bedre ut!
Rootogram
The rootogram function handles these models without any problems. The expected frequency is based on the total expected counts (from the two models).
rootogram(mod1.zero.pois)
Robust standard errors: when observations are clustered in groups
The standard errors reported in regression models assume that observations are independently drawn, i.e. that all observations have an equal probability of getting the outcome we observe. This is NOT a particular assumption for count models, it is an assumption most models share.
However, this is not always the case. Sometimes, our observations share some common features. We may, for example think that the count number of legislative proposals are determined by the committee that the MEP is a member of. This creates – potentially – two problems: One is that committee membership may be a confounder for the relationship between MEPs ideology and their probability of garnering legislative proposals. This could happen, for example, if MEPs with extreme political views were assigned to committees with little legislative activity (as a way for the leadership to side-line the extremists). We can address this by controlling for committee membership the way we did in the previous section. This will potentially change the regression parameter, if – indeed – there is such a relationship.
Another problem is that the MEPs are in fact clustered in committee and therefore are not independently drawn. This has an impact on how we calculate the standard error (i.e. the statistical significance) of our effects, but doesn’t imply any change in the parameter as such. We can address this by clustering the standard errors. They are a way of adjusting the standard errors of regression coefficients to account for correlation within clusters (here: committees), while assuming independence between clusters (here: between committees). In other words, we assume that MEPs that sit on the same committee have a similar base-line probability of getting legislative proposals.
Normal standard errors
We can estimate normal standard errors by using the
variance-covariance matrix from the model object using the
vcov()
function. It reports how the regression parameters
vary/covary.
vcov(mod2.zero.pois)
count_(Intercept) | count_nom_dist1_ep | count_nom_dist1_nat | count_ctee_expert | count_chair | count_attendance | zero_(Intercept) | zero_nom_dist1_ep | zero_nom_dist1_nat | zero_ctee_expert | zero_chair | zero_attendance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count_(Intercept) | 0.010 | -0.005 | -0.008 | -0.001 | -0.001 | -0.013 | 0.011 | -0.005 | -0.015 | -0.003 | -0.003 | -0.011 |
count_nom_dist1_ep | -0.005 | 0.116 | -0.024 | 0.001 | 0.000 | -0.002 | -0.002 | 0.092 | 0.030 | 0.002 | -0.003 | -0.015 |
count_nom_dist1_nat | -0.008 | -0.024 | 0.204 | 0.000 | 0.000 | 0.008 | -0.015 | -0.003 | 0.292 | 0.011 | -0.014 | 0.012 |
count_ctee_expert | -0.001 | 0.001 | 0.000 | 0.003 | 0.000 | 0.000 | -0.002 | 0.000 | 0.006 | 0.006 | -0.001 | 0.002 |
count_chair | -0.001 | 0.000 | 0.000 | 0.000 | 0.005 | 0.000 | 0.000 | 0.003 | -0.002 | -0.001 | 0.002 | -0.003 |
count_attendance | -0.013 | -0.002 | 0.008 | 0.000 | 0.000 | 0.019 | -0.014 | -0.005 | 0.010 | 0.001 | 0.003 | 0.022 |
zero_(Intercept) | 0.011 | -0.002 | -0.015 | -0.002 | 0.000 | -0.014 | 0.107 | -0.040 | -0.189 | -0.034 | 0.007 | -0.181 |
zero_nom_dist1_ep | -0.005 | 0.092 | -0.003 | 0.000 | 0.003 | -0.005 | -0.040 | 1.592 | -0.545 | -0.003 | 0.030 | -0.217 |
zero_nom_dist1_nat | -0.015 | 0.030 | 0.292 | 0.006 | -0.002 | 0.010 | -0.189 | -0.545 | 5.187 | 0.102 | -0.244 | 0.200 |
zero_ctee_expert | -0.003 | 0.002 | 0.011 | 0.006 | -0.001 | 0.001 | -0.034 | -0.003 | 0.102 | 0.144 | -0.035 | 0.005 |
zero_chair | -0.003 | -0.003 | -0.014 | -0.001 | 0.002 | 0.003 | 0.007 | 0.030 | -0.244 | -0.035 | 94400.207 | -0.056 |
zero_attendance | -0.011 | -0.015 | 0.012 | 0.002 | -0.003 | 0.022 | -0.181 | -0.217 | 0.200 | 0.005 | -0.056 | 0.527 |
Standard errors are calculated as the square root from the diagonal of that matrix.
#Variance-covariance matrix
vcov(mod2.zero.pois) %>%
#Diagonal
%>%
diag #Square root
sqrt
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.10079098 0.34019014 0.45140501 0.05283447
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.07137039 0.13777445 0.32682849 1.26191966
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 2.27742363 0.37903328 307.24616665 0.72611315
Can you recognize the results from mod1.zero
?
Clustered standard errors
We can estimate clustered standard errors by using the
sandwich
package in R.
#install.packages("sandwich")
library(sandwich)
The relevant function is vcovCL()
. It requires an
additional argument clustered=
where we provide the
variable we want to cluster the errors on. Here, I cluster by committee.
In GLMs the standard type=
of errors is “HC0”. This is done
automatically.
#Variance-covariance matrix
vcovCL(mod2.zero.pois,
#Cluster according to committee membership
cluster = df$committee,
type = "HC0") %>%
#Diagonal
%>%
diag #Square root
sqrt
Be mindful that if you have NAs in your model (as I do here), you would have to subset the data so that the NAs dissappear. My hack is to subset using the predict function.
#predict
$preds <- predict(mod2.zero.pois, df)
df
#Variance-covariance matrix
vcov(mod2.zero.pois,
#Subset, then cluster
cluster = df$committee[!is.na(df$preds)],
type = "HC0") %>%
#Diagonal
%>%
diag #Square root
sqrt
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.10079098 0.34019014 0.45140501 0.05283447
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.07137039 0.13777445 0.32682849 1.26191966
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 2.27742363 0.37903328 307.24616665 0.72611315
I now have regression coefficients from the model object and standard errors estimated separately.
#Copy standard errors to a separate object
<-
ses.clustered vcovCL(mod2.zero.pois,
#Subset, then cluster
cluster = df$committee[!is.na(df$preds)]) %>%
#Diagonal
%>%
diag #Square root
sqrt
#summary statsitics from the original model (for comparison)
<- summary(mod2.zero.pois)
summary.mod1.zero #Extract the standard errors of the count part of the model
<- summary.mod1.zero$coefficient$count[,"Std. Error"] ses.normal
We usually do this by adding a list of named standard errors. These names must be the same as the names of the regression coefficients.
Here we see that the names of the standard errors are prefixed by
count_
for the count model and zero_
for the
zero hurdle.
ses.clustered
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.1999302 0.3590123 0.7006974 0.1399392
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.1522670 0.2343490 0.4476172 1.3484687
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 1.8209910 0.4929240 0.5422434 1.2752056
… however, the coefficients do not have these names.
$coefficients mod2.zero.pois
## $count
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert chair
## -3.39832253 -0.02883625 -0.63717556 0.07216832 1.33814804
## attendance
## 0.66311572
##
## $zero
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert chair
## -2.4234745 3.9084822 -1.9573037 -0.4873135 -12.7187862
## attendance
## -6.1911903
I will have to change the names to get the stargazer to understand my
objects and link coefficients and errors correctly. I use the
gsub()
function (from our session on recoding of variables)
to replace the string count_
with nothing
""
.
names(ses.clustered) <- names(ses.clustered) %>%
gsub("count_", "", .)
ses.clustered
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert
## 0.1999302 0.3590123 0.7006974 0.1399392
## chair attendance zero_(Intercept) zero_nom_dist1_ep
## 0.1522670 0.2343490 0.4476172 1.3484687
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 1.8209910 0.4929240 0.5422434 1.2752056
Now, I can finally make my regression table. Here, I report the results from the original model and the results with the clustered errors.
stargazer(mod2.zero.pois, mod2.zero.pois,
se = list(ses.normal, ses.clustered),
title = "Normal vs. clustered standard errors",
type = "html")
Dependent variable: | ||
nreports | ||
(1) | (2) | |
nom_dist1_ep | -0.029 | -0.029 |
(0.340) | (0.359) | |
nom_dist1_nat | -0.637 | -0.637 |
(0.451) | (0.701) | |
ctee_expert | 0.072 | 0.072 |
(0.053) | (0.140) | |
chair | 1.338*** | 1.338*** |
(0.071) | (0.152) | |
attendance | 0.663*** | 0.663*** |
(0.138) | (0.234) | |
Constant | -3.398*** | -3.398*** |
(0.101) | (0.200) | |
Observations | 885 | 885 |
Log Likelihood | -1,536.608 | -1,536.608 |
Note: | p<0.1; p<0.05; p<0.01 |
I always check the results: My clustered errors should be larger than the normal errors, since the whole point of this is to account for the fact that my number of observations doesn’t reflect the uncertainty in the data (i.e. I have less variation than what we might think, because outcomes depend on committee membership).
Need a brush-up on robust/clustered standard errors? Robust standard errors and clustered standard errors are both methods for estimating standard errors that are robust to violations of the assumption of homoscedasticity (i.e., equal variance) in linear regression models. However, they are used in different contexts and with different assumptions.
Robust standard errors are used when the errors are believed to be heteroscedastic, but the source of the heteroscedasticity is unknown. In this case, the robust standard errors adjust the standard errors of the regression coefficients to account for the potential heteroscedasticity of the errors. Robust standard errors can be estimated using various methods, such as the sandwich estimator or the White estimator, and do not require any additional information about the structure of the data.
Clustered standard errors, on the other hand, are used when the errors are believed to be heteroscedastic and correlated within groups, but uncorrelated between groups. In this case, the groups can be defined in various ways, such as geographic regions, schools, or firms, and the clustered standard errors adjust the standard errors of the regression coefficients to account for the correlation of the errors within each group. The clustered standard errors can be estimated using the same methods as the robust standard errors, but they require information about the grouping structure of the data.
Display
<- zeroinfl(nreports ~
mod.zero +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair offset(log(months)) | attendance,
#Count model == negative binomial
dist = "poisson",
df)
<- data.frame(
newdata1 intercept1 = 1,
nom_dist1_ep = seq(0,1 , 0.01),
nom_dist1_nat = 0.028,
ctee_expert = 0,
chair = 0)
<- data.frame(
newdata0 intercept0 = 1,
attendance = 0.7)
# predict(mod.zero, newdata = newdata)
<- mvrnorm(n = 1000,
sim1 mu = coefficients(mod.zero)[1:5],
Sigma = vcov(mod.zero)[1:5, 1:5])
<- mvrnorm(n = 1000,
sim0 mu = coefficients(mod.zero)[6:7],
Sigma = vcov(mod.zero)[6:7, 6:7])
# Simulate outcomes from the zero-inflated model
<- predict(mod.zero,
prob newdata = cbind(newdata1, newdata0, months = log(60)) %>% as.data.frame(),
type = "prob")
# sim_outcomes <- rbinom(n, 1, prob[, 1]) * rpois(n, prob[, 2])
<- as.matrix(newdata) %*% t(simb)
preds
<- preds %>% exp()
preds
<- as.data.frame(t(apply(X = preds,
pdf MARGIN = 1,
probs = c(.025,.5,.975)))) quantile,