Models for event counts
## Warning: pakke 'topmodels' blev bygget under R version 4.3.3
In this session, we will be analyzing count outcomes. We will explore the base-line poisson model, test for overdispersion and adress potential problems.
We will familiarize with:
how to estimate different count models
- poisson and quasi-poisson models from base R
- negative binomial models from
MASS
- zero-inflated and hurdle models from
pscl
clustered standard errors from
sandwich
- while these are not specific to count models, we sometimes need to cluster our standard errors because our observations are clustered. You’ll find an example at the end of the script.
We will be working with data from Yoshinaka, Mcelroy, and Bowler (2010) on report allocation in the European Parliament (EP). All legislative proposals are handled by a member of one of the committees in the EP. This person (the “rapporteur”) is in charge of collecting information, negotiate a consensus between the party groups and with the other institutions (the Commission and the Council) and draft the resulting parliamentary amendments on behalf of the committee. For any single policy proposal, the rapporteur is the single-most influential member of parliament (MEP). The rapporteur is appointed by the group leader in the committee in charge of the proposal. Group leaders are committee members of the transnational party group of MEPs. These are (transnational) parliamentary parties to which national parties adhere (or not).
Back in the days, it was a common belief that national parties were running the show in the EP, since they are in charge of the candidate selection before elections, and because transnational groups have little presence outside of Parliament.
This article explores the argument that party leaders will prefer to appoint group members that are ideologically close to the transnational party median and contrasts this with the ideological proximity between MEPs and their national party group. The aim is to understand who allocates political influence in the European Parliament: National parties or European political groups? The model also includes an indicator of whether being a policy expert is an asset.
Introduction
We start by loading in the data and relevant packages.
#Data wrangling
library(dplyr);
#Plots
library(ggplot2)
#Set preferences for esthetics on ggplot2
theme_set(
#The "minimal" theme; try out others
theme_minimal()
)
#Load in data
load("df_yoshinaka.rda")
#Rename the object
<- df_yoshinaka %>%
df #Data originally from stata, so redefine as a data frame in R
as.data.frame
Descriptive statistics
Outcome variable: number of legislative proposals
The outcome variable is the number of reports (i.e. legislative
proposals, nreports
) each MEP has garnered during the
legislative period. We have, in other words, a data generating process
in which events occur – in principle unrelated to each other – within a
certain window/exposure time. This is the type of outcomes where we will
resolve to the poisson model or one of its affiliates.
%>%
df +
ggplot geom_histogram(
aes(nreports)
+
) xlab("Number of reports (proposals) per MEPs") +
ylab("Number of MEPs") +
ggtitle("Distribution of legislative proposals among MEPs")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution of legislative proposals is quite unequal. The distribution looks like a “ski slope”: I have many zero counts (MEPs who never proposed legislation) and some really high counts (MEPs that have handled quite a few proposals).
#Number of observations (MEPs)
nrow(df)
## [1] 906
#Number of no events (MEPs without any reports)
sum(df$nreports==0)
## [1] 367
#Number of extreme events
quantile(df$nreports, 0.95)
## 95%
## 7
max(df$nreports)
## [1] 22
A fairly large number of MEPs (more than a third) never drafted a report at all, while the 5% most productive members drafted 7 or more reports. The most prolific MEP drafted no less than 22 proposals!
Overall, the variance in our outcome is higher than the mean.
#Mean
mean(df$nreports)
## [1] 1.970199
#Variance
var(df$nreports)
## [1] 8.340547
This is not uncommon for count data. The poisson distribution used in poisson models assume that the mean is the same as the spread/variance in the conditional outcome. If our regression model contains enough predictors to to explain this variance, that will not be a problem. However, it means I need predictors that explain both the long tail and the high zero-count.
Main predictors: political distance
Our research question concerns the political distance between each
MEP and their political groups (i.e. the median “voter” in the group).
Here, I explore the distribution of our main predictor
nom_dist1_ep
, the left-right orientation of each MEP and
their absolute distance from their group. Out of curiosity, I plot it
per political group.
%>%
df +
ggplot geom_histogram(
aes(nom_dist1_ep)
+
) xlab("Left-rigth political distance between MEP and EP group median") +
ylab("Number of MEPs") +
ggtitle("Distribution of political distance among MEPs")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (`stat_bin()`).
My interest in the distribution of the predictor is first and foremost to check what variation I have. I can see that there is quite some variation between MEPs, but I already get a sense that there are quite a few MEPs that are really close to their group, while some people are far off from the group median. It already tells me that – even if my hypothesis is correct – my predictor cannot explain all the variation in the dependent variable. That’s because I’d expect that the higher counts in reports would be among MEPs that are close to their leaders. Here, we see that the group leaders have many MEPs to choose from that are politically close to them; yet only very few MEPs are going to be “super-rapporteur” (i.e. long tails). Similarly, we have too many zero-counts in the outcome compared to political outliers in the predictor.
Control variables
We will include other control variables as well: The political distance to the national party, expertise, activity levels etc.
library(stargazer)
<-
df_tab %>%
df #The select function from dplyr; not from MASS
::select(nreports, nom_dist1_ep, nom_dist1_nat,
dplyr%>%
ctee_expert, activity, attendance, months) #Transform to a data frame for stargazer to "understand" (sometimes necessary)
as.data.frame
stargazer(df_tab,
title = "Descriptive statistics",
#File type; could be "text"
type = "html",
#I want summary statistics from the data frame
summary = T)
Statistic | N | Mean | St. Dev. | Min | Max |
nreports | 906 | 1.970 | 2.888 | 0 | 22 |
nom_dist1_ep | 899 | 0.075 | 0.115 | 0.000 | 1.205 |
nom_dist1_nat | 899 | 0.028 | 0.076 | 0.000 | 0.849 |
ctee_expert | 905 | 0.310 | 0.463 | 0 | 1 |
activity | 904 | 0.753 | 0.211 | 0.000 | 0.994 |
attendance | 889 | 0.517 | 0.257 | 0.004 | 0.984 |
months | 906 | 42.435 | 19.388 | 1 | 60 |
Bivariate statistics
With two metric variables, I’d go for a scatterplot and a local regression. Since the outcome variable is a count, I’ll jitter the y-axis to get a feel for how many observations I have.
%>%
df +
ggplot #Scatterplot with jittering
geom_jitter(aes(
y = nreports,
x = nom_dist1_ep
),#Make dots transparent
alpha = 0.4) +
#Local regression
geom_smooth(aes(
y = nreports,
x = nom_dist1_ep
)+
) #Linear regression
geom_smooth(aes(
y = nreports,
x = nom_dist1_ep
),method = "lm",
color = "darkgrey",
se = F
+
) ggtitle("Bivariate statistics: Legislative proposals as a function of political distance") +
ylab("Legislative proposals (reports)") +
xlab("Left-right distance between MEP and the EP group")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 7 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 7 rows containing missing values (`geom_point()`).
We get the feel that there is definitely something here. However, there are quite some outliers here too.
Why not a linear model? (what not to do)
Ward and Ahlquist (2018) warn us to not run linear regressions on count data. There is one exception to that: When all your predictors are categorical, then you’re just calculating the difference between group averages anyways; you could to a linear model.
In our case, however, we have a linear predictor and my base-line go-to is the poisson regression. The poisson distribution has a single paramter \(\lambda\) (lambda). It describes both its mean/expected value and the spread/variance. That’s both its biggest asset and a draw-back.
The asset is that the variance increases with the mean. That is, the distribution “knows” that your data will be heteroscedastic. Look:
<- lm(nreports ~
mod1.ols +
nom_dist1_ep +
nom_dist1_nat
ctee_expert, df)
Standardized residuals:
#Predictions
$preds_ols <- predict(mod1.ols, df)
df#Residuals
$resid_ols <- df$nreports - df$preds_ols
df#Standardize to z-scores
$resid_ols_z <- df$resid_ols/sd(df$resid_ols, na.rm = T) df
Plot the residuals against the predicted values.
%>%
df +
ggplot geom_point(aes(y = resid_ols_z,
x = preds_ols)) +
ggtitle("Residual plot")
## Warning: Removed 8 rows containing missing values (`geom_point()`).
We have a fan-shaped distribution. As my predictions increase, I will be increasingly wrong (i.e. my residuals increase). If I were to run an OLS, my standard errors would be too small, because the normal distribution assumes an equal spread.
Some people “fix” the problem by log-transforming the dependent variable before they run a linear model (drawing from a normal distribution). However, even though this fixes the problem that my distribution is bounded at zero, it does not fix the problem with heteroscedasticity.
<- lm(log(nreports+1) ~
mod2.ols +
nom_dist1_ep +
nom_dist1_nat
ctee_expert, df)
Use the ready-made function to get a residual plot. Hit “Return” on your key-board to flip through different model statistics.
plot(mod2.ols)
The drawback of the assumption of an equal mean and spread, is that sometimes the spread is still too high, even for the poisson distribution. That’s when we have overdispersion.
Baseline model: poisson
Estimation
Running the poisson model is simple. It is GLM and is part of the base R package. The only code we need to change is the probability distribution.
<- glm(nreports ~
mod1.pois +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,#tell R which probability distribution we use
family = "poisson",
df)
summary(mod1.pois)
##
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert,
## family = "poisson", data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.70968 0.03609 19.666 < 2e-16 ***
## nom_dist1_ep -1.76930 0.31886 -5.549 2.88e-08 ***
## nom_dist1_nat -1.47164 0.50029 -2.942 0.00327 **
## ctee_expert 0.35966 0.04849 7.417 1.20e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2970.4 on 897 degrees of freedom
## Residual deviance: 2849.2 on 894 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: 4371.9
##
## Number of Fisher Scoring iterations: 6
Exposure (the event window)
Thinking about the data generating process, I realize that MEPs don’t
all serve entire terms in the European Parliament. Some quit before the
end of the term, others swap committees. In other words, the exposure
time varies across MEPs. In the data, there is a variable called
monthts
that reports the number of months the member served
on the committee.
If the exposure time is not the same across observations, the probability of a number of events to happen is also variable. We could address this by controlling for the logtransformed variable. It is more common to use an offset. The offset is – for all practical purposes – a logtransformed control variable we include in the model but for which the slope coefficient (parameter) is fixed at 1.
In R, I can do this by either using the offset()-function or the argument in the glm()-function. These two solutions are equivalent.
#Take 1
<- glm(nreports ~
mod2.pois #Political distance to EP party
+
nom_dist1_ep #Political distance to national party
+
nom_dist1_nat #Committee expert
ctee_expert,family = "poisson",
#Time on the committee
offset = log(months),
df)
#Take 2
<- glm(nreports ~
mod2.pois +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,offset = log(months),
family = "poisson",
df)
The offset in a poisson regression is typically used when we want the coefficient of a predictor to represent a multiplicative effect on the rate rather than the count. This is common in scenarios such as rate per time or area where the exposure (e.g., time, area, population) can vary between observations.
In our case, we are interested in modeling the count of reports (events) with respect to exposure time (months served on the committee). Then using log(x) as an offset is necessary to keep the relationship in a multiplicative scale when working on the log scale of the response in poisson regression.
As we can see from comparing mod1.pois
and
mod2.pois
, the regression coefficients change
substantially. However, the intercept changed too. Since the model is a
member of the exponential family with inbuilt interaction effects, it is
hard to know to what extent the change in coefficients is
substantial.
stargazer(mod1.ols, mod2.ols, mod1.pois, mod2.pois,
type = "html")
Dependent variable: | ||||
nreports | log(nreports + 1) | nreports | ||
OLS | OLS | Poisson | ||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | -2.313*** | -0.799*** | -1.769*** | -0.853*** |
(0.858) | (0.224) | (0.319) | (0.281) | |
nom_dist1_nat | -1.928 | -0.589* | -1.472*** | -0.937** |
(1.287) | (0.336) | (0.500) | (0.439) | |
ctee_expert | 0.771*** | 0.250*** | 0.360*** | 0.223*** |
(0.206) | (0.054) | (0.048) | (0.049) | |
Constant | 1.975*** | 0.772*** | 0.710*** | -3.076*** |
(0.133) | (0.035) | (0.036) | (0.035) | |
Observations | 898 | 898 | 898 | 898 |
R2 | 0.029 | 0.045 | ||
Adjusted R2 | 0.025 | 0.042 | ||
Log Likelihood | -2,181.944 | -1,868.368 | ||
Akaike Inf. Crit. | 4,371.889 | 3,744.736 | ||
Residual Std. Error (df = 894) | 2.858 | 0.746 | ||
F Statistic (df = 3; 894) | 8.809*** | 14.041*** | ||
Note: | p<0.1; p<0.05; p<0.01 |
Now, you only have to set the exposure if it is different between observations. If all observations have the same exposure, you can leave it untouched. R will assume your window of events is 1.
Interpretation
Interpreting the results is fairly straightforward. As per usual, it involves back-transforming the recoding of the dependent variable. In our case, the model has logtransformed the count in order to approximate the latent variable on which it runs the regression. Interpreting the marginal effects is the same as with binomial logistic regression, the predicted outcomes (and thus the first-difference) is easier; it suffices to take the exponent.
Marginal effects
A one-unit change in x would change y by a factor of \(exp(\beta)\).
exp(mod2.pois$coefficients["nom_dist1_ep"])
## nom_dist1_ep
## 0.4260668
Once I’ve taken the exponent, I see that the factor is less than one; i.e. the number of legislative proposals decrease as the political distance increases. I’m then doing my usual trick to change this into percentage change.
<- (1-exp(mod2.pois$coefficients["nom_dist1_ep"]))
eff1 eff1
## nom_dist1_ep
## 0.5739332
A one-unit increase in the political distance decreases the number of legislative proposals handled by the MEP by 57%.
How likely is such a shift? That is, how much “elasticity” is there in my predictor?
To determine a likely shift in my x-variable – the political distance between the MEP and the group – I consider the interquartile range. That’s a shift across the 50% most likely values in the data.
#Consider
summary(df$nom_dist1_ep)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.01700 0.03900 0.07476 0.08200 1.20500 7
#Interquartile range (1st to 3rd quartiles)
<- IQR(df$nom_dist1_ep, na.rm = T)
iqr iqr
## [1] 0.06499999
Hmmm… a one unit change is extremely unlikely. In fact, a common distance is 0.065. Let’s reformulate, then.
<- 1-(exp(mod2.pois$coefficients["nom_dist1_ep"] * iqr))
eff2 eff2
## nom_dist1_ep
## 0.05394572
Right, so: comparing an MEP that is fairly close to the group (1st quartile) to an MEP that is fairly far from the group median (3rd quartile), I find that the most distant MEP tends to write 5% fewer proposals on behalf of the group than the ideologically close MEP.
Another scenario, given the topic, could be to compare someone that shares the group’s views completely (x = 0) to a distant MEP (3rd quartile).
#Consider
summary(df$nom_dist1_ep)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.01700 0.03900 0.07476 0.08200 1.20500 7
#Interquartile range (1st to 3rd quartiles)
<- quantile(df$nom_dist1_ep, na.rm = T, probs = c(0.75))
distance distance
## 75%
## 0.082
<- 1-(exp(mod2.pois$coefficients["nom_dist1_ep"] * distance))
eff3 eff3
## nom_dist1_ep
## 0.067568
The relative effect is somewhat larger, but not massive.
First-differences
To calculate the first-difference, I’d have to set up some full-fledged scenarios, then fill in the equation for predictions, backtransform, then compare.
Scenario
I start by setting a scenario. Here, I use the same as before, but I also have to specify the control variables. As long as the glm-recoding of the outcome involves logtransformation, all effects will be proportional to each other.
<- data.frame(
scenario #Two scenarios: one close and one far away
nom_dist1_ep = quantile(df$nom_dist1_ep, na.rm = T, probs = c(0, 0.75)),
#Set the control to a typical value; here, the median
nom_dist1_nat = median(df$nom_dist1_nat, na.rm = T),
#A non-expert
ctee_expert = 0,
#NB: the offset variable also has to be in, but don't log transform!
months = 60)
When you set your scenario, you will also have to include the offset if it is anything but 1. That is, if you specified an offset during the estimation, then your predictions will have to take it into account. This is because the offset determines the “window opportunity” or exposure to an event as a place-holder for the non-events we cannot observe.
Fill in the equation
I can fill in the equation “by hand”:
#The coefficients
= mod2.pois$coefficients[1]
a = mod2.pois$coefficients[2]
b1 = mod2.pois$coefficients[3]
b2 = mod2.pois$coefficients[4]
b3
#The equation; first scenario
<-
pred1.hand #intercept
* 1 +
a #first scenario, political distance == 0
* scenario[1,1]+
b1 #typical distance to national party
* scenario[1,2]+
b2 #committee expert
* scenario[1,3] +
b3 #The offset variable has a regression coefficient of 1
1 * log(scenario[1,4])
#The equation; second scenario
<-
pred2.hand #intercept
* 1 +
a #second scenario, political distance == 0
* scenario[2,1]+
b1 #typical distance to national party
* scenario[2,2]+
b2 #committee expert
* scenario[2,3] +
b3 #offset
1 * log(scenario[1,4])
#backtransform
exp(pred1.hand);
## (Intercept)
## 2.74206
exp(pred2.hand)
## (Intercept)
## 2.556784
#first difference
exp(pred2.hand) - exp(pred1.hand)
## (Intercept)
## -0.1852755
The predicted change in the outcome is next to zero when we consider a likely different in x. It seems that the preference distance is mostly a threat to political outliers in the group.
I can make these and similar predictions using the
predict()
function too.
predict(mod2.pois, scenario, type = "response")
## 0% 75%
## 2.742060 2.556784
Be aware that if you have set the offset variable, you’d
have to add it to your scenario; it’s a variable in its own right.
However, the predict()
function takes the exponent
(backtransforms), so in the scenario, it should be left
untransformed…
<- predict(mod2.pois, scenario, type = "response")
preds1 preds1
## 0% 75%
## 2.742060 2.556784
Here, we see that the predicted number of reports for an MEP who is politically close to their EP group, reasonably close to their national group and who is not an expert, is 2.74. An equivalent MEP whose preferences are further away, can expect to receive 2.56 reports. A substantial move in preferences therefore leads to a 0.1852755 decrease in number of reports. This is a very small effect.
Another way of phrasing this is to say that 5th political outlier will lose out on one legislative report during the 60 months (exposure) they served in the EP.
Effectplot with the ggpredict
I can use the ggeffects package to plot the effect of political distance on influence. However, the package does not like it when we recode our variables in the equation. This is what I did when I set the offset. I therefore reestimate the model.
#A new offset variable
$offset <- log(df$months)
df#Re-estimate the model
<- glm(nreports ~
mod2.pois +
nom_dist1_ep +
nom_dist1_nat
ctee_expert,offset = offset,
family = "poisson",
df)
I let the political distance slide across a reasonable spectrum.
library(ggeffects)
<- ggpredict(mod2.pois,
eff terms = list("nom_dist1_ep" = seq(0, 0.5, 0.01)))
Here is the “dirty” way of plotting.
%>% plot eff
R fits the y axis so that everything looks fantastic. I can build on this to create a better plot.
%>%
eff +
ggplot geom_ribbon(
aes(
ymax = conf.high,
ymin = conf.low,
x = x
),#Transparent ribbon of uncertainty
alpha = 0.3
+
) geom_line(
aes(
y = predicted,
x = x
)+
) ylim(0,3) +
ylab("Number of legislative proposals") +
xlab("Political distance from the leadership") +
ggtitle("Effect of political distance on productivity (legislative proposals) among MEPs")
Model fit
Let’s assess the model’s fit (Ward and Ahlquist 2018, 197–201). This means comparing predicted and observed outcomes. This time, I am scouting for overdispersion, which is the one hitch with the poisson model.
Overdispersion happens when I predict too few high values and low values compared to the actual distribution of the outcome variable. Because the poisson distribution assumes a variance equal to the mean, my standard errors are going to be too small/underestimated.”Everything” becomes statistically significant, yet it would all be lies.
Histograms
I usually start the really intuitive way. How does the distribution of my predictions compare to the distribution of the observed counts?
#Predict outcomes
$preds_pois2 <- predict(mod2.pois,
dfnewdata = df,
#Note that we should not use the argument name here; but we want exponentiated outcomes
"response")
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_pois2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8 rows containing non-finite values (`stat_bin()`).
This looks dismal. We are predicting too few zero events and too few extreme counts. Clearly, the variation in the distribution of influence in the EP is larger than what our current model accounts for. It seems we are in the presence of overdispersion. It means that the standard errors are too small, since the poisson distribution has clear expectations about the variance.
Hanging rootogram
I can plot the same thing using hanging rootograms (Ward and Ahlquist 2018, p.). The
countreg
/topmodels
package has a ready-made
function for us. It is available for installation online.
#install.packages("distributions3"); hit "Yes" in R console
#install.packages("topmodels", repos="http://R-Forge.R-project.org")
library(topmodels)
#Rootogram
rootogram(mod1.pois,
main = "Hanging rootogram of mod1.pois")
The plot gives us the same message: It uses the span of predicted values to create the bars, but compares within each predicted count the number of wrong predictions. Once again, we underpredict zeros as well as extreme values. That is, all bars that hang below the zero-line, are counts we underpredict. All bars that never reach the zero line are counts that are overpredicted.
Addressing overdispersion
Overdispersion is not necessarily a big problem if i) we believe that we don’t have an omitted variable bias and ii) we have described the data generating process correctly. If we believe none of the above apply to us, we can simply make the standard errors larger. That is what the quasi-poisson model does (Ward and Ahlquist 2018, 202).
However, I usually take the opportunity to think more closely about the data generating process, i.e. I put on my polisci hat and start thinking about how the events occur. That’s what we will do in the following.
Quasi-poisson: fixing the standard errors
If I believe my model is an unbiased description of the phenomenon I want to study, it suffices to revise my certainty about the size of the predictors, without re-estimating them. The quasipoisson model adds an additional parameter \(\phi\) (“phi”) to the variance parameter for the poisson model (Ward and Ahlquist 2018, 202). It describes the (over)dispersion and corrects the standard errors for it.
\(\phi V(\lambda_i) = \phi \lambda_i\)
We can easily run the model in baseR by changing the probability distribution argument.
<- glm(nreports ~
mod.quasipois +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert offset(log(months)),
#Specify quasipoisson
family = "quasipoisson",
df)
summary(mod.quasipois)
##
## Call:
## glm(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert +
## offset(log(months)), family = "quasipoisson", data = df)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.07626 0.06155 -49.978 < 2e-16 ***
## nom_dist1_ep -0.85316 0.48985 -1.742 0.08191 .
## nom_dist1_nat -0.93748 0.76630 -1.223 0.22151
## ctee_expert 0.22292 0.08472 2.631 0.00865 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.048904)
##
## Null deviance: 2263.5 on 897 degrees of freedom
## Residual deviance: 2222.0 on 894 degrees of freedom
## (8 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Here, we see the dispersion is reported to 3.0489044.
The results are reported alongside the baseline model in table 3. We can see that the coefficieents are the same, but now, the politial distance between MEPs and their national parties is no longer statistically significant. In fact, the standard error is almost as big as the coefficient.
Adding predictors
I generally prefer taking the opportunity to think about my model choice instead of going for the post-hoc fixing operated by the quasipoisson. An alternative route is to add predictors. Do not make this into a garbage can by adding intermediate predictors, and then control away your main effect. However, you can add predictors for two reasons:
- Additional predictors: are correlated with the outcome, but not the theoretical variable you are interested in. Here, I happen to know that some MEPs don’t generally take part in parliamentary work. They are not present during the plenary sessions, for example. If my theory is that group leaders select some committee members – and not that some committee members self-deselect – then this is an interesting control. However, there is no real reason to think that the MEPs with extreme political views (compared to their party groups) are less present.
<- glm(nreports ~
mod1.predictors +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert #Additional predictor
+
activity offset(log(months)),
family = "poisson",
df)
The result is reported in the regression table. We see that the political distance is estimated to be (approximately) the same also after including the control.
- Confounders: In this particular case, I happen to know a few mechanisms that determine how many proposals an MEP could potentially be eligible for. They might be correlated to their ideological distance from their party: The leadership position and the committee on which the MEP serves.
First, legislative proposals that none is interested in are taken on by the committee chair. It means that some of my extreme counts may be due to the committee chair doing his/her job of filling in for the other members. The committee chair is elected by all of the party groups, so this position may be reserved for people that are close to the median voter in the committee, rather than the party. That’s a real potential confounder.
Second, the Commission is more active in some policy areas than others. If parties are strategic, they will put eligible rapporteurs on the more active committees. In the second model, I add fixed effects on committees and add a predictor for the commitee chair.
<- glm(nreports ~
mod2.predictors +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
activity +
chair +
committeeoffset(log(months)),
#Specify quasipoisson
family = "poisson",
df)
Let’s check the histogram again.
#Predict outcomes
$preds_predictors2 <- predict(mod2.predictors,
dfnewdata = df,
#Note that we should not use the argument name here; but we want exponentiated outcomes
"response")
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_predictors2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).
… and the rootogram.
#Rootogram
rootogram(mod2.predictors,
main = "Hanging rootogram of mod2.predictors")
This looks so much better! However, I still have a problem in predicting my zeros, though.
Comparing with the initial poisson model, we see that the significance level is lower for ideological distance (the ratio between the parameter and the error is smaller). However, it is still more precise than the initial quasipoisson model.
Important for my research question is also that I now have an insignificant effect of the alternative explanation that MEPs would have to be close to their national party to get influence; I find no support for this in the more appropriate model.
stargazer( mod1.pois, mod2.pois,
mod.quasipois,
mod1.predictors, mod2.predictors,#don't report the fixed effects
omit = "committee",
#change to "text"
type = "html")
Dependent variable: | |||||
nreports | |||||
Poisson | glm: quasipoisson | Poisson | |||
link = log | |||||
(1) | (2) | (3) | (4) | (5) | |
nom_dist1_ep | -1.769*** | -0.853*** | -0.853* | -0.850*** | -0.714** |
(0.319) | (0.281) | (0.490) | (0.279) | (0.302) | |
nom_dist1_nat | -1.472*** | -0.937** | -0.937 | -0.497 | -0.641 |
(0.500) | (0.439) | (0.766) | (0.412) | (0.439) | |
ctee_expert | 0.360*** | 0.223*** | 0.223*** | 0.182*** | 0.200*** |
(0.048) | (0.049) | (0.085) | (0.049) | (0.052) | |
activity | 1.666*** | 1.746*** | |||
(0.186) | (0.189) | ||||
chair | 1.584*** | ||||
(0.072) | |||||
Constant | 0.710*** | -3.076*** | -3.076*** | -4.422*** | -4.897*** |
(0.036) | (0.035) | (0.062) | (0.157) | (0.197) | |
Observations | 898 | 898 | 898 | 896 | 896 |
Log Likelihood | -2,181.944 | -1,868.368 | -1,820.734 | -1,505.961 | |
Akaike Inf. Crit. | 4,371.889 | 3,744.736 | 3,651.468 | 3,055.922 | |
Note: | p<0.1; p<0.05; p<0.01 |
Negative binomial model
Another route is to opt for the negative binomial model (Ward and Ahlquist 2018, .p. 202-205). The model doesn’t really have a single probability distribution, instead it merges two parameters into one using the gamma distribution (\(\Gamma\)). It means that the functional form of the negative binomial model is not set in stone and can empirically adjust to the excess counts that we have by creating a “fatter tail”.
However, this also means that we change the hypothesis. That is, the negative binomial distribution assumes the probability of an additional count increases over the number of events. The best example is the soccer player that – having scored once – feels on top of things and continue to excel throughout the game, as (s)he is accumulating successes.
In our case, we could imagine a world in which group leaders are rational time-savers. That is, when they receive legislative proposals, they will rely on people they already know; either because they are on the leader’s mind, or because they think there will be an economy of scale (for example).
To fit the negative binomial model, we need the
MASS
-package. The regression function is slightly different
glm.nb()
, but otherwise the specification is the same as in
base R. (This is, with reference to Ward and Ahlquist a NB2 model).
library(MASS)
<- glm.nb(nreports ~
mod2.nb +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
activity +
chair +
committee offset(log(months)),
df)
The negative binomial model uses a second parameter (\(\alpha\)) to mediate between the mean (expected value) and the variance. Confusingly, Ward and Ahlquist use \(\alpha\), while R refers to \(\theta\).
Here, the \(\alpha^{-1}\) is reported to be 2.6655782. This is the same as \(\frac{1}{\alpha{^-1}} =\) 0.3751531. As \(\alpha\) increases, the negative binomial and the poisson models are going to look similar.
The results are also reported in table 4.
stargazer(mod1.predictors, mod2.predictors,
mod2.nb,omit = "committee",
title = "The poisson vs. the negative binomial model",
type = "html")
Dependent variable: | |||
nreports | |||
Poisson | negative | ||
binomial | |||
(1) | (2) | (3) | |
nom_dist1_ep | -0.850*** | -0.714** | -0.761* |
(0.279) | (0.302) | (0.390) | |
nom_dist1_nat | -0.497 | -0.641 | -0.220 |
(0.412) | (0.439) | (0.537) | |
ctee_expert | 0.182*** | 0.200*** | 0.249*** |
(0.049) | (0.052) | (0.076) | |
activity | 1.666*** | 1.746*** | 1.938*** |
(0.186) | (0.189) | (0.249) | |
chair | 1.584*** | 1.651*** | |
(0.072) | (0.149) | ||
Constant | -4.422*** | -4.897*** | -5.148*** |
(0.157) | (0.197) | (0.264) | |
Observations | 896 | 896 | 896 |
Log Likelihood | -1,820.734 | -1,505.961 | -1,415.508 |
theta | 2.666*** (0.346) | ||
Akaike Inf. Crit. | 3,651.468 | 3,055.922 | 2,875.015 |
Note: | p<0.1; p<0.05; p<0.01 |
#Create a plot with one row, and two columns
par(mfrow = c(1,3))
#Plot old model
rootogram(mod1.pois)
#Plot new model
rootogram(mod2.predictors)
#Plot new model
rootogram(mod2.nb)
#Return to initial parameters; one row and one column per plot
par(mfrow = c(1,1))
We can compare models. It is clear that the more complex poisson model is betteer than the simple one. However, the negative binomial model is a bit of a mixed bag. I have mor correct zero counts, and some more extreme counts (>15 reports).
Zero-inflated and hurdle models: a focus on the excess zeros
Another alternative is to consider if, in fact we are in the presence of two data generating processes: One accounting for the excess zeros, another for the positive counts (Ward and Ahlquist 2018, 207–14).
We can, for example model the excess zeros as two data generating processes. One accounting for some of the zero counts. These would be observations that – for some reason – would not respond to our predictors in the same way (“never takers”).
Effectively, we run two models in parallel: A binomial logistic regression modelling the “never takers”, and another count model (poisson or negative binomial) with the usual predictors.
Zero-inflated models
Here, I model a poisson count model. The package I use is the
pscl
package containing – among other things – functions
for zero-inflated and hurdle models.
#install.packages("pscl")
library(pscl)
## Warning: pakke 'pscl' blev bygget under R version 4.3.1
I can also make a theory out of this. Let’s say that there is a group of MEPs that are never takers; they’d rather be caught dead than write a legislative proposal. That is, before the group leader makes the selection, there is an initial stage of self (de-)selection into legislative work.
One predictor for this is the activity level of the MEP. The variable
attendance
captures the share of committee meetings that
the member has attended. The idea is that the less you attend a
committee, the more you opt out of legislative work.
I specify the predictors for the two models in the function and
separate them with a vertical bar “|”). Before the bar, I specify the
predictors for the count part of the model. After the bar are the
predictors for the zero/hurdle part (binomial logistic model). Here, I
specify that Attendance
is a predictor for whether an MEP
is willing to engage in committee work, but not for how many positive
counts will see.
<- zeroinfl(nreports ~
mod1.zero.pois +
nom_dist1_ep +
nom_dist1_nat +
chair +
ctee_expert #Remove the fixed effects for convenience
# committee +
offset(log(months))|attendance,
#Count model == poisson
dist = "poisson",
df)
summary(mod1.zero.pois)
##
## Call:
## zeroinfl(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + chair +
## ctee_expert + offset(log(months)) | attendance, data = df, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.0975 -0.7359 -0.3825 0.3695 9.2827
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92656 0.03987 -73.408 <2e-16 ***
## nom_dist1_ep -0.39240 0.36559 -1.073 0.2831
## nom_dist1_nat -0.75077 0.43319 -1.733 0.0831 .
## chair 1.32410 0.07129 18.573 <2e-16 ***
## ctee_expert 0.08714 0.05080 1.715 0.0863 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4276 0.2428 5.879 4.12e-09 ***
## attendance -5.6132 0.5893 -9.525 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 14
## Log-likelihood: -1548 on 7 Df
The effect comes out significant and with a massive substantial effect.
#Typical difference betweeen assiduous members and active members
<- df$attendance %>%
change quantile(., c(0.2, .8), na.rm=T) %>%
%>%
diff as.numeric()
change
## [1] 0.5028251
#Extract the relevant predictor from the count part of the model
<- mod1.zero.pois$coefficients$zero["attendance"]
coef <- exp(coef * change)-1
effect effect
## attendance
## -0.940541
Comparing a low-attendee (i.e. low quantile) with an assiduous member (high quantile), I find a 94% relative decrease in an MEP’s likelihood of getting at least one legislative proposal from their leader.
Same predictors in both models
If I don’t specify anything else, R will use the same predictor for both models. We can take the explorative approach and look at what the model yields.
<- zeroinfl(nreports ~
mod2.zero.pois +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair +
attendance offset(log(months)),
#Count model == poisson
dist = "poisson",
df)
# summary(mod2.zero.pois)
Something strange seems to happen. Attendance is clearly connected to receiving legislative influence. However, my theory about political proximity is challenged by the explorative approach. Right-leaning MEPs seem to be overrepresented as never takers, while my theory linked it to the count effect.
Here, I do a negative binomial.
library(pscl)
<- zeroinfl(nreports ~
mod1.zero.nb +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair # committee +
offset(log(months)) | attendance,
#Count model == negative binomial
dist = "negbin",
df)
Stargazer will plot the results of either of the two models, but not
both. I use the zero.component=T
to plot the binomial model
that predicts my zero events.
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
omit = "committee",
title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Zero events (no reports)",
zero.component = T,
type = "html")
Zero events (no reports) | |||
nreports | |||
(1) | (2) | (3) | |
nom_dist1_ep | 3.908*** | ||
(1.262) | |||
nom_dist1_nat | -1.957 | ||
(2.277) | |||
ctee_expert | -0.487 | ||
(0.379) | |||
chair | -12.719 | ||
(307.246) | |||
attendance | -5.613*** | -6.191*** | -8.446*** |
(0.589) | (0.726) | (1.203) | |
Constant | 1.428*** | -2.423*** | 1.634*** |
(0.243) | (0.327) | (0.322) | |
Observations | 885 | 885 | 885 |
Log Likelihood | -1,547.991 | -1,536.608 | -1,449.776 |
Note: | p<0.1; p<0.05; p<0.01 |
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb,
omit = "committee",
title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Number of reports",
zero.component = F,
type = "html")
Number of reports | |||
nreports | |||
(1) | (2) | (3) | |
nom_dist1_ep | -0.392 | -0.029 | -0.747* |
(0.366) | (0.340) | (0.415) | |
nom_dist1_nat | -0.751* | -0.637 | -0.565 |
(0.433) | (0.451) | (0.567) | |
chair | 1.324*** | 1.338*** | 1.433*** |
(0.071) | (0.071) | (0.150) | |
attendance | 0.663*** | ||
(0.138) | |||
ctee_expert | 0.087* | 0.072 | 0.118 |
(0.051) | (0.053) | (0.074) | |
Constant | -2.927*** | -3.398*** | -3.028*** |
(0.040) | (0.101) | (0.055) | |
Observations | 885 | 885 | 885 |
Log Likelihood | -1,547.991 | -1,536.608 | -1,449.776 |
Note: | p<0.1; p<0.05; p<0.01 |
Hurdle models
The hurdle models are similar. They assume that there is a certain threshhold an observation has to pass in order to acquire positive counts.
This time, we use the hurdle()
function to estimate the
model.
<- hurdle(nreports ~
mod1.hurdle +
nom_dist1_ep +
nom_dist1_nat +
ctee_expert +
chair # committee +
offset(log(months)) | attendance,
#Count model == negative binomial
# dist = "negbin",
df)
summary(mod1.hurdle)
##
## Call:
## hurdle(formula = nreports ~ nom_dist1_ep + nom_dist1_nat + ctee_expert +
## chair + offset(log(months)) | attendance, data = df)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -2.3173 -0.7379 -0.4270 0.4003 8.7110
##
## Count model coefficients (truncated poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.90541 0.04091 -71.011 <2e-16 ***
## nom_dist1_ep -0.02383 0.36089 -0.066 0.9473
## nom_dist1_nat -0.95820 0.52361 -1.830 0.0673 .
## ctee_expert 0.06116 0.05263 1.162 0.2451
## chair 1.29816 0.07142 18.177 <2e-16 ***
## Zero hurdle model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5738 0.1771 -8.888 <2e-16 ***
## attendance 4.0378 0.3286 12.289 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 12
## Log-likelihood: -1629 on 7 Df
As you can see from the regression table, these models often yield similar results.
stargazer(mod1.zero.pois,mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
omit = "committee",
title = "The ZERO part of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Zero events (no reports)",
zero.component = T,
type = "html")
Zero events (no reports) | ||||
nreports | ||||
zero-inflated | hurdle | |||
count data | ||||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | 3.908*** | |||
(1.262) | ||||
nom_dist1_nat | -1.957 | |||
(2.277) | ||||
ctee_expert | -0.487 | |||
(0.379) | ||||
chair | -12.719 | |||
(307.246) | ||||
attendance | -5.613*** | -6.191*** | -8.446*** | 4.038*** |
(0.589) | (0.726) | (1.203) | (0.329) | |
Constant | 1.428*** | -2.423*** | 1.634*** | -1.574*** |
(0.243) | (0.327) | (0.322) | (0.177) | |
Observations | 885 | 885 | 885 | 885 |
Log Likelihood | -1,547.991 | -1,536.608 | -1,449.776 | -1,628.550 |
Note: | p<0.1; p<0.05; p<0.01 |
stargazer(mod1.zero.pois, mod2.zero.pois, mod1.zero.nb, mod1.hurdle,
omit = "committee",
title = "The COUNT PART of the zero-inflated models: poisson vs. the negative binomial model",
dep.var.caption = "Number of reports",
zero.component = F,
type = "html")
Number of reports | ||||
nreports | ||||
zero-inflated | hurdle | |||
count data | ||||
(1) | (2) | (3) | (4) | |
nom_dist1_ep | -0.392 | -0.029 | -0.747* | -0.024 |
(0.366) | (0.340) | (0.415) | (0.361) | |
nom_dist1_nat | -0.751* | -0.637 | -0.565 | -0.958* |
(0.433) | (0.451) | (0.567) | (0.524) | |
chair | 1.324*** | 1.338*** | 1.433*** | 1.298*** |
(0.071) | (0.071) | (0.150) | (0.071) | |
attendance | 0.663*** | |||
(0.138) | ||||
ctee_expert | 0.087* | 0.072 | 0.118 | 0.061 |
(0.051) | (0.053) | (0.074) | (0.053) | |
Constant | -2.927*** | -3.398*** | -3.028*** | -2.905*** |
(0.040) | (0.101) | (0.055) | (0.041) | |
Observations | 885 | 885 | 885 | 885 |
Log Likelihood | -1,547.991 | -1,536.608 | -1,449.776 | -1,628.550 |
Note: | p<0.1; p<0.05; p<0.01 |
Assessing the zeroinflated and hurdle models
Assessing and presenting the results from the models of excess zero is somewhat harder because the predicted outcomes are a function of two processes/models. Which of those two should we focus on? Or should we do both?
The results of the binomial part (zero-part) reports the likelihood of positive/negative counts, and can be interpreted as such. However, the results of the count part of the model, should be interpreted as the effects of the covariates contitional on being in the group of MEPs eligible for reports (given the binomial model).
Predictions
The predict()
function will provide predictions as per
usual. However, since the binomial model moderates the results of the
count part, we should think about which of the models we’d like to
present.
The type=
argument lets you decide which outcome you’re
interested in. * “count” reports the results from the count part of the
model condiitonal on passing the hurdle * “zero” reports the
probability from the binomial model.
- “response” will yield the total expected outcome (with decimals). However these predictions are done at the MEP-level, and may therefore deviate quite substantially from the observed counts, due to the interaction of the two models.
To compare the model predictions, it might be more useful to compare the probabilities of the counts, i.e. the probability of receiving 0, 1, 2 etc legislative proposals from the group leaders.
- “prob” reports the probability of the count number of reports
This last option is interesting. Now, the function yields not a
vector of predictions but a matrix. Each column corresponds to the count
categories (the nreports
), while the observations (MEPs)
are reported in the rows. The cells report the probability for each MEP
to be in a specific category of counts.
#Predict outcomes as a matrix
<- predict(mod1.zero.pois,
preds_zeronewdata = df,
#Specify the probability of a count
"prob")
dim(preds_zero)
## [1] 906 23
head(preds_zero)
## 0 1 2 3 4 5 6
## 1 0.23082468 0.11318799 0.1749544 0.1802843 0.13933247 0.086146324 0.044385377
## 2 0.31116564 0.09609989 0.1520713 0.1604280 0.12693297 0.080344966 0.042380070
## 3 0.08108013 0.11723650 0.1926832 0.2111220 0.17349404 0.114057945 0.062486367
## 4 0.36665427 0.29673189 0.2023309 0.0919748 0.03135717 0.008552533 0.001943889
## 5 0.30150344 0.26567402 0.2256174 0.1277335 0.05423735 0.018423918 0.005215360
## 6 0.08743681 0.12699462 0.2011780 0.2124635 0.16828654 0.106636155 0.056309067
## 7 8 9 10 11 12
## 1 0.0196018149 7.574618e-03 2.601792e-03 8.043166e-04 2.260418e-04 5.823203e-05
## 2 0.0191609874 7.580226e-03 2.665593e-03 8.436225e-04 2.427223e-04 6.401514e-05
## 3 0.0293425747 1.205644e-02 4.403395e-03 1.447434e-03 4.325307e-04 1.184805e-04
## 4 0.0003787053 6.455641e-05 9.781935e-06 1.333991e-06 1.653819e-07 1.879467e-08
## 5 0.0012654348 2.686602e-04 5.070075e-05 8.611284e-06 1.329624e-06 1.881920e-07
## 6 0.0254862179 1.009347e-02 3.553228e-03 1.125766e-03 3.242505e-04 8.561000e-05
## 13 14 15 16 17 18
## 1 1.384755e-05 3.057732e-06 6.301769e-07 1.217578e-07 2.214126e-08 3.802632e-09
## 2 1.558453e-05 3.523058e-06 7.433320e-07 1.470338e-07 2.737300e-08 4.812870e-09
## 3 2.995811e-05 7.033918e-06 1.541406e-06 3.166708e-07 6.123082e-08 1.118172e-08
## 4 1.971602e-09 1.920522e-10 1.746047e-11 1.488209e-12 1.193831e-13 9.044787e-15
## 5 2.458732e-08 2.982887e-09 3.377528e-10 3.585358e-11 3.582096e-12 3.380012e-13
## 6 2.086442e-05 4.721753e-06 9.973260e-07 1.974887e-07 3.680598e-08 6.478449e-09
## 19 20 21 22
## 1 6.187072e-10 9.563339e-11 1.407812e-11 1.978228e-12
## 2 8.016870e-10 1.268613e-10 1.911895e-11 2.750398e-12
## 3 1.934488e-09 3.179413e-10 4.976667e-11 7.435787e-12
## 4 6.491913e-16 4.426604e-17 2.874613e-18 1.781905e-19
## 5 3.021469e-14 2.565912e-15 2.075276e-16 1.602163e-17
## 6 1.080295e-09 1.711345e-10 2.581924e-11 3.718312e-12
This option will give us a better idea about the model’s performance in the aggregate because we can assess the distribution of the outcomes. We can either identify the most likely outcome as if each count were a category, or simply calculate the aggregated probabilities for each category/outcome.
Most likely outcome Here I sum over each column
colSums(preds_zero, na.rm = T)
## 0 1 2 3 4 5
## 328.49388994 128.11649395 132.15558796 111.25136593 78.76002226 47.94733459
## 6 7 8 9 10 11
## 25.70487224 12.60178980 6.06443721 3.23055103 2.13098970 1.71105210
## 12 13 14 15 16 17
## 1.49721003 1.31128497 1.10656961 0.88807610 0.67564952 0.48748810
## 18 19 20 21 22
## 0.33411593 0.21797666 0.13564867 0.08068526 0.04595930
which.max(preds_zero[1,])
## 0
## 1
max(preds_zero[906,]) == preds_zero[906,]
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## 13 14 15 16 17 18 19 20 21 22
## FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#Plot two overlayed histograms
%>%
df ggplot() +
#Observed outcomes
geom_histogram(aes(x = nreports),
#transparent color
alpha=0.5)+
#Predicted outcomes
geom_histogram(aes(x = preds_pois2),
#transparent red color
fill = "red",
alpha = 0.3)+
#ggtheme
theme_minimal() +
ggtitle("Distribution of observed vs. predicted outcomes") +
ylab("Number of MEPs") +
xlab("Number of reports (legislative proposals)")
Here, we reported the most likely count category for each MEP. As is clear from the overlay, my model doesn’t predict the categories very precisely. That is, the probability of an MEP to fall into a single count category of reports is not very precisely estimated. However, if we treat the outcome as metric rather than categorical, the perception might change.
Distribution of probabilities We can simply compare
the proportion of observed data in each category of
nreports
with the predicted probabilities. That way, we can
take into account that an MEP has a fair chance of getting – say – 3-5
reports – but a low probability of getting 0 or 10.
To do so, I create two data sets – one with predicted values and one with observed values – then merge them. Last, I make a histogram.
#Data frame with predicted probabilities
<-
predicted_prob #the matrix from the predict function
%>%
preds_zero #make it into a data frame
%>%
as.data.frame #Take the mean of all the columns
reframe(across(everything(), mean, na.rm = T)) %>%
#Flip columns and rows to get a data set with two variables
::pivot_longer(everything(),
tidyrnames_to = "nreports",
values_to = "pred_prob") %>%
#Recode to a numeric variable
mutate(nreports = as.numeric(nreports))
#check
head(predicted_prob)
#Data frame with observed proportions
<-
observed_prob %>%
df #Count the number of cases in each category
group_by(nreports) %>%
summarise(count = n()) %>%
#Ungroup to consider all the data
%>%
ungroup #Calculate the number of each category
mutate(N = sum(count),
#divide by all observations to get a proportion
prob = count/N,
#Recode count category to numeric
nreports = as.numeric(nreports))
#Merge the two data frames
<- full_join(observed_prob,
df_probs
predicted_prob,by = "nreports")
#Plot
%>%
df_probs # You can filter away the zeros to zoom in on the counts
# filter(nreports > 0 ) %>%
+
ggplot geom_histogram(aes(y = prob,
x = nreports),
stat = "identity") +
#Red == predicted values
geom_histogram(aes(y = pred_prob,
x = nreports),
stat = "identity",
fill = "red", alpha = 0.5) +
#Cut the x-axis to the values you want to see
coord_cartesian(xlim = c(0,25)) +
ggtitle("Predicted vs. observed counts of legislative proposals",
subtitle = "Results from a zeroinflated model") +
ylab("Probability") +
xlab("Number of legislative proposals (`reports`)") +
theme_minimal()
Rootogram
The rootogram function handles these models without any problems. The expected frequency is based on the total expected counts (from the two models).
rootogram(mod1.zero.pois)
Robust standard errors: when observations are clustered in groups
The standard errors reported in regression models assume that observations are independently drawn, i.e. that all observations have an equal probability of getting the outcome we observe. This is NOT a particular assumption for count models, it is an assumption most models share.
However, this is not always the case. Sometimes, our observations share some common features. We may, for example think that the count number of legislative proposals are determined by the committee that the MEP is a member of. This creates – potentially – two problems: One is that committee membership may be a confounder for the relationship between MEPs ideology and their probability of garnering legislative proposals. This could happen, for example, if MEPs with extreme political views were assigned to committees with little legislative activity (as a way for the leadership to side-line the extremists). We can address this by controlling for committee membership the way we did in the previous section. This will potentially change the regression parameter, if – indeed – there is such a relationship.
Another problem is that the MEPs are in fact clustered in committee and therefore are not independently drawn. This has an impact on how we calculate the standard error (i.e. the statistical significance) of our effects, but doesn’t imply any change in the parameter as such. We can address this by clustering the standard errors. They are a way of adjusting the standard errors of regression coefficients to account for correlation within clusters (here: committees), while assuming independence between clusters (here: between committees). In other words, we assume that MEPs that sit on the same committee have a similar base-line probability of getting legislative proposals.
Normal standard errors
We can estimate normal standard errors by using the
variance-covariance matrix from the model object using the
vcov()
function. It reports how the regression parameters
vary/covary.
vcov(mod2.zero.pois)
count_(Intercept) | count_nom_dist1_ep | count_nom_dist1_nat | count_ctee_expert | count_chair | count_attendance | zero_(Intercept) | zero_nom_dist1_ep | zero_nom_dist1_nat | zero_ctee_expert | zero_chair | zero_attendance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count_(Intercept) | 0.010 | -0.005 | -0.008 | -0.001 | -0.001 | -0.013 | 0.011 | -0.005 | -0.015 | -0.003 | -0.003 | -0.011 |
count_nom_dist1_ep | -0.005 | 0.116 | -0.024 | 0.001 | 0.000 | -0.002 | -0.002 | 0.092 | 0.030 | 0.002 | -0.003 | -0.015 |
count_nom_dist1_nat | -0.008 | -0.024 | 0.204 | 0.000 | 0.000 | 0.008 | -0.015 | -0.003 | 0.292 | 0.011 | -0.014 | 0.012 |
count_ctee_expert | -0.001 | 0.001 | 0.000 | 0.003 | 0.000 | 0.000 | -0.002 | 0.000 | 0.006 | 0.006 | -0.001 | 0.002 |
count_chair | -0.001 | 0.000 | 0.000 | 0.000 | 0.005 | 0.000 | 0.000 | 0.003 | -0.002 | -0.001 | 0.002 | -0.003 |
count_attendance | -0.013 | -0.002 | 0.008 | 0.000 | 0.000 | 0.019 | -0.014 | -0.005 | 0.010 | 0.001 | 0.003 | 0.022 |
zero_(Intercept) | 0.011 | -0.002 | -0.015 | -0.002 | 0.000 | -0.014 | 0.107 | -0.040 | -0.189 | -0.034 | 0.007 | -0.181 |
zero_nom_dist1_ep | -0.005 | 0.092 | -0.003 | 0.000 | 0.003 | -0.005 | -0.040 | 1.592 | -0.545 | -0.003 | 0.030 | -0.217 |
zero_nom_dist1_nat | -0.015 | 0.030 | 0.292 | 0.006 | -0.002 | 0.010 | -0.189 | -0.545 | 5.187 | 0.102 | -0.244 | 0.200 |
zero_ctee_expert | -0.003 | 0.002 | 0.011 | 0.006 | -0.001 | 0.001 | -0.034 | -0.003 | 0.102 | 0.144 | -0.035 | 0.005 |
zero_chair | -0.003 | -0.003 | -0.014 | -0.001 | 0.002 | 0.003 | 0.007 | 0.030 | -0.244 | -0.035 | 94400.207 | -0.056 |
zero_attendance | -0.011 | -0.015 | 0.012 | 0.002 | -0.003 | 0.022 | -0.181 | -0.217 | 0.200 | 0.005 | -0.056 | 0.527 |
Standard errors are calculated as the square root from the diagonal of that matrix.
#Variance-covariance matrix
vcov(mod2.zero.pois) %>%
#Diagonal
%>%
diag #Square root
sqrt
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.10079098 0.34019014 0.45140501 0.05283447
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.07137039 0.13777445 0.32682849 1.26191966
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 2.27742363 0.37903328 307.24616665 0.72611315
Can you recognize the results from mod1.zero
?
Clustered standard errors
We can estimate clustered standard errors by using the
sandwich
package in R.
#install.packages("sandwich")
library(sandwich)
## Warning: pakke 'sandwich' blev bygget under R version 4.3.1
The relevant function is vcovCL()
. It requires an
additional argument clustered=
where we provide the
variable we want to cluster the errors on. Here, I cluster by committee.
In GLMs the standard type=
of errors is “HC0”. This is done
automatically.
#Variance-covariance matrix
vcovCL(mod2.zero.pois,
#Cluster according to committee membership
cluster = df$committee,
type = "HC0") %>%
#Diagonal
%>%
diag #Square root
sqrt
Be mindful that if you have NAs in your model (as I do here), you would have to subset the data so that the NAs dissappear. My hack is to subset using the predict function.
#predict
$preds <- predict(mod2.zero.pois, df)
df
#Variance-covariance matrix
vcov(mod2.zero.pois,
#Subset, then cluster
cluster = df$committee[!is.na(df$preds)],
type = "HC0") %>%
#Diagonal
%>%
diag #Square root
sqrt
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.10079098 0.34019014 0.45140501 0.05283447
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.07137039 0.13777445 0.32682849 1.26191966
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 2.27742363 0.37903328 307.24616665 0.72611315
I now have regression coefficients from the model object and standard errors estimated separately.
#Copy standard errors to a separate object
<-
ses.clustered vcovCL(mod2.zero.pois,
#Subset, then cluster
cluster = df$committee[!is.na(df$preds)]) %>%
#Diagonal
%>%
diag #Square root
sqrt
#summary statsitics from the original model (for comparison)
<- summary(mod2.zero.pois)
summary.mod1.zero #Extract the standard errors of the count part of the model
<- summary.mod1.zero$coefficient$count[,"Std. Error"] ses.normal
We usually do this by adding a list of named standard errors. These names must be the same as the names of the regression coefficients.
Here we see that the names of the standard errors are prefixed by
count_
for the count model and zero_
for the
zero hurdle.
ses.clustered
## count_(Intercept) count_nom_dist1_ep count_nom_dist1_nat count_ctee_expert
## 0.1999302 0.3590123 0.7006974 0.1399392
## count_chair count_attendance zero_(Intercept) zero_nom_dist1_ep
## 0.1522670 0.2343490 0.4476172 1.3484687
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 1.8209910 0.4929240 0.5422434 1.2752056
… however, the coefficients do not have these names.
$coefficients mod2.zero.pois
## $count
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert chair
## -3.39832253 -0.02883625 -0.63717556 0.07216832 1.33814804
## attendance
## 0.66311572
##
## $zero
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert chair
## -2.4234745 3.9084822 -1.9573037 -0.4873135 -12.7187862
## attendance
## -6.1911903
I will have to change the names to get the stargazer to understand my
objects and link coefficients and errors correctly. I use the
gsub()
function (from our session on recoding of variables)
to replace the string count_
with nothing
""
.
names(ses.clustered) <- names(ses.clustered) %>%
gsub("count_", "", .)
ses.clustered
## (Intercept) nom_dist1_ep nom_dist1_nat ctee_expert
## 0.1999302 0.3590123 0.7006974 0.1399392
## chair attendance zero_(Intercept) zero_nom_dist1_ep
## 0.1522670 0.2343490 0.4476172 1.3484687
## zero_nom_dist1_nat zero_ctee_expert zero_chair zero_attendance
## 1.8209910 0.4929240 0.5422434 1.2752056
Now, I can finally make my regression table. Here, I report the results from the original model and the results with the clustered errors.
stargazer(mod2.zero.pois, mod2.zero.pois,
se = list(ses.normal, ses.clustered),
title = "Normal vs. clustered standard errors",
type = "html")
Dependent variable: | ||
nreports | ||
(1) | (2) | |
nom_dist1_ep | -0.029 | -0.029 |
(0.340) | (0.359) | |
nom_dist1_nat | -0.637 | -0.637 |
(0.451) | (0.701) | |
ctee_expert | 0.072 | 0.072 |
(0.053) | (0.140) | |
chair | 1.338*** | 1.338*** |
(0.071) | (0.152) | |
attendance | 0.663*** | 0.663*** |
(0.138) | (0.234) | |
Constant | -3.398*** | -3.398*** |
(0.101) | (0.200) | |
Observations | 885 | 885 |
Log Likelihood | -1,536.608 | -1,536.608 |
Note: | p<0.1; p<0.05; p<0.01 |
I always check the results: My clustered errors should be larger than the normal errors, since the whole point of this is to account for the fact that my number of observations doesn’t reflect the uncertainty in the data (i.e. I have less variation than what we might think, because outcomes depend on committee membership).
Need a brush-up on robust/clustered standard errors? Robust standard errors and clustered standard errors are both methods for estimating standard errors that are robust to violations of the assumption of homoscedasticity (i.e., equal variance) in linear regression models. However, they are used in different contexts and with different assumptions.
Robust standard errors are used when the errors are believed to be heteroscedastic, but the source of the heteroscedasticity is unknown. In this case, the robust standard errors adjust the standard errors of the regression coefficients to account for the potential heteroscedasticity of the errors. Robust standard errors can be estimated using various methods, such as the sandwich estimator or the White estimator, and do not require any additional information about the structure of the data.
Clustered standard errors, on the other hand, are used when the errors are believed to be heteroscedastic and correlated within groups, but uncorrelated between groups. In this case, the groups can be defined in various ways, such as geographic regions, schools, or firms, and the clustered standard errors adjust the standard errors of the regression coefficients to account for the correlation of the errors within each group. The clustered standard errors can be estimated using the same methods as the robust standard errors, but they require information about the grouping structure of the data.
<- data.frame(
newdata1 intercept1 = 1,
nom_dist1_ep = seq(0,1 , 0.01),
nom_dist1_nat = 0.028,
ctee_expert = 0,
chair = 0)
<- data.frame(
newdata0 intercept0 = 1,
attendance = 0.7)
# predict(mod.zero, newdata = newdata)
<- mvrnorm(n = 1000,
sim1 mu = coefficients(mod.zero)[1:5],
Sigma = vcov(mod.zero)[1:5, 1:5])
<- mvrnorm(n = 1000,
sim0 mu = coefficients(mod.zero)[6:7],
Sigma = vcov(mod.zero)[6:7, 6:7])
# Simulate outcomes from the zero-inflated model
<- predict(mod.zero,
prob newdata = cbind(newdata1, newdata0, months = log(60)) %>% as.data.frame(),
type = "prob")
# sim_outcomes <- rbinom(n, 1, prob[, 1]) * rpois(n, prob[, 2])
<- as.matrix(newdata) %*% t(simb)
preds
<- preds %>% exp()
preds
<- as.data.frame(t(apply(X = preds,
pdf MARGIN = 1,
probs = c(.025,.5,.975)))) quantile,