Missing data
In this session, we will be exploring methods for handeling missing data. We start out by exploring the variables: their distribution as well as the distribution of their missing observations. We then move on to the more simple ways of imputing values before we attack multiple imputations and simulation.
New R codes:
- from base:
sample()
- new from finalfit package:
missing_plot()
,missing_compare()
,missing_pattern()
- new from mice package:
complete()
,pool()
,pool_predictions()
We continue to use the data on Members of the European Parliament (MEPs). We are interested in whether re-election seeking MEPs spend more time in Parliament if they will compete in a system where the party selectorate can allocate “safe seats”. In other words, we have two comparison groups: We compare reelection seekers with non-reelection seekers and party-centered vs. candidate-centered systems through an interaction.
\(y = \alpha + \beta_1 x1 + \beta_2x2 + \beta_3x1\times x2\)
This time, the data includes several groupings; it is still a
time-series cross-sectional (imbalanced panel) data in which each MEP is
observed every 6 months over a 5-year period. Thus, I have observations
of individuals (ID
) in up to 10 periods
(Period
). These individuals also have a nationality
(Nationality
) and a national party affiliation
(ParlGovID
).
Our main variables are Attendance
(\(y\)), FutureInEP
(\(x1\)) and ClosedList
(\(x2\)).
I begin by loading in my data. You can find them online here:
https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda
.
I also rename a few of my variables for ease.
Introduction
The “hitch” this time is that I use a survey response to measure an MEP’s intention to seek reelection (\(x1\)). However, most MEPs did not complete the survey. My intention is to impute the answers. While we’re at it, I’ll also scout for missing on other variables to, including my dependent variable (y, Attendance).
Univariate statistics
Dependent variable: Attendance
I approximate legislative investment by the number of committee meetings an MEP has attended during the semester (period of observation.)
df %>%
ggplot +
geom_histogram(aes(y)) +
ggtitle("Number of committee meetings") +
ylab("Number of committe meetings")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 119 rows containing non-finite outside the scale range
## (`stat_bin()`).
The variable is naturally bounded at 0 (i.e. none can attend a negative number of meetings). However, for the purposes of this exercise I am mainly interested in the differences in means between MEPs with an expressed reelection intent and those that don’t have such a clear ambition. I could also make the bet that the conditional distribution (and my residuals) will be approximately normal once I have specified my full model. This is always smart to check.
Independent variable: Reelection intent
The survey asks where an MEP sees him/herself in 10 years. One of the
possible answers is “European Parliament”
(FutureInEP==1
)
df$future <- ifelse(df$x == 1, "reelection", "exit")
df %>%
ggplot +
geom_bar(aes(future)) +
ggtitle("Re-election intent upon election",
subtitle = "EPRG Survey distributed in 2009 and 2014")
As is clear from the barplot, most MEPs did not reply. I have a lot of missing data (71 %). To what extent is this a problem?
If the data is MCAR (missing completely at random), it would mainly
be a problem of statistical power. However, if the data is
systematically missing – and especially if the assignment mechanism for
NA
on \(x\) is also
related to \(y\) – I should take
action. If the data is MAR (missing at random), I might impute some
values using observables. It it is MNAR (missing not at random), it’s a
different beast; we don’t have any predictors and we cannot know what
the source is beyond theory/political science reasoning. The best way to
look for MAR is to correlate the missing with the other variables in the
model, as well as whatever other variables we might have.
Explore the missing observations
There is no way to know if the missing data is truly random. However,
a pretty sure giveaway that the missing observations are non-random is
if a predictor’s NA
s covary with either i) NA
s
in the other variables in the model (or other variables we may have) or
ii) the values of other variables.
For ease, let’s begin by subsetting the data to the covariates we are interested in. We can add variables to this code as we go.
What are the variables that contain missing?
Now, let’s see where I have missing observations. The package
finalfit
has several convenient functions for exploring
patterns in the missing observations. missing_plot()
organizes the data row-by-row on the x-axis and reports the variables on
the y-axis. I draws light blue bars for all observations that are
missing.
We can see that I have some missing observations on my dependent variable “Attendance” (\(y\)) as well as a bunch on my predictor “FutureInEP” (\(x\)). By contrast, we have no missing observations in the indicator for party-centered systems (“ClosedList”).
Unique combinations of NAs
Another functionality that comes from the mice
- package
allows us to explore the patterns of missing observations.
##
## Vedhæfter pakke: 'mice'
## Det følgende objekt er maskeret fra 'package:stats':
##
## filter
## De følgende objekter er maskerede fra 'package:base':
##
## cbind, rbind
## ClosedList y x
## 388 1 1 1 0
## 914 1 1 0 1
## 30 1 0 1 1
## 89 1 0 0 2
## 0 119 1003 1122
We have four combinations of missing observations in the data:
- no missing (N = 388)
- missing only on the x (N = 914)
- missing only on the y (N = 30)
- missing on both x and y (N = 89)
Pairs of correlations
We should also check for systematic correlations between the missing
and observed data across different variables. The
missing_compare()
function does a simple cross-tabulation
of missing values in the variable of your choice (e.g. \(x\)) and the values of other
predictors. It then tests the statistical significance of the
relationship. When the predictors are categorical, it performs a \(\chi^2\) test.
Here, I compare missing in y with values of x1 and x2. It turns out that there is a relationship between NAs in MEPs’ attendance levels (y) and party-centered systems (ClosedList). It is not an effect of the survey question, but another glitch in the data. We can check with a simple t-test; just to be sure.
##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$x)) and df$ClosedList
## t = 3.7918, df = 1419, p-value = 0.0001558
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.04840475 0.15136782
## sample estimates:
## cor
## 0.1001544
##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$y)) and df$x
## t = 0.65162, df = 416, p-value = 0.515
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06417985 0.12745652
## sample estimates:
## cor
## 0.0319318
Missing on my dependent variable is not correlated with MEPs ambition (the value of x). However, the opposite is not true.
##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$x)) and df$y
## t = -3.2667, df = 1300, p-value = 0.001117
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.14385311 -0.03608115
## sample estimates:
## cor
## -0.09023128
It seems like we have two problems: Missing on y and missing on x. Let’s ignore the missing on y for the time being.
Conditional NAs
A quick way to check if missing observations are conditional on other predictors is to regress the NAs on the rest of the predictors.
I begin by recoding my variable into 0 (not missing) and 1 (missing)
##
## Call:
## glm(formula = x_na ~ ClosedList, family = "binomial", data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.64768 0.08228 7.872 3.5e-15 ***
## ClosedList 0.44051 0.11703 3.764 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1721.8 on 1420 degrees of freedom
## Residual deviance: 1707.5 on 1419 degrees of freedom
## AIC: 1711.5
##
## Number of Fisher Scoring iterations: 4
Ok, so the non-responses to the survey are systematically correlated with the electoral system. Here, I find that the probability of a missing observation on x increases by 55.3500422 % when we go from candidate-centered to party-centered systems.
If we assume the data is MAR and this is the only covariate for missing values, then conditioning on electoral system in the main model will ideally address the bias induced by the missing observations. We can also use the predictors in the model to impute non-responses. Ward and Ahlquist (2018) also argue that we can use the outcome variable (y) to impute values on x. Their approach is that we have a (theoretical) relationship between the two, and that this is reflected in their correlation. The correlation is symmetric; if we can predict y using x, then we can also use y to predict x.
Before we go that far, let’s consider some low-stakes options.
Quick fixes
Removing data
My ideal model looks like this:
##
## Call:
## lm(formula = y ~ x * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.283 -8.193 -1.198 6.797 30.807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1931 0.8231 22.103 <2e-16 ***
## x -0.8647 1.4642 -0.591 0.555
## ClosedList -0.9899 1.2150 -0.815 0.416
## x:ClosedList 2.9445 2.1900 1.345 0.180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.912 on 384 degrees of freedom
## (1033 observations deleted due to missingness)
## Multiple R-squared: 0.005149, Adjusted R-squared: -0.002623
## F-statistic: 0.6625 on 3 and 384 DF, p-value: 0.5755
From the base-line model, we can see that MEPs that already know that they will seek reelection attend on average 3 committee meetings more than their colleagues in party-centered systems without such ambition.
If I do nothing more at the estimation stage, R will proceed to listwise exclusion of all observations with missing values on any of the covariates included in the model. .
## [1] 388
Here, we see that the length of my data is 388, while my data frame contains 1421 observations.
Creating weights
Sometimes, the main problem is just over/under-representation of some groups of observations. If so, we can include weights in the regression to compensate for the overrepresenation.
df %>%
group_by(Nationality) %>%
summarize(prop = mean(is.na(x))) %>%
dplyr::select(Nationality, prop) %>%
ggplot +
geom_bar(aes(x = prop,
y = Nationality),
stat = "identity") +
ggtitle("Proportion of non-responses by nationality") +
xlab("")
From the Figure, we see that there is some variation in response rate across nationalities. If that is all, and we don’t think the same mechanism will affect the values taken on by my \(y\), then we can simply weight the observations according to their probability of occurring in the survey. E.g. we can see that Hungary and Spain are overrepresented, while Denmark has a lower response rate.
I begin by creating a dummy indicator variable for whether the MEP responded to the survey.
I then fit a model to predict the occurence of non-response.
##
## Call:
## glm(formula = future_na ~ Nationality, family = "binomial", data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.34993 0.42414 3.183 0.00146 **
## NationalityBelgium -0.38053 0.55253 -0.689 0.49101
## NationalityBulgaria -0.94446 0.54676 -1.727 0.08410 .
## NationalityCroatia -0.94446 0.77237 -1.223 0.22140
## NationalityCyprus -1.01345 0.72302 -1.402 0.16100
## NationalityCzech Republic -1.19578 0.53204 -2.248 0.02461 *
## NationalityDenmark -1.34993 0.58869 -2.293 0.02184 *
## NationalityEstonia -0.50263 0.80999 -0.621 0.53490
## NationalityFinland -0.71394 0.59146 -1.207 0.22740
## NationalityFrance -0.24026 0.47352 -0.507 0.61187
## NationalityGermany -0.34011 0.45458 -0.748 0.45435
## NationalityGreece -0.34662 0.55149 -0.629 0.52966
## NationalityHungary 1.24034 0.73364 1.691 0.09090 .
## NationalityIreland -1.34993 0.63413 -2.129 0.03327 *
## NationalityItaly -0.56977 0.46156 -1.234 0.21704
## NationalityLatvia -0.89794 0.64316 -1.396 0.16268
## NationalityLithuania -0.30847 0.63670 -0.484 0.62804
## NationalityLuxembourg -0.65678 0.74491 -0.882 0.37795
## NationalityMalta -1.01345 0.72302 -1.402 0.16100
## NationalityNetherlands -0.50263 0.52453 -0.958 0.33794
## NationalityPoland -0.50263 0.47698 -1.054 0.29199
## NationalityPortugal -0.34662 0.55149 -0.629 0.52966
## NationalityRomania -0.85745 0.50312 -1.704 0.08833 .
## NationalitySlovakia -1.34993 0.57770 -2.337 0.01945 *
## NationalitySlovenia -1.34993 0.65566 -2.059 0.03951 *
## NationalitySpain -0.03425 0.48738 -0.070 0.94398
## NationalitySweden -0.45199 0.55483 -0.815 0.41528
## NationalityUnited Kingdom -0.18300 0.46865 -0.390 0.69619
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1721.8 on 1420 degrees of freedom
## Residual deviance: 1670.9 on 1393 degrees of freedom
## AIC: 1726.9
##
## Number of Fisher Scoring iterations: 5
The weight variable multiplies in practice our dependent variable to increase or decrease according to weight it is given.
##
## Call:
## lm(formula = Attendance ~ x * ClosedList, data = df, weights = 1/weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -23.930 -9.556 -1.641 7.906 43.313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3730 0.8092 22.704 <2e-16 ***
## x -0.9877 1.4306 -0.690 0.490
## ClosedList -1.1130 1.2439 -0.895 0.371
## x:ClosedList 3.2636 2.2176 1.472 0.142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.24 on 384 degrees of freedom
## (1033 observations deleted due to missingness)
## Multiple R-squared: 0.005924, Adjusted R-squared: -0.001842
## F-statistic: 0.7628 on 3 and 384 DF, p-value: 0.5155
This is a common way to address some of the information biases we find in surveys such as the European Social Survy, World Values Survey etc.
stargazer(mod.base, mod.w,
column.separate = c("Listwise excl", "Weights"),
title = "Listwise exclusion and regression weights.",
type = "html")
% Error: Argument ‘column.separate’ must be NULL (default), a vector of type ‘numeric.’
Simple imputations
“Imputing” missing values implies that we fill in the gaps. The non-informative ways of filling in gaps imply giving the observations some value without adding new information. This will attenuate the statistical correlation, but increase the N. If my covariate is a control variable, and I want to keep as many observations as possible (or reduce selection bias) for the main predictors in the model, these are low-energy fixes.
Mean imputation
I can simply replace my missing observations by the mean of the variable. Here, the “mean” is the proportion of MEPs that answer the survey question by the positive (x == 1).
I can run my model with this new “imputed” variable. There is no new information in the model, however, my N is higher. It also means that the standard errors will be artificially low. I.e. I’ll be underpredicting while being overly certain about the result.
##
## Call:
## lm(formula = y ~ x.bar * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.995 -7.617 -1.530 6.383 73.383
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.5296 0.6362 25.982 <2e-16 ***
## x.bar -0.8445 1.5293 -0.552 0.581
## ClosedList -0.5423 0.9163 -0.592 0.554
## x.bar:ClosedList 2.8522 2.2871 1.247 0.213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.001586, Adjusted R-squared: -0.000722
## F-statistic: 0.6871 on 3 and 1298 DF, p-value: 0.5599
Indicator variables
We can add to this strategy an indicator variable that flags all the cases where I’ve been meddling with my observations. It’s – once again – a way to keep the entire data set. Now, I control for the imputation, so that the effect of the imputed variable reports the estimated effect of x on y using only the information from the observed values (plus some noise). The imputed values are subsumed in the indicator. Consider the similarity between the two coefficients for x and x.bar in model 1 and 3.
##
## Call:
## lm(formula = y ~ +x_na * +x.bar * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.283 -7.398 -1.398 5.811 73.811
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.1931 0.8565 21.241 < 2e-16 ***
## x_na -2.5246 0.8743 -2.887 0.00395 **
## x.bar -0.8647 1.5236 -0.568 0.57042
## ClosedList -0.9899 1.2643 -0.783 0.43381
## x_na:x.bar NA NA NA NA
## x_na:ClosedList 0.8581 1.2563 0.683 0.49469
## x.bar:ClosedList 2.9445 2.2789 1.292 0.19655
## x_na:x.bar:ClosedList NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.31 on 1296 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.01056, Adjusted R-squared: 0.006739
## F-statistic: 2.765 on 5 and 1296 DF, p-value: 0.01714
stargazer(mod.base, mod.m, mod.ind,
title = "Low-stakes imputations",
column.labels = c("listwise exclusion", "mean imp.", "mean imp. + indicator"),
type = "html")
Dependent variable: | |||
y | |||
listwise exclusion | mean imp. | mean imp. + indicator | |
(1) | (2) | (3) | |
x | -0.865 | ||
(1.464) | |||
x_na | -2.525*** | ||
(0.874) | |||
x.bar | -0.845 | -0.865 | |
(1.529) | (1.524) | ||
ClosedList | -0.990 | -0.542 | -0.990 |
(1.215) | (0.916) | (1.264) | |
x:ClosedList | 2.945 | ||
(2.190) | |||
x_na:x.bar | |||
x_na:ClosedList | 0.858 | ||
(1.256) | |||
x.bar:ClosedList | 2.852 | 2.945 | |
(2.287) | (2.279) | ||
x_na:x.bar:ClosedList | |||
Constant | 18.193*** | 16.530*** | 18.193*** |
(0.823) | (0.636) | (0.857) | |
Observations | 388 | 1,302 | 1,302 |
R2 | 0.005 | 0.002 | 0.011 |
Adjusted R2 | -0.003 | -0.001 | 0.007 |
Residual Std. Error | 9.912 (df = 384) | 10.352 (df = 1298) | 10.314 (df = 1296) |
F Statistic | 0.663 (df = 3; 384) | 0.687 (df = 3; 1298) | 2.765** (df = 5; 1296) |
Note: | p<0.1; p<0.05; p<0.01 |
Dependent variable: | ||||
y | Attendance | y | ||
(1) | (2) | (3) | (4) | |
x | -0.865 | -0.988 | ||
(1.464) | (1.431) | |||
x_na | -2.525*** | |||
(0.874) | ||||
x.bar | -0.845 | -0.865 | ||
(1.529) | (1.524) | |||
ClosedList | -0.990 | -1.113 | -0.542 | -0.990 |
(1.215) | (1.244) | (0.916) | (1.264) | |
x:ClosedList | 2.945 | 3.264 | ||
(2.190) | (2.218) | |||
x_na:x.bar | ||||
x_na:ClosedList | 0.858 | |||
(1.256) | ||||
x.bar:ClosedList | 2.852 | 2.945 | ||
(2.287) | (2.279) | |||
x_na:x.bar:ClosedList | ||||
Constant | 18.193*** | 18.373*** | 16.530*** | 18.193*** |
(0.823) | (0.809) | (0.636) | (0.857) | |
Observations | 388 | 388 | 1,302 | 1,302 |
R2 | 0.005 | 0.006 | 0.002 | 0.011 |
Adjusted R2 | -0.003 | -0.002 | -0.001 | 0.007 |
Residual Std. Error | 9.912 (df = 384) | 12.243 (df = 384) | 10.352 (df = 1298) | 10.314 (df = 1296) |
F Statistic | 0.663 (df = 3; 384) | 0.763 (df = 3; 384) | 0.687 (df = 3; 1298) | 2.765** (df = 5; 1296) |
Note: | p<0.1; p<0.05; p<0.01 |
Inserting information to the model
Until now, we haven’t inserted any new information to the missing variables. We can do that using imputations; either by inserting information from i) the other covariates or from ii) new variables. There are grounds to say that we can draw on the covariates already in the model. First, if the covariates in the model are there because they covary (with each other), then we may argue that the information is already in the data; we just need to crystallize it.
Moving forward, we can either replace variables by group means, or go all the way using regressions.
Group means
We can replace the value of x by a predictor like
ClosedList
.
##
## Call:
## lm(formula = y ~ x.bar.g * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.096 -7.607 -1.577 6.393 73.393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.5765 0.6479 25.586 <2e-16 ***
## x.bar.g -0.9627 1.5288 -0.630 0.529
## ClosedList -0.5936 0.9116 -0.651 0.515
## x.bar.g:ClosedList 3.0758 2.2869 1.345 0.179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.001771, Adjusted R-squared: -0.000536
## F-statistic: 0.7677 on 3 and 1298 DF, p-value: 0.5122
Let’s instead use an entirely new covariate.
In fact, let’s add several predictors to the data set we use. My
candidate predictors are the legislator’s Nationality
,
EPGroup
(parliamentary group), EP
(legislative
period/survey round) and Age
when they responded to the
survey.
vars <- c("y", "x", "ClosedList",
"EPGroup", "EP", "ID",
"Nationality", "Age")
df0 <- df %>%
dplyr::select(all_of(vars))
Usually, I’d check the patterns of missing data once again (i.e. are there missing observations on the new covariates?).
df0 <-
df0 %>%
group_by(Nationality) %>%
mutate(x.bar.g2 = if_else(is.na(x), mean(x, na.rm = T), x)) %>%
#make sure to ungroup (useful for some of the sampling functions)
ungroup()
##
## Call:
## lm(formula = y ~ x.bar.g2 * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.939 -7.712 -1.308 6.383 73.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.7148 0.6179 27.050 <2e-16 ***
## x.bar.g2 -1.4242 1.4422 -0.987 0.324
## ClosedList -0.6270 0.8571 -0.732 0.465
## x.bar.g2:ClosedList 3.2758 2.1325 1.536 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.002097, Adjusted R-squared: -0.0002096
## F-statistic: 0.9091 on 3 and 1298 DF, p-value: 0.4359
stargazer(mod.base, mod.g, mod.g2,
column.labels = c("listwise exclusion", "group mean/ClosedList", "group mean/Nationality"),
type = "html")
Dependent variable: | |||
y | |||
listwise exclusion | group mean/ClosedList | group mean/Nationality | |
(1) | (2) | (3) | |
x | -0.865 | ||
(1.464) | |||
x.bar.g | -0.963 | ||
(1.529) | |||
x.bar.g2 | -1.424 | ||
(1.442) | |||
ClosedList | -0.990 | -0.594 | -0.627 |
(1.215) | (0.912) | (0.857) | |
x:ClosedList | 2.945 | ||
(2.190) | |||
x.bar.g:ClosedList | 3.076 | ||
(2.287) | |||
x.bar.g2:ClosedList | 3.276 | ||
(2.133) | |||
Constant | 18.193*** | 16.577*** | 16.715*** |
(0.823) | (0.648) | (0.618) | |
Observations | 388 | 1,302 | 1,302 |
R2 | 0.005 | 0.002 | 0.002 |
Adjusted R2 | -0.003 | -0.001 | -0.0002 |
Residual Std. Error | 9.912 (df = 384) | 10.351 (df = 1298) | 10.350 (df = 1298) |
F Statistic | 0.663 (df = 3; 384) | 0.768 (df = 3; 1298) | 0.909 (df = 3; 1298) |
Note: | p<0.1; p<0.05; p<0.01 |
Regression
From group means, there is only a tiny-tiny step to regression predictions. Since my x is binary, I run a binomial logistic regression to get predicted probabilities that I can later use for imputation. I’m regression on “Age”.
##
## Call:
## glm(formula = x ~ Age + I(Age * Age), family = "binomial", data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.424710 2.654323 -3.551 0.000384 ***
## Age 0.411206 0.109819 3.744 0.000181 ***
## I(Age * Age) -0.004549 0.001109 -4.100 4.13e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 519.82 on 417 degrees of freedom
## Residual deviance: 480.01 on 415 degrees of freedom
## (1003 observations deleted due to missingness)
## AIC: 486.01
##
## Number of Fisher Scoring iterations: 5
At this point, I’ll prioritize predictors that make sense theoretically and that have a strong correlation. It makes sense that MEPs that are above a certain age will not plan to stay in office for another 10 years, so age is a substantivly good predictor. I add a quadric term because I think young MEPs are increasingly likely to aim for an EU career (as they lock themselves into a career), then – as they grow closer to retirement – i expect the opposite.
df0$pred.age <- predict(mod.imp.age, df0, type = "response")
df0 <-
df0 %>%
mutate(x.age = if_else(is.na(x), pred.age, x))
##
## Call:
## lm(formula = y ~ +x.age * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.785 -7.608 -1.412 6.392 73.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6080 0.6144 27.029 <2e-16 ***
## x.age -1.0810 1.4197 -0.761 0.447
## ClosedList -0.5307 0.8714 -0.609 0.543
## x.age:ClosedList 2.7889 2.0809 1.340 0.180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.001694, Adjusted R-squared: -0.0006132
## F-statistic: 0.7343 on 3 and 1298 DF, p-value: 0.5316
Random effects and fixed effects
A lot of (unobserved) information is often contained in group variables.
Fixed effects
##
## Call:
## glm(formula = x ~ Age + I(Age * Age) + EPGroup, family = "binomial",
## data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.718932 2.887349 -3.020 0.002530 **
## Age 0.373340 0.116681 3.200 0.001376 **
## I(Age * Age) -0.004263 0.001169 -3.648 0.000264 ***
## EPGroupECR 0.102686 0.529200 0.194 0.846145
## EPGroupEFD -14.791011 759.441910 -0.019 0.984461
## EPGroupEFDD -1.470950 1.117701 -1.316 0.188157
## EPGroupEPP 0.946348 0.396894 2.384 0.017108 *
## EPGroupGreens/EFA -1.282898 0.828435 -1.549 0.121483
## EPGroupGUE/NGL -0.279946 0.671411 -0.417 0.676714
## EPGroupNI 0.631098 0.644278 0.980 0.327312
## EPGroupPES 0.648792 0.417939 1.552 0.120576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 519.82 on 417 degrees of freedom
## Residual deviance: 449.50 on 407 degrees of freedom
## (1003 observations deleted due to missingness)
## AIC: 471.5
##
## Number of Fisher Scoring iterations: 15
df0$pred.epgr <- predict(mod.imp.epgr, df0, type = "response")
df0 <-
df0 %>%
mutate(x.gr = if_else(is.na(x), pred.epgr, x))
##
## Call:
## lm(formula = y ~ +x.gr * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.538 -7.636 -1.351 6.311 74.028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.4342 0.5924 27.741 <2e-16 ***
## x.gr -0.5450 1.3513 -0.403 0.6868
## ClosedList -0.7192 0.8393 -0.857 0.3917
## x.gr:ClosedList 3.3679 1.9462 1.730 0.0838 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.34 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.003522, Adjusted R-squared: 0.001219
## F-statistic: 1.529 on 3 and 1298 DF, p-value: 0.2052
Fixed effects are always a possibility, but remember: You run the risk of overfitting the data wheen you hunt around for covariates. Fixed effects are a good example: They predict really well in-group, but what about new groups? Or if there is a change?
Random effects
are more flexible. We can create new groups simply by drawing from the distribution of random effects (i.e. the random intercepts are assumed to follow a normal distribution with a given mean and standard deviation). We can also add level-2 covariates to improve the out-of-sample predictions.
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: x ~ Age + I(Age * Age) + (1 | EPGroup)
## Data: df0
##
## AIC BIC logLik deviance df.resid
## 477.9 494.1 -235.0 469.9 414
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.1703 -0.6996 -0.4341 0.9599 6.5898
##
## Random effects:
## Groups Name Variance Std.Dev.
## EPGroup (Intercept) 0.337 0.5805
## Number of obs: 418, groups: EPGroup, 9
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.5170989 2.1412475 -4.445 8.80e-06 ***
## Age 0.4044541 0.0869720 4.650 3.31e-06 ***
## I(Age * Age) -0.0045406 0.0008735 -5.198 2.01e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age
## Age -0.982
## I(Age*Age) 0.953 -0.990
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
## - Rescale variables?
To allow new groups to be drawn, simply specify it int the prediction.
df0$pred.ran <- predict(mod.imp.ran, df0, type = "response", allow.new.levels = T)
df0 <-
df0 %>%
mutate(x.ran = if_else(is.na(x), pred.ran, x))
##
## Call:
## lm(formula = y ~ +x.ran * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.518 -7.613 -1.353 6.324 74.034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.4343 0.6005 27.369 <2e-16 ***
## x.ran -0.5441 1.3840 -0.393 0.6943
## ClosedList -0.7012 0.8508 -0.824 0.4100
## x.ran:ClosedList 3.3294 2.0008 1.664 0.0963 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.34 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.003251, Adjusted R-squared: 0.0009476
## F-statistic: 1.411 on 3 and 1298 DF, p-value: 0.2378
stargazer(mod.base, mod.gr, mod.ran,
column.labels = c("listwise exclusion", "fixeff", "raneff"),
type = "html")
Dependent variable: | |||
y | |||
listwise exclusion | fixeff | raneff | |
(1) | (2) | (3) | |
x | -0.865 | ||
(1.464) | |||
x.gr | -0.545 | ||
(1.351) | |||
x.ran | -0.544 | ||
(1.384) | |||
ClosedList | -0.990 | -0.719 | -0.701 |
(1.215) | (0.839) | (0.851) | |
x:ClosedList | 2.945 | ||
(2.190) | |||
x.gr:ClosedList | 3.368* | ||
(1.946) | |||
x.ran:ClosedList | 3.329* | ||
(2.001) | |||
Constant | 18.193*** | 16.434*** | 16.434*** |
(0.823) | (0.592) | (0.600) | |
Observations | 388 | 1,302 | 1,302 |
R2 | 0.005 | 0.004 | 0.003 |
Adjusted R2 | -0.003 | 0.001 | 0.001 |
Residual Std. Error | 9.912 (df = 384) | 10.342 (df = 1298) | 10.344 (df = 1298) |
F Statistic | 0.663 (df = 3; 384) | 1.529 (df = 3; 1298) | 1.411 (df = 3; 1298) |
Note: | p<0.1; p<0.05; p<0.01 |
Bringing in the uncertainty
Until now, we’ve added information to the models from other variables to increase the N in the main model. There is quite some uncertainty in these predictions. How can we account for that? In fact, the worse the predictions, the more important is it to account for uncertainty. We can do this by using simulation, i.e. we add a bit of noise in our predictions.
Random sampling
Absent any new information, we can of course fill in the gaps using information from the model, but without any assumptions about a trand; that is, we can do random sampling. It has the advantage that it allows us to inflate the standard error, reflecting the uncertainty in the estimation of x.
I can use the sample()
function to draw randomly from
the observed part of my x.
sam <- sample(
#Draw values that are not NA
df0$x[!is.na(df0$x)],
#Make as many draws as I have data
size = nrow(df0),
#For reach draw, replace so that I can draw anew
replace = T)
#Replace the missing observations by my draws
df0 <-
df0 %>%
mutate(x.sam = if_else(is.na(x), sam, x))
Note that this random part means that I’ll get a new result every time I run my R-code. Also, it won’t help me make much headway if I have a lot of missing observations and I am actually interested in the effect of this particular variable.
##
## Call:
## lm(formula = y ~ x.sam * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.995 -7.449 -1.449 6.551 73.005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.28066 0.50301 32.366 <2e-16 ***
## x.sam -0.05268 0.89938 -0.059 0.953
## ClosedList 0.16861 0.68930 0.245 0.807
## x.sam:ClosedList 0.59846 1.24983 0.479 0.632
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.0005861, Adjusted R-squared: -0.001724
## F-statistic: 0.2537 on 3 and 1298 DF, p-value: 0.8587
Sampling from a prediction
We can also sample from a prediction if we know the distribution of this prediction. One way is to assume a normal distribution of each prediction.
df0$Age2 <- df0$Age^2
##Missing values in x are related to the value of other xs
mod2.na <- glm(x ~
+ Age
+ Age2
+ EPGroup,
family="binomial",
df0
)
summary(mod2.na)
##
## Call:
## glm(formula = x ~ +Age + Age2 + EPGroup, family = "binomial",
## data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.718932 2.887349 -3.020 0.002530 **
## Age 0.373340 0.116681 3.200 0.001376 **
## Age2 -0.004263 0.001169 -3.648 0.000264 ***
## EPGroupECR 0.102686 0.529200 0.194 0.846145
## EPGroupEFD -14.791011 759.441910 -0.019 0.984461
## EPGroupEFDD -1.470950 1.117701 -1.316 0.188157
## EPGroupEPP 0.946348 0.396894 2.384 0.017108 *
## EPGroupGreens/EFA -1.282898 0.828435 -1.549 0.121483
## EPGroupGUE/NGL -0.279946 0.671411 -0.417 0.676714
## EPGroupNI 0.631098 0.644278 0.980 0.327312
## EPGroupPES 0.648792 0.417939 1.552 0.120576
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 519.82 on 417 degrees of freedom
## Residual deviance: 449.50 on 407 degrees of freedom
## (1003 observations deleted due to missingness)
## AIC: 471.5
##
## Number of Fisher Scoring iterations: 15
I ask for the standard error of the predictions in order to know the spread of the distribution.
preds <- predict(mod2.na, df0, "response", se.fit = T)
df0$x.glm <- preds$fit
df0$x.se <- preds$se.fit
#Draw from a normal distribution with a mean an a standard deviation
sims <- rnorm(nrow(df0), df0$x.glm, df0$x.se)
#Replace the missing observations by my draws
df0 <-
df0 %>%
mutate(x.sim = if_else(is.na(x), sims, x))
##
## Call:
## lm(formula = y ~ x.sim * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.312 -7.534 -1.331 6.334 73.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.3818 0.5874 27.887 <2e-16 ***
## x.sim -0.3754 1.3228 -0.284 0.777
## ClosedList -0.5640 0.8324 -0.678 0.498
## x.sim:ClosedList 2.8701 1.9098 1.503 0.133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 1298 degrees of freedom
## (119 observations deleted due to missingness)
## Multiple R-squared: 0.00286, Adjusted R-squared: 0.0005555
## F-statistic: 1.241 on 3 and 1298 DF, p-value: 0.2934
stargazer(mod.base, mod.sam, mod.sim,
column.labels = c("listwise exclusion", "random sample", "simulation"),
type = "html")
Dependent variable: | |||
y | |||
listwise exclusion | random sample | simulation | |
(1) | (2) | (3) | |
x | -0.865 | ||
(1.464) | |||
x.sam | -0.053 | ||
(0.899) | |||
x.sim | -0.375 | ||
(1.323) | |||
ClosedList | -0.990 | 0.169 | -0.564 |
(1.215) | (0.689) | (0.832) | |
x:ClosedList | 2.945 | ||
(2.190) | |||
x.sam:ClosedList | 0.598 | ||
(1.250) | |||
x.sim:ClosedList | 2.870 | ||
(1.910) | |||
Constant | 18.193*** | 16.281*** | 16.382*** |
(0.823) | (0.503) | (0.587) | |
Observations | 388 | 1,302 | 1,302 |
R2 | 0.005 | 0.001 | 0.003 |
Adjusted R2 | -0.003 | -0.002 | 0.001 |
Residual Std. Error | 9.912 (df = 384) | 10.358 (df = 1298) | 10.346 (df = 1298) |
F Statistic | 0.663 (df = 3; 384) | 0.254 (df = 3; 1298) | 1.241 (df = 3; 1298) |
Note: | p<0.1; p<0.05; p<0.01 |
A side note: How do compare model coefficients
sjPlot
allows us to compare coefficients from several
models at the same time.
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
Here, I replace the variable names for the regression coefficients so that I will get all coefficients placed right above each other.
replace_names <- function(x,
vars = c("x", "ClosedList", "x:ClosedList")){
names(x$coefficients)[-1] <- vars
return(x)
}
mod.m <- replace_names(mod.m)
mod.g <- replace_names(mod.g)
mod.sam <- replace_names(mod.sam)
mod.sim <- replace_names(mod.sim)
plot_models( mod.base,
mod.m,
mod.g,
mod.sam,
mod.sim,
m.labels = c("Listwise exclusion",
"Mean imputation",
"Group-mean imputation",
"Random sampling",
"Simulation"),
legend.title = "Model approaches",
title = "Coeficients from four strategies of NA handling")
stargazer(mod.base, mod.m, mod.g, mod.sam, mod.sim,
column.labels = c("Base-line", "Mean repl.", "Group repl.", "Random sample", "Imputation"),
type = "html")
Dependent variable: | |||||
y | |||||
Base-line | Mean repl. | Group repl. | Random sample | Imputation | |
(1) | (2) | (3) | (4) | (5) | |
x | -0.865 | -0.845 | -0.963 | -0.053 | -0.375 |
(1.464) | (1.529) | (1.529) | (0.899) | (1.323) | |
ClosedList | -0.990 | -0.542 | -0.594 | 0.169 | -0.564 |
(1.215) | (0.916) | (0.912) | (0.689) | (0.832) | |
x:ClosedList | 2.945 | 2.852 | 3.076 | 0.598 | 2.870 |
(2.190) | (2.287) | (2.287) | (1.250) | (1.910) | |
Constant | 18.193*** | 16.530*** | 16.577*** | 16.281*** | 16.382*** |
(0.823) | (0.636) | (0.648) | (0.503) | (0.587) | |
Observations | 388 | 1,302 | 1,302 | 1,302 | 1,302 |
R2 | 0.005 | 0.002 | 0.002 | 0.001 | 0.003 |
Adjusted R2 | -0.003 | -0.001 | -0.001 | -0.002 | 0.001 |
Residual Std. Error | 9.912 (df = 384) | 10.352 (df = 1298) | 10.351 (df = 1298) | 10.358 (df = 1298) | 10.346 (df = 1298) |
F Statistic | 0.663 (df = 3; 384) | 0.687 (df = 3; 1298) | 0.768 (df = 3; 1298) | 0.254 (df = 3; 1298) | 1.241 (df = 3; 1298) |
Note: | p<0.1; p<0.05; p<0.01 |
Mice (Multiple Imputation for missing data using Chained Equations)
We can go all in on this approach and simulate x-s, store them in
several data frames, run separate models on each of them before we
average over the coefficients. This is what mice
does.
Basic workflow
The mice()
function draws on chained equations. It
predicts x1 using x2, and y, then x2 using x1 (including the predicted
values) and y etc. It goes around in circles until the predictions
stabilizes. For each iteration, it draws randomly from a normal
distribution and multiplies is with the predictions to add a bit of
noise to the mix. In the end it produces several data sets for us in
which it imputes slightly different values.
Now, we can run our main model on each of those data sets, average over the regression coefficients and calculate the uncertainty for each of them using both the standard error from the model at hand, as well as the variation in the coefficients between models.
Step one: prepare the data and strategy
Mice requires a data set with predictors that will all be imputed/used for imputation. If I don’t specify otherwise, all covariates in the data are used to run around in circles. That’s why it’s useful to subset the data to the most relevant predictors.
Mice requires us to make explicit the measurement level of each covariates, so make sure to transform binary variables and categories to factors.
library(mice)
df0 <- df %>%
#Quadric term for age
mutate(Age2 = Age*Age) %>%
#Select variables
dplyr::select(y, x, ClosedList,
EPGroup, Age, Age2) %>%
#Recode to factor
mutate(EPGroup = as.factor(EPGroup))
Mice can use many types of models to predict. Here, I chose a linear model and ask for 10 data sets.
##
## iter imp variable
## 1 1 y x
## 1 2 y x
## 1 3 y x
## 1 4 y x
## 1 5 y x
## 1 6 y x
## 1 7 y x
## 1 8 y x
## 1 9 y x
## 1 10 y x
## 2 1 y x
## 2 2 y x
## 2 3 y x
## 2 4 y x
## 2 5 y x
## 2 6 y x
## 2 7 y x
## 2 8 y x
## 2 9 y x
## 2 10 y x
## 3 1 y x
## 3 2 y x
## 3 3 y x
## 3 4 y x
## 3 5 y x
## 3 6 y x
## 3 7 y x
## 3 8 y x
## 3 9 y x
## 3 10 y x
## 4 1 y x
## 4 2 y x
## 4 3 y x
## 4 4 y x
## 4 5 y x
## 4 6 y x
## 4 7 y x
## 4 8 y x
## 4 9 y x
## 4 10 y x
## 5 1 y x
## 5 2 y x
## 5 3 y x
## 5 4 y x
## 5 5 y x
## 5 6 y x
## 5 7 y x
## 5 8 y x
## 5 9 y x
## 5 10 y x
The result is a list containing the main data frame
(data
) as well as 10 lists with imputed values.
## [1] "data" "imp" "m" "where"
## [5] "blocks" "call" "nmis" "method"
## [9] "predictorMatrix" "visitSequence" "formulas" "post"
## [13] "blots" "ignore" "seed" "iteration"
## [17] "lastSeedValue" "chainMean" "chainVar" "loggedEvents"
## [21] "version" "date"
Here is my imputed x, for example.
Step 2: Run your main model.
Now that we have the data, we can run the main model on each of them.
I could do it by hand, by subseting the data. In that case, I use the
compleete()
function to fill in the missing information for
me. The factor=
argument specifies the data set on which I
run my model.
##
## Call:
## lm(formula = y ~ x * ClosedList, data = complete(df0.imp, factor = 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.876 -7.649 -1.346 6.466 73.779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6104 0.4916 33.785 <2e-16 ***
## x -0.5988 0.8638 -0.693 0.488
## ClosedList -0.1448 0.6715 -0.216 0.829
## x:ClosedList 1.0452 1.1920 0.877 0.381
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.42 on 1417 degrees of freedom
## Multiple R-squared: 0.0006275, Adjusted R-squared: -0.001488
## F-statistic: 0.2966 on 3 and 1417 DF, p-value: 0.8279
Another way is to do this in a loop. Here, I use the
with()
function to apply the same lm()
function to the data.
The result is a list with many model objects. The model objects are
reported in analyses
.
## [1] "call" "call1" "nmis" "analyses"
I can aggregate the results using another function, the
pool()
function. It does two things for me. It calculates
the mean regression coefficient among all 10 coefficients. It also
calculates a standard error for me: It is a mix between the standard
error of each model coefficient and the difference between the different
coefficients across models.
To present the results in stargazer, I will have to do a bit of wrangling. I add the model object, but then I specify the coefficients and the standard errors separately. Just as we did with the robust standard errors.
Here, I add both the baseline model and the newly estimated one.
coefs <- list(summary(mod.base)$coefficients[,1],
res$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
res$std.error)
stargazer(mod.base, mod$analyses[[1]],
coef = coefs,
se = ses,
title = "Initial results from mice",
type = "html")
Dependent variable: | ||
y | ||
(1) | (2) | |
x | -0.865 | -0.200 |
(1.464) | (1.074) | |
ClosedList | -0.990 | -0.285 |
(1.215) | (0.761) | |
x:ClosedList | 2.945 | 1.786 |
(2.190) | (1.592) | |
Constant | 18.193*** | 16.304*** |
(0.823) | (0.556) | |
Observations | 388 | 1,421 |
R2 | 0.005 | 0.001 |
Adjusted R2 | -0.003 | -0.001 |
Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) |
F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) |
Note: | p<0.1; p<0.05; p<0.01 |
I’m a bit underwhelmed by this prediction. Let’s do the predictions with more zeal.
Choice of imputation models and predictors
It is usually best to chose separate models for each variable we want to impute. In the previous example both x and y were imputed using a linear model and the same predictors. This might not be optimal.
I begin by creating a matrix where I specify which variables I want to impute NAs to/from. Here, I remove the y from my set of predictors. I then tell the model where the NAs are. Since Im now left with only x to impute, I opt for a binomial logistic model.
dfNA <- is.na(df0)
#Remove the dependent variable
dfNA[,1] = FALSE
#Recode to categorical variable
df0$x <- as.factor(df0$x)
df1.imp <- mice(df0,
m = 10,
#linear outcome
method = "logreg",
#Where are the NAs
where = dfNA,
printFlag = F
)
mod1 <- with(df1.imp,
lm(y ~
x
* ClosedList)
)
res1 <- pool(mod1)
coefs <- list(summary(mod.base)$coefficients[,1],
res$estimate,
res1$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
res$std.error,
res1$std.error)
stargazer(mod.base, mod$analyses[[1]],mod1$analyses[[1]],
coef = coefs,
se = ses,
title = "Initial results from mice",
type = "html")
Dependent variable: | |||
y | |||
(1) | (2) | (3) | |
x | -0.865 | -0.200 | |
(1.464) | (1.074) | ||
x1 | 0.465 | ||
(0.921) | |||
ClosedList | -0.990 | -0.285 | -0.306 |
(1.215) | (0.761) | (0.690) | |
x:ClosedList | 2.945 | 1.786 | |
(2.190) | (1.592) | ||
x1:ClosedList | 1.860 | ||
(1.243) | |||
Constant | 18.193*** | 16.304*** | 16.132*** |
(0.823) | (0.556) | (0.492) | |
Observations | 388 | 1,421 | 1,302 |
R2 | 0.005 | 0.001 | 0.006 |
Adjusted R2 | -0.003 | -0.001 | 0.004 |
Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) | 10.327 (df = 1298) |
F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) | 2.794** (df = 3; 1298) |
Note: | p<0.1; p<0.05; p<0.01 |
Imputing several variables with different covariates
We can do several imputations at the same time.
I begin by creating “blocks” of covariates. Each block corresponds to a covariate I want to impute. I then create a matrix with dummies for the covariates I want to include in the imputation.
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 3:5] = FALSE
#Two blocks
blocks <- make.blocks(c( "y", "x"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
blocks = blocks)
predictorMatrix
## y x ClosedList EPGroup Age Age2
## y 0 1 1 1 1 1
## x 1 0 1 1 1 1
df2.imp <- mice(df0,
m = 10,
#linear outcome
method = c("norm", "logreg"),
#Blocks of variables
blocks = blocks,
#Data frame specifying where imputations should be made
where = dfNA,
#Matrix flagging the predictors
predictorMatrix = predictorMatrix,
printFlag = F
)
mod2 <- with(df2.imp,
lm(y ~
x * ClosedList
)
)
res2 <- pool(mod2) %>%
summary
res2
Dependent variable: | ||||
y | ||||
(1) | (2) | (3) | (4) | |
x | -0.865 | -0.200 | ||
(1.464) | (1.074) | |||
x1 | 0.465 | -0.023 | ||
(0.921) | (1.371) | |||
ClosedList | -0.990 | -0.285 | -0.306 | -0.419 |
(1.215) | (0.761) | (0.690) | (0.722) | |
x:ClosedList | 2.945 | 1.786 | ||
(2.190) | (1.592) | |||
x1:ClosedList | 1.860 | 1.659 | ||
(1.243) | (1.453) | |||
Constant | 18.193*** | 16.304*** | 16.132*** | 16.314*** |
(0.823) | (0.556) | (0.492) | (0.615) | |
Observations | 388 | 1,421 | 1,302 | 1,421 |
R2 | 0.005 | 0.001 | 0.006 | 0.004 |
Adjusted R2 | -0.003 | -0.001 | 0.004 | 0.001 |
Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) | 10.327 (df = 1298) | 10.390 (df = 1417) |
F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) | 2.794** (df = 3; 1298) | 1.665 (df = 3; 1417) |
Note: | p<0.1; p<0.05; p<0.01 |
Display the results
We can use ggpredict to simulate scenarios for us. The most efficient way to do this is by writing a loop.
library(ggeffects)
#Create an empty object
predictions = NULL
#Index from 1 to 10 (number of datasets/models)
i = 1
for(i in 1:10){
m <- lm(y ~
x * ClosedList,
data = complete(df2.imp, action = i))
predictions[[i]] <- ggpredict(m,
terms = list("x" = c(0,1),
"ClosedList" = c(0,1))
)
}
The function pool_predictions()
pools the simulated
results for us by drawing on all 10 models. The outcome is the data
frame we’re used to.
effs <- pool_predictions(predictions)
effs %>%
#Transform to regular data frame
as.data.frame %>%
#Recode to make variables substantive
mutate(group = if_else(group == 1, "Party-centered", "Candidate-centered"),
x = if_else(x == 1, "2. reelection-seeker", "1. not reelection-seeker")) %>%
ggplot() +
geom_point(aes(x = x,
y = predicted,
group = group),
position = position_dodge(0.3)) +
geom_errorbar(aes(x = x,
ymax = conf.high,
ymin = conf.low,
group = group),
position = position_dodge(0.3)) +
ylab("Number of meetings attended") +
ggtitle("Effect of ambition on committee attendance")
Model assessment
Assessing whether we address missing observations appropriately is not easy. We don’t even know the extent of the problem! However, remember that ignoring the problem means you do a listwise exclusion. That’s a choice too.
When we impute values, the risk we are running is often overfitting: We pick variables that maximize our in-sample predictions, but then do a lousy job of predicting out of sample (i.e. our NAs).
In-sample predictions
You can do your usual cross-correlations between observed and predicted values for the imputed variable.
When using mice, there are a few additional functions. For continuous variables, we can plot the density (distribution) of the 10 simulated data (in red) against the observed density (blue).
The xyplot()
plots the imputed variable agains whatever
covariate you have used to impute; just to check if the two overlap
appropriately.
Out-of-sample predictions
A last possibility is to leave observations out before making the predictions/picking a model. You can then use the model to impute the artificial NAs and compare.
df0$x_miss <- df0$x
id <- sample(row.names(df0 %>%
filter(!is.na(x))),
100)
df0$x_miss[row.names(df0) %in% id ] <- NA
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 2:6] = FALSE
dfNA %>% head
## y x ClosedList EPGroup Age Age2 x_miss
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#Two blocks
blocks <- make.blocks(c( "y", "x_miss"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
blocks = blocks)
#Remove x as a predictor
predictorMatrix[,"x"] = 0
predictorMatrix
## y x ClosedList EPGroup Age Age2 x_miss
## y 0 0 1 1 1 1 1
## x_miss 1 0 1 1 1 1 0
#Recode to factors
df0 <-
df0 %>%
mutate(EPGroup = as.factor(EPGroup),
x_miss = as.factor(x_miss))
df2.amp <- mice(df0,
m = 10,
#linear outcome
method = c("norm", "logreg"),
#Blocks of variables
blocks = blocks,
#Data frame specifying where imputations should be made
where = dfNA,
#Matrix flagging the predictors
predictorMatrix = predictorMatrix,
printFlag = F
)
test <- mice::complete(df2.amp, factor = 3, include = T)
table(test$x, test$x_miss)
##
## 0 1
## 0 555 4
## 1 5 249