Missing data
In this session, we will be exploring methods for handeling missing data. We start out by exploring the variables: their distribution as well as the distribution of their missing observations. We then move on to the more simple ways of imputing values before we attack multiple imputations and simulation.
New R codes:
- from base: sample()
- new from finalfit package: missing_plot(),missing_compare(),missing_pattern()
- new from mice package: complete(),pool(),pool_predictions()
We continue to use the data on Members of the European Parliament (MEPs). We are interested in whether re-election seeking MEPs spend more time in Parliament if they will compete in a system where the party selectorate can allocate “safe seats”. In other words, we have two comparison groups: We compare reelection seekers with non-reelection seekers and party-centered vs. candidate-centered systems through an interaction.
\(y = \alpha + \beta_1 x1 + \beta_2x2 + \beta_3x1\times x2\)
This time, the data includes several groupings; it is still a
time-series cross-sectional (imbalanced panel) data in which each MEP is
observed every 6 months over a 5-year period. Thus, I have observations
of individuals (ID) in up to 10 periods
(Period). These individuals also have a nationality
(Nationality) and a national party affiliation
(ParlGovID).
Our main variables are Attendance (\(y\)), FutureInEP (\(x1\)) and ClosedList (\(x2\)).
I begin by loading in my data. You can find them online here:
https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda.
I also rename a few of my variables for ease.
Introduction
The “hitch” this time is that I use a survey response to measure an MEP’s intention to seek reelection (\(x1\)). However, most MEPs did not complete the survey. My intention is to impute the answers. While we’re at it, I’ll also scout for missing on other variables to, including my dependent variable (y, Attendance).
Univariate statistics
Dependent variable: Attendance
I approximate legislative investment by the number of committee meetings an MEP has attended during the semester (period of observation.)
df %>%
  ggplot +
  geom_histogram(aes(y)) +
  ggtitle("Number of committee meetings") +
  ylab("Number of committe meetings")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.## Warning: Removed 119 rows containing non-finite outside the scale range
## (`stat_bin()`).The variable is naturally bounded at 0 (i.e. none can attend a negative number of meetings). However, for the purposes of this exercise I am mainly interested in the differences in means between MEPs with an expressed reelection intent and those that don’t have such a clear ambition. I could also make the bet that the conditional distribution (and my residuals) will be approximately normal once I have specified my full model. This is always smart to check.
Independent variable: Reelection intent
The survey asks where an MEP sees him/herself in 10 years. One of the
possible answers is “European Parliament”
(FutureInEP==1)
df$future <- ifelse(df$x == 1, "reelection", "exit")
df %>%
  ggplot +
  geom_bar(aes(future)) +
  ggtitle("Re-election intent upon election",
          subtitle = "EPRG Survey distributed in 2009 and 2014")
As is clear from the barplot, most MEPs did not reply. I have a lot of
missing data (71 %). To what extent is this a problem?
If the data is MCAR (missing completely at random), it would mainly
be a problem of statistical power. However, if the data is
systematically missing – and especially if the assignment mechanism for
NA on \(x\) is also
related to \(y\) – I should take
action. If the data is MAR (missing at random), I might impute some
values using observables. It it is MNAR (missing not at random), it’s a
different beast; we don’t have any predictors and we cannot know what
the source is beyond theory/political science reasoning. The best way to
look for MAR is to correlate the missing with the other variables in the
model, as well as whatever other variables we might have.
Explore the missing observations
There is no way to know if the missing data is truly random. However,
a pretty sure giveaway that the missing observations are non-random is
if a predictor’s NAs covary with either i) NAs
in the other variables in the model (or other variables we may have) or
ii) the values of other variables.
For ease, let’s begin by subsetting the data to the covariates we are interested in. We can add variables to this code as we go.
What are the variables that contain missing?
Now, let’s see where I have missing observations. The package
finalfit has several convenient functions for exploring
patterns in the missing observations. missing_plot()
organizes the data row-by-row on the x-axis and reports the variables on
the y-axis. I draws light blue bars for all observations that are
missing.
We can see that I have some missing observations on my dependent
variable “Attendance” (\(y\)) as well
as a bunch on my predictor “FutureInEP” (\(x\)). By contrast, we have no missing
observations in the indicator for party-centered systems
(“ClosedList”).
Unique combinations of NAs
Another functionality that comes from the mice- package
allows us to explore the patterns of missing observations.
## 
## Vedhæfter pakke: 'mice'## Det følgende objekt er maskeret fra 'package:stats':
## 
##     filter## De følgende objekter er maskerede fra 'package:base':
## 
##     cbind, rbind##     ClosedList   y    x     
## 388          1   1    1    0
## 914          1   1    0    1
## 30           1   0    1    1
## 89           1   0    0    2
##              0 119 1003 1122We have four combinations of missing observations in the data:
- no missing (N = 388)
- missing only on the x (N = 914)
- missing only on the y (N = 30)
- missing on both x and y (N = 89)
Pairs of correlations
We should also check for systematic correlations between the missing
and observed data across different variables. The
missing_compare() function does a simple cross-tabulation
of missing values in the variable of your choice (e.g. \(x\)) and the values of other
predictors. It then tests the statistical significance of the
relationship. When the predictors are categorical, it performs a \(\chi^2\) test.
Here, I compare missing in y with values of x1 and x2. It turns out that there is a relationship between NAs in MEPs’ attendance levels (y) and party-centered systems (ClosedList). It is not an effect of the survey question, but another glitch in the data. We can check with a simple t-test; just to be sure.
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$x)) and df$ClosedList
## t = 3.7918, df = 1419, p-value = 0.0001558
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04840475 0.15136782
## sample estimates:
##       cor 
## 0.1001544## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$y)) and df$x
## t = 0.65162, df = 416, p-value = 0.515
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06417985  0.12745652
## sample estimates:
##       cor 
## 0.0319318Missing on my dependent variable is not correlated with MEPs ambition (the value of x). However, the opposite is not true.
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$x)) and df$y
## t = -3.2667, df = 1300, p-value = 0.001117
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14385311 -0.03608115
## sample estimates:
##         cor 
## -0.09023128It seems like we have two problems: Missing on y and missing on x. Let’s ignore the missing on y for the time being.
Conditional NAs
A quick way to check if missing observations are conditional on other predictors is to regress the NAs on the rest of the predictors.
I begin by recoding my variable into 0 (not missing) and 1 (missing)
## 
## Call:
## glm(formula = x_na ~ ClosedList, family = "binomial", data = df0)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.64768    0.08228   7.872  3.5e-15 ***
## ClosedList   0.44051    0.11703   3.764 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1721.8  on 1420  degrees of freedom
## Residual deviance: 1707.5  on 1419  degrees of freedom
## AIC: 1711.5
## 
## Number of Fisher Scoring iterations: 4Ok, so the non-responses to the survey are systematically correlated with the electoral system. Here, I find that the probability of a missing observation on x increases by 55.3500422 % when we go from candidate-centered to party-centered systems.
If we assume the data is MAR and this is the only covariate for missing values, then conditioning on electoral system in the main model will ideally address the bias induced by the missing observations. We can also use the predictors in the model to impute non-responses. Ward and Ahlquist (2018) also argue that we can use the outcome variable (y) to impute values on x. Their approach is that we have a (theoretical) relationship between the two, and that this is reflected in their correlation. The correlation is symmetric; if we can predict y using x, then we can also use y to predict x.
Before we go that far, let’s consider some low-stakes options.
Quick fixes
Removing data
My ideal model looks like this:
## 
## Call:
## lm(formula = y ~ x * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.283  -8.193  -1.198   6.797  30.807 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.1931     0.8231  22.103   <2e-16 ***
## x             -0.8647     1.4642  -0.591    0.555    
## ClosedList    -0.9899     1.2150  -0.815    0.416    
## x:ClosedList   2.9445     2.1900   1.345    0.180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.912 on 384 degrees of freedom
##   (1033 observations deleted due to missingness)
## Multiple R-squared:  0.005149,   Adjusted R-squared:  -0.002623 
## F-statistic: 0.6625 on 3 and 384 DF,  p-value: 0.5755From the base-line model, we can see that MEPs that already know that they will seek reelection attend on average 3 committee meetings more than their colleagues in party-centered systems without such ambition.
If I do nothing more at the estimation stage, R will proceed to listwise exclusion of all observations with missing values on any of the covariates included in the model. .
## [1] 388Here, we see that the length of my data is 388, while my data frame contains 1421 observations.
Creating weights
Sometimes, the main problem is just over/under-representation of some groups of observations. If so, we can include weights in the regression to compensate for the overrepresenation.
df %>%
  group_by(Nationality) %>%
  summarize(prop = mean(is.na(x))) %>%
  dplyr::select(Nationality, prop) %>%
  ggplot +
  geom_bar(aes(x = prop,
               y = Nationality),
           stat = "identity") +
  ggtitle("Proportion of non-responses by nationality") +
  xlab("")Distribution of non-responses on the EPRG survey
From the Figure, we see that there is some variation in response rate across nationalities. If that is all, and we don’t think the same mechanism will affect the values taken on by my \(y\), then we can simply weight the observations according to their probability of occurring in the survey. E.g. we can see that Hungary and Spain are overrepresented, while Denmark has a lower response rate.
I begin by creating a dummy indicator variable for whether the MEP responded to the survey.
I then fit a model to predict the occurence of non-response.
## 
## Call:
## glm(formula = future_na ~ Nationality, family = "binomial", data = df)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                1.34993    0.42414   3.183  0.00146 **
## NationalityBelgium        -0.38053    0.55253  -0.689  0.49101   
## NationalityBulgaria       -0.94446    0.54676  -1.727  0.08410 . 
## NationalityCroatia        -0.94446    0.77237  -1.223  0.22140   
## NationalityCyprus         -1.01345    0.72302  -1.402  0.16100   
## NationalityCzech Republic -1.19578    0.53204  -2.248  0.02461 * 
## NationalityDenmark        -1.34993    0.58869  -2.293  0.02184 * 
## NationalityEstonia        -0.50263    0.80999  -0.621  0.53490   
## NationalityFinland        -0.71394    0.59146  -1.207  0.22740   
## NationalityFrance         -0.24026    0.47352  -0.507  0.61187   
## NationalityGermany        -0.34011    0.45458  -0.748  0.45435   
## NationalityGreece         -0.34662    0.55149  -0.629  0.52966   
## NationalityHungary         1.24034    0.73364   1.691  0.09090 . 
## NationalityIreland        -1.34993    0.63413  -2.129  0.03327 * 
## NationalityItaly          -0.56977    0.46156  -1.234  0.21704   
## NationalityLatvia         -0.89794    0.64316  -1.396  0.16268   
## NationalityLithuania      -0.30847    0.63670  -0.484  0.62804   
## NationalityLuxembourg     -0.65678    0.74491  -0.882  0.37795   
## NationalityMalta          -1.01345    0.72302  -1.402  0.16100   
## NationalityNetherlands    -0.50263    0.52453  -0.958  0.33794   
## NationalityPoland         -0.50263    0.47698  -1.054  0.29199   
## NationalityPortugal       -0.34662    0.55149  -0.629  0.52966   
## NationalityRomania        -0.85745    0.50312  -1.704  0.08833 . 
## NationalitySlovakia       -1.34993    0.57770  -2.337  0.01945 * 
## NationalitySlovenia       -1.34993    0.65566  -2.059  0.03951 * 
## NationalitySpain          -0.03425    0.48738  -0.070  0.94398   
## NationalitySweden         -0.45199    0.55483  -0.815  0.41528   
## NationalityUnited Kingdom -0.18300    0.46865  -0.390  0.69619   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1721.8  on 1420  degrees of freedom
## Residual deviance: 1670.9  on 1393  degrees of freedom
## AIC: 1726.9
## 
## Number of Fisher Scoring iterations: 5The weight variable multiplies in practice our dependent variable to increase or decrease according to weight it is given.
## 
## Call:
## lm(formula = Attendance ~ x * ClosedList, data = df, weights = 1/weight)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.930  -9.556  -1.641   7.906  43.313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.3730     0.8092  22.704   <2e-16 ***
## x             -0.9877     1.4306  -0.690    0.490    
## ClosedList    -1.1130     1.2439  -0.895    0.371    
## x:ClosedList   3.2636     2.2176   1.472    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.24 on 384 degrees of freedom
##   (1033 observations deleted due to missingness)
## Multiple R-squared:  0.005924,   Adjusted R-squared:  -0.001842 
## F-statistic: 0.7628 on 3 and 384 DF,  p-value: 0.5155This is a common way to address some of the information biases we find in surveys such as the European Social Survy, World Values Survey etc.
stargazer(mod.base, mod.w,
          column.separate = c("Listwise excl", "Weights"),
          title = "Listwise exclusion and regression weights.",
          type = "html")% Error: Argument ‘column.separate’ must be NULL (default), a vector of type ‘numeric.’
Simple imputations
“Imputing” missing values implies that we fill in the gaps. The non-informative ways of filling in gaps imply giving the observations some value without adding new information. This will attenuate the statistical correlation, but increase the N. If my covariate is a control variable, and I want to keep as many observations as possible (or reduce selection bias) for the main predictors in the model, these are low-energy fixes.
Mean imputation
I can simply replace my missing observations by the mean of the variable. Here, the “mean” is the proportion of MEPs that answer the survey question by the positive (x == 1).
I can run my model with this new “imputed” variable. There is no new
information in the model, however, my N is higher. It also means that
the standard errors will be artificially low. I.e. I’ll be
underpredicting while being overly certain about the result.
## 
## Call:
## lm(formula = y ~ x.bar * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.995  -7.617  -1.530   6.383  73.383 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.5296     0.6362  25.982   <2e-16 ***
## x.bar             -0.8445     1.5293  -0.552    0.581    
## ClosedList        -0.5423     0.9163  -0.592    0.554    
## x.bar:ClosedList   2.8522     2.2871   1.247    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001586,   Adjusted R-squared:  -0.000722 
## F-statistic: 0.6871 on 3 and 1298 DF,  p-value: 0.5599Indicator variables
We can add to this strategy an indicator variable that flags all the cases where I’ve been meddling with my observations. It’s – once again – a way to keep the entire data set. Now, I control for the imputation, so that the effect of the imputed variable reports the estimated effect of x on y using only the information from the observed values (plus some noise). The imputed values are subsumed in the indicator. Consider the similarity between the two coefficients for x and x.bar in model 1 and 3.
## 
## Call:
## lm(formula = y ~ +x_na * +x.bar * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.283  -7.398  -1.398   5.811  73.811 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            18.1931     0.8565  21.241  < 2e-16 ***
## x_na                   -2.5246     0.8743  -2.887  0.00395 ** 
## x.bar                  -0.8647     1.5236  -0.568  0.57042    
## ClosedList             -0.9899     1.2643  -0.783  0.43381    
## x_na:x.bar                  NA         NA      NA       NA    
## x_na:ClosedList         0.8581     1.2563   0.683  0.49469    
## x.bar:ClosedList        2.9445     2.2789   1.292  0.19655    
## x_na:x.bar:ClosedList       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.31 on 1296 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.01056,    Adjusted R-squared:  0.006739 
## F-statistic: 2.765 on 5 and 1296 DF,  p-value: 0.01714stargazer(mod.base, mod.m, mod.ind,
          title = "Low-stakes imputations",
          column.labels = c("listwise exclusion", "mean imp.", "mean imp. + indicator"),
          type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | mean imp. | mean imp. + indicator | |
| (1) | (2) | (3) | |
| x | -0.865 | ||
| (1.464) | |||
| x_na | -2.525*** | ||
| (0.874) | |||
| x.bar | -0.845 | -0.865 | |
| (1.529) | (1.524) | ||
| ClosedList | -0.990 | -0.542 | -0.990 | 
| (1.215) | (0.916) | (1.264) | |
| x:ClosedList | 2.945 | ||
| (2.190) | |||
| x_na:x.bar | |||
| x_na:ClosedList | 0.858 | ||
| (1.256) | |||
| x.bar:ClosedList | 2.852 | 2.945 | |
| (2.287) | (2.279) | ||
| x_na:x.bar:ClosedList | |||
| Constant | 18.193*** | 16.530*** | 18.193*** | 
| (0.823) | (0.636) | (0.857) | |
| Observations | 388 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.002 | 0.011 | 
| Adjusted R2 | -0.003 | -0.001 | 0.007 | 
| Residual Std. Error | 9.912 (df = 384) | 10.352 (df = 1298) | 10.314 (df = 1296) | 
| F Statistic | 0.663 (df = 3; 384) | 0.687 (df = 3; 1298) | 2.765** (df = 5; 1296) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
| Dependent variable: | ||||
| y | Attendance | y | ||
| (1) | (2) | (3) | (4) | |
| x | -0.865 | -0.988 | ||
| (1.464) | (1.431) | |||
| x_na | -2.525*** | |||
| (0.874) | ||||
| x.bar | -0.845 | -0.865 | ||
| (1.529) | (1.524) | |||
| ClosedList | -0.990 | -1.113 | -0.542 | -0.990 | 
| (1.215) | (1.244) | (0.916) | (1.264) | |
| x:ClosedList | 2.945 | 3.264 | ||
| (2.190) | (2.218) | |||
| x_na:x.bar | ||||
| x_na:ClosedList | 0.858 | |||
| (1.256) | ||||
| x.bar:ClosedList | 2.852 | 2.945 | ||
| (2.287) | (2.279) | |||
| x_na:x.bar:ClosedList | ||||
| Constant | 18.193*** | 18.373*** | 16.530*** | 18.193*** | 
| (0.823) | (0.809) | (0.636) | (0.857) | |
| Observations | 388 | 388 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.006 | 0.002 | 0.011 | 
| Adjusted R2 | -0.003 | -0.002 | -0.001 | 0.007 | 
| Residual Std. Error | 9.912 (df = 384) | 12.243 (df = 384) | 10.352 (df = 1298) | 10.314 (df = 1296) | 
| F Statistic | 0.663 (df = 3; 384) | 0.763 (df = 3; 384) | 0.687 (df = 3; 1298) | 2.765** (df = 5; 1296) | 
| Note: | p<0.1; p<0.05; p<0.01 | |||
Inserting information to the model
Until now, we haven’t inserted any new information to the missing variables. We can do that using imputations; either by inserting information from i) the other covariates or from ii) new variables. There are grounds to say that we can draw on the covariates already in the model. First, if the covariates in the model are there because they covary (with each other), then we may argue that the information is already in the data; we just need to crystallize it.
Moving forward, we can either replace variables by group means, or go all the way using regressions.
Group means
We can replace the value of x by a predictor like
ClosedList.
## 
## Call:
## lm(formula = y ~ x.bar.g * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.096  -7.607  -1.577   6.393  73.393 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         16.5765     0.6479  25.586   <2e-16 ***
## x.bar.g             -0.9627     1.5288  -0.630    0.529    
## ClosedList          -0.5936     0.9116  -0.651    0.515    
## x.bar.g:ClosedList   3.0758     2.2869   1.345    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001771,   Adjusted R-squared:  -0.000536 
## F-statistic: 0.7677 on 3 and 1298 DF,  p-value: 0.5122Let’s instead use an entirely new covariate.
In fact, let’s add several predictors to the data set we use. My
candidate predictors are the legislator’s Nationality,
EPGroup (parliamentary group), EP (legislative
period/survey round) and Age when they responded to the
survey.
vars <- c("y", "x", "ClosedList",
          "EPGroup", "EP", "ID",
          "Nationality", "Age")
df0 <- df %>%
  dplyr::select(all_of(vars))Usually, I’d check the patterns of missing data once again (i.e. are there missing observations on the new covariates?).
df0 <- 
  df0 %>%
  group_by(Nationality) %>%
  mutate(x.bar.g2 = if_else(is.na(x), mean(x, na.rm = T), x)) %>%
  #make sure to ungroup (useful for some of the sampling functions)
  ungroup()## 
## Call:
## lm(formula = y ~ x.bar.g2 * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.939  -7.712  -1.308   6.383  73.404 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          16.7148     0.6179  27.050   <2e-16 ***
## x.bar.g2             -1.4242     1.4422  -0.987    0.324    
## ClosedList           -0.6270     0.8571  -0.732    0.465    
## x.bar.g2:ClosedList   3.2758     2.1325   1.536    0.125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.002097,   Adjusted R-squared:  -0.0002096 
## F-statistic: 0.9091 on 3 and 1298 DF,  p-value: 0.4359stargazer(mod.base,  mod.g, mod.g2,
          column.labels = c("listwise exclusion", "group mean/ClosedList", "group mean/Nationality"),
          type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | group mean/ClosedList | group mean/Nationality | |
| (1) | (2) | (3) | |
| x | -0.865 | ||
| (1.464) | |||
| x.bar.g | -0.963 | ||
| (1.529) | |||
| x.bar.g2 | -1.424 | ||
| (1.442) | |||
| ClosedList | -0.990 | -0.594 | -0.627 | 
| (1.215) | (0.912) | (0.857) | |
| x:ClosedList | 2.945 | ||
| (2.190) | |||
| x.bar.g:ClosedList | 3.076 | ||
| (2.287) | |||
| x.bar.g2:ClosedList | 3.276 | ||
| (2.133) | |||
| Constant | 18.193*** | 16.577*** | 16.715*** | 
| (0.823) | (0.648) | (0.618) | |
| Observations | 388 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.002 | 0.002 | 
| Adjusted R2 | -0.003 | -0.001 | -0.0002 | 
| Residual Std. Error | 9.912 (df = 384) | 10.351 (df = 1298) | 10.350 (df = 1298) | 
| F Statistic | 0.663 (df = 3; 384) | 0.768 (df = 3; 1298) | 0.909 (df = 3; 1298) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
Regression
From group means, there is only a tiny-tiny step to regression predictions. Since my x is binary, I run a binomial logistic regression to get predicted probabilities that I can later use for imputation. I’m regression on “Age”.
## 
## Call:
## glm(formula = x ~ Age + I(Age * Age), family = "binomial", data = df0)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.424710   2.654323  -3.551 0.000384 ***
## Age           0.411206   0.109819   3.744 0.000181 ***
## I(Age * Age) -0.004549   0.001109  -4.100 4.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 480.01  on 415  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 486.01
## 
## Number of Fisher Scoring iterations: 5At this point, I’ll prioritize predictors that make sense theoretically and that have a strong correlation. It makes sense that MEPs that are above a certain age will not plan to stay in office for another 10 years, so age is a substantivly good predictor. I add a quadric term because I think young MEPs are increasingly likely to aim for an EU career (as they lock themselves into a career), then – as they grow closer to retirement – i expect the opposite.
df0$pred.age <- predict(mod.imp.age, df0, type = "response")
df0 <-
  df0 %>%
  mutate(x.age = if_else(is.na(x), pred.age, x))## 
## Call:
## lm(formula = y ~ +x.age * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.785  -7.608  -1.412   6.392  73.815 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.6080     0.6144  27.029   <2e-16 ***
## x.age             -1.0810     1.4197  -0.761    0.447    
## ClosedList        -0.5307     0.8714  -0.609    0.543    
## x.age:ClosedList   2.7889     2.0809   1.340    0.180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001694,   Adjusted R-squared:  -0.0006132 
## F-statistic: 0.7343 on 3 and 1298 DF,  p-value: 0.5316Random effects and fixed effects
A lot of (unobserved) information is often contained in group variables.
Fixed effects
## 
## Call:
## glm(formula = x ~ Age + I(Age * Age) + EPGroup, family = "binomial", 
##     data = df0)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.718932   2.887349  -3.020 0.002530 ** 
## Age                 0.373340   0.116681   3.200 0.001376 ** 
## I(Age * Age)       -0.004263   0.001169  -3.648 0.000264 ***
## EPGroupECR          0.102686   0.529200   0.194 0.846145    
## EPGroupEFD        -14.791011 759.441910  -0.019 0.984461    
## EPGroupEFDD        -1.470950   1.117701  -1.316 0.188157    
## EPGroupEPP          0.946348   0.396894   2.384 0.017108 *  
## EPGroupGreens/EFA  -1.282898   0.828435  -1.549 0.121483    
## EPGroupGUE/NGL     -0.279946   0.671411  -0.417 0.676714    
## EPGroupNI           0.631098   0.644278   0.980 0.327312    
## EPGroupPES          0.648792   0.417939   1.552 0.120576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 449.50  on 407  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 471.5
## 
## Number of Fisher Scoring iterations: 15df0$pred.epgr <- predict(mod.imp.epgr, df0, type = "response")
df0 <-
  df0 %>%
  mutate(x.gr = if_else(is.na(x), pred.epgr, x))## 
## Call:
## lm(formula = y ~ +x.gr * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.538  -7.636  -1.351   6.311  74.028 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.4342     0.5924  27.741   <2e-16 ***
## x.gr             -0.5450     1.3513  -0.403   0.6868    
## ClosedList       -0.7192     0.8393  -0.857   0.3917    
## x.gr:ClosedList   3.3679     1.9462   1.730   0.0838 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.003522,   Adjusted R-squared:  0.001219 
## F-statistic: 1.529 on 3 and 1298 DF,  p-value: 0.2052Fixed effects are always a possibility, but remember: You run the risk of overfitting the data wheen you hunt around for covariates. Fixed effects are a good example: They predict really well in-group, but what about new groups? Or if there is a change?
Random effects
are more flexible. We can create new groups simply by drawing from the distribution of random effects (i.e. the random intercepts are assumed to follow a normal distribution with a given mean and standard deviation). We can also add level-2 covariates to improve the out-of-sample predictions.
## Warning: Some predictor variables are on very different scales: consider
## rescaling## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: x ~ Age + I(Age * Age) + (1 | EPGroup)
##    Data: df0
## 
##      AIC      BIC   logLik deviance df.resid 
##    477.9    494.1   -235.0    469.9      414 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1703 -0.6996 -0.4341  0.9599  6.5898 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  EPGroup (Intercept) 0.337    0.5805  
## Number of obs: 418, groups:  EPGroup, 9
## 
## Fixed effects:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.5170989  2.1412475  -4.445 8.80e-06 ***
## Age           0.4044541  0.0869720   4.650 3.31e-06 ***
## I(Age * Age) -0.0045406  0.0008735  -5.198 2.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) Age   
## Age        -0.982       
## I(Age*Age)  0.953 -0.990
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?To allow new groups to be drawn, simply specify it int the prediction.
df0$pred.ran <- predict(mod.imp.ran, df0, type = "response", allow.new.levels = T)
df0 <-
  df0 %>%
  mutate(x.ran = if_else(is.na(x), pred.ran, x))## 
## Call:
## lm(formula = y ~ +x.ran * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.518  -7.613  -1.353   6.324  74.034 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.4343     0.6005  27.369   <2e-16 ***
## x.ran             -0.5441     1.3840  -0.393   0.6943    
## ClosedList        -0.7012     0.8508  -0.824   0.4100    
## x.ran:ClosedList   3.3294     2.0008   1.664   0.0963 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.003251,   Adjusted R-squared:  0.0009476 
## F-statistic: 1.411 on 3 and 1298 DF,  p-value: 0.2378stargazer(mod.base,  mod.gr, mod.ran,
          column.labels = c("listwise exclusion", "fixeff", "raneff"),
          type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | fixeff | raneff | |
| (1) | (2) | (3) | |
| x | -0.865 | ||
| (1.464) | |||
| x.gr | -0.545 | ||
| (1.351) | |||
| x.ran | -0.544 | ||
| (1.384) | |||
| ClosedList | -0.990 | -0.719 | -0.701 | 
| (1.215) | (0.839) | (0.851) | |
| x:ClosedList | 2.945 | ||
| (2.190) | |||
| x.gr:ClosedList | 3.368* | ||
| (1.946) | |||
| x.ran:ClosedList | 3.329* | ||
| (2.001) | |||
| Constant | 18.193*** | 16.434*** | 16.434*** | 
| (0.823) | (0.592) | (0.600) | |
| Observations | 388 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.004 | 0.003 | 
| Adjusted R2 | -0.003 | 0.001 | 0.001 | 
| Residual Std. Error | 9.912 (df = 384) | 10.342 (df = 1298) | 10.344 (df = 1298) | 
| F Statistic | 0.663 (df = 3; 384) | 1.529 (df = 3; 1298) | 1.411 (df = 3; 1298) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
Bringing in the uncertainty
Until now, we’ve added information to the models from other variables to increase the N in the main model. There is quite some uncertainty in these predictions. How can we account for that? In fact, the worse the predictions, the more important is it to account for uncertainty. We can do this by using simulation, i.e. we add a bit of noise in our predictions.
Random sampling
Absent any new information, we can of course fill in the gaps using information from the model, but without any assumptions about a trand; that is, we can do random sampling. It has the advantage that it allows us to inflate the standard error, reflecting the uncertainty in the estimation of x.
I can use the sample() function to draw randomly from
the observed part of my x.
sam <- sample(
  #Draw values that are not NA
  df0$x[!is.na(df0$x)],
  #Make as many draws as I have data
  size = nrow(df0),
  #For reach draw, replace so that I can draw anew
  replace = T)
#Replace the missing observations by my draws
df0 <- 
  df0 %>%
  mutate(x.sam = if_else(is.na(x), sam, x))Note that this random part means that I’ll get a new result every time I run my R-code. Also, it won’t help me make much headway if I have a lot of missing observations and I am actually interested in the effect of this particular variable.
## 
## Call:
## lm(formula = y ~ x.sam * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.995  -7.449  -1.449   6.551  73.005 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.28066    0.50301  32.366   <2e-16 ***
## x.sam            -0.05268    0.89938  -0.059    0.953    
## ClosedList        0.16861    0.68930   0.245    0.807    
## x.sam:ClosedList  0.59846    1.24983   0.479    0.632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.0005861,  Adjusted R-squared:  -0.001724 
## F-statistic: 0.2537 on 3 and 1298 DF,  p-value: 0.8587Sampling from a prediction
We can also sample from a prediction if we know the distribution of this prediction. One way is to assume a normal distribution of each prediction.
df0$Age2 <- df0$Age^2
##Missing values in x are related to the value of other xs
mod2.na <- glm(x ~ 
               + Age
               + Age2
               + EPGroup,
               family="binomial",
               df0
)
summary(mod2.na)## 
## Call:
## glm(formula = x ~ +Age + Age2 + EPGroup, family = "binomial", 
##     data = df0)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.718932   2.887349  -3.020 0.002530 ** 
## Age                 0.373340   0.116681   3.200 0.001376 ** 
## Age2               -0.004263   0.001169  -3.648 0.000264 ***
## EPGroupECR          0.102686   0.529200   0.194 0.846145    
## EPGroupEFD        -14.791011 759.441910  -0.019 0.984461    
## EPGroupEFDD        -1.470950   1.117701  -1.316 0.188157    
## EPGroupEPP          0.946348   0.396894   2.384 0.017108 *  
## EPGroupGreens/EFA  -1.282898   0.828435  -1.549 0.121483    
## EPGroupGUE/NGL     -0.279946   0.671411  -0.417 0.676714    
## EPGroupNI           0.631098   0.644278   0.980 0.327312    
## EPGroupPES          0.648792   0.417939   1.552 0.120576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 449.50  on 407  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 471.5
## 
## Number of Fisher Scoring iterations: 15I ask for the standard error of the predictions in order to know the spread of the distribution.
preds <- predict(mod2.na, df0, "response", se.fit = T) 
df0$x.glm <- preds$fit
df0$x.se <- preds$se.fit
#Draw from a normal distribution with a mean an a standard deviation
sims <- rnorm(nrow(df0), df0$x.glm, df0$x.se)
#Replace the missing observations by my draws
df0 <- 
  df0 %>%
  mutate(x.sim = if_else(is.na(x), sims, x))## 
## Call:
## lm(formula = y ~ x.sim * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.312  -7.534  -1.331   6.334  73.883 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.3818     0.5874  27.887   <2e-16 ***
## x.sim             -0.3754     1.3228  -0.284    0.777    
## ClosedList        -0.5640     0.8324  -0.678    0.498    
## x.sim:ClosedList   2.8701     1.9098   1.503    0.133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.00286,    Adjusted R-squared:  0.0005555 
## F-statistic: 1.241 on 3 and 1298 DF,  p-value: 0.2934stargazer(mod.base,  mod.sam, mod.sim,
          column.labels = c("listwise exclusion", "random sample", "simulation"),
          type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | random sample | simulation | |
| (1) | (2) | (3) | |
| x | -0.865 | ||
| (1.464) | |||
| x.sam | -0.053 | ||
| (0.899) | |||
| x.sim | -0.375 | ||
| (1.323) | |||
| ClosedList | -0.990 | 0.169 | -0.564 | 
| (1.215) | (0.689) | (0.832) | |
| x:ClosedList | 2.945 | ||
| (2.190) | |||
| x.sam:ClosedList | 0.598 | ||
| (1.250) | |||
| x.sim:ClosedList | 2.870 | ||
| (1.910) | |||
| Constant | 18.193*** | 16.281*** | 16.382*** | 
| (0.823) | (0.503) | (0.587) | |
| Observations | 388 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.001 | 0.003 | 
| Adjusted R2 | -0.003 | -0.002 | 0.001 | 
| Residual Std. Error | 9.912 (df = 384) | 10.358 (df = 1298) | 10.346 (df = 1298) | 
| F Statistic | 0.663 (df = 3; 384) | 0.254 (df = 3; 1298) | 1.241 (df = 3; 1298) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
A side note: How do compare model coefficients
sjPlot allows us to compare coefficients from several
models at the same time.
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.Here, I replace the variable names for the regression coefficients so that I will get all coefficients placed right above each other.
replace_names <- function(x,
                          vars = c("x", "ClosedList", "x:ClosedList")){
  
  names(x$coefficients)[-1] <- vars
  return(x)
}
mod.m <- replace_names(mod.m)
mod.g <- replace_names(mod.g)
mod.sam <- replace_names(mod.sam)
mod.sim <- replace_names(mod.sim)plot_models( mod.base,
             mod.m,
             mod.g,
             mod.sam,
             mod.sim,
             m.labels = c("Listwise exclusion",
                          "Mean imputation",
                          "Group-mean imputation",
                          "Random sampling",
                          "Simulation"),
             legend.title = "Model approaches",
             title = "Coeficients from four strategies of NA handling")stargazer(mod.base, mod.m, mod.g, mod.sam, mod.sim,
          column.labels = c("Base-line", "Mean repl.", "Group repl.", "Random sample", "Imputation"),
          type = "html")| Dependent variable: | |||||
| y | |||||
| Base-line | Mean repl. | Group repl. | Random sample | Imputation | |
| (1) | (2) | (3) | (4) | (5) | |
| x | -0.865 | -0.845 | -0.963 | -0.053 | -0.375 | 
| (1.464) | (1.529) | (1.529) | (0.899) | (1.323) | |
| ClosedList | -0.990 | -0.542 | -0.594 | 0.169 | -0.564 | 
| (1.215) | (0.916) | (0.912) | (0.689) | (0.832) | |
| x:ClosedList | 2.945 | 2.852 | 3.076 | 0.598 | 2.870 | 
| (2.190) | (2.287) | (2.287) | (1.250) | (1.910) | |
| Constant | 18.193*** | 16.530*** | 16.577*** | 16.281*** | 16.382*** | 
| (0.823) | (0.636) | (0.648) | (0.503) | (0.587) | |
| Observations | 388 | 1,302 | 1,302 | 1,302 | 1,302 | 
| R2 | 0.005 | 0.002 | 0.002 | 0.001 | 0.003 | 
| Adjusted R2 | -0.003 | -0.001 | -0.001 | -0.002 | 0.001 | 
| Residual Std. Error | 9.912 (df = 384) | 10.352 (df = 1298) | 10.351 (df = 1298) | 10.358 (df = 1298) | 10.346 (df = 1298) | 
| F Statistic | 0.663 (df = 3; 384) | 0.687 (df = 3; 1298) | 0.768 (df = 3; 1298) | 0.254 (df = 3; 1298) | 1.241 (df = 3; 1298) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||||
Mice (Multiple Imputation for missing data using Chained Equations)
We can go all in on this approach and simulate x-s, store them in
several data frames, run separate models on each of them before we
average over the coefficients. This is what mice does.
Basic workflow
The mice() function draws on chained equations. It
predicts x1 using x2, and y, then x2 using x1 (including the predicted
values) and y etc. It goes around in circles until the predictions
stabilizes. For each iteration, it draws randomly from a normal
distribution and multiplies is with the predictions to add a bit of
noise to the mix. In the end it produces several data sets for us in
which it imputes slightly different values.
Now, we can run our main model on each of those data sets, average over the regression coefficients and calculate the uncertainty for each of them using both the standard error from the model at hand, as well as the variation in the coefficients between models.
Step one: prepare the data and strategy
Mice requires a data set with predictors that will all be imputed/used for imputation. If I don’t specify otherwise, all covariates in the data are used to run around in circles. That’s why it’s useful to subset the data to the most relevant predictors.
Mice requires us to make explicit the measurement level of each covariates, so make sure to transform binary variables and categories to factors.
library(mice)
df0 <- df %>% 
  #Quadric term for age
  mutate(Age2 = Age*Age) %>%
  #Select variables
  dplyr::select(y, x, ClosedList,
                EPGroup, Age, Age2) %>%
  #Recode to factor
  mutate(EPGroup = as.factor(EPGroup))Mice can use many types of models to predict. Here, I chose a linear model and ask for 10 data sets.
## 
##  iter imp variable
##   1   1  y  x
##   1   2  y  x
##   1   3  y  x
##   1   4  y  x
##   1   5  y  x
##   1   6  y  x
##   1   7  y  x
##   1   8  y  x
##   1   9  y  x
##   1   10  y  x
##   2   1  y  x
##   2   2  y  x
##   2   3  y  x
##   2   4  y  x
##   2   5  y  x
##   2   6  y  x
##   2   7  y  x
##   2   8  y  x
##   2   9  y  x
##   2   10  y  x
##   3   1  y  x
##   3   2  y  x
##   3   3  y  x
##   3   4  y  x
##   3   5  y  x
##   3   6  y  x
##   3   7  y  x
##   3   8  y  x
##   3   9  y  x
##   3   10  y  x
##   4   1  y  x
##   4   2  y  x
##   4   3  y  x
##   4   4  y  x
##   4   5  y  x
##   4   6  y  x
##   4   7  y  x
##   4   8  y  x
##   4   9  y  x
##   4   10  y  x
##   5   1  y  x
##   5   2  y  x
##   5   3  y  x
##   5   4  y  x
##   5   5  y  x
##   5   6  y  x
##   5   7  y  x
##   5   8  y  x
##   5   9  y  x
##   5   10  y  xThe result is a list containing the main data frame
(data) as well as 10 lists with imputed values.
##  [1] "data"            "imp"             "m"               "where"          
##  [5] "blocks"          "call"            "nmis"            "method"         
##  [9] "predictorMatrix" "visitSequence"   "formulas"        "post"           
## [13] "blots"           "ignore"          "seed"            "iteration"      
## [17] "lastSeedValue"   "chainMean"       "chainVar"        "loggedEvents"   
## [21] "version"         "date"Here is my imputed x, for example.
Step 2: Run your main model.
Now that we have the data, we can run the main model on each of them.
I could do it by hand, by subseting the data. In that case, I use the
compleete() function to fill in the missing information for
me. The factor= argument specifies the data set on which I
run my model.
## 
## Call:
## lm(formula = y ~ x * ClosedList, data = complete(df0.imp, factor = 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.876  -7.649  -1.346   6.466  73.779 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.6104     0.4916  33.785   <2e-16 ***
## x             -0.5988     0.8638  -0.693    0.488    
## ClosedList    -0.1448     0.6715  -0.216    0.829    
## x:ClosedList   1.0452     1.1920   0.877    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.42 on 1417 degrees of freedom
## Multiple R-squared:  0.0006275,  Adjusted R-squared:  -0.001488 
## F-statistic: 0.2966 on 3 and 1417 DF,  p-value: 0.8279Another way is to do this in a loop. Here, I use the
with() function to apply the same lm()
function to the data.
The result is a list with many model objects. The model objects are
reported in analyses.
## [1] "call"     "call1"    "nmis"     "analyses"I can aggregate the results using another function, the
pool() function. It does two things for me. It calculates
the mean regression coefficient among all 10 coefficients. It also
calculates a standard error for me: It is a mix between the standard
error of each model coefficient and the difference between the different
coefficients across models.
To present the results in stargazer, I will have to do a bit of wrangling. I add the model object, but then I specify the coefficients and the standard errors separately. Just as we did with the robust standard errors.
Here, I add both the baseline model and the newly estimated one.
coefs <- list(summary(mod.base)$coefficients[,1],
              res$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
           res$std.error)
stargazer(mod.base, mod$analyses[[1]],
          coef = coefs,
          se = ses,
          title = "Initial results from mice",
          type = "html")| Dependent variable: | ||
| y | ||
| (1) | (2) | |
| x | -0.865 | -0.200 | 
| (1.464) | (1.074) | |
| ClosedList | -0.990 | -0.285 | 
| (1.215) | (0.761) | |
| x:ClosedList | 2.945 | 1.786 | 
| (2.190) | (1.592) | |
| Constant | 18.193*** | 16.304*** | 
| (0.823) | (0.556) | |
| Observations | 388 | 1,421 | 
| R2 | 0.005 | 0.001 | 
| Adjusted R2 | -0.003 | -0.001 | 
| Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) | 
| F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) | 
| Note: | p<0.1; p<0.05; p<0.01 | |
I’m a bit underwhelmed by this prediction. Let’s do the predictions with more zeal.
Choice of imputation models and predictors
It is usually best to chose separate models for each variable we want to impute. In the previous example both x and y were imputed using a linear model and the same predictors. This might not be optimal.
I begin by creating a matrix where I specify which variables I want to impute NAs to/from. Here, I remove the y from my set of predictors. I then tell the model where the NAs are. Since Im now left with only x to impute, I opt for a binomial logistic model.
dfNA <- is.na(df0)
#Remove the dependent variable
dfNA[,1] = FALSE
#Recode to categorical variable
df0$x <- as.factor(df0$x)
df1.imp <- mice(df0,
                m = 10,
                #linear outcome
                method = "logreg",
                #Where are the NAs
                where = dfNA,
                printFlag = F
)
mod1 <- with(df1.imp, 
             lm(y ~ 
                  x
                * ClosedList)
)
res1 <- pool(mod1)coefs <- list(summary(mod.base)$coefficients[,1],
              res$estimate,
              res1$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
           res$std.error,
           res1$std.error)
stargazer(mod.base, mod$analyses[[1]],mod1$analyses[[1]],
          coef = coefs,
          se = ses,
          title = "Initial results from mice",
          type = "html")| Dependent variable: | |||
| y | |||
| (1) | (2) | (3) | |
| x | -0.865 | -0.200 | |
| (1.464) | (1.074) | ||
| x1 | 0.465 | ||
| (0.921) | |||
| ClosedList | -0.990 | -0.285 | -0.306 | 
| (1.215) | (0.761) | (0.690) | |
| x:ClosedList | 2.945 | 1.786 | |
| (2.190) | (1.592) | ||
| x1:ClosedList | 1.860 | ||
| (1.243) | |||
| Constant | 18.193*** | 16.304*** | 16.132*** | 
| (0.823) | (0.556) | (0.492) | |
| Observations | 388 | 1,421 | 1,302 | 
| R2 | 0.005 | 0.001 | 0.006 | 
| Adjusted R2 | -0.003 | -0.001 | 0.004 | 
| Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) | 10.327 (df = 1298) | 
| F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) | 2.794** (df = 3; 1298) | 
| Note: | p<0.1; p<0.05; p<0.01 | ||
Imputing several variables with different covariates
We can do several imputations at the same time.
I begin by creating “blocks” of covariates. Each block corresponds to a covariate I want to impute. I then create a matrix with dummies for the covariates I want to include in the imputation.
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 3:5] = FALSE
#Two blocks
blocks <- make.blocks(c( "y", "x"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
                                       blocks = blocks)
predictorMatrix##   y x ClosedList EPGroup Age Age2
## y 0 1          1       1   1    1
## x 1 0          1       1   1    1df2.imp <- mice(df0,
                m = 10,
                #linear outcome
                method = c("norm", "logreg"),
                #Blocks of variables
                blocks = blocks,
                #Data frame specifying where imputations should be made
                where = dfNA,
                #Matrix flagging the predictors
                predictorMatrix = predictorMatrix,
                printFlag = F
)
mod2 <- with(df2.imp, 
            lm(y ~ 
                  x * ClosedList
)
            )
res2 <- pool(mod2) %>%
  summary
res2| Dependent variable: | ||||
| y | ||||
| (1) | (2) | (3) | (4) | |
| x | -0.865 | -0.200 | ||
| (1.464) | (1.074) | |||
| x1 | 0.465 | -0.023 | ||
| (0.921) | (1.371) | |||
| ClosedList | -0.990 | -0.285 | -0.306 | -0.419 | 
| (1.215) | (0.761) | (0.690) | (0.722) | |
| x:ClosedList | 2.945 | 1.786 | ||
| (2.190) | (1.592) | |||
| x1:ClosedList | 1.860 | 1.659 | ||
| (1.243) | (1.453) | |||
| Constant | 18.193*** | 16.304*** | 16.132*** | 16.314*** | 
| (0.823) | (0.556) | (0.492) | (0.615) | |
| Observations | 388 | 1,421 | 1,302 | 1,421 | 
| R2 | 0.005 | 0.001 | 0.006 | 0.004 | 
| Adjusted R2 | -0.003 | -0.001 | 0.004 | 0.001 | 
| Residual Std. Error | 9.912 (df = 384) | 10.423 (df = 1417) | 10.327 (df = 1298) | 10.390 (df = 1417) | 
| F Statistic | 0.663 (df = 3; 384) | 0.297 (df = 3; 1417) | 2.794** (df = 3; 1298) | 1.665 (df = 3; 1417) | 
| Note: | p<0.1; p<0.05; p<0.01 | |||
Display the results
We can use ggpredict to simulate scenarios for us. The most efficient way to do this is by writing a loop.
library(ggeffects)
#Create an empty object
predictions = NULL
#Index from 1 to 10 (number of datasets/models)
  i = 1
  
  for(i in 1:10){  
    m <- lm(y ~ 
             x * ClosedList,
              data = complete(df2.imp, action = i))
    
  predictions[[i]] <-  ggpredict(m, 
                                 terms = list("x" = c(0,1),
                                              "ClosedList" = c(0,1))
  )
  }The function pool_predictions() pools the simulated
results for us by drawing on all 10 models. The outcome is the data
frame we’re used to.
effs <-  pool_predictions(predictions)
effs %>%
  #Transform to regular data frame
  as.data.frame %>%
  #Recode to make variables substantive
  mutate(group = if_else(group == 1, "Party-centered", "Candidate-centered"),
         x = if_else(x == 1, "2. reelection-seeker", "1. not reelection-seeker")) %>%
  ggplot() +
  geom_point(aes(x = x,
                 y = predicted,
                 group = group),
             position = position_dodge(0.3)) +
  geom_errorbar(aes(x = x,
                 ymax = conf.high,
                 ymin = conf.low,
                 group = group),
             position = position_dodge(0.3)) +
  ylab("Number of meetings attended") +
  ggtitle("Effect of ambition on committee attendance")Model assessment
Assessing whether we address missing observations appropriately is not easy. We don’t even know the extent of the problem! However, remember that ignoring the problem means you do a listwise exclusion. That’s a choice too.
When we impute values, the risk we are running is often overfitting: We pick variables that maximize our in-sample predictions, but then do a lousy job of predicting out of sample (i.e. our NAs).
In-sample predictions
You can do your usual cross-correlations between observed and predicted values for the imputed variable.
When using mice, there are a few additional functions. For continuous variables, we can plot the density (distribution) of the 10 simulated data (in red) against the observed density (blue).
The xyplot() plots the imputed variable agains whatever
covariate you have used to impute; just to check if the two overlap
appropriately.
Out-of-sample predictions
A last possibility is to leave observations out before making the predictions/picking a model. You can then use the model to impute the artificial NAs and compare.
df0$x_miss <- df0$x
id <- sample(row.names(df0 %>% 
                         filter(!is.na(x))), 
             100)
df0$x_miss[row.names(df0) %in% id ] <- NA
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 2:6] = FALSE
dfNA %>% head##          y     x ClosedList EPGroup   Age  Age2 x_miss
## [1,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [2,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [3,] FALSE FALSE      FALSE   FALSE FALSE FALSE  FALSE
## [4,]  TRUE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [5,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [6,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE#Two blocks
blocks <- make.blocks(c( "y", "x_miss"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
                                       blocks = blocks)
#Remove x as a predictor
predictorMatrix[,"x"] = 0
predictorMatrix##        y x ClosedList EPGroup Age Age2 x_miss
## y      0 0          1       1   1    1      1
## x_miss 1 0          1       1   1    1      0#Recode to factors
df0 <- 
  df0 %>%
  mutate(EPGroup = as.factor(EPGroup),
         x_miss = as.factor(x_miss))
df2.amp <- mice(df0,
                m = 10,
                #linear outcome
                method = c("norm", "logreg"),
                #Blocks of variables
                blocks = blocks,
                #Data frame specifying where imputations should be made
                where = dfNA,
                #Matrix flagging the predictors
                predictorMatrix = predictorMatrix,
                printFlag = F
)
test <- mice::complete(df2.amp, factor = 3, include = T)
table(test$x, test$x_miss)##    
##       0   1
##   0 555   4
##   1   5 249