Missing data

In this session, we will be exploring methods for handeling missing data. We start out by exploring the variables: their distribution as well as the distribution of their missing observations. We then move on to the more simple ways of imputing values before we attack multiple imputations and simulation.

New R codes:

  • from base: sample()
  • new from finalfit package: missing_plot(), missing_compare(), missing_pattern()
  • new from mice package: complete(), pool(), pool_predictions()

We continue to use the data on Members of the European Parliament (MEPs). We are interested in whether re-election seeking MEPs spend more time in Parliament if they will compete in a system where the party selectorate can allocate “safe seats”. In other words, we have two comparison groups: We compare reelection seekers with non-reelection seekers and party-centered vs. candidate-centered systems through an interaction.

\(y = \alpha + \beta_1 x1 + \beta_2x2 + \beta_3x1\times x2\)

This time, the data includes several groupings; it is still a time-series cross-sectional (imbalanced panel) data in which each MEP is observed every 6 months over a 5-year period. Thus, I have observations of individuals (ID) in up to 10 periods (Period). These individuals also have a nationality (Nationality) and a national party affiliation (ParlGovID).

Our main variables are Attendance (\(y\)), FutureInEP (\(x1\)) and ClosedList (\(x2\)).

I begin by loading in my data. You can find them online here: https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda.

load("MEP.rda")
df <- MEP

I also rename a few of my variables for ease.

df$x <- df$FutureInEP
df$y <- df$Attendance

Introduction

The “hitch” this time is that I use a survey response to measure an MEP’s intention to seek reelection (\(x1\)). However, most MEPs did not complete the survey. My intention is to impute the answers. While we’re at it, I’ll also scout for missing on other variables to, including my dependent variable (y, Attendance).

Univariate statistics

Dependent variable: Attendance

I approximate legislative investment by the number of committee meetings an MEP has attended during the semester (period of observation.)

df %>%
  ggplot +
  geom_histogram(aes(y)) +
  ggtitle("Number of committee meetings") +
  ylab("Number of committe meetings")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 119 rows containing non-finite outside the scale range
## (`stat_bin()`).

The variable is naturally bounded at 0 (i.e. none can attend a negative number of meetings). However, for the purposes of this exercise I am mainly interested in the differences in means between MEPs with an expressed reelection intent and those that don’t have such a clear ambition. I could also make the bet that the conditional distribution (and my residuals) will be approximately normal once I have specified my full model. This is always smart to check.

Independent variable: Reelection intent

The survey asks where an MEP sees him/herself in 10 years. One of the possible answers is “European Parliament” (FutureInEP==1)

df$future <- ifelse(df$x == 1, "reelection", "exit")
df %>%
  ggplot +
  geom_bar(aes(future)) +
  ggtitle("Re-election intent upon election",
          subtitle = "EPRG Survey distributed in 2009 and 2014")

As is clear from the barplot, most MEPs did not reply. I have a lot of missing data (71 %). To what extent is this a problem?

If the data is MCAR (missing completely at random), it would mainly be a problem of statistical power. However, if the data is systematically missing – and especially if the assignment mechanism for NA on \(x\) is also related to \(y\) – I should take action. If the data is MAR (missing at random), I might impute some values using observables. It it is MNAR (missing not at random), it’s a different beast; we don’t have any predictors and we cannot know what the source is beyond theory/political science reasoning. The best way to look for MAR is to correlate the missing with the other variables in the model, as well as whatever other variables we might have.

Explore the missing observations

There is no way to know if the missing data is truly random. However, a pretty sure giveaway that the missing observations are non-random is if a predictor’s NAs covary with either i) NAs in the other variables in the model (or other variables we may have) or ii) the values of other variables.

For ease, let’s begin by subsetting the data to the covariates we are interested in. We can add variables to this code as we go.

vars <- c("y", "x", "ClosedList")

df0 <- df %>%
  dplyr::select(all_of(vars))

What are the variables that contain missing?

Now, let’s see where I have missing observations. The package finalfit has several convenient functions for exploring patterns in the missing observations. missing_plot() organizes the data row-by-row on the x-axis and reports the variables on the y-axis. I draws light blue bars for all observations that are missing.

library(finalfit)
missing_plot(df0)

We can see that I have some missing observations on my dependent variable “Attendance” (\(y\)) as well as a bunch on my predictor “FutureInEP” (\(x\)). By contrast, we have no missing observations in the indicator for party-centered systems (“ClosedList”).

Unique combinations of NAs

Another functionality that comes from the mice- package allows us to explore the patterns of missing observations.

library(mice)
## 
## Vedhæfter pakke: 'mice'
## Det følgende objekt er maskeret fra 'package:stats':
## 
##     filter
## De følgende objekter er maskerede fra 'package:base':
## 
##     cbind, rbind
miss_pat <- missing_pattern(df0)

miss_pat
##     ClosedList   y    x     
## 388          1   1    1    0
## 914          1   1    0    1
## 30           1   0    1    1
## 89           1   0    0    2
##              0 119 1003 1122

We have four combinations of missing observations in the data:

  • no missing (N = 388)
  • missing only on the x (N = 914)
  • missing only on the y (N = 30)
  • missing on both x and y (N = 89)

Pairs of correlations

We should also check for systematic correlations between the missing and observed data across different variables. The missing_compare() function does a simple cross-tabulation of missing values in the variable of your choice (e.g. \(x\)) and the values of other predictors. It then tests the statistical significance of the relationship. When the predictors are categorical, it performs a \(\chi^2\) test.

missing_compare(df0,
                dependent = "x",
                explanatory = c("y", "ClosedList"))

Here, I compare missing in y with values of x1 and x2. It turns out that there is a relationship between NAs in MEPs’ attendance levels (y) and party-centered systems (ClosedList). It is not an effect of the survey question, but another glitch in the data. We can check with a simple t-test; just to be sure.

#NA in attendance vs. party-centered systems
cor.test(as.numeric(is.na(df$x)), df$ClosedList)
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$x)) and df$ClosedList
## t = 3.7918, df = 1419, p-value = 0.0001558
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04840475 0.15136782
## sample estimates:
##       cor 
## 0.1001544
#NA in attendance vs. x
cor.test(as.numeric(is.na(df$y)), df$x)
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$y)) and df$x
## t = 0.65162, df = 416, p-value = 0.515
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.06417985  0.12745652
## sample estimates:
##       cor 
## 0.0319318

Missing on my dependent variable is not correlated with MEPs ambition (the value of x). However, the opposite is not true.

#NA in ambition vs. attendance
cor.test(as.numeric(is.na(df$x)), df$y)
## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(is.na(df$x)) and df$y
## t = -3.2667, df = 1300, p-value = 0.001117
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14385311 -0.03608115
## sample estimates:
##         cor 
## -0.09023128

It seems like we have two problems: Missing on y and missing on x. Let’s ignore the missing on y for the time being.

Conditional NAs

A quick way to check if missing observations are conditional on other predictors is to regress the NAs on the rest of the predictors.

I begin by recoding my variable into 0 (not missing) and 1 (missing)

df0 <-
  df0 %>%
  mutate(x_na = if_else(is.na(x), 1, 0))
mod.na <- glm(x_na ~ ClosedList,
              family = "binomial",
              df0)
summary(mod.na)
## 
## Call:
## glm(formula = x_na ~ ClosedList, family = "binomial", data = df0)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.64768    0.08228   7.872  3.5e-15 ***
## ClosedList   0.44051    0.11703   3.764 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1721.8  on 1420  degrees of freedom
## Residual deviance: 1707.5  on 1419  degrees of freedom
## AIC: 1711.5
## 
## Number of Fisher Scoring iterations: 4

Ok, so the non-responses to the survey are systematically correlated with the electoral system. Here, I find that the probability of a missing observation on x increases by 55.3500422 % when we go from candidate-centered to party-centered systems.

If we assume the data is MAR and this is the only covariate for missing values, then conditioning on electoral system in the main model will ideally address the bias induced by the missing observations. We can also use the predictors in the model to impute non-responses. Ward and Ahlquist (2018) also argue that we can use the outcome variable (y) to impute values on x. Their approach is that we have a (theoretical) relationship between the two, and that this is reflected in their correlation. The correlation is symmetric; if we can predict y using x, then we can also use y to predict x.

Before we go that far, let’s consider some low-stakes options.

Quick fixes

Removing data

My ideal model looks like this:

mod.base <- lm(y ~
                 x * ClosedList,
               df0)
summary(mod.base)
## 
## Call:
## lm(formula = y ~ x * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.283  -8.193  -1.198   6.797  30.807 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.1931     0.8231  22.103   <2e-16 ***
## x             -0.8647     1.4642  -0.591    0.555    
## ClosedList    -0.9899     1.2150  -0.815    0.416    
## x:ClosedList   2.9445     2.1900   1.345    0.180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.912 on 384 degrees of freedom
##   (1033 observations deleted due to missingness)
## Multiple R-squared:  0.005149,   Adjusted R-squared:  -0.002623 
## F-statistic: 0.6625 on 3 and 384 DF,  p-value: 0.5755

From the base-line model, we can see that MEPs that already know that they will seek reelection attend on average 3 committee meetings more than their colleagues in party-centered systems without such ambition.

If I do nothing more at the estimation stage, R will proceed to listwise exclusion of all observations with missing values on any of the covariates included in the model. .

mod.base$residuals %>% length
## [1] 388

Here, we see that the length of my data is 388, while my data frame contains 1421 observations.

Creating weights

Sometimes, the main problem is just over/under-representation of some groups of observations. If so, we can include weights in the regression to compensate for the overrepresenation.

df %>%
  group_by(Nationality) %>%
  summarize(prop = mean(is.na(x))) %>%
  dplyr::select(Nationality, prop) %>%
  ggplot +
  geom_bar(aes(x = prop,
               y = Nationality),
           stat = "identity") +
  ggtitle("Proportion of non-responses by nationality") +
  xlab("")
Distribution of non-responses on the EPRG survey

Distribution of non-responses on the EPRG survey

From the Figure, we see that there is some variation in response rate across nationalities. If that is all, and we don’t think the same mechanism will affect the values taken on by my \(y\), then we can simply weight the observations according to their probability of occurring in the survey. E.g. we can see that Hungary and Spain are overrepresented, while Denmark has a lower response rate.

I begin by creating a dummy indicator variable for whether the MEP responded to the survey.

df$future_na <- df$FutureInEP %>% is.na %>% as.numeric

I then fit a model to predict the occurence of non-response.

mod.weight <- glm(future_na ~ Nationality, 
                  family = "binomial",
                  df)
summary(mod.weight)
## 
## Call:
## glm(formula = future_na ~ Nationality, family = "binomial", data = df)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)   
## (Intercept)                1.34993    0.42414   3.183  0.00146 **
## NationalityBelgium        -0.38053    0.55253  -0.689  0.49101   
## NationalityBulgaria       -0.94446    0.54676  -1.727  0.08410 . 
## NationalityCroatia        -0.94446    0.77237  -1.223  0.22140   
## NationalityCyprus         -1.01345    0.72302  -1.402  0.16100   
## NationalityCzech Republic -1.19578    0.53204  -2.248  0.02461 * 
## NationalityDenmark        -1.34993    0.58869  -2.293  0.02184 * 
## NationalityEstonia        -0.50263    0.80999  -0.621  0.53490   
## NationalityFinland        -0.71394    0.59146  -1.207  0.22740   
## NationalityFrance         -0.24026    0.47352  -0.507  0.61187   
## NationalityGermany        -0.34011    0.45458  -0.748  0.45435   
## NationalityGreece         -0.34662    0.55149  -0.629  0.52966   
## NationalityHungary         1.24034    0.73364   1.691  0.09090 . 
## NationalityIreland        -1.34993    0.63413  -2.129  0.03327 * 
## NationalityItaly          -0.56977    0.46156  -1.234  0.21704   
## NationalityLatvia         -0.89794    0.64316  -1.396  0.16268   
## NationalityLithuania      -0.30847    0.63670  -0.484  0.62804   
## NationalityLuxembourg     -0.65678    0.74491  -0.882  0.37795   
## NationalityMalta          -1.01345    0.72302  -1.402  0.16100   
## NationalityNetherlands    -0.50263    0.52453  -0.958  0.33794   
## NationalityPoland         -0.50263    0.47698  -1.054  0.29199   
## NationalityPortugal       -0.34662    0.55149  -0.629  0.52966   
## NationalityRomania        -0.85745    0.50312  -1.704  0.08833 . 
## NationalitySlovakia       -1.34993    0.57770  -2.337  0.01945 * 
## NationalitySlovenia       -1.34993    0.65566  -2.059  0.03951 * 
## NationalitySpain          -0.03425    0.48738  -0.070  0.94398   
## NationalitySweden         -0.45199    0.55483  -0.815  0.41528   
## NationalityUnited Kingdom -0.18300    0.46865  -0.390  0.69619   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1721.8  on 1420  degrees of freedom
## Residual deviance: 1670.9  on 1393  degrees of freedom
## AIC: 1726.9
## 
## Number of Fisher Scoring iterations: 5
df$weight <- predict(mod.weight, df, "response")

The weight variable multiplies in practice our dependent variable to increase or decrease according to weight it is given.

mod.w <- lm(Attendance ~ 
              x * ClosedList,
            weight = 1/weight,
            df)

summary(mod.w)
## 
## Call:
## lm(formula = Attendance ~ x * ClosedList, data = df, weights = 1/weight)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.930  -9.556  -1.641   7.906  43.313 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   18.3730     0.8092  22.704   <2e-16 ***
## x             -0.9877     1.4306  -0.690    0.490    
## ClosedList    -1.1130     1.2439  -0.895    0.371    
## x:ClosedList   3.2636     2.2176   1.472    0.142    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.24 on 384 degrees of freedom
##   (1033 observations deleted due to missingness)
## Multiple R-squared:  0.005924,   Adjusted R-squared:  -0.001842 
## F-statistic: 0.7628 on 3 and 384 DF,  p-value: 0.5155

This is a common way to address some of the information biases we find in surveys such as the European Social Survy, World Values Survey etc.

stargazer(mod.base, mod.w,
          column.separate = c("Listwise excl", "Weights"),
          title = "Listwise exclusion and regression weights.",
          type = "html")

% Error: Argument ‘column.separate’ must be NULL (default), a vector of type ‘numeric.’

Simple imputations

“Imputing” missing values implies that we fill in the gaps. The non-informative ways of filling in gaps imply giving the observations some value without adding new information. This will attenuate the statistical correlation, but increase the N. If my covariate is a control variable, and I want to keep as many observations as possible (or reduce selection bias) for the main predictors in the model, these are low-energy fixes.

Mean imputation

I can simply replace my missing observations by the mean of the variable. Here, the “mean” is the proportion of MEPs that answer the survey question by the positive (x == 1).

df0 <- 
  df0 %>%
  mutate(x.bar = if_else(is.na(x), mean(x, na.rm = T), x))
df0$x.bar %>% hist

I can run my model with this new “imputed” variable. There is no new information in the model, however, my N is higher. It also means that the standard errors will be artificially low. I.e. I’ll be underpredicting while being overly certain about the result.

mod.m <- lm(y ~ 
              x.bar * ClosedList,
            df0)

summary(mod.m)
## 
## Call:
## lm(formula = y ~ x.bar * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.995  -7.617  -1.530   6.383  73.383 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.5296     0.6362  25.982   <2e-16 ***
## x.bar             -0.8445     1.5293  -0.552    0.581    
## ClosedList        -0.5423     0.9163  -0.592    0.554    
## x.bar:ClosedList   2.8522     2.2871   1.247    0.213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001586,   Adjusted R-squared:  -0.000722 
## F-statistic: 0.6871 on 3 and 1298 DF,  p-value: 0.5599

Indicator variables

We can add to this strategy an indicator variable that flags all the cases where I’ve been meddling with my observations. It’s – once again – a way to keep the entire data set. Now, I control for the imputation, so that the effect of the imputed variable reports the estimated effect of x on y using only the information from the observed values (plus some noise). The imputed values are subsumed in the indicator. Consider the similarity between the two coefficients for x and x.bar in model 1 and 3.

mod.ind <- lm(y ~ 
              + x_na *
              + x.bar * ClosedList,
              df0)
summary(mod.ind)
## 
## Call:
## lm(formula = y ~ +x_na * +x.bar * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.283  -7.398  -1.398   5.811  73.811 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            18.1931     0.8565  21.241  < 2e-16 ***
## x_na                   -2.5246     0.8743  -2.887  0.00395 ** 
## x.bar                  -0.8647     1.5236  -0.568  0.57042    
## ClosedList             -0.9899     1.2643  -0.783  0.43381    
## x_na:x.bar                  NA         NA      NA       NA    
## x_na:ClosedList         0.8581     1.2563   0.683  0.49469    
## x.bar:ClosedList        2.9445     2.2789   1.292  0.19655    
## x_na:x.bar:ClosedList       NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.31 on 1296 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.01056,    Adjusted R-squared:  0.006739 
## F-statistic: 2.765 on 5 and 1296 DF,  p-value: 0.01714
stargazer(mod.base, mod.m, mod.ind,
          title = "Low-stakes imputations",
          column.labels = c("listwise exclusion", "mean imp.", "mean imp. + indicator"),
          type = "html")
Low-stakes imputations
Dependent variable:
y
listwise exclusion mean imp. mean imp. + indicator
(1) (2) (3)
x -0.865
(1.464)
x_na -2.525***
(0.874)
x.bar -0.845 -0.865
(1.529) (1.524)
ClosedList -0.990 -0.542 -0.990
(1.215) (0.916) (1.264)
x:ClosedList 2.945
(2.190)
x_na:x.bar
x_na:ClosedList 0.858
(1.256)
x.bar:ClosedList 2.852 2.945
(2.287) (2.279)
x_na:x.bar:ClosedList
Constant 18.193*** 16.530*** 18.193***
(0.823) (0.636) (0.857)
Observations 388 1,302 1,302
R2 0.005 0.002 0.011
Adjusted R2 -0.003 -0.001 0.007
Residual Std. Error 9.912 (df = 384) 10.352 (df = 1298) 10.314 (df = 1296)
F Statistic 0.663 (df = 3; 384) 0.687 (df = 3; 1298) 2.765** (df = 5; 1296)
Note: p<0.1; p<0.05; p<0.01
stargazer(mod.base, mod.w, mod.m, mod.ind, #mod.sam,
          type = "html")
Dependent variable:
y Attendance y
(1) (2) (3) (4)
x -0.865 -0.988
(1.464) (1.431)
x_na -2.525***
(0.874)
x.bar -0.845 -0.865
(1.529) (1.524)
ClosedList -0.990 -1.113 -0.542 -0.990
(1.215) (1.244) (0.916) (1.264)
x:ClosedList 2.945 3.264
(2.190) (2.218)
x_na:x.bar
x_na:ClosedList 0.858
(1.256)
x.bar:ClosedList 2.852 2.945
(2.287) (2.279)
x_na:x.bar:ClosedList
Constant 18.193*** 18.373*** 16.530*** 18.193***
(0.823) (0.809) (0.636) (0.857)
Observations 388 388 1,302 1,302
R2 0.005 0.006 0.002 0.011
Adjusted R2 -0.003 -0.002 -0.001 0.007
Residual Std. Error 9.912 (df = 384) 12.243 (df = 384) 10.352 (df = 1298) 10.314 (df = 1296)
F Statistic 0.663 (df = 3; 384) 0.763 (df = 3; 384) 0.687 (df = 3; 1298) 2.765** (df = 5; 1296)
Note: p<0.1; p<0.05; p<0.01

Inserting information to the model

Until now, we haven’t inserted any new information to the missing variables. We can do that using imputations; either by inserting information from i) the other covariates or from ii) new variables. There are grounds to say that we can draw on the covariates already in the model. First, if the covariates in the model are there because they covary (with each other), then we may argue that the information is already in the data; we just need to crystallize it.

Moving forward, we can either replace variables by group means, or go all the way using regressions.

Group means

We can replace the value of x by a predictor like ClosedList.

df0 <- 
  df0 %>%
  group_by(ClosedList) %>%
  mutate(x.bar.g = if_else(is.na(x), mean(x, na.rm = T), x))
mod.g <- lm(y ~ 
              x.bar.g * ClosedList,
            df0)

summary(mod.g)
## 
## Call:
## lm(formula = y ~ x.bar.g * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.096  -7.607  -1.577   6.393  73.393 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         16.5765     0.6479  25.586   <2e-16 ***
## x.bar.g             -0.9627     1.5288  -0.630    0.529    
## ClosedList          -0.5936     0.9116  -0.651    0.515    
## x.bar.g:ClosedList   3.0758     2.2869   1.345    0.179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001771,   Adjusted R-squared:  -0.000536 
## F-statistic: 0.7677 on 3 and 1298 DF,  p-value: 0.5122

Let’s instead use an entirely new covariate.

In fact, let’s add several predictors to the data set we use. My candidate predictors are the legislator’s Nationality, EPGroup (parliamentary group), EP (legislative period/survey round) and Age when they responded to the survey.

vars <- c("y", "x", "ClosedList",
          "EPGroup", "EP", "ID",
          "Nationality", "Age")

df0 <- df %>%
  dplyr::select(all_of(vars))

Usually, I’d check the patterns of missing data once again (i.e. are there missing observations on the new covariates?).

df0 <- 
  df0 %>%
  group_by(Nationality) %>%
  mutate(x.bar.g2 = if_else(is.na(x), mean(x, na.rm = T), x)) %>%
  #make sure to ungroup (useful for some of the sampling functions)
  ungroup()
mod.g2 <- lm(y ~ 
              x.bar.g2 * ClosedList,
            df0)

summary(mod.g2)
## 
## Call:
## lm(formula = y ~ x.bar.g2 * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.939  -7.712  -1.308   6.383  73.404 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          16.7148     0.6179  27.050   <2e-16 ***
## x.bar.g2             -1.4242     1.4422  -0.987    0.324    
## ClosedList           -0.6270     0.8571  -0.732    0.465    
## x.bar.g2:ClosedList   3.2758     2.1325   1.536    0.125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.002097,   Adjusted R-squared:  -0.0002096 
## F-statistic: 0.9091 on 3 and 1298 DF,  p-value: 0.4359
stargazer(mod.base,  mod.g, mod.g2,
          column.labels = c("listwise exclusion", "group mean/ClosedList", "group mean/Nationality"),
          type = "html")
Dependent variable:
y
listwise exclusion group mean/ClosedList group mean/Nationality
(1) (2) (3)
x -0.865
(1.464)
x.bar.g -0.963
(1.529)
x.bar.g2 -1.424
(1.442)
ClosedList -0.990 -0.594 -0.627
(1.215) (0.912) (0.857)
x:ClosedList 2.945
(2.190)
x.bar.g:ClosedList 3.076
(2.287)
x.bar.g2:ClosedList 3.276
(2.133)
Constant 18.193*** 16.577*** 16.715***
(0.823) (0.648) (0.618)
Observations 388 1,302 1,302
R2 0.005 0.002 0.002
Adjusted R2 -0.003 -0.001 -0.0002
Residual Std. Error 9.912 (df = 384) 10.351 (df = 1298) 10.350 (df = 1298)
F Statistic 0.663 (df = 3; 384) 0.768 (df = 3; 1298) 0.909 (df = 3; 1298)
Note: p<0.1; p<0.05; p<0.01

Regression

From group means, there is only a tiny-tiny step to regression predictions. Since my x is binary, I run a binomial logistic regression to get predicted probabilities that I can later use for imputation. I’m regression on “Age”.

mod.imp.age <- glm(x ~ Age + I(Age*Age),
                   family = "binomial",
                   df0)

summary(mod.imp.age)
## 
## Call:
## glm(formula = x ~ Age + I(Age * Age), family = "binomial", data = df0)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.424710   2.654323  -3.551 0.000384 ***
## Age           0.411206   0.109819   3.744 0.000181 ***
## I(Age * Age) -0.004549   0.001109  -4.100 4.13e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 480.01  on 415  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 486.01
## 
## Number of Fisher Scoring iterations: 5

At this point, I’ll prioritize predictors that make sense theoretically and that have a strong correlation. It makes sense that MEPs that are above a certain age will not plan to stay in office for another 10 years, so age is a substantivly good predictor. I add a quadric term because I think young MEPs are increasingly likely to aim for an EU career (as they lock themselves into a career), then – as they grow closer to retirement – i expect the opposite.

df0$pred.age <- predict(mod.imp.age, df0, type = "response")
df0 <-
  df0 %>%
  mutate(x.age = if_else(is.na(x), pred.age, x))
mod.age <- lm(y ~ 
             + x.age * ClosedList,
            df0)

summary(mod.age)
## 
## Call:
## lm(formula = y ~ +x.age * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.785  -7.608  -1.412   6.392  73.815 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.6080     0.6144  27.029   <2e-16 ***
## x.age             -1.0810     1.4197  -0.761    0.447    
## ClosedList        -0.5307     0.8714  -0.609    0.543    
## x.age:ClosedList   2.7889     2.0809   1.340    0.180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.001694,   Adjusted R-squared:  -0.0006132 
## F-statistic: 0.7343 on 3 and 1298 DF,  p-value: 0.5316

Random effects and fixed effects

A lot of (unobserved) information is often contained in group variables.

Fixed effects

mod.imp.epgr <- glm(x ~ Age + I(Age*Age)
                    +  EPGroup,
                   family = "binomial",
                   df0)

summary(mod.imp.epgr)
## 
## Call:
## glm(formula = x ~ Age + I(Age * Age) + EPGroup, family = "binomial", 
##     data = df0)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.718932   2.887349  -3.020 0.002530 ** 
## Age                 0.373340   0.116681   3.200 0.001376 ** 
## I(Age * Age)       -0.004263   0.001169  -3.648 0.000264 ***
## EPGroupECR          0.102686   0.529200   0.194 0.846145    
## EPGroupEFD        -14.791011 759.441910  -0.019 0.984461    
## EPGroupEFDD        -1.470950   1.117701  -1.316 0.188157    
## EPGroupEPP          0.946348   0.396894   2.384 0.017108 *  
## EPGroupGreens/EFA  -1.282898   0.828435  -1.549 0.121483    
## EPGroupGUE/NGL     -0.279946   0.671411  -0.417 0.676714    
## EPGroupNI           0.631098   0.644278   0.980 0.327312    
## EPGroupPES          0.648792   0.417939   1.552 0.120576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 449.50  on 407  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 471.5
## 
## Number of Fisher Scoring iterations: 15
df0$pred.epgr <- predict(mod.imp.epgr, df0, type = "response")
df0 <-
  df0 %>%
  mutate(x.gr = if_else(is.na(x), pred.epgr, x))
mod.gr <- lm(y ~ 
             + x.gr * ClosedList,
            df0)

summary(mod.gr)
## 
## Call:
## lm(formula = y ~ +x.gr * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.538  -7.636  -1.351   6.311  74.028 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.4342     0.5924  27.741   <2e-16 ***
## x.gr             -0.5450     1.3513  -0.403   0.6868    
## ClosedList       -0.7192     0.8393  -0.857   0.3917    
## x.gr:ClosedList   3.3679     1.9462   1.730   0.0838 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.003522,   Adjusted R-squared:  0.001219 
## F-statistic: 1.529 on 3 and 1298 DF,  p-value: 0.2052

Fixed effects are always a possibility, but remember: You run the risk of overfitting the data wheen you hunt around for covariates. Fixed effects are a good example: They predict really well in-group, but what about new groups? Or if there is a change?

Random effects

are more flexible. We can create new groups simply by drawing from the distribution of random effects (i.e. the random intercepts are assumed to follow a normal distribution with a given mean and standard deviation). We can also add level-2 covariates to improve the out-of-sample predictions.

library(lme4)
mod.imp.ran <- glmer(x ~ Age + I(Age*Age)
                    +  (1|EPGroup),
                   family = "binomial",
                   df0)
## Warning: Some predictor variables are on very different scales: consider
## rescaling
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)
## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?
summary(mod.imp.ran)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: x ~ Age + I(Age * Age) + (1 | EPGroup)
##    Data: df0
## 
##      AIC      BIC   logLik deviance df.resid 
##    477.9    494.1   -235.0    469.9      414 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.1703 -0.6996 -0.4341  0.9599  6.5898 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  EPGroup (Intercept) 0.337    0.5805  
## Number of obs: 418, groups:  EPGroup, 9
## 
## Fixed effects:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.5170989  2.1412475  -4.445 8.80e-06 ***
## Age           0.4044541  0.0869720   4.650 3.31e-06 ***
## I(Age * Age) -0.0045406  0.0008735  -5.198 2.01e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##            (Intr) Age   
## Age        -0.982       
## I(Age*Age)  0.953 -0.990
## fit warnings:
## Some predictor variables are on very different scales: consider rescaling
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.211989 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
##  - Rescale variables?
## Model is nearly unidentifiable: large eigenvalue ratio
##  - Rescale variables?

To allow new groups to be drawn, simply specify it int the prediction.

df0$pred.ran <- predict(mod.imp.ran, df0, type = "response", allow.new.levels = T)
df0 <-
  df0 %>%
  mutate(x.ran = if_else(is.na(x), pred.ran, x))
mod.ran <- lm(y ~ 
             + x.ran * ClosedList,
            df0)

summary(mod.ran)
## 
## Call:
## lm(formula = y ~ +x.ran * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.518  -7.613  -1.353   6.324  74.034 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.4343     0.6005  27.369   <2e-16 ***
## x.ran             -0.5441     1.3840  -0.393   0.6943    
## ClosedList        -0.7012     0.8508  -0.824   0.4100    
## x.ran:ClosedList   3.3294     2.0008   1.664   0.0963 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.34 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.003251,   Adjusted R-squared:  0.0009476 
## F-statistic: 1.411 on 3 and 1298 DF,  p-value: 0.2378
stargazer(mod.base,  mod.gr, mod.ran,
          column.labels = c("listwise exclusion", "fixeff", "raneff"),
          type = "html")
Dependent variable:
y
listwise exclusion fixeff raneff
(1) (2) (3)
x -0.865
(1.464)
x.gr -0.545
(1.351)
x.ran -0.544
(1.384)
ClosedList -0.990 -0.719 -0.701
(1.215) (0.839) (0.851)
x:ClosedList 2.945
(2.190)
x.gr:ClosedList 3.368*
(1.946)
x.ran:ClosedList 3.329*
(2.001)
Constant 18.193*** 16.434*** 16.434***
(0.823) (0.592) (0.600)
Observations 388 1,302 1,302
R2 0.005 0.004 0.003
Adjusted R2 -0.003 0.001 0.001
Residual Std. Error 9.912 (df = 384) 10.342 (df = 1298) 10.344 (df = 1298)
F Statistic 0.663 (df = 3; 384) 1.529 (df = 3; 1298) 1.411 (df = 3; 1298)
Note: p<0.1; p<0.05; p<0.01

Bringing in the uncertainty

Until now, we’ve added information to the models from other variables to increase the N in the main model. There is quite some uncertainty in these predictions. How can we account for that? In fact, the worse the predictions, the more important is it to account for uncertainty. We can do this by using simulation, i.e. we add a bit of noise in our predictions.

Random sampling

Absent any new information, we can of course fill in the gaps using information from the model, but without any assumptions about a trand; that is, we can do random sampling. It has the advantage that it allows us to inflate the standard error, reflecting the uncertainty in the estimation of x.

I can use the sample() function to draw randomly from the observed part of my x.

sam <- sample(
  #Draw values that are not NA
  df0$x[!is.na(df0$x)],
  #Make as many draws as I have data
  size = nrow(df0),
  #For reach draw, replace so that I can draw anew
  replace = T)

#Replace the missing observations by my draws

df0 <- 
  df0 %>%
  mutate(x.sam = if_else(is.na(x), sam, x))

Note that this random part means that I’ll get a new result every time I run my R-code. Also, it won’t help me make much headway if I have a lot of missing observations and I am actually interested in the effect of this particular variable.

mod.sam <- lm(y ~ 
              x.sam * ClosedList,
            df0)

summary(mod.sam)
## 
## Call:
## lm(formula = y ~ x.sam * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.995  -7.449  -1.449   6.551  73.005 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16.28066    0.50301  32.366   <2e-16 ***
## x.sam            -0.05268    0.89938  -0.059    0.953    
## ClosedList        0.16861    0.68930   0.245    0.807    
## x.sam:ClosedList  0.59846    1.24983   0.479    0.632    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.0005861,  Adjusted R-squared:  -0.001724 
## F-statistic: 0.2537 on 3 and 1298 DF,  p-value: 0.8587

Sampling from a prediction

We can also sample from a prediction if we know the distribution of this prediction. One way is to assume a normal distribution of each prediction.

df0$Age2 <- df0$Age^2

##Missing values in x are related to the value of other xs
mod2.na <- glm(x ~ 
               + Age
               + Age2
               + EPGroup,
               family="binomial",
               df0
)
summary(mod2.na)
## 
## Call:
## glm(formula = x ~ +Age + Age2 + EPGroup, family = "binomial", 
##     data = df0)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -8.718932   2.887349  -3.020 0.002530 ** 
## Age                 0.373340   0.116681   3.200 0.001376 ** 
## Age2               -0.004263   0.001169  -3.648 0.000264 ***
## EPGroupECR          0.102686   0.529200   0.194 0.846145    
## EPGroupEFD        -14.791011 759.441910  -0.019 0.984461    
## EPGroupEFDD        -1.470950   1.117701  -1.316 0.188157    
## EPGroupEPP          0.946348   0.396894   2.384 0.017108 *  
## EPGroupGreens/EFA  -1.282898   0.828435  -1.549 0.121483    
## EPGroupGUE/NGL     -0.279946   0.671411  -0.417 0.676714    
## EPGroupNI           0.631098   0.644278   0.980 0.327312    
## EPGroupPES          0.648792   0.417939   1.552 0.120576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 519.82  on 417  degrees of freedom
## Residual deviance: 449.50  on 407  degrees of freedom
##   (1003 observations deleted due to missingness)
## AIC: 471.5
## 
## Number of Fisher Scoring iterations: 15

I ask for the standard error of the predictions in order to know the spread of the distribution.

preds <- predict(mod2.na, df0, "response", se.fit = T) 

df0$x.glm <- preds$fit
df0$x.se <- preds$se.fit

#Draw from a normal distribution with a mean an a standard deviation
sims <- rnorm(nrow(df0), df0$x.glm, df0$x.se)

#Replace the missing observations by my draws

df0 <- 
  df0 %>%
  mutate(x.sim = if_else(is.na(x), sims, x))
mod.sim <- lm(y ~ 
              x.sim * ClosedList,
            df0)

summary(mod.sim)
## 
## Call:
## lm(formula = y ~ x.sim * ClosedList, data = df0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.312  -7.534  -1.331   6.334  73.883 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       16.3818     0.5874  27.887   <2e-16 ***
## x.sim             -0.3754     1.3228  -0.284    0.777    
## ClosedList        -0.5640     0.8324  -0.678    0.498    
## x.sim:ClosedList   2.8701     1.9098   1.503    0.133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 1298 degrees of freedom
##   (119 observations deleted due to missingness)
## Multiple R-squared:  0.00286,    Adjusted R-squared:  0.0005555 
## F-statistic: 1.241 on 3 and 1298 DF,  p-value: 0.2934
stargazer(mod.base,  mod.sam, mod.sim,
          column.labels = c("listwise exclusion", "random sample", "simulation"),
          type = "html")
Dependent variable:
y
listwise exclusion random sample simulation
(1) (2) (3)
x -0.865
(1.464)
x.sam -0.053
(0.899)
x.sim -0.375
(1.323)
ClosedList -0.990 0.169 -0.564
(1.215) (0.689) (0.832)
x:ClosedList 2.945
(2.190)
x.sam:ClosedList 0.598
(1.250)
x.sim:ClosedList 2.870
(1.910)
Constant 18.193*** 16.281*** 16.382***
(0.823) (0.503) (0.587)
Observations 388 1,302 1,302
R2 0.005 0.001 0.003
Adjusted R2 -0.003 -0.002 0.001
Residual Std. Error 9.912 (df = 384) 10.358 (df = 1298) 10.346 (df = 1298)
F Statistic 0.663 (df = 3; 384) 0.254 (df = 3; 1298) 1.241 (df = 3; 1298)
Note: p<0.1; p<0.05; p<0.01

A side note: How do compare model coefficients

sjPlot allows us to compare coefficients from several models at the same time.

library(sjPlot)
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
plot_models(mod.base, mod.g, mod.sim)

Here, I replace the variable names for the regression coefficients so that I will get all coefficients placed right above each other.

replace_names <- function(x,
                          vars = c("x", "ClosedList", "x:ClosedList")){
  
  names(x$coefficients)[-1] <- vars
  return(x)
}

mod.m <- replace_names(mod.m)
mod.g <- replace_names(mod.g)
mod.sam <- replace_names(mod.sam)
mod.sim <- replace_names(mod.sim)
plot_models( mod.base,
             mod.m,
             mod.g,
             mod.sam,
             mod.sim,
             m.labels = c("Listwise exclusion",
                          "Mean imputation",
                          "Group-mean imputation",
                          "Random sampling",
                          "Simulation"),
             legend.title = "Model approaches",
             title = "Coeficients from four strategies of NA handling")

stargazer(mod.base, mod.m, mod.g, mod.sam, mod.sim,
          column.labels = c("Base-line", "Mean repl.", "Group repl.", "Random sample", "Imputation"),
          type = "html")
Dependent variable:
y
Base-line Mean repl. Group repl. Random sample Imputation
(1) (2) (3) (4) (5)
x -0.865 -0.845 -0.963 -0.053 -0.375
(1.464) (1.529) (1.529) (0.899) (1.323)
ClosedList -0.990 -0.542 -0.594 0.169 -0.564
(1.215) (0.916) (0.912) (0.689) (0.832)
x:ClosedList 2.945 2.852 3.076 0.598 2.870
(2.190) (2.287) (2.287) (1.250) (1.910)
Constant 18.193*** 16.530*** 16.577*** 16.281*** 16.382***
(0.823) (0.636) (0.648) (0.503) (0.587)
Observations 388 1,302 1,302 1,302 1,302
R2 0.005 0.002 0.002 0.001 0.003
Adjusted R2 -0.003 -0.001 -0.001 -0.002 0.001
Residual Std. Error 9.912 (df = 384) 10.352 (df = 1298) 10.351 (df = 1298) 10.358 (df = 1298) 10.346 (df = 1298)
F Statistic 0.663 (df = 3; 384) 0.687 (df = 3; 1298) 0.768 (df = 3; 1298) 0.254 (df = 3; 1298) 1.241 (df = 3; 1298)
Note: p<0.1; p<0.05; p<0.01

Mice (Multiple Imputation for missing data using Chained Equations)

We can go all in on this approach and simulate x-s, store them in several data frames, run separate models on each of them before we average over the coefficients. This is what mice does.

library(mice)

Basic workflow

The mice() function draws on chained equations. It predicts x1 using x2, and y, then x2 using x1 (including the predicted values) and y etc. It goes around in circles until the predictions stabilizes. For each iteration, it draws randomly from a normal distribution and multiplies is with the predictions to add a bit of noise to the mix. In the end it produces several data sets for us in which it imputes slightly different values.

Now, we can run our main model on each of those data sets, average over the regression coefficients and calculate the uncertainty for each of them using both the standard error from the model at hand, as well as the variation in the coefficients between models.

Step one: prepare the data and strategy

Mice requires a data set with predictors that will all be imputed/used for imputation. If I don’t specify otherwise, all covariates in the data are used to run around in circles. That’s why it’s useful to subset the data to the most relevant predictors.

Mice requires us to make explicit the measurement level of each covariates, so make sure to transform binary variables and categories to factors.

library(mice)

df0 <- df %>% 
  #Quadric term for age
  mutate(Age2 = Age*Age) %>%
  #Select variables
  dplyr::select(y, x, ClosedList,
                EPGroup, Age, Age2) %>%
  #Recode to factor
  mutate(EPGroup = as.factor(EPGroup))

Mice can use many types of models to predict. Here, I chose a linear model and ask for 10 data sets.

df0.imp <- mice(df0,
                m = 10,
                #linear outcome
                method = "norm")
## 
##  iter imp variable
##   1   1  y  x
##   1   2  y  x
##   1   3  y  x
##   1   4  y  x
##   1   5  y  x
##   1   6  y  x
##   1   7  y  x
##   1   8  y  x
##   1   9  y  x
##   1   10  y  x
##   2   1  y  x
##   2   2  y  x
##   2   3  y  x
##   2   4  y  x
##   2   5  y  x
##   2   6  y  x
##   2   7  y  x
##   2   8  y  x
##   2   9  y  x
##   2   10  y  x
##   3   1  y  x
##   3   2  y  x
##   3   3  y  x
##   3   4  y  x
##   3   5  y  x
##   3   6  y  x
##   3   7  y  x
##   3   8  y  x
##   3   9  y  x
##   3   10  y  x
##   4   1  y  x
##   4   2  y  x
##   4   3  y  x
##   4   4  y  x
##   4   5  y  x
##   4   6  y  x
##   4   7  y  x
##   4   8  y  x
##   4   9  y  x
##   4   10  y  x
##   5   1  y  x
##   5   2  y  x
##   5   3  y  x
##   5   4  y  x
##   5   5  y  x
##   5   6  y  x
##   5   7  y  x
##   5   8  y  x
##   5   9  y  x
##   5   10  y  x

The result is a list containing the main data frame (data) as well as 10 lists with imputed values.

df0.imp %>% names
##  [1] "data"            "imp"             "m"               "where"          
##  [5] "blocks"          "call"            "nmis"            "method"         
##  [9] "predictorMatrix" "visitSequence"   "formulas"        "post"           
## [13] "blots"           "ignore"          "seed"            "iteration"      
## [17] "lastSeedValue"   "chainMean"       "chainVar"        "loggedEvents"   
## [21] "version"         "date"

Here is my imputed x, for example.

df0.imp$imp$x %>% head

Step 2: Run your main model.

Now that we have the data, we can run the main model on each of them. I could do it by hand, by subseting the data. In that case, I use the compleete() function to fill in the missing information for me. The factor= argument specifies the data set on which I run my model.

mod <- lm(y ~ x * ClosedList,
          data = complete(df0.imp, factor = 1))
summary(mod)
## 
## Call:
## lm(formula = y ~ x * ClosedList, data = complete(df0.imp, factor = 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.876  -7.649  -1.346   6.466  73.779 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   16.6104     0.4916  33.785   <2e-16 ***
## x             -0.5988     0.8638  -0.693    0.488    
## ClosedList    -0.1448     0.6715  -0.216    0.829    
## x:ClosedList   1.0452     1.1920   0.877    0.381    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.42 on 1417 degrees of freedom
## Multiple R-squared:  0.0006275,  Adjusted R-squared:  -0.001488 
## F-statistic: 0.2966 on 3 and 1417 DF,  p-value: 0.8279

Another way is to do this in a loop. Here, I use the with() function to apply the same lm() function to the data.

mod <- with(df0.imp, 
            lm(y ~ 
                x
                 * ClosedList)
            )

The result is a list with many model objects. The model objects are reported in analyses.

names(mod)
## [1] "call"     "call1"    "nmis"     "analyses"

I can aggregate the results using another function, the pool() function. It does two things for me. It calculates the mean regression coefficient among all 10 coefficients. It also calculates a standard error for me: It is a mix between the standard error of each model coefficient and the difference between the different coefficients across models.

res <- pool(mod) %>%
  summary
res

To present the results in stargazer, I will have to do a bit of wrangling. I add the model object, but then I specify the coefficients and the standard errors separately. Just as we did with the robust standard errors.

Here, I add both the baseline model and the newly estimated one.

coefs <- list(summary(mod.base)$coefficients[,1],
              res$estimate)

ses <- list(summary(mod.base)$coefficients[,2],
           res$std.error)

stargazer(mod.base, mod$analyses[[1]],
          coef = coefs,
          se = ses,
          title = "Initial results from mice",
          type = "html")
Initial results from mice
Dependent variable:
y
(1) (2)
x -0.865 -0.200
(1.464) (1.074)
ClosedList -0.990 -0.285
(1.215) (0.761)
x:ClosedList 2.945 1.786
(2.190) (1.592)
Constant 18.193*** 16.304***
(0.823) (0.556)
Observations 388 1,421
R2 0.005 0.001
Adjusted R2 -0.003 -0.001
Residual Std. Error 9.912 (df = 384) 10.423 (df = 1417)
F Statistic 0.663 (df = 3; 384) 0.297 (df = 3; 1417)
Note: p<0.1; p<0.05; p<0.01

I’m a bit underwhelmed by this prediction. Let’s do the predictions with more zeal.

Choice of imputation models and predictors

It is usually best to chose separate models for each variable we want to impute. In the previous example both x and y were imputed using a linear model and the same predictors. This might not be optimal.

I begin by creating a matrix where I specify which variables I want to impute NAs to/from. Here, I remove the y from my set of predictors. I then tell the model where the NAs are. Since Im now left with only x to impute, I opt for a binomial logistic model.

dfNA <- is.na(df0)
#Remove the dependent variable
dfNA[,1] = FALSE

#Recode to categorical variable
df0$x <- as.factor(df0$x)

df1.imp <- mice(df0,
                m = 10,
                #linear outcome
                method = "logreg",
                #Where are the NAs
                where = dfNA,
                printFlag = F
)

mod1 <- with(df1.imp, 
             lm(y ~ 
                  x
                * ClosedList)
)
res1 <- pool(mod1)
coefs <- list(summary(mod.base)$coefficients[,1],
              res$estimate,
              res1$estimate)

ses <- list(summary(mod.base)$coefficients[,2],
           res$std.error,
           res1$std.error)

stargazer(mod.base, mod$analyses[[1]],mod1$analyses[[1]],
          coef = coefs,
          se = ses,
          title = "Initial results from mice",
          type = "html")
Initial results from mice
Dependent variable:
y
(1) (2) (3)
x -0.865 -0.200
(1.464) (1.074)
x1 0.465
(0.921)
ClosedList -0.990 -0.285 -0.306
(1.215) (0.761) (0.690)
x:ClosedList 2.945 1.786
(2.190) (1.592)
x1:ClosedList 1.860
(1.243)
Constant 18.193*** 16.304*** 16.132***
(0.823) (0.556) (0.492)
Observations 388 1,421 1,302
R2 0.005 0.001 0.006
Adjusted R2 -0.003 -0.001 0.004
Residual Std. Error 9.912 (df = 384) 10.423 (df = 1417) 10.327 (df = 1298)
F Statistic 0.663 (df = 3; 384) 0.297 (df = 3; 1417) 2.794** (df = 3; 1298)
Note: p<0.1; p<0.05; p<0.01

Imputing several variables with different covariates

We can do several imputations at the same time.

I begin by creating “blocks” of covariates. Each block corresponds to a covariate I want to impute. I then create a matrix with dummies for the covariates I want to include in the imputation.

#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 3:5] = FALSE

#Two blocks
blocks <- make.blocks(c( "y", "x"))

#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
                                       blocks = blocks)
predictorMatrix
##   y x ClosedList EPGroup Age Age2
## y 0 1          1       1   1    1
## x 1 0          1       1   1    1
df2.imp <- mice(df0,
                m = 10,
                #linear outcome
                method = c("norm", "logreg"),
                #Blocks of variables
                blocks = blocks,
                #Data frame specifying where imputations should be made
                where = dfNA,
                #Matrix flagging the predictors
                predictorMatrix = predictorMatrix,
                printFlag = F
)


mod2 <- with(df2.imp, 
            lm(y ~ 
                  x * ClosedList
)
            )
res2 <- pool(mod2) %>%
  summary
res2
Initial results from mice
Dependent variable:
y
(1) (2) (3) (4)
x -0.865 -0.200
(1.464) (1.074)
x1 0.465 -0.023
(0.921) (1.371)
ClosedList -0.990 -0.285 -0.306 -0.419
(1.215) (0.761) (0.690) (0.722)
x:ClosedList 2.945 1.786
(2.190) (1.592)
x1:ClosedList 1.860 1.659
(1.243) (1.453)
Constant 18.193*** 16.304*** 16.132*** 16.314***
(0.823) (0.556) (0.492) (0.615)
Observations 388 1,421 1,302 1,421
R2 0.005 0.001 0.006 0.004
Adjusted R2 -0.003 -0.001 0.004 0.001
Residual Std. Error 9.912 (df = 384) 10.423 (df = 1417) 10.327 (df = 1298) 10.390 (df = 1417)
F Statistic 0.663 (df = 3; 384) 0.297 (df = 3; 1417) 2.794** (df = 3; 1298) 1.665 (df = 3; 1417)
Note: p<0.1; p<0.05; p<0.01

Display the results

We can use ggpredict to simulate scenarios for us. The most efficient way to do this is by writing a loop.

library(ggeffects)

#Create an empty object
predictions = NULL

#Index from 1 to 10 (number of datasets/models)
  i = 1
  
  for(i in 1:10){  
    m <- lm(y ~ 
             x * ClosedList,
              data = complete(df2.imp, action = i))
    
  predictions[[i]] <-  ggpredict(m, 
                                 terms = list("x" = c(0,1),
                                              "ClosedList" = c(0,1))
  )
  }

The function pool_predictions() pools the simulated results for us by drawing on all 10 models. The outcome is the data frame we’re used to.

effs <-  pool_predictions(predictions)

effs %>%
  #Transform to regular data frame
  as.data.frame %>%
  #Recode to make variables substantive
  mutate(group = if_else(group == 1, "Party-centered", "Candidate-centered"),
         x = if_else(x == 1, "2. reelection-seeker", "1. not reelection-seeker")) %>%
  ggplot() +
  geom_point(aes(x = x,
                 y = predicted,
                 group = group),
             position = position_dodge(0.3)) +
  geom_errorbar(aes(x = x,
                 ymax = conf.high,
                 ymin = conf.low,
                 group = group),
             position = position_dodge(0.3)) +
  ylab("Number of meetings attended") +
  ggtitle("Effect of ambition on committee attendance")

Model assessment

Assessing whether we address missing observations appropriately is not easy. We don’t even know the extent of the problem! However, remember that ignoring the problem means you do a listwise exclusion. That’s a choice too.

When we impute values, the risk we are running is often overfitting: We pick variables that maximize our in-sample predictions, but then do a lousy job of predicting out of sample (i.e. our NAs).

In-sample predictions

You can do your usual cross-correlations between observed and predicted values for the imputed variable.

When using mice, there are a few additional functions. For continuous variables, we can plot the density (distribution) of the 10 simulated data (in red) against the observed density (blue).

densityplot(df2.imp, ~ y)

The xyplot() plots the imputed variable agains whatever covariate you have used to impute; just to check if the two overlap appropriately.

xyplot(df2.imp, y ~ EPGroup )

Out-of-sample predictions

A last possibility is to leave observations out before making the predictions/picking a model. You can then use the model to impute the artificial NAs and compare.

df0$x_miss <- df0$x
id <- sample(row.names(df0 %>% 
                         filter(!is.na(x))), 
             100)
df0$x_miss[row.names(df0) %in% id ] <- NA

#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 2:6] = FALSE
dfNA %>% head
##          y     x ClosedList EPGroup   Age  Age2 x_miss
## [1,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [2,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [3,] FALSE FALSE      FALSE   FALSE FALSE FALSE  FALSE
## [4,]  TRUE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [5,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
## [6,] FALSE FALSE      FALSE   FALSE FALSE FALSE   TRUE
#Two blocks
blocks <- make.blocks(c( "y", "x_miss"))

#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
                                       blocks = blocks)

#Remove x as a predictor
predictorMatrix[,"x"] = 0
predictorMatrix
##        y x ClosedList EPGroup Age Age2 x_miss
## y      0 0          1       1   1    1      1
## x_miss 1 0          1       1   1    1      0
#Recode to factors
df0 <- 
  df0 %>%
  mutate(EPGroup = as.factor(EPGroup),
         x_miss = as.factor(x_miss))

df2.amp <- mice(df0,
                m = 10,
                #linear outcome
                method = c("norm", "logreg"),
                #Blocks of variables
                blocks = blocks,
                #Data frame specifying where imputations should be made
                where = dfNA,
                #Matrix flagging the predictors
                predictorMatrix = predictorMatrix,
                printFlag = F
)

test <- mice::complete(df2.amp, factor = 3, include = T)

table(test$x, test$x_miss)
##    
##       0   1
##   0 555   4
##   1   5 249

Literature

Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.