Missing data
In this session, we will be exploring methods for handeling missing data. We start out by exploring the variables: their distribution as well as the distribution of their missing observations. We then move on to the more simple ways of imputing values before we attack multiple imputations and simulation.
New R codes:
- from base:
sample() - new from the
finalfitpackage:missing_plot(),missing_compare(),missing_pattern() - new from the
micepackage:complete(),pool(),pool_predictions()
Research question
We continue to use the data on Members of the European Parliament (MEPs). Both voters and parties prefer when MEPs spend time legislating, and to do so, MEPs who want to be reelected need to spend time in committee meetings where influence is obtained. In our example on count data, we saw that MEPs are more likely to handle legislative proposals when they spend time in Parliament.
However, voters have an informational problem. Where parties usually know if MEPs occupy influential positions in parliament, voters do not. Thus, we may expect that MEPs who compete in party-centered systems – and who seek reelection – will spend more time in committees in order to gain the influence that parties want. Reversely, reelection-seeking MEPs competing in candidate-centered systems are better off spending time in their constituency to garner votes.
In other words, we have two comparison groups: We compare reelection seekers with non-reelection seekers and party-centered vs. candidate-centered systems through an interaction.
\(y = \alpha + \beta_1 x1 + \beta_2x2 + \beta_3x1\times x2\)
Substantive relationship:
\(Attendance_i = \alpha + \beta_1 FutureEP_i + \beta_2 ClosedList_i + \beta_3FutureEP_i\times ClosedList_i\)
The expectation is that the interaction term is positive: Close-list MEPs who claim they want reelection have higher attendance rates compared to open-list members / lame ducks.
The data structure is technically an unbalanced panel. We observe two survey rounds of MEPs and connect it with their behavior midway through two parliamentary periods (Fall 2012 and 2016). The panel is “unbalanced” because MEPs who were reelected in 2014 are observed twice while others are not. Here, we will treat it in a pooled model.
Our main variables are Attendance (\(y\)), FutureInEP (\(x1\)) and ClosedList (\(x2\)).
I begin by loading in my data. You can find them online here:
https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda.
load("MEP.rda")
df <- MEP %>%
#Subset to only one observation per survey round
dplyr::filter(Period == 1 | Period == 9)I also rename a few of my variables for ease.
Introduction
The “hitch” this time is that I use a survey response to measure an MEP’s intention to seek reelection (\(x1\)). However, most MEPs did not complete the survey. My intention is to impute the answers. While we’re at it, I’ll also scout for missing observations on other variables to, including my dependent variable (y, Attendance).
In the following, we will describe the data before we consider different ways to address missing values. This time, my descriptive preliminaries focus on NAs – the observations that haven’t gotten a value – because I am interested in knowing a) whether I have strong grounds to think they are systematically missing and b) whether I have reasonable strategies to adress that problem.
There are three types of missing values
It is always useful to think about the data generating process – especially for missing values – because it tells us how to model them.
| Type | Definition | Example | Strategy |
|---|---|---|---|
| MCAR (Missing Completely At Random) | The probability that an observation is missing is unrelated to either observed or unobserved values in the data. Missingness is random noise; data sample is representative. | A server crashes and randomly deletes some survey responses. | Listwise exclusion / complete-case analysis yield unbiased estimates, but we lose statistical power (size). Multiple imputation is innocuous and can help keep informatio on other covariates. |
| Multiple imputation is innocuous and can help keep informatio on other covariates. | |||
| MAR (Missing At Random) | Missingness depends on observed variables in the data, but not on the missing values themselves once we condition on those observables. | MEPs with low committee attendance are less likely to answer the survey on reelection ambitions. Because committee attendance (y) is observed, we can use it to help predict the missing values on x. | Multiple imputation using observed variables (including from y) can reduce bias. |
| MEPs from candidate-centered systems are less likely to answer a survey, but among those MEPs, response is unrelated to their actual reelection ambitions. | Condition on observables (electoral system). | ||
| MNAR (Missing Not At Random) | Missingness depends on the unobserved value itself, even after conditioning on observed variables. | MEPs with weak reelection ambitions systematically avoid answering questions about future political plans. | Difficult to solve statistically. Requires strong assumptions, theory-driven modeling, sensitivity analysis, or explicit models of the missingness process. |
Univariate statistics
Dependent variable: Attendance
I approximate legislative investment by the number of committee meetings an MEP has attended during the semester (period of observation.)
df %>%
ggplot +
geom_histogram(aes(y)) +
ggtitle("Number of committee meetings (y)") +
xlab("Number of committe meetings")The variable is a count and naturally bounded at 0 (i.e. none can attend a negative number of meetings). The canoncial model for count data is the poisson moel, of course, but for simplicity, I’ll stick to the linear model here.
For the purposes of this exercise, I am mainly interested in the differences in means between MEPs with an expressed reelection intent and those that don’t have such a clear ambition. This is a comparison of means between categorical variables, so the linear model can be defended. (I could also make the bet that the conditional distribution (and thus my residuals) will be approximately normal once I have specified my full model. If so, I could also run a linear model. )
Independent variable: Reelection intent
The EPRG survey is distributed at the beginning of each legislative
term and asks where an MEP sees him/herself in 10 years. One of the
possible answers is the “European Parliament”. This is my
operationalization of a re-election seeker
(FutureInEP==1).
df$future <- ifelse(df$x == 1, "reelection", "exit")
df %>%
ggplot +
geom_bar(aes(future)) +
xlab("Intention to seek reelection") +
ggtitle("Re-election intent upon election",
subtitle = "EPRG Survey distributed in 2009 and 2014")
As is clear from the barplot, most MEPs did not reply to the survey. I
have a lot of missing data (70 %). To what extent is this a problem?
Moderating variable : electoral system
Our moderating variable is the electoral system in which the MEP would compete if they were to seek reelection. It is a binary variable, and the MEPs are approxiamately equeally distributed accross both systems. That’s good news: we have variation.
df %>%
mutate(
ClosedList = if_else(ClosedList == 1, "Party-centered", "Candidate-centered")
) %>%
ggplot +
geom_bar(aes(ClosedList)) +
xlab("Electoral system") +
ggtitle("Electoral system in which MEPs would compete",
subtitle = "EPRG Survey distributed in 2009 and 2014")Potential confounder : Sources of missing observations
There are two potential problems with missing data: statistical power (we lose data) and bias. The key question to ask ourselves is whether the assignment mechanism that allocates NAs is systematic. Is it related to x or y, or both? The problem – and opportunities – depend on their structure and our theory about the data generating process (the one that generates NAs).
Ward and Ahlquist are on a mission in chapter 12: We have litte to lose – and a lot to gain – from imputing the missing values. Once we think about the assumptions surrounding most strategies for addressing missing values, imputation is the better option. This only holds as long as we can claim that the data is either MCAR (missing completely at random) or MAR (missing at random). If the data is MNAR (missing not at random), we need to re-theorize and / or look towards selection models (Heckman regression), sensitivity analysis and what not (fun stuff, but beyond this course).
Problem 1: statistical power
The size of the missing data is not in itself a problem for inference. If the data is MCAR (missing completely at random), it would mainly be a problem of statistical power, but the estimation itself would not be biased. Our data is in effect a random sample.
We can check the share of missing data in the survey response by asking whether an observation is NA, then calculate the mean (proportion).
## [1] 70.38596
I have 70% missing observations in my data. In survey research, this is branded as “not a problem” as long as they are representative of the population.
Imputing values for the non-responses in this case would also be completely innocuous, The key would be to keep the uncertainty from the prediction. (More on that later.)
Statistical power can quickly become a problem even if no single variable is ostensibly plagued with missing observations, but all variables have some missing values – and they are spread out over the data set. That’s why my first move is to map out where the missing values are.
What are the variables that contain missing?
For ease, let’s begin by subsetting the data to the covariates we are interested in. Some of the functions we use work better when we feed them with a data set that only contains the relevant variables. We can add more variables to this code as we go.
#Name my variables
vars <- c("y", "x", "ClosedList")
df0 <-
#Store results in a new data frame
df %>%
#Select them
dplyr::select(all_of(vars))Now, let’s see where I have missing observations. The package
finalfit has several convenient functions for exploring
patterns in the missing observations. missing_plot()
organizes the data row-by-row on the x-axis and reports the variables on
the y-axis. I draws light blue bars for all observations that are
missing.
We can see that I have some missing observations (light blue) on my
dependent variable “Attendance” (y) as well as a bunch on my predictor
“FutureInEP” (x). By contrast, we have no missing observations in the
indicator for party-centered systems (“ClosedList”).
In a data set where 10 variables all contain 10% missing observations that never overlap, you’d lose all of your data! Incidentally, this is a (soft) sign that our missing data is probably fairly random. At least it would mean that no single, systematic process has generated NAs on all variables. Ward and Ahlquist’s answer? This is the ideal case for imputation! Let’s check if each observation on NA is stand-alone or if they co-occur.
Unique combinations of NAs
Another functionality that comes from the mice- package
allows us to explore the patterns of missing observations: do
NA’s co-occur?
## ClosedList y x
## 395 1 1 1 0
## 923 1 1 0 1
## 27 1 0 1 1
## 80 1 0 0 2
## 0 107 1003 1110
We have four combinations of missing observations in the data:
- no missing (N = 395)
- missing only on the x (re-election seeker) (N = 923)
- missing only on the y (committee attendance) (N = 27)
- missing on both x and y (N = 80)
If we were to go for the standard solution, “listwise exclusion” (delete all observations with at least one NA), we’d end up with an analysis of 395 observations. Our problem is not an equal spread of NA accross variables. Instead, most missing observations are in the survey response (x). The only second unique data leakage is from the dependent variable (27 MEPs).
The big mike-drop moment in Ward and Ahlquist’s chapter is their approach to the dependent variable. According to them, there is nothing wrong in a) imputing missing values in the dependent variable, nor in b) using the dependent variable to impute missing values in the predictors. If we have other predictors in the main model, we can throw them into the imputation as well.
This sounds counter-intuitive, but they’re taking our logic to the extreme. If we trust our data enough to regress y on x (with controls), then we should also trust it enough to regress x on y and use the predictions to fill data gaps. The logic is simple: either we believe the missing completely is at random (MCAR), and this would not introduce bias. Or we believe that the missing is at random (MAR). If so, the regression is already biased and internal imputations would simply continue (but not worsen) a problem we already have.
The added value is that, we use the information embedded structure implied in the model to the full, rather than just the information from each variable separately.
They take this one step further, however, and argue that this is the first step to correct bias. If we deleted our observations, we risk having a skewed data set.
Problem 2: Bias
Ward and Ahlquist take glass half-full/half-empty approach to missingness. NA is rarely random noise. After all we study human behavior.
In the worst case, the same mechanism that produces missing observations in x may also influence both the observed values of x and y. Now I have a selection problem that looks eerily like the one we have with unobserved confounders! The same underlying trait may affect both whether an observation is missing and the values taken on by x and y.
For example, some of my MEPs may both skimp on their committee work and avoid answering annoying surveys. If so, simply deleting observations through listwise exclusion would introduce systematic bias. My sample of complete data on MEP activity and ambition is no longer representative of the underlying population.
How do I find systematic bias? There is no way to
know if the missing data is truly random, but there are some pretty sure
giveaways that the missing observations are not completely
random. That is when a predictor’s NAs covary with either
i) the observed values of other variables or ii) NAs in the
other variables (in the model or other variables we might have at our
disposal).
Those correlations are informative in more than one way: they rule out MCAR, but they also make the missing values “observable”. Once we can see the problem, we can address it.
Let’s look for correlations; they are signs of potential adressable bias.
Pairs of correlations
We can begin by checking for systematic correlations between the
missing and observed data across different variables. The
missing_compare() function does a simple cross-tabulation
of missing values in the variable of your choice (e.g. x,
re-election seeker) and the values of other
predictors/outcomes. It then tests the statistical significance of
the bivariate relationship. When the predictors are categorical, it
performs a \(\chi^2\) test.
Here, I compare missing in x with values of y and z (party-centered system). The average number of committee meetings is higher among survey respondents than non-respondents (missing). It also turns out that the cross-tabulation between party-centered systems and survey (non-)responses is statistically significant: MEPs from closed-list systems/party-centered systems where less likely to respond to the survey.
This weakens the MCAR assumptions, and makes me think we are in an MAR scenario: we have systematic missing observations that we should address.
Let’s check the NAs in my dependent variable for good form. I’m using a simple correlation here.
##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$y)) and df$ClosedList
## t = 2.5355, df = 1423, p-value = 0.01134
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01518664 0.11857821
## sample estimates:
## cor
## 0.06706245
Qe see that I miss data on committee attendance for more MEPs from party-centered systems than candidate-centered systems.
I can also correlate the missingness in y with the value of x: missing attendance with observed ambition.
#NA in attendance vs. future in the EP (reelection ambition)
cor.test(as.numeric(is.na(df$y)), df$x)##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$y)) and df$x
## t = 0.63699, df = 420, p-value = 0.5245
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06458376 0.12615157
## sample estimates:
## cor
## 0.03106673
Missing on my dependent variable is not correlated with MEPs ambition (the value of x). However, the opposite is not true. People who did not answer the survey attend fewer committee meetings:
##
## Pearson's product-moment correlation
##
## data: as.numeric(is.na(df$x)) and df$y
## t = -2.5653, df = 1316, p-value = 0.01042
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.12406177 -0.01660529
## sample estimates:
## cor
## -0.07053817
It seems like we have two problems: Missing on y and missing on x. Let’s ignore the missing on y for the time being and focus on the non-responses in my survey.
Conditional NAs
A quick way to check if missing observations are conditional on other predictors is to regress the NAs on the rest of the predictors.
I begin by recoding my variable into 0 (not missing) and 1 (missing).
Now, I can estimate a binomial model to see if the data is balanced.
##
## Call:
## glm(formula = x_na ~ ClosedList, family = "binomial", data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.63908 0.08186 7.807 5.85e-15 ***
## ClosedList 0.44045 0.11665 3.776 0.000159 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.5 on 1424 degrees of freedom
## Residual deviance: 1717.2 on 1423 degrees of freedom
## AIC: 1721.2
##
## Number of Fisher Scoring iterations: 4
… they are not. The non-responses to the survey are systematically correlated with the electoral system. Here, I find that the probability of a missing observation on x increases by 55.3407789 % when we go from candidate-centered to party-centered systems.
If we assume the data is MAR rather than MNAR, then conditioning on electoral system in the main model would ideally address the bias induced by the missing observations. We can also use the predictors in the model to impute non-responses.
Ward and Ahlquist (2018) also argue that we can use the outcome variable (y) to impute values on x. Their approach is that we have a (theoretical) relationship between the two, and that this is reflected in their correlation. The correlation is symmetric, so if we can predict y using x, then they argue that we can also use y to predict x.
Before we go that far, let’s consider some low-stakes options.
Quick fixes
If we can plausibly claim that our data is missing competely at random, we can simply treat the remaining complete-observation sample as if it was randomly drawn, and thus representative of the population.
Removing data (listwise exclusion / complete sample analysis)
My ideal model looks like this:
##
## Call:
## lm(formula = y ~ x * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.839 -7.408 -1.358 6.324 47.642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.4082 0.8460 22.942 <2e-16 ***
## x -1.0500 1.5119 -0.694 0.488
## ClosedList -0.4642 1.2479 -0.372 0.710
## x:ClosedList 2.9452 2.2375 1.316 0.189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.26 on 391 degrees of freedom
## (1030 observations deleted due to missingness)
## Multiple R-squared: 0.00507, Adjusted R-squared: -0.002564
## F-statistic: 0.6641 on 3 and 391 DF, p-value: 0.5745
From the base-line model, we can see that MEPs that already know that they will seek reelection attend on average 3 committee meetings more than their colleagues in party-centered systems without such ambition. The result is not statistically significant, though.
If I do nothing more at the estimation stage, R will proceed to listwise exclusion of all observations with missing values on any of the covariates included in the model. .
## [1] 395
Here, we see that the length of my data is 395, while my data frame contains 1425 observations. We lost quite a lot of data on the other variables in the regression.
Re-weighing data (regression weights)
A common problem / feature among survey researchers is that samples are not representative of the population / original data. Researchers fix the symptom, by re-balancing the data. They bank on the assumption that the observations that are complete, are representative enough for their group to “stand in” for non-responses.
If the main problem is just over/under-representation of some groups of observations. If so, we can include weights in the regression to compensate for the overrepresenation.
df %>%
group_by(Nationality) %>%
summarize(prop = mean(is.na(x))) %>%
dplyr::select(Nationality, prop) %>%
ggplot +
geom_bar(aes(x = prop,
y = Nationality),
stat = "identity") +
ggtitle("Proportion of non-responses by nationality") +
xlab("")Distribution of non-responses on the EPRG survey
From the Figure, we see that there is some variation in response rate across nationalities. If that is all, and we don’t think the same mechanism will affect the values taken on by my \(y\), then we can simply weight the observations according to their probability of occurring in the survey. E.g. we can see that Hungary and Spain are overrepresented, while Denmark has a lower response rate.
I begin by creating a dummy indicator variable for whether the MEP responded to the survey.
I then fit a model to predict the probability of a non-response as a function of nationality.
##
## Call:
## glm(formula = future_na ~ Nationality, family = "binomial", data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.34993 0.42414 3.183 0.00146 **
## NationalityBelgium -0.58779 0.53353 -1.102 0.27060
## NationalityBulgaria -0.94446 0.54676 -1.727 0.08410 .
## NationalityCroatia -0.94446 0.77237 -1.223 0.22140
## NationalityCyprus -1.16761 0.73930 -1.579 0.11426
## NationalityCzech Republic -1.19578 0.53204 -2.248 0.02461 *
## NationalityDenmark -1.26988 0.58322 -2.177 0.02945 *
## NationalityEstonia -0.94446 0.77237 -1.223 0.22140
## NationalityFinland -0.71394 0.59146 -1.207 0.22740
## NationalityFrance -0.25131 0.47364 -0.531 0.59570
## NationalityGermany -0.34011 0.45458 -0.748 0.45435
## NationalityGreece -0.21852 0.55871 -0.391 0.69570
## NationalityHungary 1.24034 0.73364 1.691 0.09090 .
## NationalityIreland -1.24457 0.62530 -1.990 0.04655 *
## NationalityItaly -0.60271 0.46113 -1.307 0.19120
## NationalityLatvia -0.99325 0.65019 -1.528 0.12661
## NationalityLithuania -0.30847 0.63670 -0.484 0.62804
## NationalityLuxembourg -0.65678 0.74491 -0.882 0.37795
## NationalityMalta -1.01345 0.72302 -1.402 0.16100
## NationalityNetherlands -0.47446 0.52377 -0.906 0.36502
## NationalityPoland -0.50263 0.47698 -1.054 0.29199
## NationalityPortugal -0.34662 0.55149 -0.629 0.52966
## NationalityRomania -0.85745 0.50312 -1.704 0.08833 .
## NationalitySlovakia -1.26988 0.58322 -2.177 0.02945 *
## NationalitySlovenia -1.34993 0.65566 -2.059 0.03951 *
## NationalitySpain -0.04652 0.48754 -0.095 0.92398
## NationalitySweden -0.41562 0.55364 -0.751 0.45283
## NationalityUnited Kingdom -0.21285 0.46770 -0.455 0.64904
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1731.5 on 1424 degrees of freedom
## Residual deviance: 1681.8 on 1397 degrees of freedom
## AIC: 1737.8
##
## Number of Fisher Scoring iterations: 5
The weight variable multiplies in practice our dependent variable to increase or decrease according to weight it is given. Now, I weigh up the observations from under-represented groups and down the over-represented groups.
##
## Call:
## lm(formula = Attendance ~ x * ClosedList, data = df, weights = 1/weight)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -25.625 -9.453 -1.720 7.607 64.608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.4783 0.8397 23.197 <2e-16 ***
## x -0.8878 1.4916 -0.595 0.552
## ClosedList -0.4212 1.2888 -0.327 0.744
## x:ClosedList 2.9464 2.2863 1.289 0.198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.78 on 391 degrees of freedom
## (1030 observations deleted due to missingness)
## Multiple R-squared: 0.00509, Adjusted R-squared: -0.002543
## F-statistic: 0.6668 on 3 and 391 DF, p-value: 0.5728
This is a common way to address some of the information biases we find in surveys such as the European Social Survy, World Values Survey etc.
stargazer(mod.base, mod.w,
#Esthetics: redefine the dependent variable and column names
dep.var.caption = "Committee attendance",
dep.var.labels.include = F,
model.numbers = F,
column.labels = c("Listwise excl", "Weights"),
title = "Listwise exclusion and regression weights.",
type = "html")| Committee attendance | ||
| Listwise excl | Weights | |
| x | -1.050 | -0.888 |
| (1.512) | (1.492) | |
| ClosedList | -0.464 | -0.421 |
| (1.248) | (1.289) | |
| x:ClosedList | 2.945 | 2.946 |
| (2.237) | (2.286) | |
| Constant | 19.408*** | 19.478*** |
| (0.846) | (0.840) | |
| Observations | 395 | 395 |
| R2 | 0.005 | 0.005 |
| Adjusted R2 | -0.003 | -0.003 |
| Residual Std. Error (df = 391) | 10.257 | 12.780 |
| F Statistic (df = 3; 391) | 0.664 | 0.667 |
| Note: | p<0.1; p<0.05; p<0.01 | |
Simple imputations (the intuition)
“Imputing” missing values implies that we fill in the gaps. The non-informative ways of filling in gaps imply giving the observations some value without adding new information. This will attenuate the statistical correlation, but increase the N. If my covariate is a control variable, and I want to keep as many observations as possible (or reduce selection bias) for the main predictors in the model, these are low-energy fixes.
This section is mainly pedagogical: to have you understand the logic behind multiple imputatation (next section) through simpler examples.
Mean imputation
I can simply replace my missing observations by the mean of the variable. Here, the “mean” is the proportion of MEPs that answer the survey question by the positive (x == 1).
The histogram shows what I just did: I now have my original observations
with 0s and 1s, but I also have a big group of “imputed” observations:
the variable mean.
I can run my model with this new “imputed” variable. There is no new information in the model; to the contrary, I am “watering down” the correlation with noise. My N is higher, however, meaning that the standard errors will be artificially low. I.e. I’ll be underpredicting while being overly certain about the result.
##
## Call:
## lm(formula = y ~ x_bar * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.479 -8.022 -1.206 6.380 74.794
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3558 0.6645 27.623 <2e-16 ***
## x_bar -1.0602 1.5988 -0.663 0.507
## ClosedList -0.7361 0.9540 -0.772 0.440
## x_bar:ClosedList 2.9194 2.3659 1.234 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.85 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.001267, Adjusted R-squared: -0.001013
## F-statistic: 0.5557 on 3 and 1314 DF, p-value: 0.6444
Indicator variables
A traditional “fix” is the indicator-variable trick. We replace missing observations by the variable mean, but an indicator variable that flags all the cases where I’ve been meddling with my observations.
It’s – once again – a way to keep the entire data set. Now, I control for the imputation, so that the effect of the imputed variable reports the estimated effect of x on y using only the information from the observed values (plus some noise). The imputed values are subsumed in the indicator. Consider the similarity between the two coefficients for x and x_bar in model 1 and 3.
##
## Call:
## lm(formula = y ~ +x_na + +x_bar * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.763 -7.762 -1.436 6.238 75.238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.4714 0.7890 24.679 < 2e-16 ***
## x_na -1.7045 0.6534 -2.609 0.00919 **
## x_bar -1.0493 1.5953 -0.658 0.51079
## ClosedList -0.6018 0.9533 -0.631 0.52794
## x_bar:ClosedList 2.9426 2.3607 1.246 0.21281
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.82 on 1313 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.006417, Adjusted R-squared: 0.00339
## F-statistic: 2.12 on 4 and 1313 DF, p-value: 0.07615
One problem with this solution is the big question mark we are left with: Missing values on x (x_na) correlate heavily with the outcome. Respondents have higher attendance rates in parliament.
stargazer(mod.base, mod.w, mod.m, mod.ind, #mod.sam,
dep.var.labels.include = F,
dep.var.caption = "Attendance",
column.labels = c("listwise exclusion", "weights", "mean imp.", "mean imp. + indicator"),
type = "html")| Attendance | ||||
| listwise exclusion | weights | mean imp. | mean imp. + indicator | |
| (1) | (2) | (3) | (4) | |
| x | -1.050 | -0.888 | ||
| (1.512) | (1.492) | |||
| x_na | -1.704*** | |||
| (0.653) | ||||
| x_bar | -1.060 | -1.049 | ||
| (1.599) | (1.595) | |||
| ClosedList | -0.464 | -0.421 | -0.736 | -0.602 |
| (1.248) | (1.289) | (0.954) | (0.953) | |
| x:ClosedList | 2.945 | 2.946 | ||
| (2.237) | (2.286) | |||
| x_bar:ClosedList | 2.919 | 2.943 | ||
| (2.366) | (2.361) | |||
| Constant | 19.408*** | 19.478*** | 18.356*** | 19.471*** |
| (0.846) | (0.840) | (0.665) | (0.789) | |
| Observations | 395 | 395 | 1,318 | 1,318 |
| R2 | 0.005 | 0.005 | 0.001 | 0.006 |
| Adjusted R2 | -0.003 | -0.003 | -0.001 | 0.003 |
| Residual Std. Error | 10.257 (df = 391) | 12.780 (df = 391) | 10.846 (df = 1314) | 10.822 (df = 1313) |
| F Statistic | 0.664 (df = 3; 391) | 0.667 (df = 3; 391) | 0.556 (df = 3; 1314) | 2.120* (df = 4; 1313) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Inserting information to the model (multiple imputation)
Until now, we haven’t inserted any new information to the missing variables. We can do that using imputations; either by inserting information from i) the other covariates or from ii) new variables. There are grounds to say that we can draw on the covariates already in the model. First, if the covariates in the model are there because they covary (with each other), then we may argue that the information is already in the data; we just need to crystallize it.
Moving forward, we can either replace variables by group means, or go all the way using regressions.
Information from other covariates
Group means
We can replace the value of x by a predictor like
ClosedList.
##
## Call:
## lm(formula = y ~ x_bar_g * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.571 -8.016 -1.198 6.388 74.802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3871 0.6753 27.227 <2e-16 ***
## x_bar_g -1.1271 1.5981 -0.705 0.481
## ClosedList -0.7753 0.9497 -0.816 0.414
## x_bar_g:ClosedList 3.0864 2.3652 1.305 0.192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.85 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.001407, Adjusted R-squared: -0.0008732
## F-statistic: 0.617 on 3 and 1314 DF, p-value: 0.604
Including new covariates
Let’s instead use an entirely new covariate.
In fact, let’s add several predictors to the data set we use. My
candidate predictors are the legislator’s Nationality,
EPGroup (parliamentary group), EP (legislative
period/survey round) and Age when they responded to the
survey.
#Center age (for convenience later)
df <-
df %>%
mutate(
Age_c = Age - 40,
Age2_c = Age_c^2
)
##Select variables
vars <- c("y", "x", "ClosedList",
"EPGroup",
"Nationality",
"Age", "Age_c", "Age2_c",
"position", "lrecon")
df0 <- df %>%
dplyr::select(all_of(vars))Usually, I’d check the patterns of missing data once again (i.e. are there missing observations on the new covariates?).
df0 <-
df0 %>%
group_by(Nationality) %>%
mutate(
x_bar_g2 = if_else(is.na(x), mean(x, na.rm = T), x)) %>%
#make sure to ungroup (useful for some of the sampling functions)
ungroup()##
## Call:
## lm(formula = y ~ x_bar_g2 * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.744 -8.027 -1.277 6.405 74.827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.2853 0.6447 28.362 <2e-16 ***
## x_bar_g2 -0.8313 1.5065 -0.552 0.581
## ClosedList -0.6906 0.8988 -0.768 0.442
## x_bar_g2:ClosedList 2.9393 2.2243 1.321 0.187
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.001561, Adjusted R-squared: -0.0007184
## F-statistic: 0.6848 on 3 and 1314 DF, p-value: 0.5613
stargazer(mod.base, mod.g, mod.g2,
column.labels = c("listwise exclusion", "group mean/ClosedList", "group mean/Nationality"),
type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | group mean/ClosedList | group mean/Nationality | |
| (1) | (2) | (3) | |
| x | -1.050 | ||
| (1.512) | |||
| x_bar_g | -1.127 | ||
| (1.598) | |||
| x_bar_g2 | -0.831 | ||
| (1.506) | |||
| ClosedList | -0.464 | -0.775 | -0.691 |
| (1.248) | (0.950) | (0.899) | |
| x:ClosedList | 2.945 | ||
| (2.237) | |||
| x_bar_g:ClosedList | 3.086 | ||
| (2.365) | |||
| x_bar_g2:ClosedList | 2.939 | ||
| (2.224) | |||
| Constant | 19.408*** | 18.387*** | 18.285*** |
| (0.846) | (0.675) | (0.645) | |
| Observations | 395 | 1,318 | 1,318 |
| R2 | 0.005 | 0.001 | 0.002 |
| Adjusted R2 | -0.003 | -0.001 | -0.001 |
| Residual Std. Error | 10.257 (df = 391) | 10.845 (df = 1314) | 10.845 (df = 1314) |
| F Statistic | 0.664 (df = 3; 391) | 0.617 (df = 3; 1314) | 0.685 (df = 3; 1314) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Regression
From group means, there is only a tiny-tiny step to regression predictions. Since my x is binary, I run a binomial logistic regression to get predicted probabilities that I can later use for imputation. I’m regression on “Age”.
##
## Call:
## glm(formula = x ~ Age_c + Age2_c, family = "binomial", data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.299117 0.166678 -1.795 0.0727 .
## Age_c 0.054337 0.025004 2.173 0.0298 *
## Age2_c -0.004572 0.001119 -4.085 4.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 525.96 on 421 degrees of freedom
## Residual deviance: 487.40 on 419 degrees of freedom
## (1003 observations deleted due to missingness)
## AIC: 493.4
##
## Number of Fisher Scoring iterations: 5
At this point, I’ll prioritize predictors that make sense theoretically and that have a strong correlation. It makes sense that MEPs that are above a certain age will not plan to stay in office for another 10 years, so age is a substantivly good predictor. I add a quadric term because I think young MEPs are increasingly likely to aim for an EU career (as they lock themselves into a career), then – as they grow closer to retirement – i expect the opposite.
df0$pred.age <- predict(mod.imp.age, df0, type = "response")
df0 <-
df0 %>%
mutate(x.age = if_else(is.na(x), pred.age, x))##
## Call:
## lm(formula = y ~ +x.age * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.523 -7.942 -1.247 6.262 75.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.5225 0.6369 29.082 <2e-16 ***
## x.age -1.5927 1.4834 -1.074 0.283
## ClosedList -0.7840 0.9046 -0.867 0.386
## x.age:ClosedList 3.0683 2.1582 1.422 0.155
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.001618, Adjusted R-squared: -0.0006612
## F-statistic: 0.7099 on 3 and 1314 DF, p-value: 0.5461
Random effects and fixed effects
A lot of (unobserved) information is often contained in group variables.
Fixed effects
mod.imp.epgr <- glm(x ~ Age_c + Age2_c
+ EPGroup
,
family = "binomial",
df)
#Check the results
# summary(mod.imp.epgr)df0$pred.epgr <- predict(mod.imp.epgr, df0, type = "response")
df0 <-
df0 %>%
mutate(x_gr = if_else(is.na(x), pred.epgr, x))##
## Call:
## lm(formula = y ~ +x_gr * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.793 -7.945 -1.307 6.370 75.369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3403 0.6147 29.836 <2e-16 ***
## x_gr -1.0334 1.4144 -0.731 0.4651
## ClosedList -0.8833 0.8723 -1.013 0.3114
## x_gr:ClosedList 3.3691 2.0246 1.664 0.0963 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.002448, Adjusted R-squared: 0.0001704
## F-statistic: 1.075 on 3 and 1314 DF, p-value: 0.3587
Fixed effects are always a possibility, but remember: You run the risk of overfitting the data wheen you hunt around for covariates. Fixed effects are a good example: They predict really well in-group, but what about new groups? Or if there is a change?
Random effects
are more flexible. We can create new groups simply by drawing from the distribution of random effects (i.e. the random intercepts are assumed to follow a normal distribution with a given mean and standard deviation). We can also add level-2 covariates to improve the out-of-sample predictions.
library(lme4)
mod.imp.ran <- glmer(x ~ Age_c + Age2_c
+ (1|EPGroup),
family = "binomial",
df0)
summary(mod.imp.ran)## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: x ~ Age_c + Age2_c + (1 | EPGroup)
## Data: df0
##
## AIC BIC logLik deviance df.resid
## 485.3 501.5 -238.7 477.3 418
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.1683 -0.7179 -0.4391 0.9765 7.4477
##
## Random effects:
## Groups Name Variance Std.Dev.
## EPGroup (Intercept) 0.3495 0.5912
## Number of obs: 422, groups: EPGroup, 8
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.596654 0.294265 -2.028 0.0426 *
## Age_c 0.047631 0.026319 1.810 0.0703 .
## Age2_c -0.004556 0.001157 -3.938 8.21e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Age_c
## Age_c -0.209
## Age2_c 0.042 -0.888
## optimizer (Nelder_Mead) convergence code: 0 (OK)
## Model failed to converge with max|grad| = 0.00296411 (tol = 0.002, component 1)
## Model is nearly unidentifiable: very large eigenvalue
## - Rescale variables?
To allow new groups to be drawn, simply specify it int the prediction.
df0$pred.ran <- predict(mod.imp.ran, df0, type = "response", allow.new.levels = T)
df0 <-
df0 %>%
mutate(x.ran = if_else(is.na(x), pred.ran, x))##
## Call:
## lm(formula = y ~ +x.ran * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.781 -7.989 -1.316 6.376 75.371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3368 0.6216 29.500 <2e-16 ***
## x.ran -1.0212 1.4446 -0.707 0.480
## ClosedList -0.8685 0.8829 -0.984 0.325
## x.ran:ClosedList 3.3340 2.0755 1.606 0.108
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.002277, Adjusted R-squared: -5.978e-07
## F-statistic: 0.9997 on 3 and 1314 DF, p-value: 0.3921
stargazer(mod.base, mod.gr, mod.ran,
column.labels = c("listwise exclusion", "fixeff", "raneff"),
type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | fixeff | raneff | |
| (1) | (2) | (3) | |
| x | -1.050 | ||
| (1.512) | |||
| x_gr | -1.033 | ||
| (1.414) | |||
| x.ran | -1.021 | ||
| (1.445) | |||
| ClosedList | -0.464 | -0.883 | -0.869 |
| (1.248) | (0.872) | (0.883) | |
| x:ClosedList | 2.945 | ||
| (2.237) | |||
| x_gr:ClosedList | 3.369* | ||
| (2.025) | |||
| x.ran:ClosedList | 3.334 | ||
| (2.076) | |||
| Constant | 19.408*** | 18.340*** | 18.337*** |
| (0.846) | (0.615) | (0.622) | |
| Observations | 395 | 1,318 | 1,318 |
| R2 | 0.005 | 0.002 | 0.002 |
| Adjusted R2 | -0.003 | 0.0002 | -0.00000 |
| Residual Std. Error | 10.257 (df = 391) | 10.840 (df = 1314) | 10.841 (df = 1314) |
| F Statistic | 0.664 (df = 3; 391) | 1.075 (df = 3; 1314) | 1.000 (df = 3; 1314) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Bringing in the uncertainty
Until now, we’ve added information to the models from other variables to increase the N in the main model. There is quite some uncertainty in these predictions. How can we account for that? In fact, the worse the predictions, the more important is it to account for uncertainty. We can do this by using simulation, i.e. we add a bit of noise in our predictions.
Random sampling
Absent any new information, we can of course fill in the gaps using information from the model, but without any assumptions about a trand; that is, we can do random sampling. It has the advantage that it allows us to inflate the standard error, reflecting the uncertainty in the estimation of x.
I can use the sample() function to draw randomly from
the observed part of my x.
sam <- sample(
#Draw values that are not NA
df0$x[!is.na(df0$x)],
#Make as many draws as I have data
size = nrow(df0),
#For reach draw, replace so that I can draw anew
replace = T)
#Replace the missing observations by my draws
df0 <-
df0 %>%
mutate(x.sam = if_else(is.na(x), sam, x))Note that this random part means that I’ll get a new result every time I run my R-code. Also, it won’t help me make much headway if I have a lot of missing observations and I am actually interested in the effect of this particular variable.
##
## Call:
## lm(formula = y ~ x.sam * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.357 -8.190 -1.190 6.134 74.766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8656 0.5270 33.901 <2e-16 ***
## x.sam 0.4912 0.9325 0.527 0.598
## ClosedList 0.3242 0.7198 0.450 0.652
## x.sam:ClosedList -0.4469 1.2978 -0.344 0.731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.85 on 1314 degrees of freedom
## (107 observations deleted due to missingness)
## Multiple R-squared: 0.0002821, Adjusted R-squared: -0.002
## F-statistic: 0.1236 on 3 and 1314 DF, p-value: 0.9462
Sampling from a prediction
We can also sample from a prediction if we know the distribution of this prediction. One way is to assume a normal distribution of each prediction.
##Missing values in x are related to the value of other xs
mod2.na <- glm(x ~
+ Age_c
+ Age2_c
+ position
+ EPGroup,
family="binomial",
df0
)
summary(mod2.na)##
## Call:
## glm(formula = x ~ +Age_c + Age2_c + position + EPGroup, family = "binomial",
## data = df0)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.366516 1.096786 -2.158 0.0310 *
## Age_c 0.037068 0.029898 1.240 0.2150
## Age2_c -0.004099 0.001283 -3.195 0.0014 **
## position 0.313443 0.169086 1.854 0.0638 .
## EPGroupECR 0.965138 0.740880 1.303 0.1927
## EPGroupEFDD -0.730251 1.285852 -0.568 0.5701
## EPGroupEPP 0.698014 0.402373 1.735 0.0828 .
## EPGroupGreens/EFA -1.918548 1.095199 -1.752 0.0798 .
## EPGroupGUE/NGL 0.433698 0.903595 0.480 0.6312
## EPGroupNI 1.400315 1.017342 1.376 0.1687
## EPGroupPES 0.389912 0.418219 0.932 0.3512
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 458.61 on 369 degrees of freedom
## Residual deviance: 398.12 on 359 degrees of freedom
## (1055 observations deleted due to missingness)
## AIC: 420.12
##
## Number of Fisher Scoring iterations: 5
I ask for the standard error of the predictions in order to know the spread of the distribution.
preds <- predict(mod2.na, df0, "response", se.fit = T)
df0$x.glm <- preds$fit
df0$x.se <- preds$se.fit
#Draw from a normal distribution with a mean an a standard deviation
sims <- rnorm(nrow(df0), df0$x.glm, df0$x.se)
#Replace the missing observations by my draws
df0 <-
df0 %>%
mutate(x_sim = if_else(is.na(x), sims, x))##
## Call:
## lm(formula = y ~ x_sim * ClosedList, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.319 -7.935 -1.131 6.278 75.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.0125 0.6196 29.072 <2e-16 ***
## x_sim -0.2902 1.4190 -0.205 0.838
## ClosedList -0.6291 0.8702 -0.723 0.470
## x_sim:ClosedList 3.2259 2.0178 1.599 0.110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.84 on 1231 degrees of freedom
## (190 observations deleted due to missingness)
## Multiple R-squared: 0.003699, Adjusted R-squared: 0.001271
## F-statistic: 1.523 on 3 and 1231 DF, p-value: 0.2067
stargazer(mod.base, mod.sam, mod.sim,
column.labels = c("listwise exclusion", "random sample", "simulation"),
type = "html")| Dependent variable: | |||
| y | |||
| listwise exclusion | random sample | simulation | |
| (1) | (2) | (3) | |
| x | -1.050 | ||
| (1.512) | |||
| x.sam | 0.491 | ||
| (0.932) | |||
| x_sim | -0.290 | ||
| (1.419) | |||
| ClosedList | -0.464 | 0.324 | -0.629 |
| (1.248) | (0.720) | (0.870) | |
| x:ClosedList | 2.945 | ||
| (2.237) | |||
| x.sam:ClosedList | -0.447 | ||
| (1.298) | |||
| x_sim:ClosedList | 3.226 | ||
| (2.018) | |||
| Constant | 19.408*** | 17.866*** | 18.013*** |
| (0.846) | (0.527) | (0.620) | |
| Observations | 395 | 1,318 | 1,235 |
| R2 | 0.005 | 0.0003 | 0.004 |
| Adjusted R2 | -0.003 | -0.002 | 0.001 |
| Residual Std. Error | 10.257 (df = 391) | 10.852 (df = 1314) | 10.840 (df = 1231) |
| F Statistic | 0.664 (df = 3; 391) | 0.124 (df = 3; 1314) | 1.523 (df = 3; 1231) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
A side note: How do compare model coefficients
sjPlot allows us to compare coefficients from several
models at the same time.
Here, I replace the variable names for the regression coefficients so that I will get all coefficients placed right above each other.
replace_names <- function(x,
vars = c("x", "ClosedList", "x:ClosedList")){
names(x$coefficients)[-1] <- vars
return(x)
}
mod.m <- replace_names(mod.m)
mod.g <- replace_names(mod.g)
mod.sam <- replace_names(mod.sam)
mod.sim <- replace_names(mod.sim)plot_models( mod.base,
mod.m,
mod.g,
mod.sam,
mod.sim,
m.labels = c("Listwise exclusion",
"Mean imputation",
"Group-mean imputation",
"Random sampling",
"Simulation"),
legend.title = "Model approaches",
title = "Coeficients from four strategies of NA handling")stargazer(mod.base, mod.m, mod.g, mod.sam, mod.sim,
column.labels = c("Base-line", "Mean repl.", "Group repl.", "Random sample", "Imputation"),
type = "html")| Dependent variable: | |||||
| y | |||||
| Base-line | Mean repl. | Group repl. | Random sample | Imputation | |
| (1) | (2) | (3) | (4) | (5) | |
| x | -1.050 | -1.060 | -1.127 | 0.491 | -0.290 |
| (1.512) | (1.599) | (1.598) | (0.932) | (1.419) | |
| ClosedList | -0.464 | -0.736 | -0.775 | 0.324 | -0.629 |
| (1.248) | (0.954) | (0.950) | (0.720) | (0.870) | |
| x:ClosedList | 2.945 | 2.919 | 3.086 | -0.447 | 3.226 |
| (2.237) | (2.366) | (2.365) | (1.298) | (2.018) | |
| Constant | 19.408*** | 18.356*** | 18.387*** | 17.866*** | 18.013*** |
| (0.846) | (0.665) | (0.675) | (0.527) | (0.620) | |
| Observations | 395 | 1,318 | 1,318 | 1,318 | 1,235 |
| R2 | 0.005 | 0.001 | 0.001 | 0.0003 | 0.004 |
| Adjusted R2 | -0.003 | -0.001 | -0.001 | -0.002 | 0.001 |
| Residual Std. Error | 10.257 (df = 391) | 10.846 (df = 1314) | 10.845 (df = 1314) | 10.852 (df = 1314) | 10.840 (df = 1231) |
| F Statistic | 0.664 (df = 3; 391) | 0.556 (df = 3; 1314) | 0.617 (df = 3; 1314) | 0.124 (df = 3; 1314) | 1.523 (df = 3; 1231) |
| Note: | p<0.1; p<0.05; p<0.01 | ||||
Mice (Multiple Imputation for missing data using Chained Equations)
We can go all in on this approach and simulate x-s, store them in
several data frames, run separate models on each of them before we
average over the coefficients. This is what mice does.
Basic workflow
The mice() function draws on chained equations. It
predicts x1 using x2, and y, then x2 using x1 (including the predicted
values) and y etc. It goes around in circles until the predictions
stabilizes. For each iteration, it draws randomly from a normal
distribution and multiplies is with the predictions to add a bit of
noise to the mix. In the end it produces several data sets for us in
which it imputes slightly different values.
Now, we can run our main model on each of those data sets, average over the regression coefficients and calculate the uncertainty for each of them using both the standard error from the model at hand, as well as the variation in the coefficients between models.
Step one: prepare the data and strategy
Mice requires a data set with predictors that will all be imputed/used for imputation. If I don’t specify otherwise, all covariates in the data are used to run around in circles. That’s why it’s useful to subset the data to the most relevant predictors.
Mice requires us to make explicit the measurement level of each covariates, so make sure to transform binary variables and categories to factors.
library(mice)
df0 <- df %>%
#Quadric term for age
mutate(x = as.factor(x),
ClosedList = as.factor(ClosedList)) %>%
#Select variables
dplyr::select(y, x, ClosedList, position,
EPGroup, Age_c, Age2_c) %>%
#Recode to factor
mutate(EPGroup = as.factor(EPGroup))Mice can use many types of models to predict. Here, I chose a linear model and ask for 10 data sets.
df0.imp <- mice(df0,
m = 10,
# #linear outcome
# method = c("norm",
# #Binary variables as factors
# "logreg", "logreg",
# #Metric
# "norm",
# #Categorical, unordered
# "polyreg",
# "norm", "norm")
)##
## iter imp variable
## 1 1 y x position
## 1 2 y x position
## 1 3 y x position
## 1 4 y x position
## 1 5 y x position
## 1 6 y x position
## 1 7 y x position
## 1 8 y x position
## 1 9 y x position
## 1 10 y x position
## 2 1 y x position
## 2 2 y x position
## 2 3 y x position
## 2 4 y x position
## 2 5 y x position
## 2 6 y x position
## 2 7 y x position
## 2 8 y x position
## 2 9 y x position
## 2 10 y x position
## 3 1 y x position
## 3 2 y x position
## 3 3 y x position
## 3 4 y x position
## 3 5 y x position
## 3 6 y x position
## 3 7 y x position
## 3 8 y x position
## 3 9 y x position
## 3 10 y x position
## 4 1 y x position
## 4 2 y x position
## 4 3 y x position
## 4 4 y x position
## 4 5 y x position
## 4 6 y x position
## 4 7 y x position
## 4 8 y x position
## 4 9 y x position
## 4 10 y x position
## 5 1 y x position
## 5 2 y x position
## 5 3 y x position
## 5 4 y x position
## 5 5 y x position
## 5 6 y x position
## 5 7 y x position
## 5 8 y x position
## 5 9 y x position
## 5 10 y x position
The result is a list containing the main data frame
(data) as well as 10 lists with imputed values.
## [1] "data" "imp" "m" "where"
## [5] "blocks" "call" "nmis" "method"
## [9] "predictorMatrix" "visitSequence" "formulas" "post"
## [13] "blots" "ignore" "seed" "iteration"
## [17] "lastSeedValue" "chainMean" "chainVar" "loggedEvents"
## [21] "version" "date"
Here is my imputed x, for example.
Step 2: Run your main model.
Now that we have the data, we can run the main model on each of them.
I could do it by hand, by subseting the data. In that case, I use the
compleete() function to fill in the missing information for
me. The factor= argument specifies the data set on which I
run my model.
##
## Call:
## lm(formula = y ~ x * ClosedList, data = complete(df0.imp, factor = 1))
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.514 -7.981 -1.242 6.486 75.019
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.2421 0.5118 35.641 <2e-16 ***
## x1 -0.7513 0.8906 -0.844 0.399
## ClosedList1 -0.2615 0.6980 -0.375 0.708
## x1:ClosedList1 1.2847 1.2164 1.056 0.291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.76 on 1421 degrees of freedom
## Multiple R-squared: 0.0008479, Adjusted R-squared: -0.001261
## F-statistic: 0.402 on 3 and 1421 DF, p-value: 0.7516
Another way is to do this in a loop. Here, I use the
with() function to apply the same lm()
function to the data.
The result is a list with many model objects. The model objects are
reported in analyses.
## [1] "call" "call1" "nmis" "analyses"
I can aggregate the results using another function, the
pool() function. It does two things for me. It calculates
the mean regression coefficient among all 10 coefficients. It also
calculates a standard error for me: It is a mix between the standard
error of each model coefficient and the difference between the different
coefficients across models.
To present the results in stargazer, I will have to do a bit of wrangling. I add the model object, but then I specify the coefficients and the standard errors separately. Just as we did with the robust standard errors.
Here, I add both the baseline model and the newly estimated one.
coefs <- list(summary(mod.base)$coefficients[,1],
res$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
res$std.error)
stargazer(mod.base, mod$analyses[[1]],
coef = coefs,
se = ses,
title = "Initial results from mice",
type = "html")| Dependent variable: | ||
| y | ||
| (1) | (2) | |
| x | -1.050 | |
| (1.512) | ||
| ClosedList | -0.464 | |
| (1.248) | ||
| x:ClosedList | 2.945 | |
| (2.237) | ||
| x1 | -0.640 | |
| (1.081) | ||
| ClosedList1 | -0.432 | |
| (0.766) | ||
| x1:ClosedList1 | 1.648 | |
| (1.535) | ||
| Constant | 19.408*** | 18.190*** |
| (0.846) | (0.561) | |
| Observations | 395 | 1,425 |
| R2 | 0.005 | 0.001 |
| Adjusted R2 | -0.003 | -0.001 |
| Residual Std. Error | 10.257 (df = 391) | 10.760 (df = 1421) |
| F Statistic | 0.664 (df = 3; 391) | 0.402 (df = 3; 1421) |
| Note: | p<0.1; p<0.05; p<0.01 | |
I’m a bit underwhelmed by this prediction. Let’s do the predictions with more zeal.
Choice of imputation models and predictors
It is usually best to chose separate models for each variable we want to impute. In the previous example both x and y were imputed using a linear model and the same predictors. This might not be optimal.
I begin by creating a matrix where I specify which variables I want to impute NAs to/from. Here, I remove the y from my set of predictors. I then tell the model where the NAs are. Since Im now left with only x to impute, I opt for a binomial logistic model.
dfNA <- is.na(df0)
#Remove the dependent variable
dfNA[,1] = FALSE
#Recode to categorical variable
df0$x <- as.factor(df0$x)
df1.imp <- mice(df0,
m = 10,
#linear outcome
# method = "norm",
#Where are the NAs
where = dfNA,
printFlag = F
)
mod1 <- with(df1.imp,
lm(y ~
x
* ClosedList)
)
res1 <- pool(mod1)coefs <- list(summary(mod.base)$coefficients[,1],
res$estimate,
res1$estimate)
ses <- list(summary(mod.base)$coefficients[,2],
res$std.error,
res1$std.error)
stargazer(mod.base, mod$analyses[[1]],mod1$analyses[[1]],
coef = coefs,
se = ses,
title = "Initial results from mice",
type = "html")| Dependent variable: | |||
| y | |||
| (1) | (2) | (3) | |
| x | -1.050 | ||
| (1.512) | |||
| ClosedList | -0.464 | ||
| (1.248) | |||
| x:ClosedList | 2.945 | ||
| (2.237) | |||
| x1 | -0.640 | 0.279 | |
| (1.081) | (0.951) | ||
| ClosedList1 | -0.432 | -0.673 | |
| (0.766) | (0.707) | ||
| x1:ClosedList1 | 1.648 | 2.986** | |
| (1.535) | (1.314) | ||
| Constant | 19.408*** | 18.190*** | 17.941*** |
| (0.846) | (0.561) | (0.514) | |
| Observations | 395 | 1,425 | 1,318 |
| R2 | 0.005 | 0.001 | 0.010 |
| Adjusted R2 | -0.003 | -0.001 | 0.008 |
| Residual Std. Error | 10.257 (df = 391) | 10.760 (df = 1421) | 10.799 (df = 1314) |
| F Statistic | 0.664 (df = 3; 391) | 0.402 (df = 3; 1421) | 4.385*** (df = 3; 1314) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
Imputing several variables with different covariates
We can do several imputations at the same time.
I begin by creating “blocks” of covariates. Each block corresponds to a covariate I want to impute. I then create a matrix with dummies for the covariates I want to include in the imputation.
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 3:5] = FALSE
#Two blocks
blocks <- make.blocks(c( "y", "x"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
blocks = blocks)
predictorMatrix## y x ClosedList position EPGroup Age_c Age2_c
## y 0 1 1 1 1 1 1
## x 1 0 1 1 1 1 1
df2.imp <- mice(df0,
m = 10,
#linear outcome
method = c("norm", "logreg"),
#Blocks of variables
blocks = blocks,
#Data frame specifying where imputations should be made
where = dfNA,
#Matrix flagging the predictors
predictorMatrix = predictorMatrix,
printFlag = F
)
mod2 <- with(df2.imp,
lm(y ~
x * ClosedList
)
)
res2 <- pool(mod2) %>%
summary
res2| Dependent variable: | ||||
| y | ||||
| (1) | (2) | (3) | (4) | |
| x | -1.050 | |||
| (1.512) | ||||
| ClosedList | -0.464 | |||
| (1.248) | ||||
| x:ClosedList | 2.945 | |||
| (2.237) | ||||
| x1 | -0.640 | 0.279 | -0.068 | |
| (1.081) | (0.951) | (1.557) | ||
| ClosedList1 | -0.432 | -0.673 | -0.172 | |
| (0.766) | (0.707) | (0.836) | ||
| x1:ClosedList1 | 1.648 | 2.986** | 1.145 | |
| (1.535) | (1.314) | (1.634) | ||
| Constant | 19.408*** | 18.190*** | 17.941*** | 17.961*** |
| (0.846) | (0.561) | (0.514) | (0.687) | |
| Observations | 395 | 1,425 | 1,318 | 1,335 |
| R2 | 0.005 | 0.001 | 0.010 | 0.014 |
| Adjusted R2 | -0.003 | -0.001 | 0.008 | 0.012 |
| Residual Std. Error | 10.257 (df = 391) | 10.760 (df = 1421) | 10.799 (df = 1314) | 10.727 (df = 1331) |
| F Statistic | 0.664 (df = 3; 391) | 0.402 (df = 3; 1421) | 4.385*** (df = 3; 1314) | 6.410*** (df = 3; 1331) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Display the results
We can use ggpredict() to simulate scenarios for us. The
most efficient way to do this is by writing a loop.
library(ggeffects)
#Create an empty object
predictions = NULL
#Index from 1 to 10 (number of datasets/models)
i = 1
for(i in 1:10){
m <- lm(y ~
x * ClosedList,
data = complete(df2.imp, action = i))
predictions[[i]] <- ggpredict(m,
terms = list("x" = c(0,1),
"ClosedList" = c(0,1))
)
}The function pool_predictions() pools the simulated
results for us by drawing on all 10 models. The outcome is the data
frame we’re used to.
effs <- pool_predictions(predictions)
effs %>%
plot +
ylab("Number of meetings attended") +
xlab("Election-seeker") +
ggtitle("Effect of ambition on committee attendance")Model assessment
Assessing whether we address missing observations appropriately is not easy. We don’t even know the extent of the problem! However, remember that ignoring the problem means you do a listwise exclusion. That’s a choice too.
When we impute values, the risk we are running is often overfitting: We pick variables that maximize our in-sample predictions, but then do a lousy job of predicting out of sample (i.e. our NAs).
In-sample predictions
You can do your usual cross-correlations between observed and predicted values for the imputed variable.
When using mice, there are a few additional functions. For continuous variables, we can plot the density (distribution) of the 10 simulated data (in red) against the observed density (blue).
The xyplot() plots the imputed variable agains whatever
covariate you have used to impute; just to check if the two overlap
appropriately.
Out-of-sample predictions
A last possibility is to leave observations out before making the predictions/picking a model. You can then use the model to impute the artificial NAs and compare.
df0$x_miss <- df0$x
id <- sample(row.names(df0 %>%
filter(!is.na(x))),
100)
df0$x_miss[row.names(df0) %in% id ] <- NA
#All NAs
dfNA <- is.na(df0)
#Only two first variables are missing/to be imputed
dfNA[, 2:6] = FALSE
dfNA %>% head## y x ClosedList position EPGroup Age_c Age2_c x_miss
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [4,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
#Two blocks
blocks <- make.blocks(c( "y", "x_miss"))
#Matrix with predictors
predictorMatrix = make.predictorMatrix(df0,
blocks = blocks)
#Remove x as a predictor
predictorMatrix[,"x"] = 0
predictorMatrix## y x ClosedList position EPGroup Age_c Age2_c x_miss
## y 0 0 1 1 1 1 1 1
## x_miss 1 0 1 1 1 1 1 0
#Recode to factors
df0 <-
df0 %>%
mutate(EPGroup = as.factor(EPGroup),
x_miss = as.factor(x_miss))
df2.amp <- mice(df0,
m = 10,
#linear outcome
method = c("norm", "logreg"),
#Blocks of variables
blocks = blocks,
#Data frame specifying where imputations should be made
where = dfNA,
#Matrix flagging the predictors
predictorMatrix = predictorMatrix,
printFlag = F
)
test <- mice::complete(df2.amp, factor = 3, include = T)
table(test$x, test$x_miss)##
## 0 1
## 0 543 6
## 1 2 253