Let’s explore missing data imputation the intuitive way; by doing it!

Our research question is whether Members of the European Parliament (MEPs) that do not have to cultivate a personal vote – but instead are facing a party selectorate that cares about policy delivery – spend more time in Parliament when they are re-election seekers.

The dependent variable Attendance reports the number of committee meetings the MEP has attended while ClosedList indicates whether the MEP will compete in a party-centered system if (s)he were to seek reelection. FutureInEP describes whether the MEP plans on seeking reelected. The “hitch” here is that the variable is mostly missing; it comes from a survey.

How can we address that problem?

For simplicity, we will run a pooled OLS on the data with an interaction term between the MEP’s ambition and electoral system. You can find the data online https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda.

Be prepared to share the table/codes/thoughts on padlet: https://padlet.com/siljesynnove/exercise-1-simple-na-fixes-df5ppo9n2x0a4wuc

Exercise 1: Simple fixes.

Consult Gelman and Hill (2007) p. 532-34 and follow their descriptions for simple fixes.

  1. estimate a base-line model where you proceed to a list-wise exclusion with an interaction term between FutureInEP and ClosedList. Personally, I used OLS estimation for this one.
  2. use the EPGroup variable to weigh the observations in an alternative model:
  1. replace the NA by the mean of the FutureInEP variable
  2. replace the NA by the group mean of the FutureInEP variable. Use the EPGroup as a grouping variable.
  3. estimate a model where you include your imputed variable from d), but now you also control for the whether there is missing in your observations using the dummy indicator you created. What do Gelman and Hill say about this technique? When do you think it might be useful?
  4. put all your models in a single table (e.g. in stargazer) and compare. What happened to your estimates? And what about the standard error? Thoughts?

Exercise 2: Sampling

  1. Can you fill the missing parts of the FutureInEP variable by randomly sample from the observed parts of the variable?

You can use the sample() function in R. Here, I randomly draw 4 samples from my x variable. I also specify that the machine is not allowed to draw the same observation twice.

x <- 1:10

sample(x = x,
       size = 4,
       replace = F)

[1] 8 7 5 1

After having randomly sampled, you’d have to replace only the NAs in FutureInEP with your imputed values, while keeping the observed values intact.

  1. Re-estimate the main model with your new – partially sampled – predictor. Compare with the previous models. What happened? Why?

Excercise 3: Deterministic imputation

Can you impute the missing values using regression? Here are a few suggestions for predictors Age (of the MEP), TermsInOffice (of the MEP), position (of national party towards EU integration). The variable ParlGovID identifies the national parties, while Nationality and Period flag the MEP’s member state and the period of study (one of 10 semesters).

  1. fit the regression model of your choice

  2. use the predict() function to extract predicted values. Replace the missing values and re-run the main model. Compare.

  3. can you combine your prediction with an element of random sampling?

    • get the standard error of the prediction (e.g. by the argument se.fit = T)
    • then sample from a normal distribution using the rnorm() function

Here, I draw randomly twice from a vector of two normal distributions; in the first one the mean is 1 and the standard error is 0.1; in the second the mean is 2 and the standard error 0.1.

mean <- c(1,2)
se <- c(0.1, 0.1)

rnorm(n = 2,
      mean = mean,
      sd = se)

[1] 1.024479 1.980594

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.