Let’s explore missing data imputation the intuitive way; by doing it!
Our research question is whether Members of the European Parliament (MEPs) that do not have to cultivate a personal vote – but instead are facing a party selectorate that cares about policy delivery – spend more time in Parliament when they are re-election seekers.
The dependent variable Attendance reports the number of
committee meetings the MEP has attended while ClosedList
indicates whether the MEP will compete in a party-centered system if
(s)he were to seek reelection. FutureInEP describes whether
the MEP plans on seeking reelected. The “hitch” here is that the
variable is mostly missing; it comes from a survey.
How can we address that problem?
For simplicity, we will run a pooled OLS on the data with an
interaction term between the MEP’s ambition and electoral system. You
can find the data online
https://siljehermansen.github.io/teaching/beyond-linear-models/MEP.rda.
Be prepared to share the table/codes/thoughts on padlet: https://padlet.com/siljesynnove/exercise-1-simple-na-fixes-df5ppo9n2x0a4wuc
Consult Gelman and Hill (2007) p. 532-34 and follow their descriptions for simple fixes.
FutureInEP and
ClosedList. Personally, I used OLS estimation for this
one.EPGroup variable to weigh the observations in
an alternative model:FutureInEP.FutureInEP using EPGroup as a predictor.weights = 1/preds)NA by the mean of the
FutureInEP variableNA by the group mean of the
FutureInEP variable. Use the EPGroup as a
grouping variable.FutureInEP
variable by randomly sample from the observed parts of the
variable?You can use the sample() function in R. Here, I randomly
draw 4 samples from my x variable. I also specify that the
machine is not allowed to draw the same observation twice.
x <- 1:10
sample(x = x,
size = 4,
replace = F)
[1] 8 7 5 1
After having randomly sampled, you’d have to replace only the
NAs in FutureInEP with your imputed values,
while keeping the observed values intact.
Can you impute the missing values using regression? Here are a few
suggestions for predictors Age (of the MEP),
TermsInOffice (of the MEP), position (of
national party towards EU integration). The variable
ParlGovID identifies the national parties, while
Nationality and Period flag the MEP’s member
state and the period of study (one of 10 semesters).
fit the regression model of your choice
use the predict() function to extract predicted
values. Replace the missing values and re-run the main model.
Compare.
can you combine your prediction with an element of random sampling?
se.fit = T)rnorm() functionHere, I draw randomly twice from a vector of two normal distributions; in the first one the mean is 1 and the standard error is 0.1; in the second the mean is 2 and the standard error 0.1.
mean <- c(1,2)
se <- c(0.1, 0.1)
rnorm(n = 2,
mean = mean,
sd = se)
[1] 1.024479 1.980594