Linear models: Interpretation of point estimates

In this session, we will be exploring how to estimate and interpret linear models. The concepts and procedures we use today may be slightly complicated for the simple example we are working with, but as we move over to the non-linear relationship between x and y in Generalized Linear Models (GLMs), they will come in very handy.

Introduction

Main methodological takeaways

Gelman and Hill (2007) (chapter 3) claim that regression is at its core a description of the data. Potential causal claims also depend on how we design our study: how do we rule out alternative explanations. They therefore explain that linear models are in essence a comparison of differences in means. However, we can also see them as descriptions of correlations in the data.

Here, we will use linear models (OLS) first to estimate group averages. Specifically, we will explore the distribution of the dependent variable within groups of predictors and compare their means. These are instances where OLS is always do-able.

We then move to use the model for inference – predicting outcomes from combinations of (metric) variables. Our focus is then on interpretation. In the application today, the interpretations often involve a detour; we could obtain the same thing in simpler ways. However, when we move to explore GLMs where the relationship between x and y is non-linear, these skills will be useful.

Main substantial take-aways:

  • the four stages of quantitative research design
  • how to combine parameters and variables to make predictions
  • how to use prediction for interpretation
  • three stages of interpretation
  • how to apply this to interaction models

I will introduce a few additional R-functions:

  • linear regression

    • (base R): lm(), predict() , data.frame()
    • ggplot2: geom_line(); ylim()
    • dplyr: lag() (of minor importance)
  • repetition

    • dplyr: group_by(), reframe(), mutate()
  • stargazer() for making summary tables (stargazer package)

  • mvrnorm() for making simulations (MASS package)

  • apply() for applying a function on (base R)

  • geom_ribbon() and geom_errorbar() for ploting uncertainty (ggplot2 package).

Make sure you work through the examples here, so that you are prepared for our next session.

Political science example

We will be working on some of the insights from (Hermansen2024?) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.

For a long time, there was a claim in the academic literature that what MEPs do in office, doesn’t matter for their re-election; they are not held electorally accountable. In this paper, we show that MEPs haven’t really gotten that memo. They behave as if their reelection depends on what they do. Voters in all democracies are confronted with an informational problem. They don’t know what their representatives do. It is up to the representatives to prove their usefulness to their voters. They can do that in different ways.

We make the assumption that most MEPs will seek (re-)election in the future and that they can increase their likelihood of that happening by investing in the right type of activities. The claim we explore today, is that some MEPs have an incentive to cultivate a personal vote, while others do not. Specifically, the electoral rules for EP elections vary by member state. MEPs elected from candidate-centered systems will gain a seat in parts due to the personal votes they get. As a result, they have an incentive to invest in constituency activities; they invest in their local presence in order to get a name recognition. By contrast, MEPs elected from party-centered systems can ride on the party label, and so will invest less.

The data

We will work with a subset of the replication data for the article. Our data lists all MEPs present in parliament in the spring semester of 2014.

I start by fetching the R packages we will use from the library.

#Data wrangling
library(dplyr); 
#Visualization
library(ggplot2)

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
              destfile = "MEP2014.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("MEP2014.rda")

#Rename
df <- MEP2014

Descriptive statistics

I usually explore the data beforehand: The univarate distribution of the relevant variables, their bivarate relationship. Here, I’m a bit skimpy.

Table of descriptive statistics

df_desc <- 
  df %>%
  dplyr::select(
    #Local staff size reported for the MEP in 2014
    LocalAssistants,
    #Electoral system
    OpenList,
    #Labor cost in country of origin
    LaborCost,
    #Size of national party in national parliament
    SeatsNatPal.prop) %>%
  #redefine as data frame
  as.data.frame()
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(df_desc,
          #Summary statistics, please!
          summary = T,
          #Remove the message about citations
          header = F,
          type = "html", #You want to change to "text"
          title = "Descriptive statitics")
Descriptive statitics
Statistic N Mean St. Dev. Min Max
LocalAssistants 739 2.915 3.212 0.000 39.500
OpenList 739 0.471 0.499 0 1
LaborCost 739 22.824 11.207 3.800 40.600
SeatsNatPal.prop 722 0.261 0.182 0.000 0.668

Distributions

Local staff size is numerical, so a histogram will do the trick.

df %>%
  ggplot +
  geom_histogram(aes(LocalAssistants))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see that the variable is bounded as zero (none can have negative number of local staffers). And skewed upwards; a Polish MEP has gone overboard with his hiring. For the time being, I make note, but don’t address it yet.

Electoral system is categorical (binary), so I opt for a barplot. Here, I see that I’ve got plenty of observations in both groups.

df %>%
  ggplot +
  geom_bar(aes(OpenList)) 

Both my x and my y variable have plenty of variation; thats a good sign.

Linear regression : the concept

The analytical process in quantitative political sciences looks like this

  1. theorization: express it as an equation \(\rightarrow\) a regression model
  2. data collection: draw on your observation of the world \(\rightarrow\) a data object
  3. test the theorized relationships: run the model \(\rightarrow\) parameter size and significance
  4. interpretation: use the applied theory to describe the world \(\rightarrow\) predictions

Let’s exemplify with a numeric outcome (Number of local assistants an MEP has) variable and a binary predictor (Electoral system).

A comparison of means

Gelman and Hill (2007) take the approach that linear regression is, at its core, just a comparison of group means. The easiest way to see this, is to do it.

Here, I want to compare two group means; the local staff size among MEPs (y) from party- and candidate-centered systems respectively (x).

tab <-
  df %>%
  #Group by values of x
  group_by(OpenList) %>%
  #Group mean of y
  reframe(mean_y = mean(LocalAssistants))
tab

We can see that the average staff size in candidate-centered systems is above 3 assistants, while party-centered systems have above 2 staffers per MEP on average.

How large is this difference?

tab <- 
  tab %>%
  #remove the grouping; same operation on both observations
  ungroup %>%
  #lag() reports the value of the previous observation; here, I calculate the difference
  mutate(diff = mean_y - lag(mean_y))
tab

We see that MEPs from candidate-centred systems have on average 1 more local staffers on their pay-roll.

Now, we know that

  • the mean of y in party-centered systems is 2.468
  • the average difference between party- and candidate-centered systems is 0.949.
  • the mean of y in candidate-centered systems is 2.468 + 0.949 = 3.417

This is parallel to how we interpret the regression coefficients in a linear model. Of course, if the predictor is continuous, we’ll have many such group averages. R would then have to fit a regression line so as to minimizes the errors (which is what the Ordinary Least Square estimation does).

Regression

We can also make the estimation by running a regression model. The results are often reported as a regression table where we find key information.

Effect of electoral system on local staff size among MEPs (results from a linear model)
Dependent variable:
LocalAssistants
OpenList 0.949***
(0.234)
Constant 2.468***
(0.161)
Observations 739
R2 0.022
Adjusted R2 0.020
Residual Std. Error 3.179 (df = 737)
F Statistic 16.396*** (df = 1; 737)
Note: p<0.1; p<0.05; p<0.01

Equations are expressions of a theory

Equations are a way to express how our political science theory links different phenomena together. We can then use data to test whether these links are present. Thus, all statistical models can be expressed as equations.

\(y = a + bx + \epsilon\)

  • data: \(y\) and \(x\)
  • parameters: \(a\) and \(b\), \(\epsilon\)

Variables: the observed data

We often make explicit the indexing of the data when we describe a specific model. This is useful for understanding what data structure the model leverages.

\(y_i = a + bx_i + \epsilon_i\)

\(i\) is in fact a variable in its own right: It runs from 1 to the size of our data (\(N\)).

These indexes are useful when we read models that leverage complex data structures, but also for writing loops.

We can write the equation in slightly less generic terms by giving our variables a real name.

\(Local\:Assistants_i = a + b \times Open\:List_i + \epsilon_i\)

Let’s call N, the number of observations in our data frame. The function dim() reports first the number of observations, then the number of variables.

N <- dim(df)[1]
N
## [1] 739

It means that we can give i which ever value we want between 1 and 739. Here, I give it 1.

i <- 1

By indexing, I can see the first observation in my data.

df$LocalAssistants[i]
## [1] 2

The first MEP has 2 local staffers on his pay-roll.

Parameters: estimated quantities

Our regression coefficients are “parameters”. In contrast to the data, the parameters are estimated quantities. In fact, they draw on the data and the social science theory (equation).

We can rewrite our equation to reflect the results from the regression.

\(y = 2.468 + 0.949 x + \epsilon\)

The intercept (a) reports the mean value of the outcome (y) when all the predictors are 0 (x = 0). Here, the predicted local staff size among MEPs from party-centered systems (i.e. the reference level) is 2.468 employees.

The slope (b) reports the difference between the group averages of the outcome (y) when x increases by one unit. Here, the difference in mean local staff size among MEPs from party- and candidate-centered systems, respectively is 0.949 employees.

Prediction: Parameters and data

We can use the model results to make predictions. Even if do not intend to use the model to do forecasting (predicting out of sample), the predictions will help us interpret and understand what we have found.

Once the model is estimated, we can fill in all the values of the equation. We combine the parameters and data by assigning values to our predictors. We can follow two approaches when filling in the data:

  • observed data: fill in with the entire data set.

This is what we do when we make in-sample predictions for model assessment. These in-sample predictions allow us to calculate the residuals and generate model statistics.

  • scenarios: fill in with hypothetical variable values

This is what we do when we interpret the results as well as when we do out-of-sample predictions. We’ll mostly focus on the former.

In our example from the European Parliament, I can make two scenarios. I can make predictions for the staff situation when MEPs compete in candidate-centered systems (x = 1), as well as when they compete in party-centered systems (x = 0)

\(y = 2.468 + 0.949 \times 0 = 2.468\)

\(y = 2.468 + 0.949 \times 1 = 3.417\)

We can interpret the data in five ways (sometimes incrementally). Going through these three steps become increasingly important as we move on to non-linear effects. In linear regression, this is the case when we recode the predictors to be curvilinear or have interactions. In GLMs, all effects are non-linear.

mod2 <- lm(LocalAssistants ~
             OpenList
           + LaborCost,
           df)
summary(mod2)
## 
## Call:
## lm(formula = LocalAssistants ~ OpenList + LaborCost, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.492 -1.938 -0.414  1.078 35.001 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.12659    0.28612  14.422  < 2e-16 ***
## OpenList     0.82883    0.22784   3.638 0.000294 ***
## LaborCost   -0.07020    0.01015  -6.913 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.083 on 736 degrees of freedom
## Multiple R-squared:  0.08141,    Adjusted R-squared:  0.07891 
## F-statistic: 32.61 on 2 and 736 DF,  p-value: 2.689e-14

1. Direction and significance

We can eyeball the regression output and observe that the slope coefficient goes in the right direction (given my hypothesis). The effect of OpenList is positive; MEPs from candidate-centered systems tend to hire more local staffers. I can also see that this effect is unlikely to be generated by accident. The standard error is less than half as big as the slope coefficient, the t-value is above 1.96 (i.e. 95% confidence interval) and the p-value is below 0.1 (less than 5% of the confidence overlaps zero).

2. Marginal effect: the regression coefficient

The marginal effect (or the “relative” effect) refers to the relative change in y when x increases by one scale unit. When we refer to the marginal effect, we do not take into account the value of the other predictors in the model. This is the slope of the regression line. In a linear regression with a linear effect, this slope is constant. It increases at the same rate regardless of the values of the intercept and the other predictors.

Here, the slope is 0.949.

Gelman and Hill (2007) highlight that political scientists alternate between two potential interpretations:

  • descriptive/comparisons of means: we can make interpretations that simply refers to the differences in means. I.e. we can say that MEPs’ local staff size is on average 0.949 higher in candidate-centred systems (compared to party-centred systems).

  • causal/relative effect: We can also interpret this in a causal framework by stating that changing the electoral system (a one-unit increase in x) increases the number of local employees MEPs hold. Now, we approach what King, Tomz, and Wittenberg (2000) label the “counterfactual”.

\(\Rightarrow\) If Denmark changed from a candidate- to party-centered system, this would lead to a 0.949 average decrease in staff size.

This is a bold claim, and usually requires you to rely on descriptive statistics and your political knowledge to assess if such a change is realistic and how large a potential change would be.

3. Predicted effects

The third step in an interpretation is to create a scenario where you fill in hypothetical values to the predictors, then fill in the parameters from the equation to calculate predictions. In King, Tomz, and Wittenberg (2000), this corresponds to step 1 and 2 of their simulation.

Here, I construct one scenario. Specifically, I create a data frame where each predictor gets assigned a value. We should be mindful of the value we assign to all the control variables as well. This is particularly important when effects are non-linear (more on that in March).

Specifically, I set the control variable (LaborCost) to its mean, but decide to look at the predictions for the “typical” candidate-centered system (OpenList == 1),

#Create a data frame with hypothetical data
scenario <- data.frame(OpenList = c(1),
                       LaborCost = mean(df$LaborCost, na.rm = T)) 
#Look at the data; there's only one observation/line
scenario

I can now use the regression coefficients to calculate predictions. I can do this by hand or use the predict() function in R. It takes two arguments: the model object (for the parameters/regression coefficients) and a data frame with scenarios.

scenario <- 
  scenario %>%
  #Mutate creates a new variable, predict predicts from my scenario
  mutate(LocalAssistants_pred = predict(mod,
                                        newdata = scenario))
scenario

I find that coming from a country with average labor cost, an MEP in a candidate centered system is expected to hire 3.4166667 local staffers.

4. First-differences:

While we’ve seen that the effect is statistically significant and in the expected direction. We may ask how big this effect is in substantive terms, though. An increase of 0.8288335 staffers is a lot if MEPs generally don’t have any local assistants, but isn’t really much if they tend to have 100 staffers.

The substantive effect has do be assessed in context by drawing on descriptive statistics. It depends on two things: the elasticity of the variable (how much does x really vary/change) and the context (what is the reference level). This is where the first difference and visualizations come in.

The “first difference” is a way to refer to the substantive effect. You pick two scenarios, predict their outcome, then calculate the difference in expected outcomes between the two scenarios.

There are several ways to pick the scenarios parallel to the two types of interpretations (Ward and Ahlquist 2018):

  • construct typical scenarios draw on the data to construct typical examples. Here, we might even let the other predictors have different values in the different scenarios. This is the less common approach, but it is useful if we want to compare dynamics between sub-populations.

  • all-else-equal-scenarios let the theroetically interesting predictor experience a typical change, while holding all other variables constant. This would mean you should know the descriptive statistics. As a rule of thumb, there is little to gain from making predictions in an area of x where you have no observations. At this point, you will set the other predictors (control variables) to a fixed – often typical – value (e.g. mean, mode, median…) as well.

Here, I let the electoral system change, but hold the control variable constant at its mean.

#Create a data frame with hypothetical data
scenario2 <- data.frame(OpenList = c(0, 1),
                        LaborCost = mean(df$LaborCost, na.rm = T))

scenario2 <- 
  scenario2 %>%
  #Mutate creates a new variable, predict predicts from my scenario
  mutate(LocalAssistants_pred = predict(mod2,
                                        #Dot signifies the scenario2
                                        newdata = .))

#Look at the data; there are two observations/lines
scenario2

We can now calculate the first difference between the two predictions.

scenario2 <- 
  scenario2 %>%
  #Calculate the first difference
  mutate(first_difference = LocalAssistants_pred - lag(LocalAssistants_pred))
scenario2

Since this is a linear model and I’m still considering the same change in electoral system, the first difference is the same as the marginal effect. This will change as we move on, though.

\(\Rightarrow\) Here, we’d say that the average local staff size increases from 2.524 to 3.353 employees when we go from studying party-centered to candidate-centered electoral systems. The first difference is 0.8288335.

5. Graphical display: scenarios on speed

A last step in the interpretation is to rely on the predictions from the all-else-equal scenarios to calculate the absolute effect of the predictor. That is, you let the predictor of interest slide from low to high (e.g. its minimum to its maximum value), while holding all other variables constant.

In this example, I let the labor cost vary, but hold the electoral system constant to candidate-centered.

scenario3 <- data.frame(
  #Hold constant
  OpenList = 1,
  #Let vary
  LaborCost = min(df$LaborCost): max(df$LaborCost))

scenario3 <- 
  scenario3 %>%
  #Make prediction
  mutate(preds = predict(mod2, newdata = scenario3))

#Check the data
scenario3 %>% head

I now have my x-values and my expected y-values and can place them in a coordinate system to make an effect plot.

#Data object == my scenario
scenario3 %>%
  #ggplot blanck sheet
  ggplot +
  #geometric element: a line
  geom_line(
    #coordinates
    aes(
      x = LaborCost,
      y = preds
    )
  ) +
  #A title
  ggtitle("Effect of labor cost on size of local staff among MEPs")

More interpretation: scenarios and plots of linear effects

The interpretation is usually much more fun that what we just did. We can calcuclate the uncertainty around our predictions (King, Tomz, and Wittenberg 2000) and make more substantively interesting scenarios (Ward2017?).

The ggeffects package allow you to simulate the predictions using the variance-covariance matrix as recommended by King, Tomz, and Wittenberg (2000).

Automatic

Visualization when the predictor is METRIC

When the predictor is metric, it often makes sense to create an effect plot, a regression line where x varies from low to high (i.e. scenarios on speed) and y reports the predicted value (expected number of local assistants).

I fetch the ggeffects package from the library (if you haven’t installed it, you’d have to do that first.).

library(ggeffects)

The function we’re interested in is the ggpredict(). The function needs to know which model object we’re working with (first argument), and it is useful to specify which variable we want to let vary. By default, the function sets all the other variables to their typical value in the empirical distribution.

eff <- ggpredict(mod2,
                 terms = "LaborCost")

The output is a ggeffect-type of “data frame”. We can look at it.

eff

The printout shows the varying values of the x-value, the predicted number of local assistants and the confidence interval.

eff %>%
  plot

R automatically knows how to plot these scenarios. If the predictor is defined as “numeric”, R will plot an effect plot for you surrounded by the simulated confidence interval.

Visualization when the predictor is CATEGORICAL

Let’s do the same with a scenario where the predictor is categorical: The electoral system. When the predictor is categorical, it makes more sense to plot it using a coefplot. If I redefine the model to a categorical value, R will create a coefplot for me automatically.

#Recode to a categorical variable using ifelse
df <- df %>%
  mutate(electoral_system = if_else(OpenList == 1, 
                                    "Candidate-centered",
                                    "Party-centered"))
#Reestimate the model with the new variable
mod3 <- lm(LocalAssistants ~ electoral_system + LaborCost, df)

#Run through ggpredict
eff.electoral.system <- ggpredict(mod3, terms = "electoral_system")

#Plot it
eff.electoral.system %>% plot

A more realistic counterfactual

Ward and Ahlquist (2018) recommend paying attention to the scenarios we create. If I were to make a counterfactual for Danmark, for example, I’d have to check what Denmark really looks like. This is where descriptive statistics and subsetting come in.

Let’s consider an interpretation where we’d ask, what the local staff would look like for Danish MEPs if Denmark was to change its electoral system from candidate-centered to party-centered. To do so, we should check what the labor cost in Denmark is.

scenario_dk <-
  df %>%
  #Select observations with values Nationality == "Denmark"
  filter(Nationality == "Denmark") %>%
  #Select the relevant variables; since there are other packages with select(), I specify the package
  dplyr::select(Nationality, LaborCost) %>%
  #Only distinct/unique observations, but keep all varibles; note the dot in .keep_all
  distinct(., .keep_all = T)
scenario_dk

We can specify our scenario using the condition= argument in the ggpredict() function.

eff.dk <- ggpredict(mod3,
                    terms = "electoral_system",
                    condition = list("LaborCost" = 40.6))

eff.dk %>% plot

We can pull out these descriptive statistics from the eff.dk object. I retransform it to a conentional data frame. Now, I have the confidence interval and can report this in the text, following the recommendation by (Gelman2000?).

df.eff.dk <- 
  eff.dk %>%
  #Define as data frame
  as.data.frame()

Danish MEPs competing in the current candidate-centered system keep on average 2 local assistants on their payroll, although this is likely to vary between 1.6 and 2.6.

If the electoral incentives had been different – if they had competed on the party label – this number would have been higher with the average MEP paying for 1.3 local assistants, with a likely variation between 0.8 and 1.7.

First difference

Unfortunately, the ggeffects package does not simulate first difference for us. You may calculate the first difference from the point estimates (predicted) in the eff.dk object, but you’d have to revert to the traditional confidence interval or go all in on manual simulation to get the simulated uncertainty.

#first difference
first_difference <- eff.dk$predicted[1]- eff.dk$predicted[2]
#Standard error of two samples: take the root of the sum of the two variances
std_error <- sqrt(eff.dk$std.error[1]^2 + eff.dk$std.error[2]^2)
std_error
## [1] 0.3424799
#Two standard deviations == 95% confidence interval
variation <- std_error * 1.96
variation 
## [1] 0.6712606
#Confidence bounds
high <- first_difference + variation
low <- first_difference - variation

In the event of a change from a candidate-centered to a party-centered system in Denmark, the average size of the local staff size among MEPs would decrease by 0.8288335, give or take 0.6712606.

An alternative is to simulate by hand. I provide code for this in the following. However, for many of you, this may be too much of a step up for the time being.

Simulation “by hand”

The work process for interpretation is as follows:

  1. Estimate the model and store the coefficients

  2. Create a dataset with hypothetical scenarios: This means creating a dataset where all variables have a probable/typical value. Then let the variable(s) you want to illustrate the effect of vary.

  3. Predict y.

  4. Interpret Insert the predicted y-values into the hypothetical dataset so that you have everythingin one place. Now, you can interpret using the three Danish Ts (tekst, tal, tegning). The interpretation focuses on the predicted value of the dependent variable.

    1. Numbers what are the different predicted values?
    2. Text Compare single scenarios and make use of plain English. E.g. what is the difference between two predicted \(y\)s?
    3. Plot the results. The choice of graphics depends on the measurement level of the variable you want to illustrate. The effect of a categorical/binary variable can be best illustrated with a coefficient plot, while an effect plot (a line) is best for continuous variables.

Extract the coefficients

I create a vector of coefficients.

coefs <- mod$coefficients

Create the data for the scenaros

I make a data frame with the scenarios I want. Be aware that you will have to make data for the intercept too.

newdata <- data.frame(
  Intercept = 1,
  OpenList = c(0,1))
newdata

Predict outcomes

Now, I can predict the y values by multiplying the regression coefficients with the data. We can use matrix manipulation.

To that, I have to “flip” (transpose) the matrix, so that the first coefficient in my vector is multiplied with the first column in my data etc.

preds <- coefs %*% t(newdata)
preds 
##          [,1]     [,2]
## [1,] 2.468031 3.416667
  1. Numeric interpretation

The predicted value for scenario 1 is 2.4680307, and the predicted value for scenario 2 is 3.4166667. The difference (in the words of Ward and Ahlquist (2018), the “first difference”) is 0.948636.

  1. Plain English interpretation. MEPs from party-centered systems keep on average 2.4680307 staffers on their payroll in their member state, while MEPs from candidate-centered have a staff of preds[2] people. That is, MEPs from candidate-centered systems keep on average 1 more staffer on their local team.

Here is a bold interpretation. If we were to make all MEPs compete in candidate-centered systems, the European Parliament would have to pay for 370.9166667 additional local staffers. How did I come up with this number?

  1. Plot it.

Start by making a data set for interpretation. We will also use it for graphical display, just as we did before.

#Flip the matrix
preds <- t(preds)

#Add it to your newdata object (and rename it)
dfp <- cbind(newdata, 
             "Staff size" = preds)

dfp

The table is useful for your numeric and english interpretations. It is also useful for graphical display.

ggplot(dfp) +
  geom_point(aes(x = OpenList,
                 y = `Staff size`))

These are point predictions. It’s the average predicted effect. However, we usually want to know the uncertainty of our estimate.

That’s why I simulate the results. The MASS package uses information about the distribution of data/coefficient contained in our model object to make simulations. The purpose is to explore different scenarios by simulating the outcomes of our statistical model.

We may of course assume that the margins of error (the variation) are the same regardless of where you are in the prediction, and that our best estimate is an average value + a prediction interval. However, this is not always the case, especially when moving to GLMs (e.g. logit). In that case, we can simulate outcomes.

I perform the simulation in the MASS package, so I start by loading it from the library. If you haven’t used it before, you need to install it first (install.packages("MASS")).

We can either fetch the package in the library or make a little “pipe” by fetching only the one function we need in the MASS package without actually ever loading it MASS::mvnorm(). Here, I do the latter. The function draws from a multivariate normal distribution, using information contained in our model object.

#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod)
)

The data stored in coef.sim are variations around our coefficients. We can use these coefficients as we did before. Now, we get many approximate predictions from each scenario we made.

#See the 10 first simulations
coef.sim[1:10,]
##       (Intercept)  OpenList
##  [1,]    2.645139 0.7314640
##  [2,]    2.498171 0.8618712
##  [3,]    2.415577 0.9338460
##  [4,]    2.619244 0.9425712
##  [5,]    2.920549 0.4987225
##  [6,]    2.387520 1.0723701
##  [7,]    2.543787 0.8019200
##  [8,]    2.535946 0.8418666
##  [9,]    2.333618 1.1298331
## [10,]    2.400752 0.6228863

Now, we can go back to our old workflow. First, make predictions using matrix calculations.

preds <- coef.sim %*% t(newdata)
preds[1:10,]
##           [,1]     [,2]
##  [1,] 2.645139 3.376603
##  [2,] 2.498171 3.360042
##  [3,] 2.415577 3.349422
##  [4,] 2.619244 3.561815
##  [5,] 2.920549 3.419272
##  [6,] 2.387520 3.459890
##  [7,] 2.543787 3.345707
##  [8,] 2.535946 3.377813
##  [9,] 2.333618 3.463451
## [10,] 2.400752 3.023638

Each column in the dataset is now a scenario. We can therefore calculate the summary statistics we’re interested in. Instead of using the mean and standard deviation/error, let’s use the median and the 95% most common predicted values.

dim(preds)
## [1] 10000     2
#Scenario 1
quantile(preds[,1], probs = c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 2.155102 2.469130 2.784512
#Scenario 2
quantile(preds[,2], probs = c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 3.079370 3.416765 3.754005

I can of course keep these calculations and bind them together in a vector. However, I can also do this in one go.

To do so, I’m using an old classic from base R. apply() applies whatever function I want to a matrix.

pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

The output is a data frame.

pred_summary
##           [,1]     [,2]
## 2.5%  2.155102 3.079370
## 50%   2.469130 3.416765
## 97.5% 2.784512 3.754005

I now go back to my old workflow and bind this to my initial data with the scenarios. Be mindful, that you will have to flip the data frame again.

newdata
pred_summary <- t(pred_summary)
pred_summary
##          2.5%      50%    97.5%
## [1,] 2.155102 2.469130 2.784512
## [2,] 3.079370 3.416765 3.754005
#Column bind
dfp2 <- cbind(newdata,
              pred_summary)

Now, you have a data frame with interpretable values. Here, we might say that we are likely to observe MEPs with a local staff ranging between 2 and 3 employees.

dfp2

You can also plot these scenarios.

Plot your scenarios

Coefplots

When my x-variable is categorical – or in any way not continuous – a coefficient graphic is better. Here, I illustrate the median predicted effect of each scenario on the y-axis, surrounded by its 95% most likely alternative outcomes (margin of error).

#Plot predicted values
library(ggplot2)

#Add a new categorical variable for nicer labels
dfp2$`Electoral system` <- c("Candidate-centred",
                             "Party-centred")

ggplot(dfp2) +
  #Plot the points
  geom_point(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = `Electoral system`)
  ) +
  #Add error bars
  geom_errorbar(aes(
    #Different scenarios
    x = `Electoral system`,
    #Where the bar starts
    ymin = `2.5%`,
    #Where the bar ends
    ymax = `97.5%`,
    #Remove the ends of the bars
    width = 0)
  ) +
  ylab("Local staff size") +
  ggtitle(label = "The effect of electoral system on MEPs' local investment") +
  ylab("Local staff") +
  xlab("")+
  #Use a different "theme" as the basis for the aesthetics
  theme_light() 

Note: Unfortunately, the MASS package contains some functions that have the same name as functions in dplyr. You will receive a message about this when loading the package. In this case, the select() function is overwritten by MASS. If you still want to use the select() function from dplyr, you can specify this in your code as dplyr::select().

Your turn

  • Can you do the same with model 2? Predict the local staff size for MEPs as the labor cost increases.

This time, you will have many scenarios. To plot the uncertainty, you will also use the geom_ribbon rather than the errorbars.

#Scenario
newdata <- data.frame(
  Intercept = 1,
  OpenList = 1,
  #Create a vector using values from the descriptive stats.
  LaborCost = seq(from = 4, to = 40, by = 0.1))

#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod2$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod2)
)

#Matrix multiplication to get predicted values
preds <- coef.sim %*% t(newdata)

#Summarize the simulation
pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp3 <- cbind(newdata,
              t(pred_summary))


ggplot(dfp3) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = LaborCost)
  ) +
    #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = LaborCost),
    alpha = 0.3) +
  theme_minimal()

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
King, Gary, Michael Tomz, and Jason Wittenberg. 2000. “Making the Most of Statistical Analyses: Improving Interpretation and Presentation.” American Journal of Political Science 44 (2): 341–55.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press.