Linear models: Interpretation

In this session, we will be exploring how to estimate, present and interpret linear models. The concepts and procedures we use today may be slightly complicated for the simple example we are working with, but as we move over to the non-linear relationship between x and y in GLMs, they will come in very handy.

Introduction

Main methodological takeaways

We will use linear models (OLS) first to estimate group averages. Specifically, we will explore the distribution of the dependent variable within groups of predictors and compare their means. These are instances where OLS is always do-able.

We then move to use the model for inference – predicting outcomes from combinations of (metric) variables. Our focus is then on interpretation. In the application today, the interpretations often involve a detour; we could obtain the same thing in simpler ways. However, when we move to explore GLMs where the relationship between x and y is non-linear, these skills will be useful.

I will introduce a few additional R-functions:

  • stargazer() for making summary tables (stargazer package)
  • mvrnorm() for making simulations (MASS package)
  • apply() for applying a function on (base R)
  • geom_ribbon() and geom_errorbar() for ploting uncertainty (ggplot2 package).

The rest of the R codes are implementations of what we already encountered in the homework (Hermansen 2023) and in the last R session (notebooks online). Make sure you work through the examples here, so that you are prepared for our next session.

Substantive topic (the research)

We will be working on some of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.

For a long time, there was a claim in the academic literature that what MEPs do in office, doesn’t matter for their relection; they are not held electorally accountable. In this paper, we show that MEPs haven’t really gotten that memo. They behave as if their relection depends on what they do. Voters in all democracies are confronted with an informational problem. They don’t know what their representatives do. It is up to the representatives to prove their usefulness to their voters. They can do that in different ways.

We make the assumption that most MEPs will seek (re-)election in the future and that they can increase their likelihood of that happening by investing in the right type of activities. The claim we explore today, is that some MEPs have an incentive to cultivate a personal vote, while others do not. Specifically, the electoral rules for EP elections vary by member state. MEPs elected from candidate-centered systems will gain a seat in parts due to the personal votes they get. As a result, they have an incentive to invest in constituency activities; they invest in their local presence in order to get a name recognition. By contrast, MEPs elected from party-centered systems can ride on the party label, and so will invest less.

The data

we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2014.

I start by fetching the R packages we will use from the library.

library(dplyr); library(ggplot2)

Analysis

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
              destfile = "MEP2014.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("MEP2014.rda")

#Rename
df <- MEP2014

Univariate distribution

To approximate MEPs’ local investment, we measure the number of local/constituency assistants that Members of the European Parliament hire. Our dependent variable, “LocalAssistants”, is thus a count of the number of local employees (constituency staff) an MEP has. Since MEPs can share assistants, the variable also reports fractions.

I begin by exploring the univariate distribution. Here, I inspect the distribution of the variable, asking R to give me a histogram with 20 bars (for a “high-resolution” plot)

hist(df$LocalAssistants,
     breaks = 20)

There are a lot of MEPs with only a few local assistants, some MEPs have many local employees, while none has a negative number of MEPs.

My main predictor today is a binary variable that flags whether the MEP was elected (and therefore will run for reelection) in a candidate-centered system (“OpenList”).

barplot(table(df$OpenList))

From the barplot, we see that about half of the MEP have an incentive to cultivate a personal vote (OpenList == 1), while the other half can ride on the party label (OpenList == 0).

Descriptive statistics using stargazer

The visual inspection of the univariate distribution is the most intuitive by far. However, we may find it useful to keep the key-takeaways in a table of descriptive statistics giving some key numbers.

Here I make a table of descriptive statistics using the stargazer package. It creates the summary statistics automatically when we feed it with a data frame or a model object. The package contains only one function: stargazer() that generate the table.

If you haven’t already installed the package, you do that. Then we can fetch it int the library.

#install.packages("stargazer")
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

To make the table begin by selecting the variables I am interested in from the data using dplyr.

df0 <- df %>% select(LocalAssistants,
                     OpenList,
                     LaborCost,
                     position,
                     ProxNatElection,
                     ProxNatElection2)

I select the number of assistants an MEP has in their member states, the electoral system the MEP is elected from, as well as a few other variables we will use later: The labor cost, EU-enthusiasm and proximity of the national elections.

I then make the table. By default, stargazer will give you LaTeX code in the output. Do you want to see the result in R? Use type = "text".

stargazer(df0, type = "text")
## 
## ==================================================
## Statistic         N   Mean  St. Dev.  Min    Max  
## --------------------------------------------------
## LocalAssistants  739 2.915   3.212   0.000  39.500
## OpenList         739 0.471   0.499     0      1   
## LaborCost        739 22.824  11.207  3.800  40.600
## position         664 5.174   1.594   1.000  7.000 
## ProxNatElection  739 -2.290  1.225   -4.000 -0.115
## ProxNatElection2 739 6.745   5.621   0.013  16.000
## --------------------------------------------------

In a more fancy version, I could use additional arguments to specify what kind of statistics I want. Use the ?stargazer command to get tips.

stargazer(df0,
          #Title of the table
          title = "Descriptive statistics",
          #Type of output
          type = "text",
          #Summary stats of data frame
          summary = T,
          #Additional arguments for statistics
          iqr = T, median = T, min.max = T)
## 
## Descriptive statistics
## ===========================================================================
## Statistic         N   Mean  St. Dev.  Min   Pctl(25) Median Pctl(75)  Max  
## ---------------------------------------------------------------------------
## LocalAssistants  739 2.915   3.212   0.000   1.000   2.000   4.000   39.500
## OpenList         739 0.471   0.499     0       0       0       1       1   
## LaborCost        739 22.824  11.207  3.800   9.700   25.800  31.400  40.600
## position         664 5.174   1.594   1.000   4.545   6.000   6.273   7.000 
## ProxNatElection  739 -2.290  1.225   -4.000  -3.255  -2.422  -1.342  -0.115
## ProxNatElection2 739 6.745   5.621   0.013   1.802   5.866   10.594  16.000
## ---------------------------------------------------------------------------

Bivariate statistics

The article makes the claim that MEPs elected from candidate-centred electoral systems are more likely to invest locally, since they need the name recognition. The variable “OpenList” is a binary indicator of whether an MEP is elected from such a system.

A natural start of a bivariate analysis is to compare the average number of local hires in each group. The group_by() function from dplyr is useful for this. I then use the summarize() to give a summary of the variable “Local Assistants”.

df %>%
  group_by("Candidate-centred" = OpenList) %>%
  summarize("Average number" = mean(LocalAssistants))

We can clearly see there is a difference between the two groups, but is it significant?

One way to test if two groups are different, is to perform a t-test. However, we can also use linear models for this. That’s the point Gelman and Hill (2007) make. When we run an OLS with categorical predictors, we can interpret the coefficients as predicted averages for each group.

The function for linear models in R is lm().

mod <- lm(LocalAssistants ~ OpenList,
          data = df)

Let’s see if we can recognize the results.

summary(mod)
## 
## Call:
## lm(formula = LocalAssistants ~ OpenList, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.417 -2.417 -0.468  1.532 36.083 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4680     0.1608  15.351  < 2e-16 ***
## OpenList      0.9486     0.2343   4.049 5.68e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.179 on 737 degrees of freedom
## Multiple R-squared:  0.02176,    Adjusted R-squared:  0.02044 
## F-statistic:  16.4 on 1 and 737 DF,  p-value: 5.684e-05

Your turn

  • can you calculate the predicted number of local assistants for OpenList == 0?
  • can you calculate the predicted number of local assistants for OpenList == 1?
  • What is the meaning of the \(\beta\) coefficient here?

We could ask R to predict the y-values for us, given the data we have, the info contained in the predictors (OpenList) and the coefficients from the model. This is what we’d call “in-sample” predictions.

#Predict
MEP$predicted <- predict(mod,
                         #The data we used for estimation
                         newdata = MEP)

#Distribution of predicted values
table(MEP$predicted)
## 
## 2.46803069053709 3.41666666663794 
##             3840             3303

There are two ways of interpreting the regression coefficients:

  • we can use the OLS to compare means \(\rightarrow\) the following OLS assumptions don’t matter
  • we can use it to make inferences: one unit increase in \(x\) is a \(\beta\) increase in \(y\). \(\rightarrow\) the following OLS assumptions do matter

Multivariate regression

There are many potential confounders in this analysis. In particular, MEPs come from different EU member states. They have the same budget for parliamentary assistance, but their local assistants are hired on local wages. It means that Danish MEPs can hire fewer assistants than Polish MEPs. Labor cost is also related to MEPs incentive to cultivate a personal vote (through their nationality).

I therefore fit a second model where I add LaborCost as a predictor. I store the output in a different model object.

mod2 <- lm(LocalAssistants ~
             OpenList +
             LaborCost,
           df)
summary(mod2)
## 
## Call:
## lm(formula = LocalAssistants ~ OpenList + LaborCost, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.492 -1.938 -0.414  1.078 35.001 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.12659    0.28612  14.422  < 2e-16 ***
## OpenList     0.82883    0.22784   3.638 0.000294 ***
## LaborCost   -0.07020    0.01015  -6.913 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.083 on 736 degrees of freedom
## Multiple R-squared:  0.08141,    Adjusted R-squared:  0.07891 
## F-statistic: 32.61 on 2 and 736 DF,  p-value: 2.689e-14

Present your results: table of results

Ward and Ahlquist (2018) don’t like results tables. They call them BUTONs (Big Ugly Tables of Numbres) because they are counterintuitive. However, most analyses will involve making a regression table after we have fitted the model and before we start the interpretation.

There are several ways to create result tables. You can always export coefficients and standard errors to Excel via write.table(), but there are easier solutions.

The package stargazer gives you a nice, pre-made presentation of most model objects (with significance stars).

We start by loading the package. It has one function to take note of, namely: stargazer().

stargazer(mod,
          type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                           LocalAssistants      
## -----------------------------------------------
## OpenList                     0.949***          
##                               (0.234)          
##                                                
## Constant                     2.468***          
##                               (0.161)          
##                                                
## -----------------------------------------------
## Observations                    739            
## R2                             0.022           
## Adjusted R2                    0.020           
## Residual Std. Error      3.179 (df = 737)      
## F Statistic           16.396*** (df = 1; 737)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

You can present multiple models in stargazer at the same time. Just add all the model objects first in the function. Then you get one column per model.

stargazer(mod, mod2,
          type = "text")
## 
## ===================================================================
##                                   Dependent variable:              
##                     -----------------------------------------------
##                                     LocalAssistants                
##                               (1)                     (2)          
## -------------------------------------------------------------------
## OpenList                   0.949***                0.829***        
##                             (0.234)                 (0.228)        
##                                                                    
## LaborCost                                          -0.070***       
##                                                     (0.010)        
##                                                                    
## Constant                   2.468***                4.127***        
##                             (0.161)                 (0.286)        
##                                                                    
## -------------------------------------------------------------------
## Observations                  739                     739          
## R2                           0.022                   0.081         
## Adjusted R2                  0.020                   0.079         
## Residual Std. Error    3.179 (df = 737)        3.083 (df = 736)    
## F Statistic         16.396*** (df = 1; 737) 32.612*** (df = 2; 736)
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01

Now it’s easier to compare the models. Among other things, we get an idea of what happens when we add controls.

Here we see that the effect of electoral system is moderated when we take into account the differences in labor cost.

Pimp your table Stargazer also has many additional arguments that allow us to customize the appearance. Some of these are especially interesting if you want to customize the table for a Danish audience.

additional_arguments <- c("type =",
                          "title = ",
                          "covariate.labels = ",
                          "dep.var.labels =",
                          "dep.var.caption =",
                          "decimal.mark = ")
alternatives <- c("'text', 'html', 'tex'",
                  "'My Title'",
                  "c('var 1', 'var 2')",
                  "A text, or a vector with text",
                  "'Dependent variable'",
                  "',', '.' etc.")
operation <- c("What file type is this?",
               "Add a title",
               "Vector with variable names",
               "What is my dependent variable called?",
               "Title for dependent variable",
               "What decimal separator do I use? (e.g., ',')")

tab_stargazer <- cbind(additional_arguments,
                       alternatives,
                       operation)
colnames(tab_stargazer) <- c("Argument",
                             "Alternatives",
                             "What it does")
Results table in stargazer: additional arguents
Argument Alternatives What it does
type = ‘text’, ‘html’, ‘tex’ What file type is this?
title = ‘My Title’ Add a title
covariate.labels = c(‘var 1’, ‘var 2’) Vector with variable names
dep.var.labels = A text, or a vector with text What is my dependent variable called?
dep.var.caption = ‘Dependent variable’ Title for dependent variable
decimal.mark = ‘,’, ‘.’ etc. What decimal separator do I use? (e.g., ‘,’)

Export the table

Do you want the result in HTML (for export to Word), change it to “html” and print the table in a separate file.

stargazer(mod,
          type = "html",
          out = "result_table.html")
Dependent variable:
LocalAssistants
OpenList 0.949***
(0.234)
Constant 2.468***
(0.161)
Observations 739
R2 0.022
Adjusted R2 0.020
Residual Std. Error 3.179 (df = 737)
F Statistic 16.396*** (df = 1; 737)
Note: p<0.1; p<0.05; p<0.01

Open the file in your web browser and copy-paste it into Word.

Interpretation

Let’s interpret the results using inference. We can either focus on the relative change in local staff, or the absolute size of the local staff. When we have an OLS, that’s basically the same thing. Once we move to GLMs, that will not be the case. Therefore, we will spend quite some time on predicting outcomes.

All regressions revolve around a linear theoretical relationship that we can express as follows:

\(y \sim \alpha + \beta \times x\)

The greek letters are regression coefficients/parameters that R estimates from the data. The normal letters are the data we feed into the equation. The estimation is done on real data, while the interpretation is done on data we invent. These are “scenarios” (Ward and Ahlquist 2018, ch. 3) that we construct to better understand what the model tells us.

Marginal effect

We can interpret the \(\beta\) (slope parameter/coefficient) as a relative increase in \(y\) when \(x\) increases with one unit.

Here, we might say that MEPs from candidate-centered systems (OpenList == 1) have on average a local staff containing 0.9 more assistants than MEPs from party centered systems (OpenList == 0). Note how we interpret the units: \(\beta\) reports the increase in local assistants (\(y\)) when \(x\) increases by one unit (“from party-centered to candidate-centered system”).

Be realistic When we make interpretations, we should make sure that the increase we imagine for \(x\) is a realistic change in the variable. Depending on what we are interested in, we might consider realistic differences between observations (e.g. between different MEPs), or realistic changes/elasticity in \(x\) (e.g. a realistic increase in hires for one MEP). To make that assessment, you need to know the descriptive statistics (univariate distributions) of your predictors. Use the table of descriptive statistics actively!

Predicted values using scenarios

We can interpret the results as a comparison of predicted values when the value of “OpenList” increases by one unit.

To calculate the predicted values of model 1, we fill in the linear equation:

\(y_i = \alpha + \beta \times x_i\)

R gives us the coefficients, we provide the data. This data is artificial, they are “scenarios” that we create. That is, we “invent” data points. The procedure looks something like this. Here I’ve created two scenarios for \(x_1\) and \(x_2\).

scenario y \(\alpha\) \(\beta\) x
Party-centered 2.4680307 2.4680307 0.948636 0
Candidate-centered 3.4166667 2.4680307 0.948636 1

We can do this in R. We can plug in the numbers and use R as a calculator; we can use an object-oriented calculation and we can do these calculations on many scenarios at the same time. For the time being: do this by hand.

Your turn:

  • Can you make an interpretation of the effect of labor cost on MEPs’ local staff?
  • Use the “Nationality” variable to find out what the labor cost is in Denmark, Italy, and Austria respectively. Make three scenarios and predict the size of MEPs’ local staff in the three countries.

Suggestion:

#Find labor cost
labor_dk <- df %>% 
  filter(Nationality == "Denmark") %>% 
  select(LaborCost) %>%
  unique
list_dk <- df %>%
  filter(Nationality == "Denmark") %>%
  select(OpenList) %>%
  unique

Fill in the equation

a <- mod2$coefficients[1]
b1 <- mod2$coefficients[2]
b2 <- mod2$coefficients[3]

x1 <- list_dk
x2 <- labor_dk

y = a*1 + b1*x1 + b2*x2
y
  • Read the section on scenarios in Ward and Ahlquist (2018) (ch 3) and decide what type of scenarios the combination Denmark/Italy is ; then Denmark/Austria.

  • Based on model 2, how many local assistants would a Danish MEP have if Denmark was to change from a candidate- to a party-centered system?

The work process for interpretation is as follows:

  1. Estimate the model and store the coefficients

  2. Create a dataset with hypothetical scenarios: This means creating a dataset where all variables have a probable/typical value. Then let the variable(s) you want to illustrate the effect of vary.

  3. Predict y.

  4. Interpret Insert the predicted y-values into the hypothetical dataset so that you have everythingin one place. Now, you can interpret using the three Danish Ts (tekst, tal, tegning). The interpretation focuses on the predicted value of the dependent variable.

    1. Numbers what are the different predicted values?
    2. Text Compare single scenarios and make use of plain English. E.g. what is the difference between two predicted \(y\)s?
    3. Plot the results. The choice of graphics depends on the measurement level of the variable you want to illustrate. The effect of a categorical/binary variable can be best illustrated with a coefficient plot, while an effect plot (a line) is best for continuous variables.

Extract the coefficients

I create a vector of coefficients.

coefs <- mod$coefficients

Create the data for the scenaros

I make a data frame with the scenarios I want. Be aware that you will have to make data for the intercept too.

newdata <- data.frame(
  Intercept = 1,
  OpenList = c(0,1))
newdata

Predict outcomes

Now, I can predict the y values by multiplying the regression coefficients with the data. We can use matrix manipulation.

To that, I have to “flip” (transpose) the matrix, so that the first coefficient in my vector is multiplied with the first column in my data etc.

preds <- coefs %*% t(newdata)
preds 
##          [,1]     [,2]
## [1,] 2.468031 3.416667
  1. Numeric interpretation

The predicted value for scenario 1 is 2.4680307, and the predicted value for scenario 2 is 3.4166667. The difference (in the words of Ward and Ahlquist (2018), the “first difference”) is 0.948636.

  1. Plain English interpretation. MEPs from party-centered systems keep on average 2.4680307 staffers on their payroll in their member state, while MEPs from candidate-centered have a staff of preds[2] people. That is, MEPs from candidate-centered systems keep on average 1 more staffer on their local team.

Your turn

Here is a bold interpretation. If we were to make all MEPs compete in candidate-centered systems, the European Parliament would have to pay for 370.9166667 additional local staffers. How did I come up with this number?

  1. Plot it.

Start by making a data set for interpretation. We will also use it for graphical display, just as we did before.

#Flip the matrix
preds <- t(preds)

#Add it to your newdata object (and rename it)
dfp <- cbind(newdata, 
             "Staff size" = preds)

dfp

The table is useful for your numeric and english interpretations. It is also useful for graphical display.

ggplot(dfp) +
  geom_point(aes(x = OpenList,
                 y = `Staff size`))

These are point predictions. It’s the average predicted effect. However, we usually want to know the uncertainty of our estimate.

That’s why I simulate the results. The MASS package uses information about the distribution of data/coefficient contained in our model object to make simulations. The purpose is to explore different scenarios by simulating the outcomes of our statistical model.

We may of course assume that the margins of error (the variation) are the same regardless of where you are in the prediction, and that our best estimate is an average value + a prediction interval. However, this is not always the case, especially when moving to GLMs (e.g. logit). In that case, we can simulate outcomes.

I perform the simulation in the MASS package, so I start by loading it from the library. If you haven’t used it before, you need to install it first (install.packages("MASS")).

We can either fetch the package in the library or make a little “pipe” by fetching only the one function we need in the MASS package without actually ever loading it MASS::mvnorm(). Here, I do the latter. The function draws from a multivariate normal distribution, using information contained in our model object.

#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod)
)

The data stored in coef.sim are variations around our coefficients. We can use these coefficients as we did before. Now, we get many approximate predictions from each scenario we made.

#See the 10 first simulations
coef.sim[1:10,]
##       (Intercept)  OpenList
##  [1,]    2.334823 1.4200694
##  [2,]    2.259047 1.1318657
##  [3,]    2.414392 1.0983652
##  [4,]    2.266176 0.9188021
##  [5,]    2.269018 1.0775364
##  [6,]    2.594430 0.6084245
##  [7,]    2.586575 1.0647615
##  [8,]    2.579203 0.9952217
##  [9,]    2.453261 0.9395702
## [10,]    2.729737 0.4784737

Now, we can go back to our old workflow. First, make predictions using matrix calculations.

preds <- coef.sim %*% t(newdata)
preds[1:10,]
##           [,1]     [,2]
##  [1,] 2.334823 3.754892
##  [2,] 2.259047 3.390913
##  [3,] 2.414392 3.512757
##  [4,] 2.266176 3.184978
##  [5,] 2.269018 3.346554
##  [6,] 2.594430 3.202854
##  [7,] 2.586575 3.651337
##  [8,] 2.579203 3.574424
##  [9,] 2.453261 3.392831
## [10,] 2.729737 3.208210

Each column in the dataset is now a scenario. We can therefore calculate the summary statistics we’re interested in. Instead of using the mean and standard deviation/error, let’s use the median and the 95% most common predicted values.

dim(preds)
## [1] 10000     2
#Scenario 1
quantile(preds[,1], probs = c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 2.151481 2.467985 2.779521
#Scenario 2
quantile(preds[,2], probs = c(0.025, 0.5, 0.975))
##     2.5%      50%    97.5% 
## 3.072811 3.414616 3.747813

I can of course keep these calculations and bind them together in a vector. However, I can also do this in one go.

To do so, I’m using an old classic from base R. apply() applies whatever function I want to a matrix.

pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

The output is a data frame.

pred_summary
##           [,1]     [,2]
## 2.5%  2.151481 3.072811
## 50%   2.467985 3.414616
## 97.5% 2.779521 3.747813

I now go back to my old workflow and bind this to my initial data with the scenarios. Be mindful, that you will have to flip the data frame again.

newdata
pred_summary <- t(pred_summary)
pred_summary
##          2.5%      50%    97.5%
## [1,] 2.151481 2.467985 2.779521
## [2,] 3.072811 3.414616 3.747813
dfp2 <- cbind(newdata,
              pred_summary)

Now, you have a data frame with interpretable values. Here, we might say that we are likely to observe MEPs with a local staff ranging between 2 and 3 employees.

dfp2

You can also plot these scenarios.

Plot your scenarios

Coefplots

When my x-variable is categorical – or in any way not continuous – a coefficient graphic is better. Here, I illustrate the median predicted effect of each scenario on the y-axis, surrounded by its 95% most likely alternative outcomes (margin of error).

#Plot predicted values
library(ggplot2)

#Add a new categorical variable for nicer labels
dfp2$`Electoral system` <- c("Candidate-centred",
                             "Party-centred")

ggplot(dfp2) +
  #Plot the points
  geom_point(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = `Electoral system`)
  ) +
  #Add error bars
  geom_errorbar(aes(
    #Different scenarios
    x = `Electoral system`,
    #Where the bar starts
    ymin = `2.5%`,
    #Where the bar ends
    ymax = `97.5%`,
    #Remove the ends of the bars
    width = 0)
  ) +
  ylab("Local staff size") +
  ggtitle(label = "The effect of electoral system on MEPs' local investment") +
  ylab("Local staff") +
  xlab("")+
  #Use a different "theme" as the basis for the aesthetics
  theme_light() 

Note: Unfortunately, the MASS package contains some functions that have the same name as functions in dplyr. You will receive a message about this when loading the package. In this case, the select() function is overwritten by MASS. If you still want to use the select() function from dplyr, you can specify this in your code as dplyr::select().

Your turn

  • Can you do the same with model 2? Predict the local staff size for MEPs as the labor cost increases.

This time, you will have many scenarios. To plot the uncertainty, you will also use the geom_ribbon rather than the errorbars.

#Scenario
newdata <- data.frame(
  Intercept = 1,
  OpenList = 1,
  #Create a vector using values from the descriptive stats.
  LaborCost = seq(from = 4, to = 40, by = 0.1))

#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod2$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod2)
)

#Matrix multiplication to get predicted values
preds <- coef.sim %*% t(newdata)

#Summarize the simulation
pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp3 <- cbind(newdata,
              t(pred_summary))


ggplot(dfp3) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = LaborCost)
  ) +
    #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = LaborCost),
    alpha = 0.3) +
  theme_minimal()

Non-linear effects

These techniques are really useful when the effect of x is not linear. This happens in linear models too when we recode the x’s: It happens when the effect of a predictor is different depending on where we are on its scale (i.e. curvilinear) and in interaction effects.

Here are two examples.

Curvilinear

Some MEPs will not seek reelection to the european Parliament. Instead, they will be running for office in national elections (parachuting to the national level). Therefore, we may also expect that they increase their local investment right before national elections.

The data contains a variable (ProxNatElection) that measures the time remaining until then next national parliamentary elections.

I begin by exploring the bivariate relationship.

ggplot(df) +
  geom_smooth(aes(x = ProxNatElection,
                  y = LocalAssistants))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

We see that it is curvilinear. THere is no point in having a large local staff right after an election. At that point, it is better to use the parliamentary allowance for staffers in Brussels. However, we might expect that they mobilize (i.e. increase local staff size) right before the election.

We can model this by using a quadric term (Hermansen 2023, chap. 8).

mod3 <- lm(LocalAssistants ~ 
             LaborCost +
             OpenList +
             ProxNatElection +
             I(ProxNatElection^2),
           data = df)

summary(mod3)
## 
## Call:
## lm(formula = LocalAssistants ~ LaborCost + OpenList + ProxNatElection + 
##     I(ProxNatElection^2), data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.692 -1.859 -0.359  1.074 34.649 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.24209    0.53779   6.029 2.62e-09 ***
## LaborCost            -0.07587    0.01085  -6.993 6.06e-12 ***
## OpenList              0.92038    0.23796   3.868  0.00012 ***
## ProxNatElection      -0.81047    0.47121  -1.720  0.08586 .  
## I(ProxNatElection^2) -0.13127    0.10368  -1.266  0.20588    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.072 on 734 degrees of freedom
## Multiple R-squared:  0.09009,    Adjusted R-squared:  0.08513 
## F-statistic: 18.17 on 4 and 734 DF,  p-value: 3.053e-14

The effect here is estimated as not different from 0, but never mind. We will do it anyways. Be mindful when you construct your scenario. Use the descriptive statistics actively, and remember that both ProXNatElections and its quadric terms have to vary.

#Scenario
newdata <- data.frame(
  Intercept = 1,
  LaborCost = mean(df$LaborCost),
  OpenList = 0,
  ProxNatElection = seq(from = -4, to = 0, by = 0.1),
  ProxNatElection2 = seq(from = -4, to = 0, by = 0.1)^2 
  )



#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod3$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod3)
)

#Matrix multiplication to get predicted values
preds <- coef.sim %*% t(newdata)

#Summarize the simulation
pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp3 <- cbind(newdata,
              t(pred_summary))



ggplot(dfp3) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = ProxNatElection)
  ) +
    #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = ProxNatElection),
    alpha = 0.3) +
  theme_minimal() +
  ggtitle("The size of local staff as national elections approach") +
  ylab("Number of local staffers") +
  xlab("Proximity of national election")

Interaction

Interaction terms means that the effect of one variable depends on the value of another variable in the regression. Here, we can formulate a hypothesis that eurosceptical MEPs are more likely to attempt a return into national politics instead of hanging out in an institution they dispise.

“position” measures the party’s europenthusiasm (pro-EU attitudes). We might expect that pro-integration MEPs are less sensitive to the domestic electoral cycle.

mod4 <- lm(LocalAssistants ~ 
             LaborCost +
             OpenList +
             ProxNatElection *
             position,
           data = df)

summary(mod4)
## 
## Call:
## lm(formula = LocalAssistants ~ LaborCost + OpenList + ProxNatElection * 
##     position, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.847 -1.848 -0.561  1.052 34.723 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               6.52336    0.80885   8.065 3.47e-15 ***
## LaborCost                -0.08480    0.01148  -7.386 4.61e-13 ***
## OpenList                  0.99611    0.24027   4.146 3.83e-05 ***
## ProxNatElection           0.61116    0.31811   1.921 0.055134 .  
## position                 -0.52739    0.14265  -3.697 0.000236 ***
## ProxNatElection:position -0.16618    0.06004  -2.768 0.005804 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.019 on 658 degrees of freedom
##   (75 observations deleted due to missingness)
## Multiple R-squared:  0.1064, Adjusted R-squared:  0.09961 
## F-statistic: 15.67 on 5 and 658 DF,  p-value: 1.372e-14

To illustrate the effect, we have to set one of the variable (the moderator) to one value, when we let the other vary. Afterwards, we may do the same with a new value of the moderator.

I start by estimating the effect of the national electoral cycle among eurosceptics (“position == 1”; the minimum in the data).

#Check names
names(mod4$coefficients)
## [1] "(Intercept)"              "LaborCost"               
## [3] "OpenList"                 "ProxNatElection"         
## [5] "position"                 "ProxNatElection:position"
#Scenario
newdata <- data.frame(
  Intercept = 1,
  LaborCost = mean(df$LaborCost),
  OpenList = 0,
  ProxNatElection = seq(from = -4, to = 0, by = 0.1),
  position = 1,
  Interaction = seq(from = -4, to = 0, by = 0.1) * 1
  )



#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod4$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod4)
)

#Matrix multiplication to get predicted values
preds <- coef.sim %*% t(newdata)

#Summarize the simulation
pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp4 <- cbind(newdata,
              t(pred_summary))



ggplot(dfp4) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = ProxNatElection)
  ) +
    #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = ProxNatElection),
    alpha = 0.3) +
  theme_minimal() +
  ggtitle("Effect of national electoral cycle among EUROSCEPTICS")+
  ylab("Number of local staffers") +
  xlab("Proximity of national election")

Can you do the same with the most euroenthusiast MEPS? (position == 7)

#Check names
names(mod4$coefficients)
## [1] "(Intercept)"              "LaborCost"               
## [3] "OpenList"                 "ProxNatElection"         
## [5] "position"                 "ProxNatElection:position"
#Scenario
newdata <- data.frame(
  Intercept = 1,
  LaborCost = mean(df$LaborCost),
  OpenList = 0,
  ProxNatElection = seq(from = -4, to = 0, by = 0.1),
  position = 7,
  Interaction = seq(from = -4, to = 0, by = 0.1) * 7
  )



#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod4$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod4)
)

#Matrix multiplication to get predicted values
preds <- coef.sim %*% t(newdata)

#Summarize the simulation
pred_summary <- 
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2, 
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp5 <- cbind(newdata,
              t(pred_summary))



ggplot(dfp5) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = ProxNatElection)
  ) +
    #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = ProxNatElection),
    alpha = 0.3) +
  theme_minimal() +
  ggtitle("Effect of national electoral cycle among EUROENTHUSIASTS") +
  ylab("Number of local staffers") +
  xlab("Proximity of national election")

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Hermansen, Silje Synnøve Lyder. 2023. R i praksis - en introduktion for samfundsvidenskaberne. 1st ed. Copenhagen: DJØF Forlag. https://www.djoef-forlag.dk/book-info/r-i-praksis.
Hermansen, Silje Synnøve Lyder, and Andreja Pegan. 2024. “Blurred Lines Between Electoral and Parliamentary Representation: The Use of Constituency Staff Among Members of the European Parliament.” European Union Politics, 1–25. https://doi.org/10.1177/14651165221149900.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.