Linear models: Interpretation
In this session, we will be exploring how to estimate, present and interpret linear models. The concepts and procedures we use today may be slightly complicated for the simple example we are working with, but as we move over to the non-linear relationship between x and y in GLMs, they will come in very handy.
Introduction
Main methodological takeaways
We will use linear models (OLS) first to estimate group averages. Specifically, we will explore the distribution of the dependent variable within groups of predictors and compare their means. These are instances where OLS is always do-able.
We then move to use the model for inference – predicting outcomes from combinations of (metric) variables. Our focus is then on interpretation. In the application today, the interpretations often involve a detour; we could obtain the same thing in simpler ways. However, when we move to explore GLMs where the relationship between x and y is non-linear, these skills will be useful.
I will introduce a few additional R-functions:
stargazer()
for making summary tables (stargazer
package)mvrnorm()
for making simulations (MASS
package)apply()
for applying a function on (base R)geom_ribbon()
andgeom_errorbar()
for ploting uncertainty (ggplot2
package).
The rest of the R codes are implementations of what we already encountered in the homework (Hermansen 2023) and in the last R session (notebooks online). Make sure you work through the examples here, so that you are prepared for our next session.
Substantive topic (the research)
We will be working on some of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.
For a long time, there was a claim in the academic literature that what MEPs do in office, doesn’t matter for their relection; they are not held electorally accountable. In this paper, we show that MEPs haven’t really gotten that memo. They behave as if their relection depends on what they do. Voters in all democracies are confronted with an informational problem. They don’t know what their representatives do. It is up to the representatives to prove their usefulness to their voters. They can do that in different ways.
We make the assumption that most MEPs will seek (re-)election in the future and that they can increase their likelihood of that happening by investing in the right type of activities. The claim we explore today, is that some MEPs have an incentive to cultivate a personal vote, while others do not. Specifically, the electoral rules for EP elections vary by member state. MEPs elected from candidate-centered systems will gain a seat in parts due to the personal votes they get. As a result, they have an incentive to invest in constituency activities; they invest in their local presence in order to get a name recognition. By contrast, MEPs elected from party-centered systems can ride on the party label, and so will invest less.
The data
we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2014.
I start by fetching the R packages we will use from the library.
library(dplyr); library(ggplot2)
Analysis
As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.
We begin by loading in the data.
If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
destfile = "MEP2014.rda")
Now, I can load it into R and rename the object to my liking
(df
).
#Load in
load("MEP2014.rda")
#Rename
<- MEP2014 df
Univariate distribution
To approximate MEPs’ local investment, we measure the number of local/constituency assistants that Members of the European Parliament hire. Our dependent variable, “LocalAssistants”, is thus a count of the number of local employees (constituency staff) an MEP has. Since MEPs can share assistants, the variable also reports fractions.
I begin by exploring the univariate distribution. Here, I inspect the distribution of the variable, asking R to give me a histogram with 20 bars (for a “high-resolution” plot)
hist(df$LocalAssistants,
breaks = 20)
There are a lot of MEPs with only a few local assistants, some MEPs have many local employees, while none has a negative number of MEPs.
My main predictor today is a binary variable that flags whether the MEP was elected (and therefore will run for reelection) in a candidate-centered system (“OpenList”).
barplot(table(df$OpenList))
From the barplot, we see that about half of the MEP have an incentive to
cultivate a personal vote (OpenList == 1
), while the other
half can ride on the party label (OpenList == 0
).
Descriptive statistics using stargazer
The visual inspection of the univariate distribution is the most intuitive by far. However, we may find it useful to keep the key-takeaways in a table of descriptive statistics giving some key numbers.
Here I make a table of descriptive statistics using the
stargazer
package. It creates the summary statistics
automatically when we feed it with a data frame or a model object. The
package contains only one function: stargazer()
that
generate the table.
If you haven’t already installed the package, you do that. Then we can fetch it int the library.
#install.packages("stargazer")
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
To make the table begin by selecting the variables I am interested in
from the data using dplyr
.
<- df %>% select(LocalAssistants,
df0
OpenList,
LaborCost,
position,
ProxNatElection, ProxNatElection2)
I select the number of assistants an MEP has in their member states, the electoral system the MEP is elected from, as well as a few other variables we will use later: The labor cost, EU-enthusiasm and proximity of the national elections.
I then make the table. By default, stargazer will give you LaTeX code
in the output. Do you want to see the result in R? Use
type = "text"
.
stargazer(df0, type = "text")
##
## ==================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------
## LocalAssistants 739 2.915 3.212 0.000 39.500
## OpenList 739 0.471 0.499 0 1
## LaborCost 739 22.824 11.207 3.800 40.600
## position 664 5.174 1.594 1.000 7.000
## ProxNatElection 739 -2.290 1.225 -4.000 -0.115
## ProxNatElection2 739 6.745 5.621 0.013 16.000
## --------------------------------------------------
In a more fancy version, I could use additional arguments to specify
what kind of statistics I want. Use the ?stargazer
command
to get tips.
stargazer(df0,
#Title of the table
title = "Descriptive statistics",
#Type of output
type = "text",
#Summary stats of data frame
summary = T,
#Additional arguments for statistics
iqr = T, median = T, min.max = T)
##
## Descriptive statistics
## ===========================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
## ---------------------------------------------------------------------------
## LocalAssistants 739 2.915 3.212 0.000 1.000 2.000 4.000 39.500
## OpenList 739 0.471 0.499 0 0 0 1 1
## LaborCost 739 22.824 11.207 3.800 9.700 25.800 31.400 40.600
## position 664 5.174 1.594 1.000 4.545 6.000 6.273 7.000
## ProxNatElection 739 -2.290 1.225 -4.000 -3.255 -2.422 -1.342 -0.115
## ProxNatElection2 739 6.745 5.621 0.013 1.802 5.866 10.594 16.000
## ---------------------------------------------------------------------------
Bivariate statistics
The article makes the claim that MEPs elected from candidate-centred electoral systems are more likely to invest locally, since they need the name recognition. The variable “OpenList” is a binary indicator of whether an MEP is elected from such a system.
A natural start of a bivariate analysis is to compare the average
number of local hires in each group. The group_by()
function from dplyr
is useful for this. I then use the
summarize()
to give a summary of the variable “Local
Assistants”.
%>%
df group_by("Candidate-centred" = OpenList) %>%
summarize("Average number" = mean(LocalAssistants))
We can clearly see there is a difference between the two groups, but is it significant?
One way to test if two groups are different, is to perform a t-test. However, we can also use linear models for this. That’s the point Gelman and Hill (2007) make. When we run an OLS with categorical predictors, we can interpret the coefficients as predicted averages for each group.
The function for linear models in R is lm()
.
<- lm(LocalAssistants ~ OpenList,
mod data = df)
Let’s see if we can recognize the results.
summary(mod)
##
## Call:
## lm(formula = LocalAssistants ~ OpenList, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.417 -2.417 -0.468 1.532 36.083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4680 0.1608 15.351 < 2e-16 ***
## OpenList 0.9486 0.2343 4.049 5.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.179 on 737 degrees of freedom
## Multiple R-squared: 0.02176, Adjusted R-squared: 0.02044
## F-statistic: 16.4 on 1 and 737 DF, p-value: 5.684e-05
Your turn
- can you calculate the predicted number of local assistants for OpenList == 0?
- can you calculate the predicted number of local assistants for OpenList == 1?
- What is the meaning of the \(\beta\) coefficient here?
We could ask R to predict the y-values for us, given the data we have, the info contained in the predictors (OpenList) and the coefficients from the model. This is what we’d call “in-sample” predictions.
#Predict
$predicted <- predict(mod,
MEP#The data we used for estimation
newdata = MEP)
#Distribution of predicted values
table(MEP$predicted)
##
## 2.46803069053709 3.41666666663794
## 3840 3303
There are two ways of interpreting the regression coefficients:
- we can use the OLS to compare means \(\rightarrow\) the following OLS assumptions don’t matter
- we can use it to make inferences: one unit increase in \(x\) is a \(\beta\) increase in \(y\). \(\rightarrow\) the following OLS assumptions do matter
Multivariate regression
There are many potential confounders in this analysis. In particular, MEPs come from different EU member states. They have the same budget for parliamentary assistance, but their local assistants are hired on local wages. It means that Danish MEPs can hire fewer assistants than Polish MEPs. Labor cost is also related to MEPs incentive to cultivate a personal vote (through their nationality).
I therefore fit a second model where I add LaborCost as a predictor. I store the output in a different model object.
<- lm(LocalAssistants ~
mod2 +
OpenList
LaborCost, df)
summary(mod2)
##
## Call:
## lm(formula = LocalAssistants ~ OpenList + LaborCost, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.492 -1.938 -0.414 1.078 35.001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.12659 0.28612 14.422 < 2e-16 ***
## OpenList 0.82883 0.22784 3.638 0.000294 ***
## LaborCost -0.07020 0.01015 -6.913 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.083 on 736 degrees of freedom
## Multiple R-squared: 0.08141, Adjusted R-squared: 0.07891
## F-statistic: 32.61 on 2 and 736 DF, p-value: 2.689e-14
Present your results: table of results
Ward and Ahlquist (2018) don’t like results tables. They call them BUTONs (Big Ugly Tables of Numbres) because they are counterintuitive. However, most analyses will involve making a regression table after we have fitted the model and before we start the interpretation.
There are several ways to create result tables. You can always export
coefficients and standard errors to Excel via
write.table()
, but there are easier solutions.
The package stargazer
gives you a nice, pre-made
presentation of most model objects (with significance stars).
We start by loading the package. It has one function to take note of,
namely: stargazer()
.
stargazer(mod,
type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## LocalAssistants
## -----------------------------------------------
## OpenList 0.949***
## (0.234)
##
## Constant 2.468***
## (0.161)
##
## -----------------------------------------------
## Observations 739
## R2 0.022
## Adjusted R2 0.020
## Residual Std. Error 3.179 (df = 737)
## F Statistic 16.396*** (df = 1; 737)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
You can present multiple models in stargazer at the same time. Just add all the model objects first in the function. Then you get one column per model.
stargazer(mod, mod2,
type = "text")
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## LocalAssistants
## (1) (2)
## -------------------------------------------------------------------
## OpenList 0.949*** 0.829***
## (0.234) (0.228)
##
## LaborCost -0.070***
## (0.010)
##
## Constant 2.468*** 4.127***
## (0.161) (0.286)
##
## -------------------------------------------------------------------
## Observations 739 739
## R2 0.022 0.081
## Adjusted R2 0.020 0.079
## Residual Std. Error 3.179 (df = 737) 3.083 (df = 736)
## F Statistic 16.396*** (df = 1; 737) 32.612*** (df = 2; 736)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Now it’s easier to compare the models. Among other things, we get an idea of what happens when we add controls.
Here we see that the effect of electoral system is moderated when we take into account the differences in labor cost.
Pimp your table Stargazer also has many additional arguments that allow us to customize the appearance. Some of these are especially interesting if you want to customize the table for a Danish audience.
<- c("type =",
additional_arguments "title = ",
"covariate.labels = ",
"dep.var.labels =",
"dep.var.caption =",
"decimal.mark = ")
<- c("'text', 'html', 'tex'",
alternatives "'My Title'",
"c('var 1', 'var 2')",
"A text, or a vector with text",
"'Dependent variable'",
"',', '.' etc.")
<- c("What file type is this?",
operation "Add a title",
"Vector with variable names",
"What is my dependent variable called?",
"Title for dependent variable",
"What decimal separator do I use? (e.g., ',')")
<- cbind(additional_arguments,
tab_stargazer
alternatives,
operation)colnames(tab_stargazer) <- c("Argument",
"Alternatives",
"What it does")
Argument | Alternatives | What it does |
---|---|---|
type = | ‘text’, ‘html’, ‘tex’ | What file type is this? |
title = | ‘My Title’ | Add a title |
covariate.labels = | c(‘var 1’, ‘var 2’) | Vector with variable names |
dep.var.labels = | A text, or a vector with text | What is my dependent variable called? |
dep.var.caption = | ‘Dependent variable’ | Title for dependent variable |
decimal.mark = | ‘,’, ‘.’ etc. | What decimal separator do I use? (e.g., ‘,’) |
Export the table
Do you want the result in HTML (for export to Word), change it to “html” and print the table in a separate file.
stargazer(mod,
type = "html",
out = "result_table.html")
Dependent variable: | |
LocalAssistants | |
OpenList | 0.949*** |
(0.234) | |
Constant | 2.468*** |
(0.161) | |
Observations | 739 |
R2 | 0.022 |
Adjusted R2 | 0.020 |
Residual Std. Error | 3.179 (df = 737) |
F Statistic | 16.396*** (df = 1; 737) |
Note: | p<0.1; p<0.05; p<0.01 |
Open the file in your web browser and copy-paste it into Word.
Interpretation
Let’s interpret the results using inference. We can either focus on the relative change in local staff, or the absolute size of the local staff. When we have an OLS, that’s basically the same thing. Once we move to GLMs, that will not be the case. Therefore, we will spend quite some time on predicting outcomes.
All regressions revolve around a linear theoretical relationship that we can express as follows:
\(y \sim \alpha + \beta \times x\)
The greek letters are regression coefficients/parameters that R estimates from the data. The normal letters are the data we feed into the equation. The estimation is done on real data, while the interpretation is done on data we invent. These are “scenarios” (Ward and Ahlquist 2018, ch. 3) that we construct to better understand what the model tells us.
Marginal effect
We can interpret the \(\beta\) (slope parameter/coefficient) as a relative increase in \(y\) when \(x\) increases with one unit.
Here, we might say that MEPs from candidate-centered systems
(OpenList == 1
) have on average a local staff containing
0.9 more assistants than MEPs from party centered systems
(OpenList == 0
). Note how we interpret the units: \(\beta\) reports the increase in local
assistants (\(y\)) when \(x\) increases by one unit (“from
party-centered to candidate-centered system”).
Be realistic When we make interpretations, we should make sure that the increase we imagine for \(x\) is a realistic change in the variable. Depending on what we are interested in, we might consider realistic differences between observations (e.g. between different MEPs), or realistic changes/elasticity in \(x\) (e.g. a realistic increase in hires for one MEP). To make that assessment, you need to know the descriptive statistics (univariate distributions) of your predictors. Use the table of descriptive statistics actively!
Predicted values using scenarios
We can interpret the results as a comparison of predicted values when the value of “OpenList” increases by one unit.
To calculate the predicted values of model 1, we fill in the linear equation:
\(y_i = \alpha + \beta \times x_i\)
R gives us the coefficients, we provide the data. This data is artificial, they are “scenarios” that we create. That is, we “invent” data points. The procedure looks something like this. Here I’ve created two scenarios for \(x_1\) and \(x_2\).
scenario | y | \(\alpha\) | \(\beta\) | x | |
---|---|---|---|---|---|
Party-centered | 2.4680307 | 2.4680307 | 0.948636 | 0 | |
Candidate-centered | 3.4166667 | 2.4680307 | 0.948636 | 1 |
We can do this in R. We can plug in the numbers and use R as a calculator; we can use an object-oriented calculation and we can do these calculations on many scenarios at the same time. For the time being: do this by hand.
Your turn:
- Can you make an interpretation of the effect of labor cost on MEPs’ local staff?
- Use the “Nationality” variable to find out what the labor cost is in Denmark, Italy, and Austria respectively. Make three scenarios and predict the size of MEPs’ local staff in the three countries.
Suggestion:
#Find labor cost
<- df %>%
labor_dk filter(Nationality == "Denmark") %>%
select(LaborCost) %>%
unique<- df %>%
list_dk filter(Nationality == "Denmark") %>%
select(OpenList) %>%
unique
Fill in the equation
<- mod2$coefficients[1]
a <- mod2$coefficients[2]
b1 <- mod2$coefficients[3]
b2
<- list_dk
x1 <- labor_dk
x2
= a*1 + b1*x1 + b2*x2
y y
Read the section on scenarios in Ward and Ahlquist (2018) (ch 3) and decide what type of scenarios the combination Denmark/Italy is ; then Denmark/Austria.
Based on model 2, how many local assistants would a Danish MEP have if Denmark was to change from a candidate- to a party-centered system?
The work process for interpretation is as follows:
Estimate the model and store the coefficients
Create a dataset with hypothetical scenarios: This means creating a dataset where all variables have a probable/typical value. Then let the variable(s) you want to illustrate the effect of vary.
Predict y.
Interpret Insert the predicted y-values into the hypothetical dataset so that you have everythingin one place. Now, you can interpret using the three Danish Ts (tekst, tal, tegning). The interpretation focuses on the predicted value of the dependent variable.
- Numbers what are the different predicted values?
- Text Compare single scenarios and make use of plain English. E.g. what is the difference between two predicted \(y\)s?
- Plot the results. The choice of graphics depends on the measurement level of the variable you want to illustrate. The effect of a categorical/binary variable can be best illustrated with a coefficient plot, while an effect plot (a line) is best for continuous variables.
Extract the coefficients
I create a vector of coefficients.
<- mod$coefficients coefs
Create the data for the scenaros
I make a data frame with the scenarios I want. Be aware that you will have to make data for the intercept too.
<- data.frame(
newdata Intercept = 1,
OpenList = c(0,1))
newdata
Predict outcomes
Now, I can predict the y values by multiplying the regression coefficients with the data. We can use matrix manipulation.
To that, I have to “flip” (transpose) the matrix, so that the first coefficient in my vector is multiplied with the first column in my data etc.
<- coefs %*% t(newdata)
preds preds
## [,1] [,2]
## [1,] 2.468031 3.416667
- Numeric interpretation
The predicted value for scenario 1 is 2.4680307, and the predicted value for scenario 2 is 3.4166667. The difference (in the words of Ward and Ahlquist (2018), the “first difference”) is 0.948636.
- Plain English interpretation. MEPs from party-centered systems keep
on average 2.4680307 staffers on their payroll in their member state,
while MEPs from candidate-centered have a staff of
preds[2]
people. That is, MEPs from candidate-centered systems keep on average 1 more staffer on their local team.
Your turn
Here is a bold interpretation. If we were to make all MEPs compete in candidate-centered systems, the European Parliament would have to pay for 370.9166667 additional local staffers. How did I come up with this number?
- Plot it.
Start by making a data set for interpretation. We will also use it for graphical display, just as we did before.
#Flip the matrix
<- t(preds)
preds
#Add it to your newdata object (and rename it)
<- cbind(newdata,
dfp "Staff size" = preds)
dfp
The table is useful for your numeric and english interpretations. It is also useful for graphical display.
ggplot(dfp) +
geom_point(aes(x = OpenList,
y = `Staff size`))
These are point predictions. It’s the average predicted effect. However, we usually want to know the uncertainty of our estimate.
That’s why I simulate the results. The
MASS
package uses information about the distribution of
data/coefficient contained in our model object to make simulations. The
purpose is to explore different scenarios by simulating the outcomes of
our statistical model.
We may of course assume that the margins of error (the variation) are the same regardless of where you are in the prediction, and that our best estimate is an average value + a prediction interval. However, this is not always the case, especially when moving to GLMs (e.g. logit). In that case, we can simulate outcomes.
I perform the simulation in the MASS package, so I start by loading
it from the library. If you haven’t used it before, you need to install
it first (install.packages("MASS")
).
We can either fetch the package in the library or make a little
“pipe” by fetching only the one function we need in the MASS package
without actually ever loading it MASS::mvnorm()
. Here, I do
the latter. The function draws from a multivariate normal distribution,
using information contained in our model object.
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod)
)
The data stored in coef.sim
are variations around our
coefficients. We can use these coefficients as we did before. Now, we
get many approximate predictions from each scenario we made.
#See the 10 first simulations
1:10,] coef.sim[
## (Intercept) OpenList
## [1,] 2.334823 1.4200694
## [2,] 2.259047 1.1318657
## [3,] 2.414392 1.0983652
## [4,] 2.266176 0.9188021
## [5,] 2.269018 1.0775364
## [6,] 2.594430 0.6084245
## [7,] 2.586575 1.0647615
## [8,] 2.579203 0.9952217
## [9,] 2.453261 0.9395702
## [10,] 2.729737 0.4784737
Now, we can go back to our old workflow. First, make predictions using matrix calculations.
<- coef.sim %*% t(newdata)
preds 1:10,] preds[
## [,1] [,2]
## [1,] 2.334823 3.754892
## [2,] 2.259047 3.390913
## [3,] 2.414392 3.512757
## [4,] 2.266176 3.184978
## [5,] 2.269018 3.346554
## [6,] 2.594430 3.202854
## [7,] 2.586575 3.651337
## [8,] 2.579203 3.574424
## [9,] 2.453261 3.392831
## [10,] 2.729737 3.208210
Each column in the dataset is now a scenario. We can therefore calculate the summary statistics we’re interested in. Instead of using the mean and standard deviation/error, let’s use the median and the 95% most common predicted values.
dim(preds)
## [1] 10000 2
#Scenario 1
quantile(preds[,1], probs = c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 2.151481 2.467985 2.779521
#Scenario 2
quantile(preds[,2], probs = c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 3.072811 3.414616 3.747813
I can of course keep these calculations and bind them together in a vector. However, I can also do this in one go.
To do so, I’m using an old classic from base R. apply()
applies whatever function I want to a matrix.
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
The output is a data frame.
pred_summary
## [,1] [,2]
## 2.5% 2.151481 3.072811
## 50% 2.467985 3.414616
## 97.5% 2.779521 3.747813
I now go back to my old workflow and bind this to my initial data with the scenarios. Be mindful, that you will have to flip the data frame again.
newdata
<- t(pred_summary)
pred_summary pred_summary
## 2.5% 50% 97.5%
## [1,] 2.151481 2.467985 2.779521
## [2,] 3.072811 3.414616 3.747813
<- cbind(newdata,
dfp2 pred_summary)
Now, you have a data frame with interpretable values. Here, we might say that we are likely to observe MEPs with a local staff ranging between 2 and 3 employees.
dfp2
You can also plot these scenarios.
Plot your scenarios
Coefplots
When my x-variable is categorical – or in any way not continuous – a coefficient graphic is better. Here, I illustrate the median predicted effect of each scenario on the y-axis, surrounded by its 95% most likely alternative outcomes (margin of error).
#Plot predicted values
library(ggplot2)
#Add a new categorical variable for nicer labels
$`Electoral system` <- c("Candidate-centred",
dfp2"Party-centred")
ggplot(dfp2) +
#Plot the points
geom_point(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = `Electoral system`)
+
) #Add error bars
geom_errorbar(aes(
#Different scenarios
x = `Electoral system`,
#Where the bar starts
ymin = `2.5%`,
#Where the bar ends
ymax = `97.5%`,
#Remove the ends of the bars
width = 0)
+
) ylab("Local staff size") +
ggtitle(label = "The effect of electoral system on MEPs' local investment") +
ylab("Local staff") +
xlab("")+
#Use a different "theme" as the basis for the aesthetics
theme_light()
Note: Unfortunately, the MASS package contains some functions that have the same name as functions in dplyr. You will receive a message about this when loading the package. In this case, the select() function is overwritten by MASS. If you still want to use the
select()
function from dplyr, you can specify this in your code asdplyr::select()
.
Your turn
- Can you do the same with model 2? Predict the local staff size for MEPs as the labor cost increases.
This time, you will have many scenarios. To plot the uncertainty, you
will also use the geom_ribbon
rather than the
errorbars.
#Scenario
<- data.frame(
newdata Intercept = 1,
OpenList = 1,
#Create a vector using values from the descriptive stats.
LaborCost = seq(from = 4, to = 40, by = 0.1))
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod2$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod2)
)
#Matrix multiplication to get predicted values
<- coef.sim %*% t(newdata)
preds
#Summarize the simulation
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
#Add summary to data with scenarios
<- cbind(newdata,
dfp3 t(pred_summary))
ggplot(dfp3) +
#Plot a line this time
geom_line(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = LaborCost)
+
) #Plot a ribbon of uncertainty
geom_ribbon(aes(
#Predicted y-values
ymin = `2.5%`,
ymax = `97.5%`,
#Different scenarios
x = LaborCost),
alpha = 0.3) +
theme_minimal()
Non-linear effects
These techniques are really useful when the effect of x is not linear. This happens in linear models too when we recode the x’s: It happens when the effect of a predictor is different depending on where we are on its scale (i.e. curvilinear) and in interaction effects.
Here are two examples.
Curvilinear
Some MEPs will not seek reelection to the european Parliament. Instead, they will be running for office in national elections (parachuting to the national level). Therefore, we may also expect that they increase their local investment right before national elections.
The data contains a variable (ProxNatElection) that measures the time remaining until then next national parliamentary elections.
I begin by exploring the bivariate relationship.
ggplot(df) +
geom_smooth(aes(x = ProxNatElection,
y = LocalAssistants))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We see that it is curvilinear. THere is no point in having a large local staff right after an election. At that point, it is better to use the parliamentary allowance for staffers in Brussels. However, we might expect that they mobilize (i.e. increase local staff size) right before the election.
We can model this by using a quadric term (Hermansen 2023, chap. 8).
<- lm(LocalAssistants ~
mod3 +
LaborCost +
OpenList +
ProxNatElection I(ProxNatElection^2),
data = df)
summary(mod3)
##
## Call:
## lm(formula = LocalAssistants ~ LaborCost + OpenList + ProxNatElection +
## I(ProxNatElection^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.692 -1.859 -0.359 1.074 34.649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.24209 0.53779 6.029 2.62e-09 ***
## LaborCost -0.07587 0.01085 -6.993 6.06e-12 ***
## OpenList 0.92038 0.23796 3.868 0.00012 ***
## ProxNatElection -0.81047 0.47121 -1.720 0.08586 .
## I(ProxNatElection^2) -0.13127 0.10368 -1.266 0.20588
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.072 on 734 degrees of freedom
## Multiple R-squared: 0.09009, Adjusted R-squared: 0.08513
## F-statistic: 18.17 on 4 and 734 DF, p-value: 3.053e-14
The effect here is estimated as not different from 0, but never mind. We will do it anyways. Be mindful when you construct your scenario. Use the descriptive statistics actively, and remember that both ProXNatElections and its quadric terms have to vary.
#Scenario
<- data.frame(
newdata Intercept = 1,
LaborCost = mean(df$LaborCost),
OpenList = 0,
ProxNatElection = seq(from = -4, to = 0, by = 0.1),
ProxNatElection2 = seq(from = -4, to = 0, by = 0.1)^2
)
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod3$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod3)
)
#Matrix multiplication to get predicted values
<- coef.sim %*% t(newdata)
preds
#Summarize the simulation
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
#Add summary to data with scenarios
<- cbind(newdata,
dfp3 t(pred_summary))
ggplot(dfp3) +
#Plot a line this time
geom_line(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = ProxNatElection)
+
) #Plot a ribbon of uncertainty
geom_ribbon(aes(
#Predicted y-values
ymin = `2.5%`,
ymax = `97.5%`,
#Different scenarios
x = ProxNatElection),
alpha = 0.3) +
theme_minimal() +
ggtitle("The size of local staff as national elections approach") +
ylab("Number of local staffers") +
xlab("Proximity of national election")
Interaction
Interaction terms means that the effect of one variable depends on the value of another variable in the regression. Here, we can formulate a hypothesis that eurosceptical MEPs are more likely to attempt a return into national politics instead of hanging out in an institution they dispise.
“position” measures the party’s europenthusiasm (pro-EU attitudes). We might expect that pro-integration MEPs are less sensitive to the domestic electoral cycle.
<- lm(LocalAssistants ~
mod4 +
LaborCost +
OpenList *
ProxNatElection
position,data = df)
summary(mod4)
##
## Call:
## lm(formula = LocalAssistants ~ LaborCost + OpenList + ProxNatElection *
## position, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.847 -1.848 -0.561 1.052 34.723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.52336 0.80885 8.065 3.47e-15 ***
## LaborCost -0.08480 0.01148 -7.386 4.61e-13 ***
## OpenList 0.99611 0.24027 4.146 3.83e-05 ***
## ProxNatElection 0.61116 0.31811 1.921 0.055134 .
## position -0.52739 0.14265 -3.697 0.000236 ***
## ProxNatElection:position -0.16618 0.06004 -2.768 0.005804 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.019 on 658 degrees of freedom
## (75 observations deleted due to missingness)
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.09961
## F-statistic: 15.67 on 5 and 658 DF, p-value: 1.372e-14
To illustrate the effect, we have to set one of the variable (the moderator) to one value, when we let the other vary. Afterwards, we may do the same with a new value of the moderator.
I start by estimating the effect of the national electoral cycle among eurosceptics (“position == 1”; the minimum in the data).
#Check names
names(mod4$coefficients)
## [1] "(Intercept)" "LaborCost"
## [3] "OpenList" "ProxNatElection"
## [5] "position" "ProxNatElection:position"
#Scenario
<- data.frame(
newdata Intercept = 1,
LaborCost = mean(df$LaborCost),
OpenList = 0,
ProxNatElection = seq(from = -4, to = 0, by = 0.1),
position = 1,
Interaction = seq(from = -4, to = 0, by = 0.1) * 1
)
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod4$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod4)
)
#Matrix multiplication to get predicted values
<- coef.sim %*% t(newdata)
preds
#Summarize the simulation
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
#Add summary to data with scenarios
<- cbind(newdata,
dfp4 t(pred_summary))
ggplot(dfp4) +
#Plot a line this time
geom_line(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = ProxNatElection)
+
) #Plot a ribbon of uncertainty
geom_ribbon(aes(
#Predicted y-values
ymin = `2.5%`,
ymax = `97.5%`,
#Different scenarios
x = ProxNatElection),
alpha = 0.3) +
theme_minimal() +
ggtitle("Effect of national electoral cycle among EUROSCEPTICS")+
ylab("Number of local staffers") +
xlab("Proximity of national election")
Can you do the same with the most euroenthusiast MEPS?
(position == 7
)
#Check names
names(mod4$coefficients)
## [1] "(Intercept)" "LaborCost"
## [3] "OpenList" "ProxNatElection"
## [5] "position" "ProxNatElection:position"
#Scenario
<- data.frame(
newdata Intercept = 1,
LaborCost = mean(df$LaborCost),
OpenList = 0,
ProxNatElection = seq(from = -4, to = 0, by = 0.1),
position = 7,
Interaction = seq(from = -4, to = 0, by = 0.1) * 7
)
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod4$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod4)
)
#Matrix multiplication to get predicted values
<- coef.sim %*% t(newdata)
preds
#Summarize the simulation
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
#Add summary to data with scenarios
<- cbind(newdata,
dfp5 t(pred_summary))
ggplot(dfp5) +
#Plot a line this time
geom_line(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = ProxNatElection)
+
) #Plot a ribbon of uncertainty
geom_ribbon(aes(
#Predicted y-values
ymin = `2.5%`,
ymax = `97.5%`,
#Different scenarios
x = ProxNatElection),
alpha = 0.3) +
theme_minimal() +
ggtitle("Effect of national electoral cycle among EUROENTHUSIASTS") +
ylab("Number of local staffers") +
xlab("Proximity of national election")