Logistic regression models

In this session, we will be predicting regression outcomes. We do this for interpretation and for model evaluation.

Introduction

Main methodological takeaways

GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation

  • no transformation (logodds). These are the regression coefficients from the model output and your resultstable. It is useful for hypothesis testing.

  • partial reversal (odds). Marginal effects are a one-number summary of the effect of a particular variable. I do this by exponentiating the slope coefficient.

  • reversal (probability). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).

  • total reversal to original scale (binary). Useful mostly for model assessment (\(\chi^2\) test, ROC-curve )

All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.

Substantive topic (the research)

We will be working on some more of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.

This time we will play around with another implication of the candidate-centered electoral systems: Intra-party competition. We formulate a hypothesis that MEPs that will later compete against party colleagues are less prone to share resources; in particular the local assistants that allow them to cultivate a personal vote. We furthermore make the assumption that – all else equal – the more strapped an MEP (and their party) is for money, the stronger incentives they have to pool resources.

The data

we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2016.

I start by fetching the R packages we will use from the library.

library(tidyr);library(dplyr); library(ggplot2);
library(stargazer)

Analysis

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory. Note that this time, we work with data on MEPs from 2016.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2016.rda",
              destfile = "MEP2016.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("MEP2016.rda")

#Rename
df <- MEP2016

Univariate distribution

To approximate MEPs’ propensity to share assistants with party colleagues, I have created a binary variable PoolsLocal that indicates whehter the MEP has indeed at least one local assistant on his/her payroll that they also share with colleagues. My main predictor is also binary: Is the MEP elected from a candidate-centred system (OpenList)?

I begin by exploring the univariate distribution in numbers, a table and a barplot.

mean(df$PoolsLocal)
## [1] 0.3536068

We see that the proportion of MEPs that share local assistants is 0.35. That is, the predicted probability of sharing assistants in a base-line intercept-only model is 0.3536068.

For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.

#In base
table(df$PoolsLocal)
## 
##   0   1 
## 457 250
barplot(table(df$PoolsLocal))

#Tidyverse
df$PoolsLocal %>%
  table %>%
  barplot

Since we will work with probabilities, it is useful to plot the relative distribution of the dependent variable.

df$PoolsLocal %>%
  table %>%
  prop.table %>%
  barplot(.,
          main = "Pooling resources among MEPs",
          xlab = "Pooling  (shares at least one assistant)",
          ylab=  "Proportion")

I then explore each of the predictors that I will use and make a table of descriptive statistics. I will use them actively when I make my interpretations .

df0 <- df %>% dplyr::select(PoolsLocal,
                            OpenList,
                            LaborCost,
                            SeatsNatPal.prop,
                            position)
library(stargazer)
stargazer(df0,
          #Title of the table
          title = "Descriptive statistics",
          #Type of output
          type = "html",
          #Summary stats of data frame
          summary = T,
          #Additional arguments for statistics
          iqr = T, median = T, min.max = T)
Descriptive statistics
Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
PoolsLocal 707 0.354 0.478 0 0 0 1 1
OpenList 707 0.463 0.499 0 0 0 1 1
LaborCost 707 23.953 11.222 4.400 10.900 26.700 33.000 42.000
SeatsNatPal.prop 686 0.229 0.173 0.000 0.071 0.241 0.357 0.668
position 656 4.943 1.890 1.059 3.429 5.786 6.400 7.000

Bivariate statistics

My main predictor this time is a binary variable that flags whether the MEP was elected – and therefore might run for reelection – in a candidate-centered system (“OpenList”).

I explore the variation between the two in a cross table. I first calculate the counts of different value combinations (appropriate for discrete/categorical) variables. I can use different R dialects and obtain the same results.

#Base
table(df[, c("OpenList", "PoolsLocal")])
##         PoolsLocal
## OpenList   0   1
##        0 208 172
##        1 249  78
#Tidyverse
df %>%
  #Groups/combinations of values
  group_by(PoolsLocal, OpenList) %>%
  #Count occurences
  count() %>%
  #Spread OpenList by columns: from tidyr package
  pivot_wider(names_from = PoolsLocal,
              values_from = n)

… I then transform to proportions. This is already a first step in the analysis. By calculating proportions row-wise (margin = 1), I make the distributions comparable.

#Base
df[, c("OpenList", "PoolsLocal")] %>%
  table %>%
  prop.table(., margin = 1 )
##         PoolsLocal
## OpenList         0         1
##        0 0.5473684 0.4526316
##        1 0.7614679 0.2385321

Now, we can clearly see that there is a larger proportion of MEPs that pool resources in party-centered systems (PoolsLocal == 1; OpenList == 0) compared to MEPs from candidate-centered systems (PoolsLocal == 1; OpenList == 1).

If I want to, I can test whether the relationship between the two variables is statistically significant even here. That is, I can perform a chisquare test (\(X^2\) test). It tests whether the combination of values (counts) in my table is different from what the univariate distributions would indicate. The test relies on another probability distribution you have encountered: The \(X^2\) distribution. As you may recall from methods 1, it describes the sum of squares of independent standard normal random variables. Here, we have a small cross table with two variables (four squares): Pooling resources and electoral system.

df[, c("OpenList", "PoolsLocal")] %>%
  table %>%
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 34.317, df = 1, p-value = 4.683e-09

We see that the p-value is really low (you’d have to move the comma 9 slots to the left to get all the decimals). That is, it is highly unlikely that the combinations we see are produced by accident. We can discard the null hypothesis.

Usually, I wouldn’t bother doing the chisquare test here, because that’s the purpose of the regression. However, the chisquare test sets us up nicely for the model assessment where we are interested in whether the predicted and observed outcomes are correlated.

Run a binomial logistic regression

The function for generalized linear models in R is glm(). This time, we also have to specify the probability distribution that will allow us to map the recoded dependent variable (the logodds) to probabilities.

I run a base-line model with my one predictor, then a second more complex model where I also control for the resources the MEP has at his/her disposal.

mod1 <- glm(PoolsLocal ~ OpenList,
            family = "binomial",
            df)

mod2 <- glm(PoolsLocal ~ OpenList +
              LaborCost + 
              SeatsNatPal.prop,
            family = "binomial",
            df)

Present your results: table of results

Although Ward and Ahlquist (2018) dont like regression tables (i.e. the BUTONs), it is useful to present them too.

stargazer::stargazer(mod1, mod2,
                     dep.var.caption = "Dependent variable",
                     title = "MEPs' propensity to share local assistants (a binomial logit)",
                     type = "html")
MEPs’ propensity to share local assistants (a binomial logit)
Dependent variable
PoolsLocal
(1) (2)
OpenList -0.971*** -1.124***
(0.166) (0.181)
LaborCost 0.056***
(0.009)
SeatsNatPal.prop -1.930***
(0.527)
Constant -0.190* -1.094***
(0.103) (0.286)
Observations 707 686
Log Likelihood -441.336 -392.832
Akaike Inf. Crit. 886.672 793.665
Note: p<0.1; p<0.05; p<0.01

Now that I have settled on my model, I will assess its performance, then use the interpretations to tell my story. There is no, single, perfect model. Also, if there was any such thing, we wouldn’t know unless we saw some lower-performing alternatives. Therefore, the model assessment often involves a comparison between alternative models.

Model evaluation

An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.

The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.

While the interpretation of the results and the assessment of our hypothesis require us to engage with the uncertainty of the predictions, this is not an important point in the model evaluation. That is, we often rely on point estimates (Can you ask chatGPT what that is?). It also means that I will simply use the base-line R function for prediction without any simulations. I specify that I want the predicted values given as probabilities.

df$preds1 <- predict(mod1, 
                     #In-sample prediction
                     newdata = df,
                     #I want probabilities
                     type = "response")

We can admire the distribution of the predicted values in a histogram. What do you think?

hist(df$preds1)

We can do the same with the more complex model.

df$preds2 <- predict(mod2, 
                     newdata = df,
                     type = "response")

hist(df$preds2)

Is that better? What just happened?

Reversal to probabilities

The model assigns a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that an MEP shares staff with a colleague. In other words, a model that does a good job in describing the outcome would assign high probabilities to MEPs that share an assistant, and low probabilities to MEPs that don’t share.

A single-number summary: the Hosmer-Lemeshow test

The Hosmer - Lemeshow test orders the data according to the predicted probability, then slices the data up into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.

df %>% 
  #Filter out potential NAs
  filter(!is.na(preds1)) %>%
  #Select my variables
  dplyr::select(PoolsLocal, preds1) %>%
  #create a variable that identifies observations by their decile
  mutate("decile" = ntile(preds1, 10)) %>%
  #Group by decile
  group_by(decile) %>%
  #Count the number of successes 
  summarise(sum(PoolsLocal)) %>% 
  #Do the chisquare test; I ask for a simulated p-value
  chisq.test(., simulate.p.value = T)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  .
## X-squared = 5.6222, df = NA, p-value = 0.7961

Your turn:

How should we interpret the results of the Hosmer-Lemeshow test?

The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good.

A visual comparison: the separation plot

We can state the same intuition differently : We should see few MEPs that pool resources on the lower end of the probability scale, and a higher density of MEPs that share assistants on the high end of the probability scale.

We can even illustrate this in a plot: The “separation plot”.

df$preds1 <- predict(mod1, newdata = df,
                     type = "response")

tab <- df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(PoolsLocal, preds1) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds1) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

#Plot
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(data = tab,
               aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate the segments by predicted outcome
                   color = as.factor(PoolsLocal)),
               #Make colors transparent incase we have multiple observations with same probability
               alpha = 0.6) +
  scale_color_manual(values = c("grey", "purple"),
                     labels = c("Does not pool", "Pool"),
                     name = "Observed outcomes: MEPs pooling resources") +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds1,
                x = index)) +
  #Esthetics
  ylab("Y-hat") +
  xlab("index") +
  ggtitle("Separation plot",
          subtitle = "Does the model assign high probabilities to positive outcomes?") +
  theme_minimal() +
  theme(axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "bottom",
        legend.direction = "vertical") 

The black line illustrates how the probability increases when we order the data from low to high probability. If there is a clear separation in the predicted probabilities in the two groups, then this line would have a sharp increase at some point. It would also start by being close to 0 and finish high up towards the 1 on the y-axis. We would furthermore see many purple segments (PoolsLocal == 1) to the right of the plot where the predicted probability of observing 1s is high, and many grey areas to the left where the probability of observing 0s is low (PoolsLocal == 0).

The orginal R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.

library(separationplot)
separationplot(pred = df$preds1,
               actual = df$PoolsLocal)

If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).

Your turn:

  • Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
  • Run the code on the predictions for model 2. How does the separation plot change?

Choosing a cut point

An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.

As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model.

df$PoolsLocal_pred <-
  (df$preds1 > 0.35) %>%
  as.numeric()

I can plot the predicted and the observed values against each other in a cross table.

df[c("PoolsLocal_pred", "PoolsLocal")] %>%
  table
##                PoolsLocal
## PoolsLocal_pred   0   1
##               0 249  78
##               1 208 172

To make the table more comparable, I can calculate the proportion of correct predictions by the values of the dependent variable. That is, I could calculate the correct/false prediction rate for successes (PoolsLocal == 1) and the correct/false prediction of failures (PoolsLocal == 0)

tab <- 
  df[c("PoolsLocal_pred", "PoolsLocal")] %>%
  #cross table
  table %>% 
  #probabilities by row (sums up to 1 per row)
  prop.table(., margin = 2)

tab
##                PoolsLocal
## PoolsLocal_pred         0         1
##               0 0.5448578 0.3120000
##               1 0.4551422 0.6880000

The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.

Here, I can clearly see that I have a 54% correct prediction rate for negative outcomes/failures (PoolsLocal == 0). In other words, my model performs OK in predicting the MEPs that do not pool resources. However, my model performs fairly well when predicting MEPs that pool resources (69%).

This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.

Would you like to illustrate the correlation visually? Make a tiles plot with strong colors for high predictions. Note that the correct predictions now run on the anti-diagonal (from left upwards to the right).

tab %>% 
  data.frame %>% 
  ggplot() +  
  geom_tile(aes(x = PoolsLocal_pred, 
                y = PoolsLocal, 
                fill = Freq)) +
  xlab("Observed outcomes") +
  ylab("Predicted outcomes") +
  scale_fill_gradient(high = "blue",
                      low = "white",
                      name = "Probability")

Letting the cutpoint vary: The ROC curve

The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.

The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.

A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.

#load necessary packages; install it if need be
library(pROC)

#define the object to plot
roc1 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds1)

#Create the ROC plot
ggroc(roc1) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 1") +
  theme_classic()

We begin by the roc() function that does the calculations for us (sliding the cutpoint, comparing outcomes). The ggroc() function then plots the object. It’s a ggplot object, so I can add all the usual esthetics to it.

Let’s see how my more complex model is doing.

#define the object to plot
roc2 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds2)

#Create the ROC plot
ggroc(roc2) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 2") +
  theme_classic()

We can now see how the model improves. The package even has a function to explicitly compare (and choose) models

roc2 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc.test(roc1, roc2)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc1 and roc2
## Z = -6.3678, p-value = 1.918e-10
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  -0.14291608 -0.07564453
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.6236973   0.7329776

The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically signficant.

The roc2 object is a list. It means that it contains a lot of information that we can extract using indexing. I can ask for all the list elements.

names(roc2)
##  [1] "percent"            "sensitivities"      "specificities"     
##  [4] "thresholds"         "direction"          "cases"             
##  [7] "controls"           "fun.sesp"           "auc"               
## [10] "call"               "original.predictor" "original.response" 
## [13] "predictor"          "response"           "levels"

… and extract the area under the curve:

roc2$auc
## Area under the curve: 0.733

What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.

ggplot() +
  geom_line(aes(y = roc2$sensitivities,
                x = roc2$thresholds))  +
  geom_line(aes(y = roc2$specificities,
                x = roc2$thresholds),
            lty = 2)

id.threshold <- (roc2$sensitivities - roc2$specificities) %>% 
  abs %>%
  which.min

#Best balance
roc2$thresholds[id.threshold]
## [1] 0.3884075

For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html

Interpretation

Once we are satisfied with the performance of our model, we can start playing around with it. What does it have to tell us?

1. Regression coefficients

The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).

\(y \sim \alpha + \beta \times x\)

Regression coefficients reported in logodds are useful for the hypothesis testing. I can easily see whether my effects are positive or negative; and whether they are statistically likely to be different from zero. From the results table, I can also see that intra-party competition (OpenList) reduces MEPs’ likelihood of pooling resources. But by how much?

GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side. That’s what exercise 1 in the problem set for today aimed to illustrate.

When I interpret, I usually follow up with three additional steps.

2. Marginal effect

In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the relative increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. When I partially back transform the slope parameter in logistic regression, I obtain the odds ratio (exercice 2b). To do so, I exponentiate the slope parametr. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.

exp(mod1$coefficients[2])
##  OpenList 
## 0.3788176

When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. Here, the \(y\) would increase by a factor of 0.379.

Unless your exponentiated coefficient is above 2 (i.e. “double”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.

exp(mod1$coefficients[2])-1
##   OpenList 
## -0.6211824

MEPs from candidate-centered systems are 62% less likely to share an assistant, compared to members from party-centered systems.

3. First differences and scenarios

Because of the in-built interaction effect, it is particularly important to go all the way to the predicted effect by backtransforming from the logodds.

Logistic transformation: \[\begin{aligned}\frac{1}{1+exp(-x)}\end{aligned}\]

or

\[\begin{aligned}\frac{exp(x)}{1+exp(x)}\end{aligned}\]

First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. Until now, we have relied on all-else-equal statements in the construction of scenarios. However, when we illustrate with first differences, we do not have to that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.

If the point is merely to give point estimates (expected values, average predicted effect…), I use the predict() function on a hypothetical dataset.

Here, I illustrate the predicted effect of the 2022 election on MEPs from Venstre.

I use dplyr to select variable values from the actual data on Denmark. You can of course just use data.frame() and specify by hand. The seat share for my scenario, I drew from Wikipedia.

share2019 <- 43/179
share2022<-23/179

newdata <-
  df %>%
  filter(Nationality == "Denmark") %>%
  dplyr::select(OpenList, LaborCost)%>%
  unique %>%
  cbind(SeatsNatPal.prop = c(43/179, 23/179),
        Year=c(2019,2022))
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
preds <- predict(mod2, 
                 newdata = newdata,
                 type = "response")

#Predicted probabilitis
preds
##         1         2 
## 0.4174272 0.4706101
#First difference
diff(preds)
##          2 
## 0.05318288

We see that the Danish Members of the European Parliament had a 5 percentage point higher likelihood of sharing an assistant after the catastrophical national election in 2022.

Note my vocabulary: When we deal with absolute predicted changes, we talk about percentage points. % is reserved for relative changes.

Graphics

An image speaks more than a 1000 words; especially when we work with GLMs. Put your scenarios on speed and illustrate the effect of a predictor over its entire range instead of a simple first-difference. When you do this, you will go back to the all-else-equal ideal. Set the other variables to a typical but constant value, and let your predictor of choice vary.

When the predictor is binary, the tilted coefplot is a good choice. When it is continuous, you can use the classical effect graphic by drawing a line. In fact, the entire procedure is exactly the same as before, except for one detail: You need to reverse your dependent variable to a probability. You do this after you have simulated the predicted values (logodds), before you start plotting.

I begin by setting up my scenario. When doing so, I rely heavily on my descriptive table. Here, I continue with my riff on Venstre, but illustrate the effect of the electoral system.

#Scenario
newdata <- data.frame(
  Intercept = 1,
  OpenList = c(0, 1),
  #Labor cost in Denmark.
  LaborCost = 42,
  SeatsNatPal.prop = 23/179)

#Simulate
coef.sim <- MASS::mvrnorm(
  #Number of simulations
  n = 10000,
  #Coefficients
  mu = mod2$coefficients,
  #Uncertainty/variance-covariance matrix
  Sigma = vcov(mod2)
)

#Matrix multiplication to get predicted LOGODDS
logodds <- coef.sim %*% t(newdata)

#Reverse recoding to probability
preds <- 1/(1+exp(-logodds)) #or exp(logodds)/(1 + exp(logodds))

#Summarize the simulation
pred_summary <-
  apply(preds,
        #Apply this to columns (==2)
        MARGIN = 2,
        #What a function?
        FUN = quantile, probs = c(0.025, 0.5, 0.975))

#Add summary to data with scenarios
dfp3 <- cbind(newdata,
              t(pred_summary))


ggplot(dfp3) +
  #Plot a line this time
  geom_point(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = OpenList)
  ) +
  #plot a little line to insinuate a move from one descrete value to another
  geom_line(aes(
    #Predicted y-values
    y = `50%`,
    #Different scenarios
    x = OpenList),
    lty = 2, color = "grey",
  ) +
  #Plot a ribbon of uncertainty
  #Plot a the uncertainty
  geom_errorbar(aes(
    #Predicted y-values
    ymin = `2.5%`,
    ymax = `97.5%`,
    #Different scenarios
    x = OpenList),
    alpha = 0.3,
    #Error bars "sans serif"
    width = 0) +
  theme_minimal()

For the R enthusiast

Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.

The function that creates functions (sic!) is called function().

  1. In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).

  2. After that, we open a set of curly brackets. This is where we put our R codes.

  3. The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function: return(). R will only give one object per function.

  4. Store your function in an R object. my_function <- function(){}

A toy example.

I want a function that tells me if the number i put into it is smaller or larger than 10. My argument ´x=´ is the input; the output is given in the return(). For ease, I create a default value for x. I chose x = 1. However, once the function is compiled (run the R-code), the user may redefine the x object to whatever value (s)he wants.

bigger_than10 <- function(x = 1){ 
  return(x > 10)
}

#Try it out
bigger_than10(x = 20)
## [1] TRUE

If you like your function and want to use it later, you can store it as an R object in your working directory `

save(bigger_than10,
     file  = "bigger_than10.rda")

Later, you fetch it as usual.

load("bigger_than10.rda")

Your turn

Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code.

  1. create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of df$PoolsLocal and df$preds1.

  2. identify all the times the observed y (PoolsLocal) is used in the code. Replace the name by y. Do the same with y_pred and preds1.

  3. create a data table df0 by a data.frame(y, y_pred). This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning.

  4. start the following code (copy paste from previous) without the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.

  5. pack the code chunk into the separation_plot(y = NULL, y_pred = NULL){}.

tab <- df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(PoolsLocal, preds1) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds1) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

#Plot
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(data = tab,
               aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate the segments by predicted outcome
                   color = as.factor(PoolsLocal)),
               #Make colors transparent incase we have multiple observations with same probability
               alpha = 0.6) +
  scale_color_manual(values = c("grey", "purple")) +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds1,
                x = index)) +
  #Esthetics
  ylab("Y-hat") +
  xlab("index") +
  ggtitle("Separation plot") +
  theme_minimal()

Want another challenge?

  • What aesthetics would you have in the function that the user can later alter?
  • The model object in R contains the two vectors you need: mod1$y and mod1$fitted.values. Can you rewrite your function so that the only argument required is the model object mod1?.

Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Hermansen, Silje Synnøve Lyder, and Andreja Pegan. 2024. “Blurred Lines Between Electoral and Parliamentary Representation: The Use of Constituency Staff Among Members of the European Parliament.” European Union Politics, 1–25. https://doi.org/10.1177/14651165221149900.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.