Logistic regression models

Binomial logistic models allows us to analyze binary outcomes. They are part of the GLM family of models, meaning that they require a recoding strategy to make the dependent variable continuous and unbounded. They also require us to pick a probability distribution in order to link observed outcomes to the probability of the event. All of this R does for us, as long as we specify the model correctly. However, the interpretation requires us to partially or completely backtrack the recoding. Also, although the model assessment compares predicted and observed outcomes, we have the additional challenge that we have to decide on a useful cut point at which we believe the latent variable translates to 1s instead of 0s.

New R codes:

  • new from base: glm(), prop.table(), chisq.test(), predict(., type = "response")
  • new from dplyr package: n(), distinct()
  • new from ggplot: geom_errorbar(), geom_segment() (less important)
  • new from pROC: roc()
  • new from separationplot: separationplot()
  • new from tidyr: pivot_wider() (less important)

Old R codes:

  • old from base: dim(), table, as.data.frame()
  • old from dplyr package: group_by, mutate(), reframe()
  • old from ggplot: geom_bar(), geom_col, geom_point()
  • old from ggeffects: ggpredict(., terms = , condition = )

Introduction

Main methodological takeaways

GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation

  • no transformation (logodds). These are the regression coefficients from the model output and your resultstable. It is useful for hypothesis testing.

  • partial reversal (oddsratio, change in odds). Marginal effects are a one-number summary of the relative effect of a particular variable. I do this by exponentiating the slope coefficient.

  • reversal (probability). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).

  • total reversal to original scale (binary). Useful mostly for model assessment (\(\chi^2\) test, ROC-curve ).

All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.

Substantive topic (the research)

We will be working on some more of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.

This time we will play around with another implication of the electoral systems: Intra-party competition. We formulate a hypothesis that MEPs that will later compete against party colleagues are less prone to share resources. In our case, we measure the whether MEPs share at least one local assistant with their national party colleagues. Local assistants work in the circumscription/member state and allow MEPs to cultivate a personal vote. We therefore compare MEPs from candidate-centered systems to MEPs from party-centered systems. We furthermore make the assumption that – all else equal – the more strapped an MEP (and their party) is for money, the stronger incentives they have to pool resources. We approximate this by measuring the average labor cost in the MEPs member state and the party size in the national parliament.

The data

we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2016.

I start by fetching the R packages we will use from the library.

#Data wrangling
library(tidyr);library(dplyr); 
#Presentation
library(ggplot2);
library(stargazer); library(ggeffects); 
#New packages for model assessment
library(separationplot); library(pROC)

Descriptive statistics

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory. Note that this time, we work with data on MEPs from 2016.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2016.rda",
              destfile = "MEP2016.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("MEP2016.rda")

#Rename
df <- MEP2016

Univariate distribution

To approximate MEPs’ propensity to share assistants with party colleagues, I have created a binary variable PoolsLocal that indicates whether the MEP has indeed at least one local assistant on his/her payroll that they also share with colleagues. My main predictor is also binary: Is the MEP elected from a candidate-centered system (OpenList)?

I begin by exploring the univariate distribution in numbers, a table and a barplot. The most interesting thing here, is to figure out how common the phenomenon is on average. I therefore calculate the mean; i.e. the proportion of MEPs that pools resources

mean(df$PoolsLocal)
## [1] 0.3536068

We see that the proportion of MEPs that share local assistants is 0.35. That is, the predicted probability of sharing assistants in a base-line intercept-only model is 0.3536068.

If my phenomenon is rare or very common (outcome close to 0 or 1), it is a sign that

  • the binomial logistic regression likely yields different results than a linear model (ref. the S-shape).
  • I might have problems correctly predicting successes (or failure), because they are rare.

For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.

#In base
table(df$PoolsLocal)
## 
##   0   1 
## 457 250
#In dplyr
df %>%
  group_by(PoolsLocal) %>%
  #Count number of observations in the group
  reframe(N = n())
#Quick one in base R
barplot(table(df$PoolsLocal))

#dplyr + ggplot2
df %>%
  ggplot +
  #Function automatically makes frequency table and displays it
  geom_bar(aes(PoolsLocal)) +
  #Plot title
  ggtitle("Distribution of MEPs that share local assisants (pool resources)")

Since we will work with probabilities, it is useful to plot the relative distribution of the dependent variable. I can use base R to transform the frequency table (table) to proportions (prop.table()).

df$PoolsLocal %>%
  table %>%
  prop.table 
## .
##         0         1 
## 0.6463932 0.3536068

To illustrate that in ggplot, I make a new data frame where I calculate the height of each column, then make a barplot. When I provide both x-values and y-values to the graphic, the correct function is geom_col().

#Tidyverse + ggplot2
df %>%
  #Group by outcome
  group_by(PoolsLocal) %>%
  reframe(
    #Number of observations in each group
    n = n(),
    #Number of observations in the data
    N = dim(df)[1],
    #The proportion
    prop = n/N) %>%
  ggplot +
  geom_col(aes(y = prop,
               x = PoolsLocal)) +
  ggtitle("Distribution of MEPs that share local assisants (pool resources)")

I then explore each of the predictors and make a table of descriptive statistics. I will use them actively when I make my interpretations. Here, I create a separate data table with only the relevant variables, so that I can pass them to stargazer() for descriptive statistics.

df0 <- 
  df %>% 
  #The select() function has namesakes in other packages, so I often specify which package I'm drawing on
  dplyr::select(PoolsLocal,
                OpenList,
                LaborCost,
                SeatsNatPal.prop,
                position)

It is useful already here, to make a table of descriptive statistics. I can use stargazer() to get them automatically. Here, I nevertheless specify what type of statistics I want (interquartile range, median, minimum and maximum).

library(stargazer)
stargazer(df0,
          #Title of the table
          title = "Descriptive statistics",
          #Type of output
          type = "html",
          #Summary stats of data frame
          summary = T,
          #Additional arguments for statistics
          iqr = T, median = T, min.max = T)
Descriptive statistics
Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
PoolsLocal 707 0.354 0.478 0 0 0 1 1
OpenList 707 0.463 0.499 0 0 0 1 1
LaborCost 707 23.953 11.222 4.400 10.900 26.700 33.000 42.000
SeatsNatPal.prop 686 0.229 0.173 0.000 0.071 0.241 0.357 0.668
position 656 4.943 1.890 1.059 3.429 5.786 6.400 7.000

Bivariate statistics

My main predictor this time is a binary variable that flags whether the MEP was elected – and therefore might run for reelection – in a candidate-centered system (“OpenList”).

I explore the variation between the two in a cross table. I first calculate the counts of different value combinations (appropriate for discrete/categorical) variables. I can use different R dialects and obtain the same results.

#Base
table(df[, c("OpenList", "PoolsLocal")])
##         PoolsLocal
## OpenList   0   1
##        0 208 172
##        1 249  78
#dplyr + tidyr (for those who look for an R challenge)
df %>%
  #Groups/combinations of values
  group_by(PoolsLocal, OpenList) %>%
  #Count occurences
  reframe(n = n())%>%
  #Spread OpenList by columns: from tidyr package
  tidyr::pivot_wider(names_from = PoolsLocal,
                     values_from = n)

… I then transform to proportions. This is already a first step in the analysis. By calculating proportions row-wise (margin = 1), I make the distributions comparable.

#Base
df[, c("OpenList", "PoolsLocal")] %>%
  table %>%
  prop.table(., margin = 1 )
##         PoolsLocal
## OpenList         0         1
##        0 0.5473684 0.4526316
##        1 0.7614679 0.2385321
#dplyr + tidyr (for those who look for an R challenge)
df %>%
  #Group by the predictor first (to calculate proportion of sharing within each value of x)
  group_by(OpenList) %>%
  #Count number of occurences of each value of the predictor
  mutate(N = n()) %>%
  #Groups/combinations of values
  group_by(PoolsLocal, OpenList) %>%
  #Count occurences and divide by observations in that predictor
  mutate(prop = n()/N) %>%
  #Distinct observations (alternative to mutate)
  distinct(PoolsLocal, prop) %>%
  #arrange by values
  arrange(OpenList, PoolsLocal) %>%
  #Spread by columns: from tidyr package
  tidyr::pivot_wider(names_from = PoolsLocal,
                     values_from = prop)

Now, we can clearly see that there is a larger proportion of MEPs that pool resources in party-centered systems (PoolsLocal == 1; OpenList == 0) compared to MEPs from candidate-centered systems (PoolsLocal == 1; OpenList == 1).

If I want to, I can test whether the relationship between the two variables is statistically significant even here. That is, I can perform a chisquare test (\(X^2\) test). It tests whether the combination of values (counts) in my table is different from what the univariate distributions would indicate. The test relies on another probability distribution you have encountered in your intro class to statistics: The \(X^2\) distribution. As you may recall from methods 1, it describes the sum of squares of independent standard normal random variables. Here, we have a small cross table with two variables (four squares): Pooling resources and electoral system.

df[, c("OpenList", "PoolsLocal")] %>%
  table %>%
  chisq.test()
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 34.317, df = 1, p-value = 4.683e-09

We see that the p-value is really low (you’d have to move the comma 9 slots to the left to get all the decimals). That is, it is highly unlikely that the combinations we see are produced by accident. We can discard the null hypothesis.

Usually, I wouldn’t bother doing the chisquare test here, because that’s the purpose of the regression. However, the chisquare test sets us up nicely for the model assessment where we are interested in whether the predicted and observed outcomes are correlated.

Analysis

Estimate a binomial logistic regression

The function for generalized linear models in R is glm(). This time, we also have to specify the probability distribution that will allow us to map the recoded dependent variable (the logodds) to probabilities.

I run a base-line model with my one predictor, then a second more complex model where I also control for the resources the MEP has at his/her disposal.

mod1 <- glm(PoolsLocal ~ OpenList,
            family = "binomial",
            df)

mod2 <- glm(PoolsLocal ~ 
              OpenList +
              LaborCost + 
              SeatsNatPal.prop,
            family = "binomial",
            df)

Present your results: table of results

Although Ward and Ahlquist (2018) don’t like regression tables (i.e. the BUTONs; Big Ugly Table of Numbers), it is useful to present them too.

stargazer::stargazer(mod1, mod2,
                     dep.var.caption = "Dependent variable",
                     title = "MEPs' propensity to share local assistants (a binomial logit)",
                     type = "html")
MEPs’ propensity to share local assistants (a binomial logit)
Dependent variable
PoolsLocal
(1) (2)
OpenList -0.971*** -1.124***
(0.166) (0.181)
LaborCost 0.056***
(0.009)
SeatsNatPal.prop -1.930***
(0.527)
Constant -0.190* -1.094***
(0.103) (0.286)
Observations 707 686
Log Likelihood -441.336 -392.832
Akaike Inf. Crit. 886.672 793.665
Note: p<0.1; p<0.05; p<0.01

The table lets us compare regression coefficients directly between models. They are reported as changes in logodds (our recoded, “latent” variable). However, the size of the coefficients are not really comparable across glm-models because the coefficient is always proportional to the intercept. Here, we see that the marginal effect of the electoral system has a larger coefficient in the second model, but the intercept is also much smaller. The relative change in the probability (or oddds; or logodds) is therefore larger, but it comes from a lower base-line level. The table is still useful for checking the precision (standard error and stars) of the estimates.

Now that I have settled on my model, I will assess its performance, then use the interpretations to tell my story. There is no, single, perfect model. Also, if there was any such thing, we wouldn’t know unless we saw some lower-performing alternatives. Therefore, the model assessment often involves a comparison between alternative models.

Model assessment

An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.

The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.

Reversal to probabilities

The model predictions can be reported as a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that an MEP shares staff with a colleague. In other words, a model that does a good job in describing the outcome would assign high probabilities to MEPs that share an assistant, and low probabilities to MEPs that don’t share.

I use the predict(., type = "response") function in base R to specify that I want the predicted values given as probabilities.

df <- df %>%
  mutate(
    #Prediction from model 1
    preds1 = predict(mod1, 
                     #In-sample prediction
                     newdata = df,
                     #I want probabilities
                     type = "response"),
    #Predictions from model 2
    preds2 = predict(mod2, 
                     #In-sample prediction
                     newdata = df,
                     #I want probabilities
                     type = "response"),
  )

We can admire the distribution of the predicted values in a histogram. What do you think?

df %>%
  ggplot +
  geom_histogram(aes(preds1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

My predictions only involves two possible outcomes. That’s because the model only has one, binary predictor.

We can do the same with the more complex model.

df %>%
  ggplot +
  geom_histogram(aes(preds2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 21 rows containing non-finite values (`stat_bin()`).

The histogram of predicted probability gives a sense of the distribution of probabilities, but how can we compare with the model’s observed outcomes? The challenge in the assessment is that we compare a continuous prediction with discrete observed values.

A single-number summary

Brier score

Ward and Ahlquist (2018) suggest using the Brier score as a single-number summary of how well your model predicts in sample.

The Brier score approximates the average correct predictions (something like the \(R^2\) for linear models). It calculates the difference between the predicted and observed outcomes. It then squares them and calculates the mean. In an unbiased model, the average distance between the residuals is 0; hence the additional operation.

\[B_b \equiv \frac{1}{n}\Sigma^{n}_{i = 1}(\hat{\theta_i} - y_i)^2\]

df %>%
  mutate(diff = preds2 - PoolsLocal,
         diff2 = diff^2) %>%
  reframe(Brier = mean(diff2, na.rm = T))

The Brier score ranges from 0 to 1 where lower scores indicate smaller distance between the predicted and observed outcome. Here, my model seems to perform pretty well.

the Hosmer-Lemeshow test

The Hosmer - Lemeshow test orders the data according to the predicted probability of a positive outcome, then slices the data into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.

There are several packages that perform the test for you:

glmtoolbox We can use the glmtoolbox package to fetch the function for the test. It suffices to feed the model object into the hltest() function.

#install.packages("glmtoolbox")
library(glmtoolbox)
## Warning: pakke 'glmtoolbox' blev bygget under R version 4.3.3
hltest(mod2, df = 8)
## 
##    The Hosmer-Lemeshow goodness-of-fit test
## 
##  Group Size Observed  Expected
##      1   69        4  5.897530
##      2   70       14  9.631471
##      3   69       14 12.535274
##      4   67        5 17.903488
##      5   63       22 20.598544
##      6   70       43 26.823364
##      7   65       21 28.580932
##      8   83       44 42.546345
##      9   73       38 42.634777
##     10   57       41 38.848275
## 
##          Statistic =  37.09993 
## degrees of freedom =  8 
##            p-value =  1.1032e-05

Here, we see that in the first decile of the predicted probabilities, there are 4 MEPs that share assistants, while we – when we sum over the probabilities – find 4 predicted MEPs.

Note that the test in principle checks whether the observed and the predicted outcomes are significantly different from each other. We don’t want that; we want the two groups to be similar to each other. Here, the p-value is 1.103191^{-5}; bad news for us.

If were to run this test on my base-line model, I’d get an error message.

hltest(mod1)

That’s because I only have two predicted outcomes, while the test requires us to create ten groups (deciles).

ResourceSelection Another package is the ResourceSelection package, using te hoslem.test(). It asks for the observed outcome and the predicted probabilites, respectively. If you have NAs in your prediction, the function breaks down, and you’d have to subset the variables to get a comparison.

library(ResourceSelection)
## Warning: pakke 'ResourceSelection' blev bygget under R version 4.3.3
## ResourceSelection 0.3-6   2023-06-27
hoslem.test(df$PoolsLocal[!is.na(df$preds2)], 
            df$preds2[!is.na(df$preds2)])
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  df$PoolsLocal[!is.na(df$preds2)], df$preds2[!is.na(df$preds2)]
## X-squared = 50.983, df = 8, p-value = 2.644e-08

The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good. As you can see, the different test functions also yield slightly different results, so you will have to use some discernment when assessing your model fit.

A visual comparison: the separation plot

We can state the same intuition differently : We should see few MEPs that pool resources on the lower end of the predicted probability scale, and a higher density of MEPs that share assistants on the high end of the probability scale.

We can even illustrate this in a plot: The “separation plot”.

Here, I select the two variables (observed and predicted outcomes). I then randomly re-order the data by drawing my data anew using the sample() function. This is because I will arrange the data according to the predicted probability to share assistants, but observations with identical probability may vary in whether I observe a 0 or 1 in the outcome. It means that the plot will look slightly different every time, especially if I have many categorical predictors, but few numeric ones. Last, I order the data according to the MEP’s predicted probability of pooling resources and create an index variable from 0 to N =707 (the number of rows in the data). The index will constitute my coordinates on the x-axis.

tab <- 
  df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(PoolsLocal, preds2) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds2) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

The separation plot then plots a bar (segment) for each observation in the data. The bars are color coded according to the observed outcome, but sorted by their predicted probability on the x-axis.

#Plot
tab %>%
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate and color the segments by predicted outcome
                   color = as.factor(PoolsLocal)),
               #Make colors transparent increase we have multiple observations with same probability
               alpha = 0.6) +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds2,
                x = index))
## Warning: Removed 21 rows containing missing values (`geom_line()`).

The black line illustrates how the probability of observing a “success” (1) increases when we order the data from low to high probability. If there is a clear separation in the predicted probabilities in the two groups, then this line would have a sharp increase at some point. It would also start by being close to 0 and finish high up towards the 1 on the y-axis. We would furthermore see many green segments (PoolsLocal == 1) to the right of the plot where the predicted probability of observing 1s is high, and many grey areas to the left where the probability of observing 0s is low (PoolsLocal == 0).

The original R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.

library(separationplot)
separationplot(pred = df$preds1,
               actual = df$PoolsLocal)

If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).

Your turn:

  • Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
  • Run the code on the predictions for model 1. How does the separation plot change? Why?

Choosing a cut point

An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point (\(\tau\)) should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.

A single cut point (\(\tau\))

As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model, i.e. the intercept-only model.

df <-
  df %>%
  mutate(PoolsLocal_pred = as.numeric(df$preds1 > 0.35))

I can plot the predicted and the observed values against each other in a cross table.

#Base R
df[c("PoolsLocal_pred", "PoolsLocal")] %>%
  table
##                PoolsLocal
## PoolsLocal_pred   0   1
##               0 249  78
##               1 208 172
df %>%
  group_by(PoolsLocal, PoolsLocal_pred) %>%
  reframe(N = n()) %>%
  mutate(correct_prediction = (PoolsLocal_pred == PoolsLocal))

To make the table more comparable, I can calculate the proportion of correct predictions by the values of the dependent variable. That is, I could calculate the correct/false prediction rate for successes (PoolsLocal == 1) and the correct/false prediction of failures (PoolsLocal == 0)

#Base R
tab <- 
  df[c("PoolsLocal_pred", "PoolsLocal")] %>%
  #cross table
  table %>% 
  #probabilities by row (sums up to 1 per row)
  prop.table(., margin = 2)

tab
##                PoolsLocal
## PoolsLocal_pred         0         1
##               0 0.5448578 0.3120000
##               1 0.4551422 0.6880000
#
tab <- 
  df %>%
  #Group by outcome
  group_by(PoolsLocal) %>%
  #Mean/proportion of 0 predictions and 1 predictions
  reframe(Predicted0 = mean(PoolsLocal_pred == 0),
          Predicted1 = mean(PoolsLocal_pred == 1))

tab

The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.

Here, I can clearly see that I have a
% correct prediction rate for negative outcomes/failures (PoolsLocal == 0). In other words, my model performs OK in predicting the MEPs that do not pool resources. However, my model performs fairly well when predicting MEPs that pool resources (

%).

It is a good idea to report the correct positives and negatives in your results table. If so, I usually go for a cut point decided by the proportion of successes in the data. In other words, does my model with predictors predict better than a simple guess from a model without predictors.

#For esthetics, round off proportions to two digits
tab <- round(tab, digits = 2)

stargazer(mod2,
          #Add two lines
          add.lines = list(
            c("Correct positives (prop.)", tab$Predicted1[2]),
            c("Correct negatives (prop.)", tab$Predicted0[1])
          ))
## 
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: on, mar 13, 2024 - 22:22:02
## \begin{table}[!htbp] \centering 
##   \caption{} 
##   \label{} 
## \begin{tabular}{@{\extracolsep{5pt}}lc} 
## \\[-1.8ex]\hline 
## \hline \\[-1.8ex] 
##  & \multicolumn{1}{c}{\textit{Dependent variable:}} \\ 
## \cline{2-2} 
## \\[-1.8ex] & PoolsLocal \\ 
## \hline \\[-1.8ex] 
##  OpenList & $-$1.124$^{***}$ \\ 
##   & (0.181) \\ 
##   & \\ 
##  LaborCost & 0.056$^{***}$ \\ 
##   & (0.009) \\ 
##   & \\ 
##  SeatsNatPal.prop & $-$1.930$^{***}$ \\ 
##   & (0.527) \\ 
##   & \\ 
##  Constant & $-$1.094$^{***}$ \\ 
##   & (0.286) \\ 
##   & \\ 
## \hline \\[-1.8ex] 
## Correct positives (prop.) & 0.69 \\ 
## Correct negatives (prop.) & 0.54 \\ 
## Observations & 686 \\ 
## Log Likelihood & $-$392.832 \\ 
## Akaike Inf. Crit. & 793.665 \\ 
## \hline 
## \hline \\[-1.8ex] 
## \textit{Note:}  & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\ 
## \end{tabular} 
## \end{table}

This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.

Letting the cutpoint vary: The ROC curve

The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.

The ROC curve plots the correct positive rate (sensitivity) against the incorrect positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.

A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.

#load necessary packages; install it if need be
library(pROC)

#define the object to plot
roc1 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds1)

#Create the ROC plot
ggroc(roc1) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 1") +
  theme_classic()

We begin by the roc() function that does the calculations for us (sliding the cutpoint, comparing outcomes). The ggroc() function then plots the object. It’s a ggplot object, so I can add all the usual esthetics to it.

Let’s see how my more complex model is doing.

#define the object to plot
roc2 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds2)

#Create the ROC plot
ggroc(roc2) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 2") +
  theme_classic()

We can now see how the model improves. The package even has a function to explicitly compare (and choose) models

roc2 <- roc(
  #Observed values
  response = df$PoolsLocal, 
  #Predicted values
  predictor = df$preds2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc.test(roc1, roc2)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc1 and roc2
## Z = -6.3678, p-value = 1.918e-10
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  -0.14291608 -0.07564453
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.6236973   0.7329776

The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically significant.

The roc2 object is a list. It means that it contains a lot of information that we can extract using indexing. I can ask for all the list elements.

names(roc2)
##  [1] "percent"            "sensitivities"      "specificities"     
##  [4] "thresholds"         "direction"          "cases"             
##  [7] "controls"           "fun.sesp"           "auc"               
## [10] "call"               "original.predictor" "original.response" 
## [13] "predictor"          "response"           "levels"

… and extract the area under the curve:

roc2$auc
## Area under the curve: 0.733

What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.

ggplot() +
  geom_line(aes(y = roc2$sensitivities,
                x = roc2$thresholds))  +
  geom_line(aes(y = roc2$specificities,
                x = roc2$thresholds),
            lty = 2)

id.threshold <- (roc2$sensitivities - roc2$specificities) %>% 
  abs %>%
  which.min

#Best balance
roc2$thresholds[id.threshold]
## [1] 0.3884075

Using this as a threshold, my cross table would look like this:

df %>% 
  group_by(PoolsLocal) %>% 
  reframe(Predicted1 = mean(preds2 > roc2$thresholds[id.threshold], na.rm = T), 
          Predicted0 = mean(preds2 <= roc2$thresholds[id.threshold], na.rm = T))

As is clear, at this threshold, I’m (almost) equally wrong/right for my predictions in each group in the outcome (PoolsLocal).

For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html

Interpretation

Once we are satisfied with the performance of our model, we can start playing around with it. What does it have to tell us?

1. Regression coefficients

The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).

\(y \sim \alpha + \beta \times x\)

Regression coefficients reported in logodds are useful for the hypothesis testing. I can easily see whether my effects are positive or negative and whether their likelihood of being different from zero is statistically significant. From the results table, I can also see that intra-party competition (OpenList) reduces MEPs’ likelihood of pooling resources. But by how much?

GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side. That’s what exercise 1 in the problem set for today aimed to illustrate.

When I interpret, I usually follow up with three additional steps.

2. Marginal effect

In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the relative increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. When I partially back transform the slope parameter in logistic regression, I obtain the odds ratio (exercise 2b). To do so, I exponentiate the slope parameter. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.

exp(mod1$coefficients[2])
##  OpenList 
## 0.3788176

When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. It is the factor of change in y when x increases by one unit. Here, the \(y\) would increase by a factor of 0.379.

Unless your exponentiated coefficient is above 2 (i.e. the effect more than “doubles”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.

exp(mod1$coefficients[2])-1
##   OpenList 
## -0.6211824

MEPs from candidate-centered systems are 62% less likely to share an assistant, compared to members from party-centered systems.

3. First differences and scenarios

Because of the in-built interaction effect, it is particularly important to go all the way to the predicted effect by backtransforming from the logodds.

Logistic transformation: \[\frac{1}{1+exp(-x)}\]

or

\[\frac{exp(x)}{1+exp(x)}\]

First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. In linear models (with linear effects), we can rely on all-else-equal statements in the construction of scenarios. However, in GLMs we can’t do that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.

If the point is merely to give point estimates (expected values, average predicted effect…), I use the predict() function on a hypothetical dataset.

Here, I illustrate the predicted effect of the 2022 national election on MEPs from Venstre.

I use dplyr to select variable values from the actual data on Denmark. You can of course just use data.frame() and specify by hand. The seat share for my scenario, I drew from Wikipedia.

share2019 <- 43/179
share2022<-23/179

scenario <- 
  df %>%
  filter(Nationality == "Denmark") %>%
  dplyr::select(OpenList, LaborCost) %>%
  distinct(., .keep_all = T) %>%
  data.frame(SeatsNatPal.prop = c(share2019, share2022),
             Year = c(2019, 2022))

preds <- predict(mod2, 
                 newdata = scenario,
                 type = "response")

#Predicted probabilities
preds
##         1         2 
## 0.4174272 0.4706101
#First difference
diff(preds)
##          2 
## 0.05318288

We see that the Danish Members of the European Parliament had a 5 percentage point higher likelihood of sharing an assistant after the catastrophical national election in 2022.

Note my vocabulary: When we deal with absolute predicted changes, we talk about percentage points. % is reserved for relative changes.

Graphics

An image speaks more than a 1000 words; especially when we work with GLMs. Put your scenarios on speed and illustrate the effect of a predictor over its entire range instead of a simple first-difference. When you do this, you will go back to the all-else-equal ideal. Set the other variables to a typical but constant value, and let your predictor of choice vary.

The choice of graphic depends on the measurement level of your predictor.

Binary predictor: tilted coefplot

When the predictor is binary, the tilted coefplot is a good choice. When it is continuous, you can use the classical effect graphic by drawing a line. In fact, the entire procedure is exactly the same as before, except for one detail: You need to reverse your dependent variable to a probability. You do this after you have simulated the predicted values (logodds), before you start plotting.

I begin by setting up my scenario.

The ggpredict package decide on a typical scenario for you, given the distribution of each variable in the data. If you just want to illustrate the general take-away from your model, that’s great. The simulation then also takes into account where you have the most observations, so you’ll see that the uncertainty around the predictions change depending on where you are on your predictor.

Simple You can also let R make all the aesthetic choices for you too. However, you’ll have to redefine the OpenList variable into categorical (factor) to get the tilted coefplot.

mod2 <- glm(PoolsLocal ~ as.factor(OpenList) + LaborCost + SeatsNatPal.prop, family = "binomial", df)
eff <- ggpredict(mod2,
                 terms = "OpenList")
eff %>% plot

By hand If you have a specific research question (or a decision-maker with priorities) in mind, you may want to change the scenario. If I set my own scenario, I rely heavily on my descriptive table. Here, I continue with my riff on Venstre, but illustrate the effect of the electoral system. I also do the graphic from scratch.

# Prediction with a scenario
library(ggeffects)
eff <- ggpredict(mod2,
                 terms = 
                   list("OpenList" = c(0, 1)),
                 condition = list(
                   LaborCost = 42,
                   SeatsNatPal.prop = 23/179
                 ))

#Check the output
eff %>% as.data.frame
eff %>%
  #Transform to data frame, and recode to more discernable variable names
  as.data.frame %>%
  #Aesthetical recoding: give the predictor proper values, then order the factor from 0 (party centered) to 1 (candidate centered)
  mutate(`Electoral system` = if_else(x == 1,
                                      "Candidate-centered",
                                      "Party-centered"),
         `Electoral system` = reorder(`Electoral system`, x)) %>%
ggplot() +
  #Plot a line this time
  geom_point(aes(
    #Predicted y-values
    y = predicted,
    #Different scenarios
    x = `Electoral system`)
  ) +
  #plot a little line to insinuate a move from one descrete value to another
  geom_line(aes(
    #Predicted y-values
    y = predicted,
    #Different scenarios
    x = `Electoral system`),
    lty = 2,
  ) +
  #Plot a ribbon of uncertainty
  #Plot a the uncertainty
  geom_errorbar(aes(
    #Predicted y-values
    ymin = conf.low,
    ymax = conf.high,
    #Different scenarios
    x = `Electoral system`),
    alpha = 0.3,
    #Error bars "sans serif"
    width = 0) +
  #Set the y-axis
  ylim(0, 1) +
  ylab("Probability") +
  ggtitle("Effect of ELECTORAL SYSTEM on MEPs propensity to pool resources") 
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Numeric predictor: effect plot

If the predictor is numeric, you can make fine-grained scenarios and display the effect as a regression line.

Simple I can do this in a simple way. Note how ggplot changes my y-axis to zoom in and give the impression I’ve gotten a massive effect. You can change that by adding ylim(0,1), for example.

eff <- ggpredict(mod2,
                 terms = "LaborCost")
## Data were 'prettified'. Consider using `terms="LaborCost [all]"` to get
##   smooth plots.
eff %>% plot

By hand I can also specify the scenario and make the plot from scratch.

# Prediction with a scenario
library(ggeffects)
eff <- ggpredict(mod2,
                 terms = 
                   list("LaborCost" = seq(min(df$LaborCost), max(df$LaborCost), 0.1)),
                 condition = list(
                   OpenList = 1,
                   SeatsNatPal.prop = 23/179
                 ))

#Check the output
eff %>% as.data.frame %>% head
eff %>%
  ggplot() +
   #Plot a ribbon of uncertainty
  geom_ribbon(aes(
    #Predicted y-values
    ymax = conf.high,
    ymin = conf.low,
    x = x),
    #Make the ribbon transparent
    alpha = 0.3
  ) +
  #Plot a line this time
  geom_line(aes(
    #Predicted y-values
    y = predicted,
    #Different scenarios
    x = x)
  ) +
  #Set the y-axis
  ylim(0, 1) +
  ylab("Probability") + xlab("Labor cost") +
  ggtitle("Effect of LABOR COST on MEPs propensity to pool resources") 

For the R enthusiast: R functions

Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.

The function that creates functions (sic!) is called function().

  1. In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).

  2. After that, we open a set of curly brackets. This is where we put our R codes.

  3. The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function: return(). R will only give one object per function.

  4. Store your function in an R object. my_function <- function(){}

A toy example.

I want a function that tells me if the number i put into it is smaller or larger than 10. My argument ´x=´ is the input; the output is given in the return(). For ease, I create a default value for x. I chose x = 1. However, once the function is compiled (run the R-code), the user may redefine the x object to whatever value (s)he wants.

bigger_than10 <- function(x = 1){ 
  return(x > 10)
}

#Try it out
bigger_than10(x = 20)
## [1] TRUE

If you like your function and want to use it later, you can store it as an R object in your working directory `

save(bigger_than10,
     file  = "bigger_than10.rda")

Later, you fetch it as usual.

load("bigger_than10.rda")

Your turn!

Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code. You can use the following code as a basis:

tab <- df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(PoolsLocal, preds1) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds1) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

#Plot
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(data = tab,
               aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate the segments by predicted outcome
                   color = as.factor(PoolsLocal)),
               #Make colors transparent incase we have multiple observations with same probability
               alpha = 0.6) +
  scale_color_manual(values = c("grey", "purple")) +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds1,
                x = index)) +
  #Esthetics
  ylab("Y-hat") +
  xlab("index") +
  ggtitle("Separation plot") +
  theme_minimal() +
  theme(legend.position = "bottom")

  1. create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of df$PoolsLocal and df$preds1.

  2. identify all the times the observed y (PoolsLocal) is used in the code. Replace the name by y. Do the same with y_pred and preds1.

  3. create a data table df0 by a data.frame(y, y_pred). This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning. Replace df by df0 in the code.

  4. Remove the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.

  5. pack the code chunk into the curly brackets of the following code separation_plot <- function(y = NULL, y_pred = NULL){}.

  6. Ok! Does it run? Try out with preds1 from model 1 as well. Does it work?

Want another challenge?

  • What aesthetics would you have in the function that the user can later alter?
  • The model object in R contains the two vectors you need: mod1$y and mod1$fitted.values. Can you rewrite your function so that the only argument required is the model object mod1?.

Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Hermansen, Silje Synnøve Lyder, and Andreja Pegan. 2024. “Blurred Lines Between Electoral and Parliamentary Representation: The Use of Constituency Staff Among Members of the European Parliament.” European Union Politics, 1–25. https://doi.org/10.1177/14651165221149900.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.