Logistic regression models
Binomial logistic models allows us to analyze binary outcomes. They are part of the GLM family of models, meaning that they require a recoding strategy to make the dependent variable continuous and unbounded. They also require us to pick a probability distribution in order to link observed outcomes to the probability of the event. All of this R does for us, as long as we specify the model correctly. However, the interpretation requires us to partially or completely backtrack the recoding. Also, although the model assessment compares predicted and observed outcomes, we have the additional challenge that we have to decide on a useful cut point at which we believe the latent variable translates to 1s instead of 0s.
New R codes:
- new from base:
glm()
,prop.table()
,chisq.test()
,predict(., type = "response")
- new from dplyr package:
n()
,distinct()
- new from ggplot:
geom_errorbar()
,geom_segment()
(less important) - new from pROC:
roc()
- new from separationplot:
separationplot()
- new from tidyr:
pivot_wider()
(less important)
Old R codes:
- old from base:
dim()
,table
,as.data.frame()
- old from dplyr package:
group_by
,mutate()
,reframe()
- old from ggplot:
geom_bar()
,geom_col
,geom_point()
- old from ggeffects:
ggpredict(., terms = , condition = )
Introduction
Main methodological takeaways
GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation
no transformation (logodds). These are the regression coefficients from the model output and your resultstable. It is useful for hypothesis testing.
partial reversal (oddsratio, change in odds). Marginal effects are a one-number summary of the relative effect of a particular variable. I do this by exponentiating the slope coefficient.
reversal (probability). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).
total reversal to original scale (binary). Useful mostly for model assessment (\(\chi^2\) test, ROC-curve ).
All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.
Substantive topic (the research)
We will be working on some more of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.
This time we will play around with another implication of the electoral systems: Intra-party competition. We formulate a hypothesis that MEPs that will later compete against party colleagues are less prone to share resources. In our case, we measure the whether MEPs share at least one local assistant with their national party colleagues. Local assistants work in the circumscription/member state and allow MEPs to cultivate a personal vote. We therefore compare MEPs from candidate-centered systems to MEPs from party-centered systems. We furthermore make the assumption that – all else equal – the more strapped an MEP (and their party) is for money, the stronger incentives they have to pool resources. We approximate this by measuring the average labor cost in the MEPs member state and the party size in the national parliament.
The data
we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2016.
I start by fetching the R packages we will use from the library.
#Data wrangling
library(tidyr);library(dplyr);
#Presentation
library(ggplot2);
library(stargazer); library(ggeffects);
#New packages for model assessment
library(separationplot); library(pROC)
Descriptive statistics
As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.
We begin by loading in the data.
If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory. Note that this time, we work with data on MEPs from 2016.
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2016.rda",
destfile = "MEP2016.rda")
Now, I can load it into R and rename the object to my liking
(df
).
#Load in
load("MEP2016.rda")
#Rename
<- MEP2016 df
Univariate distribution
To approximate MEPs’ propensity to share assistants with party
colleagues, I have created a binary variable PoolsLocal
that indicates whether the MEP has indeed at least one local assistant
on his/her payroll that they also share with colleagues. My main
predictor is also binary: Is the MEP elected from a candidate-centered
system (OpenList
)?
I begin by exploring the univariate distribution in numbers, a table and a barplot. The most interesting thing here, is to figure out how common the phenomenon is on average. I therefore calculate the mean; i.e. the proportion of MEPs that pools resources
mean(df$PoolsLocal)
## [1] 0.3536068
We see that the proportion of MEPs that share local assistants is 0.35. That is, the predicted probability of sharing assistants in a base-line intercept-only model is 0.3536068.
If my phenomenon is rare or very common (outcome close to 0 or 1), it is a sign that
- the binomial logistic regression likely yields different results than a linear model (ref. the S-shape).
- I might have problems correctly predicting successes (or failure), because they are rare.
For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.
#In base
table(df$PoolsLocal)
##
## 0 1
## 457 250
#In dplyr
%>%
df group_by(PoolsLocal) %>%
#Count number of observations in the group
reframe(N = n())
#Quick one in base R
barplot(table(df$PoolsLocal))
#dplyr + ggplot2
%>%
df +
ggplot #Function automatically makes frequency table and displays it
geom_bar(aes(PoolsLocal)) +
#Plot title
ggtitle("Distribution of MEPs that share local assisants (pool resources)")
Since we will work with probabilities, it is useful to plot the
relative distribution of the dependent variable. I can use base R to
transform the frequency table (table
) to proportions
(prop.table()
).
$PoolsLocal %>%
df%>%
table prop.table
## .
## 0 1
## 0.6463932 0.3536068
To illustrate that in ggplot, I make a new data frame where I
calculate the height of each column, then make a barplot. When I provide
both x-values and y-values to the graphic, the correct function is
geom_col()
.
#Tidyverse + ggplot2
%>%
df #Group by outcome
group_by(PoolsLocal) %>%
reframe(
#Number of observations in each group
n = n(),
#Number of observations in the data
N = dim(df)[1],
#The proportion
prop = n/N) %>%
+
ggplot geom_col(aes(y = prop,
x = PoolsLocal)) +
ggtitle("Distribution of MEPs that share local assisants (pool resources)")
I then explore each of the predictors and make a table of descriptive
statistics. I will use them actively when I make my interpretations.
Here, I create a separate data table with only the relevant variables,
so that I can pass them to stargazer()
for descriptive
statistics.
<-
df0 %>%
df #The select() function has namesakes in other packages, so I often specify which package I'm drawing on
::select(PoolsLocal,
dplyr
OpenList,
LaborCost,
SeatsNatPal.prop, position)
It is useful already here, to make a table of descriptive statistics.
I can use stargazer()
to get them automatically. Here, I
nevertheless specify what type of statistics I want (interquartile
range, median, minimum and maximum).
library(stargazer)
stargazer(df0,
#Title of the table
title = "Descriptive statistics",
#Type of output
type = "html",
#Summary stats of data frame
summary = T,
#Additional arguments for statistics
iqr = T, median = T, min.max = T)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Median | Pctl(75) | Max |
PoolsLocal | 707 | 0.354 | 0.478 | 0 | 0 | 0 | 1 | 1 |
OpenList | 707 | 0.463 | 0.499 | 0 | 0 | 0 | 1 | 1 |
LaborCost | 707 | 23.953 | 11.222 | 4.400 | 10.900 | 26.700 | 33.000 | 42.000 |
SeatsNatPal.prop | 686 | 0.229 | 0.173 | 0.000 | 0.071 | 0.241 | 0.357 | 0.668 |
position | 656 | 4.943 | 1.890 | 1.059 | 3.429 | 5.786 | 6.400 | 7.000 |
Bivariate statistics
My main predictor this time is a binary variable that flags whether the MEP was elected – and therefore might run for reelection – in a candidate-centered system (“OpenList”).
I explore the variation between the two in a cross table. I first calculate the counts of different value combinations (appropriate for discrete/categorical) variables. I can use different R dialects and obtain the same results.
#Base
table(df[, c("OpenList", "PoolsLocal")])
## PoolsLocal
## OpenList 0 1
## 0 208 172
## 1 249 78
#dplyr + tidyr (for those who look for an R challenge)
%>%
df #Groups/combinations of values
group_by(PoolsLocal, OpenList) %>%
#Count occurences
reframe(n = n())%>%
#Spread OpenList by columns: from tidyr package
::pivot_wider(names_from = PoolsLocal,
tidyrvalues_from = n)
… I then transform to proportions. This is already a first step in the analysis. By calculating proportions row-wise (margin = 1), I make the distributions comparable.
#Base
c("OpenList", "PoolsLocal")] %>%
df[, %>%
table prop.table(., margin = 1 )
## PoolsLocal
## OpenList 0 1
## 0 0.5473684 0.4526316
## 1 0.7614679 0.2385321
#dplyr + tidyr (for those who look for an R challenge)
%>%
df #Group by the predictor first (to calculate proportion of sharing within each value of x)
group_by(OpenList) %>%
#Count number of occurences of each value of the predictor
mutate(N = n()) %>%
#Groups/combinations of values
group_by(PoolsLocal, OpenList) %>%
#Count occurences and divide by observations in that predictor
mutate(prop = n()/N) %>%
#Distinct observations (alternative to mutate)
distinct(PoolsLocal, prop) %>%
#arrange by values
arrange(OpenList, PoolsLocal) %>%
#Spread by columns: from tidyr package
::pivot_wider(names_from = PoolsLocal,
tidyrvalues_from = prop)
Now, we can clearly see that there is a larger proportion of MEPs
that pool resources in party-centered systems
(PoolsLocal == 1
; OpenList == 0
) compared to
MEPs from candidate-centered systems (PoolsLocal == 1
;
OpenList == 1
).
If I want to, I can test whether the relationship between the two variables is statistically significant even here. That is, I can perform a chisquare test (\(X^2\) test). It tests whether the combination of values (counts) in my table is different from what the univariate distributions would indicate. The test relies on another probability distribution you have encountered in your intro class to statistics: The \(X^2\) distribution. As you may recall from methods 1, it describes the sum of squares of independent standard normal random variables. Here, we have a small cross table with two variables (four squares): Pooling resources and electoral system.
c("OpenList", "PoolsLocal")] %>%
df[, %>%
table chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 34.317, df = 1, p-value = 4.683e-09
We see that the p-value is really low (you’d have to move the comma 9 slots to the left to get all the decimals). That is, it is highly unlikely that the combinations we see are produced by accident. We can discard the null hypothesis.
Usually, I wouldn’t bother doing the chisquare test here, because that’s the purpose of the regression. However, the chisquare test sets us up nicely for the model assessment where we are interested in whether the predicted and observed outcomes are correlated.
Analysis
Estimate a binomial logistic regression
The function for generalized linear models in R is
glm()
. This time, we also have to specify the probability
distribution that will allow us to map the recoded dependent variable
(the logodds) to probabilities.
I run a base-line model with my one predictor, then a second more complex model where I also control for the resources the MEP has at his/her disposal.
<- glm(PoolsLocal ~ OpenList,
mod1 family = "binomial",
df)
<- glm(PoolsLocal ~
mod2 +
OpenList +
LaborCost
SeatsNatPal.prop,family = "binomial",
df)
Present your results: table of results
Although Ward and Ahlquist (2018) don’t like regression tables (i.e. the BUTONs; Big Ugly Table of Numbers), it is useful to present them too.
::stargazer(mod1, mod2,
stargazerdep.var.caption = "Dependent variable",
title = "MEPs' propensity to share local assistants (a binomial logit)",
type = "html")
Dependent variable | ||
PoolsLocal | ||
(1) | (2) | |
OpenList | -0.971*** | -1.124*** |
(0.166) | (0.181) | |
LaborCost | 0.056*** | |
(0.009) | ||
SeatsNatPal.prop | -1.930*** | |
(0.527) | ||
Constant | -0.190* | -1.094*** |
(0.103) | (0.286) | |
Observations | 707 | 686 |
Log Likelihood | -441.336 | -392.832 |
Akaike Inf. Crit. | 886.672 | 793.665 |
Note: | p<0.1; p<0.05; p<0.01 |
The table lets us compare regression coefficients directly between models. They are reported as changes in logodds (our recoded, “latent” variable). However, the size of the coefficients are not really comparable across glm-models because the coefficient is always proportional to the intercept. Here, we see that the marginal effect of the electoral system has a larger coefficient in the second model, but the intercept is also much smaller. The relative change in the probability (or oddds; or logodds) is therefore larger, but it comes from a lower base-line level. The table is still useful for checking the precision (standard error and stars) of the estimates.
Now that I have settled on my model, I will assess its performance, then use the interpretations to tell my story. There is no, single, perfect model. Also, if there was any such thing, we wouldn’t know unless we saw some lower-performing alternatives. Therefore, the model assessment often involves a comparison between alternative models.
Model assessment
An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.
The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.
Reversal to probabilities
The model predictions can be reported as a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that an MEP shares staff with a colleague. In other words, a model that does a good job in describing the outcome would assign high probabilities to MEPs that share an assistant, and low probabilities to MEPs that don’t share.
I use the predict(., type = "response")
function in base
R to specify that I want the predicted values given as
probabilities.
<- df %>%
df mutate(
#Prediction from model 1
preds1 = predict(mod1,
#In-sample prediction
newdata = df,
#I want probabilities
type = "response"),
#Predictions from model 2
preds2 = predict(mod2,
#In-sample prediction
newdata = df,
#I want probabilities
type = "response"),
)
We can admire the distribution of the predicted values in a histogram. What do you think?
%>%
df +
ggplot geom_histogram(aes(preds1))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
My predictions only involves two possible outcomes. That’s because the model only has one, binary predictor.
We can do the same with the more complex model.
%>%
df +
ggplot geom_histogram(aes(preds2))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 21 rows containing non-finite values (`stat_bin()`).
The histogram of predicted probability gives a sense of the distribution of probabilities, but how can we compare with the model’s observed outcomes? The challenge in the assessment is that we compare a continuous prediction with discrete observed values.
A single-number summary
Brier score
Ward and Ahlquist (2018) suggest using the Brier score as a single-number summary of how well your model predicts in sample.
The Brier score approximates the average correct predictions (something like the \(R^2\) for linear models). It calculates the difference between the predicted and observed outcomes. It then squares them and calculates the mean. In an unbiased model, the average distance between the residuals is 0; hence the additional operation.
\[B_b \equiv \frac{1}{n}\Sigma^{n}_{i = 1}(\hat{\theta_i} - y_i)^2\]
%>%
df mutate(diff = preds2 - PoolsLocal,
diff2 = diff^2) %>%
reframe(Brier = mean(diff2, na.rm = T))
The Brier score ranges from 0 to 1 where lower scores indicate smaller distance between the predicted and observed outcome. Here, my model seems to perform pretty well.
the Hosmer-Lemeshow test
The Hosmer - Lemeshow test orders the data according to the predicted probability of a positive outcome, then slices the data into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.
There are several packages that perform the test for you:
glmtoolbox We can use the glmtoolbox
package to fetch the function for the test. It suffices to feed the
model object into the hltest()
function.
#install.packages("glmtoolbox")
library(glmtoolbox)
## Warning: pakke 'glmtoolbox' blev bygget under R version 4.3.3
hltest(mod2, df = 8)
##
## The Hosmer-Lemeshow goodness-of-fit test
##
## Group Size Observed Expected
## 1 69 4 5.897530
## 2 70 14 9.631471
## 3 69 14 12.535274
## 4 67 5 17.903488
## 5 63 22 20.598544
## 6 70 43 26.823364
## 7 65 21 28.580932
## 8 83 44 42.546345
## 9 73 38 42.634777
## 10 57 41 38.848275
##
## Statistic = 37.09993
## degrees of freedom = 8
## p-value = 1.1032e-05
Here, we see that in the first decile of the predicted probabilities, there are 4 MEPs that share assistants, while we – when we sum over the probabilities – find 4 predicted MEPs.
Note that the test in principle checks whether the observed and the predicted outcomes are significantly different from each other. We don’t want that; we want the two groups to be similar to each other. Here, the p-value is 1.103191^{-5}; bad news for us.
If were to run this test on my base-line model, I’d get an error message.
hltest(mod1)
That’s because I only have two predicted outcomes, while the test requires us to create ten groups (deciles).
ResourceSelection Another package is the
ResourceSelection
package, using te
hoslem.test()
. It asks for the observed outcome and the
predicted probabilites, respectively. If you have NAs in your
prediction, the function breaks down, and you’d have to subset the
variables to get a comparison.
library(ResourceSelection)
## Warning: pakke 'ResourceSelection' blev bygget under R version 4.3.3
## ResourceSelection 0.3-6 2023-06-27
hoslem.test(df$PoolsLocal[!is.na(df$preds2)],
$preds2[!is.na(df$preds2)]) df
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: df$PoolsLocal[!is.na(df$preds2)], df$preds2[!is.na(df$preds2)]
## X-squared = 50.983, df = 8, p-value = 2.644e-08
The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good. As you can see, the different test functions also yield slightly different results, so you will have to use some discernment when assessing your model fit.
A visual comparison: the separation plot
We can state the same intuition differently : We should see few MEPs that pool resources on the lower end of the predicted probability scale, and a higher density of MEPs that share assistants on the high end of the probability scale.
We can even illustrate this in a plot: The “separation plot”.
Here, I select the two variables (observed and predicted outcomes). I
then randomly re-order the data by drawing my data anew using the
sample()
function. This is because I will arrange the data
according to the predicted probability to share assistants, but
observations with identical probability may vary in whether I observe a
0 or 1 in the outcome. It means that the plot will look slightly
different every time, especially if I have many categorical predictors,
but few numeric ones. Last, I order the data according to the MEP’s
predicted probability of pooling resources and create an index variable
from 0 to N =707 (the number of rows in the data). The index will
constitute my coordinates on the x-axis.
<-
tab %>%
df #Select the predicted probabilities and the observed outcomes
::select(PoolsLocal, preds2) %>%
dplyr#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds2) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
The separation plot then plots a bar (segment) for each observation in the data. The bars are color coded according to the observed outcome, but sorted by their predicted probability on the x-axis.
#Plot
%>%
tab ggplot() +
#Vertical lines == one per observation in the data
geom_segment(aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate and color the segments by predicted outcome
color = as.factor(PoolsLocal)),
#Make colors transparent increase we have multiple observations with same probability
alpha = 0.6) +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds2,
x = index))
## Warning: Removed 21 rows containing missing values (`geom_line()`).
The black line illustrates how the probability of observing a
“success” (1) increases when we order the data from low to high
probability. If there is a clear separation in the predicted
probabilities in the two groups, then this line would have a sharp
increase at some point. It would also start by being close to 0 and
finish high up towards the 1 on the y-axis. We would furthermore see
many green segments (PoolsLocal == 1
) to the right of the
plot where the predicted probability of observing 1s is high, and many
grey areas to the left where the probability of observing 0s is low
(PoolsLocal == 0
).
The original R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.
library(separationplot)
separationplot(pred = df$preds1,
actual = df$PoolsLocal)
If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).
Your turn:
- Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
- Run the code on the predictions for model 1. How does the separation plot change? Why?
Choosing a cut point
An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point (\(\tau\)) should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.
A single cut point (\(\tau\))
As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model, i.e. the intercept-only model.
<-
df %>%
df mutate(PoolsLocal_pred = as.numeric(df$preds1 > 0.35))
I can plot the predicted and the observed values against each other in a cross table.
#Base R
c("PoolsLocal_pred", "PoolsLocal")] %>%
df[ table
## PoolsLocal
## PoolsLocal_pred 0 1
## 0 249 78
## 1 208 172
%>%
df group_by(PoolsLocal, PoolsLocal_pred) %>%
reframe(N = n()) %>%
mutate(correct_prediction = (PoolsLocal_pred == PoolsLocal))
To make the table more comparable, I can calculate the proportion of
correct predictions by the values of the dependent variable. That is, I
could calculate the correct/false prediction rate for successes
(PoolsLocal == 1
) and the correct/false prediction of
failures (PoolsLocal == 0
)
#Base R
<-
tab c("PoolsLocal_pred", "PoolsLocal")] %>%
df[#cross table
%>%
table #probabilities by row (sums up to 1 per row)
prop.table(., margin = 2)
tab
## PoolsLocal
## PoolsLocal_pred 0 1
## 0 0.5448578 0.3120000
## 1 0.4551422 0.6880000
#
<-
tab %>%
df #Group by outcome
group_by(PoolsLocal) %>%
#Mean/proportion of 0 predictions and 1 predictions
reframe(Predicted0 = mean(PoolsLocal_pred == 0),
Predicted1 = mean(PoolsLocal_pred == 1))
tab
The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.
Here, I can clearly see that I have aPoolsLocal == 0
). In other words, my model performs OK in
predicting the MEPs that do not pool resources. However, my model
performs fairly well when predicting MEPs that pool resources (
%).
It is a good idea to report the correct positives and negatives in your results table. If so, I usually go for a cut point decided by the proportion of successes in the data. In other words, does my model with predictors predict better than a simple guess from a model without predictors.
#For esthetics, round off proportions to two digits
<- round(tab, digits = 2)
tab
stargazer(mod2,
#Add two lines
add.lines = list(
c("Correct positives (prop.)", tab$Predicted1[2]),
c("Correct negatives (prop.)", tab$Predicted0[1])
))
##
## % Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com
## % Date and time: on, mar 13, 2024 - 22:22:02
## \begin{table}[!htbp] \centering
## \caption{}
## \label{}
## \begin{tabular}{@{\extracolsep{5pt}}lc}
## \\[-1.8ex]\hline
## \hline \\[-1.8ex]
## & \multicolumn{1}{c}{\textit{Dependent variable:}} \\
## \cline{2-2}
## \\[-1.8ex] & PoolsLocal \\
## \hline \\[-1.8ex]
## OpenList & $-$1.124$^{***}$ \\
## & (0.181) \\
## & \\
## LaborCost & 0.056$^{***}$ \\
## & (0.009) \\
## & \\
## SeatsNatPal.prop & $-$1.930$^{***}$ \\
## & (0.527) \\
## & \\
## Constant & $-$1.094$^{***}$ \\
## & (0.286) \\
## & \\
## \hline \\[-1.8ex]
## Correct positives (prop.) & 0.69 \\
## Correct negatives (prop.) & 0.54 \\
## Observations & 686 \\
## Log Likelihood & $-$392.832 \\
## Akaike Inf. Crit. & 793.665 \\
## \hline
## \hline \\[-1.8ex]
## \textit{Note:} & \multicolumn{1}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
## \end{tabular}
## \end{table}
This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.
Letting the cutpoint vary: The ROC curve
The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.
The ROC curve plots the correct positive rate (sensitivity) against the incorrect positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.
A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.
#load necessary packages; install it if need be
library(pROC)
#define the object to plot
<- roc(
roc1 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds1)
#Create the ROC plot
ggroc(roc1) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 1") +
theme_classic()
We begin by the roc()
function that does the calculations
for us (sliding the cutpoint, comparing outcomes). The
ggroc()
function then plots the object. It’s a ggplot
object, so I can add all the usual esthetics to it.
Let’s see how my more complex model is doing.
#define the object to plot
<- roc(
roc2 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds2)
#Create the ROC plot
ggroc(roc2) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 2") +
theme_classic()
We can now see how the model improves. The package even has a function to explicitly compare (and choose) models
<- roc(
roc2 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc.test(roc1, roc2)
##
## DeLong's test for two correlated ROC curves
##
## data: roc1 and roc2
## Z = -6.3678, p-value = 1.918e-10
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.14291608 -0.07564453
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.6236973 0.7329776
The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically significant.
The roc2
object is a list. It means that it contains a
lot of information that we can extract using indexing. I can ask for all
the list elements.
names(roc2)
## [1] "percent" "sensitivities" "specificities"
## [4] "thresholds" "direction" "cases"
## [7] "controls" "fun.sesp" "auc"
## [10] "call" "original.predictor" "original.response"
## [13] "predictor" "response" "levels"
… and extract the area under the curve:
$auc roc2
## Area under the curve: 0.733
What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.
ggplot() +
geom_line(aes(y = roc2$sensitivities,
x = roc2$thresholds)) +
geom_line(aes(y = roc2$specificities,
x = roc2$thresholds),
lty = 2)
<- (roc2$sensitivities - roc2$specificities) %>%
id.threshold %>%
abs
which.min
#Best balance
$thresholds[id.threshold] roc2
## [1] 0.3884075
Using this as a threshold, my cross table would look like this:
%>%
df group_by(PoolsLocal) %>%
reframe(Predicted1 = mean(preds2 > roc2$thresholds[id.threshold], na.rm = T),
Predicted0 = mean(preds2 <= roc2$thresholds[id.threshold], na.rm = T))
As is clear, at this threshold, I’m (almost) equally wrong/right for
my predictions in each group in the outcome
(PoolsLocal
).
For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html
Interpretation
Once we are satisfied with the performance of our model, we can start playing around with it. What does it have to tell us?
1. Regression coefficients
The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).
\(y \sim \alpha + \beta \times x\)
Regression coefficients reported in logodds are useful for the
hypothesis testing. I can easily see whether my effects
are positive or negative and whether their likelihood of being different
from zero is statistically significant. From the results table, I can
also see that intra-party competition (OpenList
) reduces
MEPs’ likelihood of pooling resources. But by how much?
GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side. That’s what exercise 1 in the problem set for today aimed to illustrate.
When I interpret, I usually follow up with three additional steps.
2. Marginal effect
In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the relative increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. When I partially back transform the slope parameter in logistic regression, I obtain the odds ratio (exercise 2b). To do so, I exponentiate the slope parameter. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.
exp(mod1$coefficients[2])
## OpenList
## 0.3788176
When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. It is the factor of change in y when x increases by one unit. Here, the \(y\) would increase by a factor of 0.379.
Unless your exponentiated coefficient is above 2 (i.e. the effect more than “doubles”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.
exp(mod1$coefficients[2])-1
## OpenList
## -0.6211824
MEPs from candidate-centered systems are 62% less likely to share an assistant, compared to members from party-centered systems.
3. First differences and scenarios
Because of the in-built interaction effect, it is particularly important to go all the way to the predicted effect by backtransforming from the logodds.
Logistic transformation: \[\frac{1}{1+exp(-x)}\]
or
\[\frac{exp(x)}{1+exp(x)}\]
First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. In linear models (with linear effects), we can rely on all-else-equal statements in the construction of scenarios. However, in GLMs we can’t do that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.
If the point is merely to give point estimates (expected values,
average predicted effect…), I use the predict()
function on
a hypothetical dataset.
Here, I illustrate the predicted effect of the 2022 national election on MEPs from Venstre.
I use dplyr to select variable values from the actual data on
Denmark. You can of course just use data.frame()
and
specify by hand. The seat share for my scenario, I drew from
Wikipedia.
<- 43/179
share2019 <-23/179
share2022
<-
scenario %>%
df filter(Nationality == "Denmark") %>%
::select(OpenList, LaborCost) %>%
dplyrdistinct(., .keep_all = T) %>%
data.frame(SeatsNatPal.prop = c(share2019, share2022),
Year = c(2019, 2022))
<- predict(mod2,
preds newdata = scenario,
type = "response")
#Predicted probabilities
preds
## 1 2
## 0.4174272 0.4706101
#First difference
diff(preds)
## 2
## 0.05318288
We see that the Danish Members of the European Parliament had a 5 percentage point higher likelihood of sharing an assistant after the catastrophical national election in 2022.
Note my vocabulary: When we deal with absolute predicted changes, we talk about percentage points. % is reserved for relative changes.
Graphics
An image speaks more than a 1000 words; especially when we work with GLMs. Put your scenarios on speed and illustrate the effect of a predictor over its entire range instead of a simple first-difference. When you do this, you will go back to the all-else-equal ideal. Set the other variables to a typical but constant value, and let your predictor of choice vary.
The choice of graphic depends on the measurement level of your predictor.
Binary predictor: tilted coefplot
When the predictor is binary, the tilted coefplot is a good choice. When it is continuous, you can use the classical effect graphic by drawing a line. In fact, the entire procedure is exactly the same as before, except for one detail: You need to reverse your dependent variable to a probability. You do this after you have simulated the predicted values (logodds), before you start plotting.
I begin by setting up my scenario.
The ggpredict package decide on a typical scenario for you, given the distribution of each variable in the data. If you just want to illustrate the general take-away from your model, that’s great. The simulation then also takes into account where you have the most observations, so you’ll see that the uncertainty around the predictions change depending on where you are on your predictor.
Simple You can also let R make all the aesthetic
choices for you too. However, you’ll have to redefine the
OpenList
variable into categorical (factor) to get the
tilted coefplot.
<- glm(PoolsLocal ~ as.factor(OpenList) + LaborCost + SeatsNatPal.prop, family = "binomial", df)
mod2 <- ggpredict(mod2,
eff terms = "OpenList")
%>% plot eff
By hand If you have a specific research question (or a decision-maker with priorities) in mind, you may want to change the scenario. If I set my own scenario, I rely heavily on my descriptive table. Here, I continue with my riff on Venstre, but illustrate the effect of the electoral system. I also do the graphic from scratch.
# Prediction with a scenario
library(ggeffects)
<- ggpredict(mod2,
eff terms =
list("OpenList" = c(0, 1)),
condition = list(
LaborCost = 42,
SeatsNatPal.prop = 23/179
))
#Check the output
%>% as.data.frame eff
%>%
eff #Transform to data frame, and recode to more discernable variable names
%>%
as.data.frame #Aesthetical recoding: give the predictor proper values, then order the factor from 0 (party centered) to 1 (candidate centered)
mutate(`Electoral system` = if_else(x == 1,
"Candidate-centered",
"Party-centered"),
`Electoral system` = reorder(`Electoral system`, x)) %>%
ggplot() +
#Plot a line this time
geom_point(aes(
#Predicted y-values
y = predicted,
#Different scenarios
x = `Electoral system`)
+
) #plot a little line to insinuate a move from one descrete value to another
geom_line(aes(
#Predicted y-values
y = predicted,
#Different scenarios
x = `Electoral system`),
lty = 2,
+
) #Plot a ribbon of uncertainty
#Plot a the uncertainty
geom_errorbar(aes(
#Predicted y-values
ymin = conf.low,
ymax = conf.high,
#Different scenarios
x = `Electoral system`),
alpha = 0.3,
#Error bars "sans serif"
width = 0) +
#Set the y-axis
ylim(0, 1) +
ylab("Probability") +
ggtitle("Effect of ELECTORAL SYSTEM on MEPs propensity to pool resources")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Numeric predictor: effect plot
If the predictor is numeric, you can make fine-grained scenarios and display the effect as a regression line.
Simple I can do this in a simple way. Note how
ggplot changes my y-axis to zoom in and give the impression I’ve gotten
a massive effect. You can change that by adding ylim(0,1)
,
for example.
<- ggpredict(mod2,
eff terms = "LaborCost")
## Data were 'prettified'. Consider using `terms="LaborCost [all]"` to get
## smooth plots.
%>% plot eff
By hand I can also specify the scenario and make the plot from scratch.
# Prediction with a scenario
library(ggeffects)
<- ggpredict(mod2,
eff terms =
list("LaborCost" = seq(min(df$LaborCost), max(df$LaborCost), 0.1)),
condition = list(
OpenList = 1,
SeatsNatPal.prop = 23/179
))
#Check the output
%>% as.data.frame %>% head eff
%>%
eff ggplot() +
#Plot a ribbon of uncertainty
geom_ribbon(aes(
#Predicted y-values
ymax = conf.high,
ymin = conf.low,
x = x),
#Make the ribbon transparent
alpha = 0.3
+
) #Plot a line this time
geom_line(aes(
#Predicted y-values
y = predicted,
#Different scenarios
x = x)
+
) #Set the y-axis
ylim(0, 1) +
ylab("Probability") + xlab("Labor cost") +
ggtitle("Effect of LABOR COST on MEPs propensity to pool resources")
For the R enthusiast: R functions
Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.
The function that creates functions (sic!) is called
function()
.
In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).
After that, we open a set of curly brackets. This is where we put our R codes.
The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function:
return()
. R will only give one object per function.Store your function in an R object.
my_function <- function(){}
A toy example.
I want a function that tells me if the number i put into it is
smaller or larger than 10. My argument ´x=´ is the input; the output is
given in the return()
. For ease, I create a default value
for x. I chose x = 1
. However, once the function is
compiled (run the R-code), the user may redefine the x object to
whatever value (s)he wants.
<- function(x = 1){
bigger_than10 return(x > 10)
}
#Try it out
bigger_than10(x = 20)
## [1] TRUE
If you like your function and want to use it later, you can store it as an R object in your working directory `
save(bigger_than10,
file = "bigger_than10.rda")
Later, you fetch it as usual.
load("bigger_than10.rda")
Your turn!
Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code. You can use the following code as a basis:
<- df %>%
tab #Select the predicted probabilities and the observed outcomes
::select(PoolsLocal, preds1) %>%
dplyr#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds1) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
#Plot
ggplot() +
#Vertical lines == one per observation in the data
geom_segment(data = tab,
aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate the segments by predicted outcome
color = as.factor(PoolsLocal)),
#Make colors transparent incase we have multiple observations with same probability
alpha = 0.6) +
scale_color_manual(values = c("grey", "purple")) +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds1,
x = index)) +
#Esthetics
ylab("Y-hat") +
xlab("index") +
ggtitle("Separation plot") +
theme_minimal() +
theme(legend.position = "bottom")
create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of
df$PoolsLocal
anddf$preds1
.identify all the times the observed y (
PoolsLocal
) is used in the code. Replace the name byy
. Do the same withy_pred
andpreds1
.create a data table
df0
by adata.frame(y, y_pred)
. This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning. Replacedf
bydf0
in the code.Remove the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.
pack the code chunk into the curly brackets of the following code
separation_plot <- function(y = NULL, y_pred = NULL){}
.Ok! Does it run? Try out with preds1 from model 1 as well. Does it work?
Want another challenge?
- What aesthetics would you have in the function that the user can later alter?
- The model object in R contains the two vectors you need:
mod1$y
andmod1$fitted.values
. Can you rewrite your function so that the only argument required is the model objectmod1
?.
Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.