Logistic regression models
Binomial logistic models allows us to analyze binary outcomes. They are part of the GLM family of models, meaning that they require a recoding strategy to make the dependent variable continuous and unbounded. They also require us to pick a probability distribution in order to link observed outcomes to the probability of the event. All of this R does for us, as long as we specify the model correctly. However, the interpretation requires us to partially or completely backtrack the recoding. Also, although the model assessment compares predicted and observed outcomes, we have the additional challenge that we have to decide on a useful cut point at which we believe the latent variable translates to 1s instead of 0s.
New R codes:
- new from base:
glm()
,prop.table()
,chisq.test()
,predict(., type = "response")
- new from dplyr package:
n()
,distinct()
- new from ggplot:
geom_errorbar()
,geom_segment()
(less important) - new from pROC:
roc()
- new from separationplot:
separationplot()
- new from tidyr:
pivot_wider()
(less important)
Old R codes:
- old from base:
dim()
,table
,as.data.frame()
- old from dplyr package:
group_by
,mutate()
,reframe()
- old from ggplot:
geom_bar()
,geom_col
,geom_point()
- old from ggeffects:
ggpredict(., terms = , condition = )
Introduction
Main methodological takeaways
GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation
no transformation: the outcome variable (logodds). These are the values of the recoded outcome variable on which the model is estimated. The model output and results table report coefficients in this unit (i.e. a one-unit change in x results in a \(\beta\) change in logodds).
no transformation (logoddsratio). These are the slope coefficients from the model output and your result stable. It is useful for hypothesis testing.
partial reversal (oddsratio, change in odds). Marginal effects are a one-number summary of the relative effect of a particular variable. I do this by exponentiating the slope coefficient. The focus is on the relative change in y when x changes according to your partial scenario.
reversal (probability). This is when we fill in the full scenario with values for all predictors and then back-transform to probability (the “latent variable”, z). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).
total reversal to original scale (binary). This is when we go from the latent variable/predicted probabilities to the actual outcome. This requires us to set a cutpoint (\(\tau\)) and map probabilities to binary outcomes. Useful mostly for model assessment (\(\chi^2\) test, ROC-curve ).
All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.
Substantive topic (the research)
We will be drawing on replication data on from my forthcoming study of judicial reappointments to the Court of Justice of the European Union (CJEU). The CJEU is the supreme court of the EU. One of its main tasks is therefore to perform judicial review of national legislation. That is, it can strike down government policies that it deems incompatible with EU legislation. Judges at the CJEU are appointed by the national government, and they sit for 6-year renewable terms. In the paper, we argue that governments appoint judges according to two main criteria: their potential influence on the Court’s caselaw (“performance”) and their willingness to take the caselaw in the governments’ preferred direction (“ideology”).
Research design: We study reappointment decisions when the judge’s term is expired. We cannot observe judges’ ideology directly, but we can infer that when the same judge faces two different appointing governments (i.e. an appointing and a reappointing government), their likelihood of exiting the Court increases with the absolute preference distance between the two governments. That is, governments with different ideologies, prefer different judges.
\(H_1\) As the political distance between appointing governments increase, the likelihood of a change in judge increases (positive effect).
Similarly, we hypthesize that high-performing judges – judges that have held many important positions in the past – are more likely to influence the Court’s caselaw also in the future.
\(H_2\) As the judges’ performance increases, the likelihood of a change in judge decreases (negative effect).
The data
The data lists all instances in which a judge in the CJEU has served until the end of their mandate in the 1958-2021 period.
I start by fetching the R packages we will use from the library.
#Data wrangling
library(tidyr);library(dplyr);
#Presentation
library(ggplot2);
library(stargazer); library(ggeffects);
#New packages for model assessment
library(separationplot); library(pROC)
We begin by loading in the data.
If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/Reappointments.rda",
destfile = "Reappointments.rda")
Now, I can load it into R and rename the object to my liking
(df
).
Descriptive statistics
As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.
Univariate distribution
To approximate governments’ propensity to replace judges for
political reasons, I have created a binary variable exit
that indicates whether the judge exited the court at the end of their
mandate or if they stayed on to serve another term. My main predictor is
also binary: Is the judge a member of the upper or lower EU court
(court
)?
I begin by exploring the univariate distribution in numbers, a table and a barplot. The most interesting thing here, is to figure out how common the phenomenon is on average. I therefore calculate the mean; i.e. the proportion of judges that exit the Court.
## [1] 0.2798507
We see that the proportion of judges that exit the Court is 0.28. That is, the predicted probability of a replacement in a base-line intercept-only model is 0.2798507.
If my phenomenon is rare or very common (outcome close to 0 or 1), it is a sign that:
- the binomial logistic regression likely yields different results than a linear model (ref. the S-shape).
- effect sizes are highly variable depending on the other covariates, meaning interpretation requires more care
- I might have problems correctly predicting successes (or failure), because they are rare.
For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.
##
## 0 1
## 193 75
My variable is categorical (binary), so I’m going for a barplot.
#dplyr + ggplot2
df %>%
ggplot +
#Function automatically makes frequency table and displays it
geom_bar(aes(exit)) +
#Plot title
ggtitle("Distribution of judges that exited the courts")
For the time being, I am looking for whether I have “enough”
observations in the 0 and the 1 category. With few observations in
either, I risk having problems estimating reliable results for many
covariates. Here, I have 75 observations in my smallest category. That’s
promising.
Bonus plot: plotting the proportion
Since we will work with probabilities, it is useful to plot the
relative distribution of the dependent variable. I can use base R to
transform the frequency table (table
) to proportions
(prop.table()
).
## .
## 0 1
## 0.7201493 0.2798507
To illustrate that in ggplot, I make a new data frame where I
calculate the height of each column, then make a barplot. When I provide
both x-values and y-values to the graphic, the correct function is
geom_col()
.
#Tidyverse + ggplot2
df %>%
#Group by outcome
group_by(exit) %>%
reframe(
#Number of observations in each group
n = n(),
#Number of observations in the data
N = dim(df)[1],
#The proportion
prop = n/N) %>%
ggplot +
geom_col(aes(y = prop,
x = exit)) +
ggtitle("Distribution of judges that exit the court")
I then explore each of the predictors. This usually implies a bunch of histograms to check if I have variation in my predictors, if I have outliers and what the range of the variables really is. At this stage, I’m only scouting for potential problems and possibilities.
Main predictor: Governments’ political distance
The main predictor to test hypothesis 1 is the absolute distance in governments’ preferences. This is a highly stylized construct; I calculated it from cabinet parties’ electoral programs, so it will require some care at the interpretation stage in order to figure out the substantive effect size.
df %>%
ggplot +
geom_histogram(aes(free_economy_diff)) +
ggtitle("Main predictor: Political distance between governments")
From the histogram, I retain that the preference distance is small in most distances, but there are a few outliers. There is plenty of variation here; which is promising.
Table of descriptive statistics
I finish off by making a table of descriptive statistics. I will use
them actively when I make my interpretations. Here, I create a separate
data table with only the relevant variables, so that I can pass them to
stargazer()
for descriptive statistics.
df0 <-
df %>%
#The select() function has namesakes in other packages, so I often specify which package I'm drawing on
dplyr::select(exit,
# court,
performance,
free_economy_diff,
age,
tenure,
attendance)
It is useful already here, to make a table of descriptive statistics.
I can use stargazer()
to get them automatically. Here, I
nevertheless specify what type of statistics I want (interquartile
range, median, minimum and maximum).
library(stargazer)
stargazer(df0,
#Title of the table
title = "Descriptive statistics",
#Type of output
type = "html",
#Summary stats of data frame
summary = T,
#Additional arguments for statistics
iqr = T, median = T, min.max = T)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Median | Pctl(75) | Max |
exit | 268 | 0.280 | 0.450 | 0 | 0 | 0 | 1 | 1 |
performance | 267 | 0.925 | 0.565 | 0.000 | 0.714 | 0.975 | 1.092 | 3.854 |
free_economy_diff | 251 | 0.345 | 0.393 | 0.000 | 0.071 | 0.241 | 0.461 | 2.586 |
age | 268 | 60.271 | 8.247 | 37.723 | 54.605 | 60.288 | 65.949 | 83.819 |
tenure | 268 | 7.800 | 4.916 | 1.003 | 3.869 | 6.003 | 11.184 | 32.104 |
attendance | 253 | 5.286 | 21.811 | -50.370 | -9.056 | 1.724 | 17.816 | 70.630 |
Bivariate statistics
My main predictor this time is a continuous variable
(free_economy_diff
) that measures the absolute preference
distance between the appointing and reappointing government.
I can calculate Pearson’s R.
##
## Pearson's product-moment correlation
##
## data: df$exit and df$free_economy_diff
## t = 4.4536, df = 249, p-value = 1.276e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1529470 0.3825747
## sample estimates:
## cor
## 0.2716223
Here, I find that the two are strongly positively correlated; and the correlation is statistically significant. More interestingly here, I can calcualte the shared variation between the two by correlating them, then take the square.
R2 <-
cor(df$exit, df$free_economy_diff,
#How to handle NAs in this function
use = "pairwise.complete.obs")^2
R2
## [1] 0.07377867
They share 7 % of their variation.
Of course, you can do this visually. Here, I also show visually where the observations are on the x-axis through a “rugplot”.
df %>%
ggplot +
geom_smooth(aes(
y = exit,
x = free_economy_diff
)) +
#Rugplot shows observations on the x-axis
geom_point(
aes(
y = 0,
x = free_economy_diff
),
shape = "|"
) +
ggtitle("Correlation between political distance and judges' exit")
From the plot, we can see that there is a positive correlation between exit decisions and political distance. Furthermore, we see that the curve is steepest where there are the most observations on the predictor. There are a few outliers that subdue the final effect somewhat.
–>
–>
Analysis
Estimate a binomial logistic regression
The function for generalized linear models in R is
glm()
. This time, we also have to specify the probability
distribution that will allow us to map the recoded dependent variable
(the logodds) to probabilities.
I run a base-line model with my one predictor, then a second more complex model where I also control for alternative explanations for judges’ exit. The main alternative explanation for a judge’s exit is that they leave on their own volition, so I control for their career stage; age, tenure and change in attendance rate.
Present your results: table of results
Although Ward and Ahlquist (2018) don’t like regression tables (i.e. the BUTONs; Big Ugly Table of Numbers), it is useful to present them too.
stargazer::stargazer(mod1, mod2,
dep.var.caption = "Dependent variable",
title = "Exit decisions at the Court of Justice of the European Union (a binomial logit)",
type = "html")
Dependent variable | ||
exit | ||
(1) | (2) | |
free_economy_diff | 1.472*** | 1.490*** |
(0.378) | (0.443) | |
performance | -0.556* | |
(0.307) | ||
courtGC | 0.809** | |
(0.376) | ||
age | 0.132*** | |
(0.027) | ||
tenure | 0.045 | |
(0.036) | ||
attendance | -0.015* | |
(0.009) | ||
Constant | -1.548*** | -9.746*** |
(0.210) | (1.759) | |
Observations | 251 | 235 |
Log Likelihood | -138.038 | -110.922 |
Akaike Inf. Crit. | 280.076 | 235.844 |
Note: | p<0.1; p<0.05; p<0.01 |
The table lets us compare regression coefficients directly between models. They are reported as changes in logodds (our recoded, “latent” variable). Gelman and Hill (2007) have argued that we can think of regressions as a comparison of means; especially when the predictors are categorical. This is still the case. However, there is a difference. The linear model reports the difference in means in the unit of measurement of the dependent variable For us, that was the number of local staffers hired by MEPs in previous weeks. The binomial logistic regression instead reports the slope coefficients as logratios. That is, for a binary predictor such as the Court (General Court vs. Court of Justice), it compares the GC with the CJ by calculating the ratio between the two (it divides the odds in one group by the odds in the other).
The coefficients are therefore always proportional to the intercept, so the size of the coefficients are not really comparable across GLM-models. Here, we see that the marginal effect of the electoral system has a larger coefficient in the second model, but the intercept is also much smaller. The relative change in the probability (or oddds; or logodds) is therefore larger, but it comes from a lower base-line level. The table is still useful for checking the precision (standard error and stars) of the estimates.
Now that I have settled on my model, I will use the interpretation to tell my story, then assess the model’s performance.
Interpretation
Usually, I’d check the performance of the model first, but we’ll do that later in this notebook. Once we are satisfied with the performance of our model, we can start playing around with it. What does it tell us?
1. Regression coefficients
The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).
\(y \sim \alpha + \beta \times x\)
Regression coefficients reported in logodds are useful for the hypothesis testing. I can easily see whether my effects are positive or negative and whether their likelihood of being different from zero is statistically significant. From the results table, I can see that governments with different preferences tend to prefer different judges. Specifically, the effect of political distance is positive and statistically significant, meaning that as the political distance between appointing and reappointing governments increases, the likelihood that a judge is replaced also increases. But by how much?
GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side.
When I interpret, I usually follow up with three successive steps.
2. Marginal effect
In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. This is too abstract. We’ll have to backtrack to make this legible.
Step 1: construct a partial scenario
To make the interpretation of the marginal effect, I create a partial scenario. That is, I determine an increment in x that I find appealing. This can be informed by the data, a real-world or well-known anecdote or just a typical change. It is rarely useful to let x vary by one unit or from minimum to maximum. In our example, a one-unit increment in political distance is really too big to be realistic; let alone the min-max scenario. We saw this in the bivariate plot.
A few options are:
- interquartile range
- a quartile increment
- min to a quartile increment
- any increment informed by your exploration of the data
It depends on the data.
Here, I opt for a minimum to median change. We saw from the histogram that this is where most changes are.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.07139 0.24067 0.34545 0.46069 2.58621 17
## Median
## 0.2406742
Step 2: fill in to calculate the change in logodds
Now, I can fill in to calculate the change.
\(\beta x =\) 1.4895535 \(\times\) 0.2406742 = 0.3584971
Step 3: partial backtransofmation to relative change (%)
I can now back-transform from logoddsratio to % change.
I first take the exponent of the slope parameter (with the scenario) to get oddsratio. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.
\(exp(\beta x)\) = 1.4311769
## free_economy_diff
## 1.431177
When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. It is the factor of change in y when x increases by one unit. Here, the \(y\) would increase by a factor of 1.4.
Unless your exponentiated coefficient is above 2 once you’ve created your partial scenario (i.e. the effect more than “doubles”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.
\((exp(\beta x) - 1) \times 100\)
## free_economy_diff
## 43.11769
A “typical” change in government preferences (min-median) would increase the probability that a judge is replaced by 43.1176939%.
This sounds like quite an effect! What are the practical implications of this result?
3. First differences and scenarios
Because of the in-built interaction effect in GLMs, it is often interesting to go all the way to the predicted effect by filling in a a few full-fledged scenarios and backtransform the prediction from logodds to percentage points (probability of an exit).
Logistic transformation: \[\frac{1}{1+exp(-x)}\]
or
\[\frac{exp(x)}{1+exp(x)}\]
First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. In linear models (with linear effects), we can rely on all-else-equal statements in the construction of scenarios. However, in GLMs we can’t do that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.
If the point is merely to give point estimates (expected values,
average predicted effect…), I use the predict()
function on
a hypothetical dataset.
I’ve picked four substantive scenarios as follows (similar to the slides): What is the the predicted probability that a judge exits the court when he faces a similar government to the one that appointed him vs. one in which he faces a distant government (the same increment as in the partial scenario as before).
To add to the fun, I also vary their age, so that I compare the same scenario for young and old judges as well (40 vs. 60 years).
Lastly, I’d have to set the value for all other covariates. I’ve opted for a judge at the Court of Justice that performs equal to the Court (i.e. the share of “important” cases in his portfolio is equal to that of the Court), who has not changed his participation rate in the last year and who is at the end of his first term at the Court.
I store the scenario as a data frame.
scenarios <-
data.frame(
#Reasonable increment in preference distance between governments
free_economy_diff =
#rep() repeats the vector as many times as I want; here == 2
rep(summary(df$free_economy_diff)[c(1,3)],
2),
#Judge performance == court performance
performance = 1,
#A 40 and a 65 year old judge
age = c(40, 40, 65, 65),
#No change in attendance
attendance = 0,
#After one term in office
tenure = 6,
#At the Court of Justice
court = "CJ"
)
… then I predict the outcome rather than filling in the equation by
hand. The type = "response"
argument specifies that I want
the prediction backtransformed/translated to probabilities. I store the
predictions as a variable in the same data set.
scenarios <-
scenarios %>%
mutate(
preds = predict(mod2, scenarios, type = "response")
) %>%
#Calculate first-difference within each pair of scenario
group_by(age) %>%
mutate(first_diff = preds - lag(preds))
#check the results
scenarios
Wow! The predicted probability of exiting the Court is low among young judges (1 percentage points), and substantially higher for older judges (25 percentage points).
We also see that the first differences vary in the two groups. A moderate shift in government preferences would not change the probability of an exit much for the youngest members of the Court: For a young judge who is performing fine, there is almost no change in the likelihood of an exit. For older judges, the effect is 6 percentage point increase. Although the marginal effect of political preferences was pretty impressive – we found that the relative increase in the probability of an exit was 43.1176939% when the political distance increased – the first-difference is actually pretty small.
This is a pretty extreme scenario; the average age in the sample is 60.2712635. Also, it is unlikely that the effect of age is linear, I’d expect voluntary exits to be high around judges’ retirement age, but not much before. However, it illustrates my point that binomial logistic regressions have an in-build interaction effect in them.
If you want to calculate the uncertainty around this scenario, you
can use the simulations in the ggeffects
package.
4. Graphical display/scenarios on speed.
We will usually want to communicate our results through visuals. With
a continuous variable, I’d go for an effect plot. You can define your
own scenario by defining the varying variables with the
terms=
argument. The ggpredict()
function sets
the other covariates to realistic values given the data. The
condition=
argument is optional, but this is where you’d
define the values of all covariates you want to hold constant.
eff.pref <- ggpredict(mod2,
#Let the political distance vary across its empirical range
terms = "free_economy_diff[all]")
eff.pref %>%
plot +
#Add a rugplot (as previously)
geom_point(
#Specify that I use a different data set here
data = df,
#The coordinates
aes(
y = 0,
x = free_economy_diff
),
shape = "|"
) +
ylab("Probability of replacing the judge") +
xlab("Political distance between appointing governments") +
ggtitle("Effect of ideology on judicial appointments")
Model assessment
An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.
The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.
Reversal to probabilities
The model predictions can be reported as a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that a judge exits the Court. In other words, a model that does a good job in describing the outcome would assign high probabilities to judges that exit the court, and low probabilities to judges that stay.
I use the predict(., type = "response")
function in base
R to specify that I want the predicted values given as probabilities.
This time, I use the predict function to do predictions in-sample. That
is, the newdata =
is my data frame.
df <- df %>%
mutate(
#Prediction from model 1
preds1 = predict(mod1,
#In-sample prediction
newdata = df,
#I want probabilities
type = "response"),
#Predictions from model 2
preds2 = predict(mod2,
#In-sample prediction
newdata = df,
#I want probabilities
type = "response"),
)
We can admire the distribution of the predicted values in a histogram. What do you think?
My predictions only involves two possible outcomes. That’s because the
model only has one, binary predictor.
We can do the same with the more complex model.
The histogram of predicted probability gives a sense of the distribution of probabilities, but how can we compare with the model’s observed outcomes? The challenge in the assessment is that we compare a continuous prediction with discrete observed values.
A single-number summary
Brier score
Ward and Ahlquist (2018) suggest using the Brier score as a single-number summary of how well your model predicts in sample.
The Brier score approximates the average correct predictions (something like the \(R^2\) for linear models). It calculates the difference between the predicted and observed outcomes. It then squares them and calculates the mean. In an unbiased model, the average distance between the residuals is 0; hence the additional operation.
\[B_b \equiv \frac{1}{n}\Sigma^{n}_{i = 1}(\hat{\theta_i} - y_i)^2\]
The Brier score ranges from 0 to 1 where lower scores indicate smaller distance between the predicted and observed outcome. Here, my model seems to perform pretty well.
A bonus: the Hosmer-Lemeshow test
The Hosmer - Lemeshow test orders the data according to the predicted probability of a positive outcome, then slices the data into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.
There are several packages that perform the test for you:
glmtoolbox We can use the glmtoolbox
package to fetch the function for the test. It suffices to feed the
model object into the hltest()
function.
##
## The Hosmer-Lemeshow goodness-of-fit test
##
## Group Size Observed Expected
## 1 24 0 0.7099593
## 2 24 1 1.6778688
## 3 24 4 2.5980216
## 4 24 4 3.6911927
## 5 24 7 5.0078211
## 6 24 6 6.0556472
## 7 24 8 7.9311446
## 8 24 9 10.4143126
## 9 24 12 14.2744994
## 10 19 16 14.6395329
##
## Statistic = 4.69259
## degrees of freedom = 8
## p-value = 0.78987
Here, we see that in the first decile of the predicted probabilities, there are 0 judges that exited the Court, while we – when we sum over the probabilities – find 0 predicted exits (i.e. less than one judge).
Note that the test checks whether the observed and the predicted outcomes are significantly different from each other. We don’t want them to be different. We want the predicted group to be similar to the observed group. Here, the p-value is 0.7898701; rather good news for us.
ResourceSelection Another package is the
ResourceSelection
package, using te
hoslem.test()
. It asks for the observed outcome and the
predicted probabilites, respectively. If you have NAs in your
prediction, the function breaks down, and you’d have to subset the
variables to get a comparison.
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: df$exit[!is.na(df$preds2)], df$preds2[!is.na(df$preds2)]
## X-squared = 6.4216, df = 8, p-value = 0.6001
The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good. The different test functions also yield slightly different results, so you will have to use some discernment when assessing your model fit. However, you can play around and see whehter your model predictions improve as you add covariates. E.g. you can compare model 1 and model 2.
A visual comparison: the separation plot
We can state the same intuition differently : We should see few judges that leave the Court on the lower end of the predicted probability scale, and a higher density of judges that are replaced on the high end of the probability scale.
We can even illustrate this in a plot: The “separation plot”.
Here, I select the two variables (observed and predicted outcomes). I
then randomly re-order the data by drawing my data anew using the
sample()
function. This is because I will arrange the data
according to the predicted probability to leave office, but observations
with identical probability may vary in whether I observe a 0 or 1 in the
outcome. It means that the plot will look slightly different every time,
especially if I have many categorical predictors, but few numeric ones.
Last, I order the data according to the judges’ predicted probability of
exiting and create an index variable from 0 to N =268 (the number of
rows in the data). The index will constitute my coordinates on the
x-axis.
tab <-
df %>%
#Select the predicted probabilities and the observed outcomes
dplyr::select(exit, preds2) %>%
#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds2) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
The separation plot then plots a bar (segment) for each observation in the data. The bars are color coded according to the observed outcome, but sorted by their predicted probability on the x-axis.
#Plot
tab %>%
ggplot() +
#Vertical lines == one per observation in the data
geom_segment(aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate and color the segments by predicted outcome
color = as.factor(exit)),
#Make colors transparent increase we have multiple observations with same probability
alpha = 0.6) +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds2,
x = index))
The black line illustrates how the probability of observing a
“success” (1) increases when we order the data from low to high
probability. If there is a clear separation in the predicted
probabilities in the two groups, then this line would have a sharp
increase at some point. It would also start by being close to 0 and
finish high up towards the 1 on the y-axis. We would furthermore see
many green segments (exit == 1
) to the right of the plot
where the predicted probability of observing 1s is high, and many grey
areas to the left where the probability of observing 0s is low
(exit == 0
).
The original R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.
library(separationplot)
test <- df %>% filter(!is.na(preds1))
separationplot(pred = test$preds1,
actual = test$exit)
If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).
Your turn:
- Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
- Run the code on the predictions for model 1. How does the separation plot change? Why?
Choosing a cut point
An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point (\(\tau\)) should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.
A single cut point (\(\tau\))
As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model, i.e. the intercept-only model.
## [1] 0.2798507
I can plot the predicted and the observed values against each other in a cross table.
## exit
## exit_pred 0 1
## 0 139 35
## 1 44 33
df %>%
group_by(exit, exit_pred) %>%
reframe(N = n()) %>%
mutate(correct_prediction = (exit_pred == exit))
To make the table more comparable, I can calculate the proportion of
correct predictions by the values of the dependent variable. That is, I
could calculate the correct/false prediction rate for successes
(exit == 1
) and the correct/false prediction of failures
(exit == 0
)
#Base R
tab <-
df[c("exit_pred", "exit")] %>%
#cross table
table %>%
#probabilities by row (sums up to 1 per row)
prop.table(., margin = 2)
tab
## exit
## exit_pred 0 1
## 0 0.7595628 0.5147059
## 1 0.2404372 0.4852941
#
tab <-
df %>%
#Group by outcome
group_by(exit) %>%
#Mean/proportion of 0 predictions and 1 predictions
reframe(Predicted0 = mean(exit_pred == 0, na.rm = T),
Predicted1 = mean(exit_pred == 1, na.rm = T))
tab
The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.
Here, I can clearly see that I have aexit == 0
). In other words, my model performs OK in
predicting the judges that do not exit the court. However, my model
performs worse when predicting judges that exit after their mandate
ended (
%).
It is a good idea to report the correct positives and negatives in your results table. If so, I usually go for a cut point decided by the proportion of successes in the data. In other words, does my model with predictors predict better than a simple guess from a model without predictors.
#For esthetics, round off proportions to two digits
tab <- round(tab, digits = 2)
stargazer(mod2,
#Add two lines
add.lines = list(
c("Correct positives (prop.)", tab$Predicted1[2]),
c("Correct negatives (prop.)", tab$Predicted0[1])
),
type = "html")
Dependent variable: | |
exit | |
free_economy_diff | 1.490*** |
(0.443) | |
performance | -0.556* |
(0.307) | |
courtGC | 0.809** |
(0.376) | |
age | 0.132*** |
(0.027) | |
tenure | 0.045 |
(0.036) | |
attendance | -0.015* |
(0.009) | |
Constant | -9.746*** |
(1.759) | |
Correct positives (prop.) | 0.49 |
Correct negatives (prop.) | 0.76 |
Observations | 235 |
Log Likelihood | -110.922 |
Akaike Inf. Crit. | 235.844 |
Note: | p<0.1; p<0.05; p<0.01 |
This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.
Letting the cutpoint vary: The ROC curve
The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.
The ROC curve plots the correct positive rate (sensitivity) against the incorrect positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.
A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.
#load necessary packages; install it if need be
library(pROC)
#define the object to plot
roc1 <- roc(
#Observed values
response = df$exit,
#Predicted values
predictor = df$preds1)
#Create the ROC plot
ggroc(roc1) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 1") +
theme_classic()
We begin by the
roc()
function that does the calculations
for us (sliding the cutpoint, comparing outcomes). The
ggroc()
function then plots the object. It’s a ggplot
object, so I can add all the usual esthetics to it.
Let’s see how my more complex model is doing.
#define the object to plot
roc2 <- roc(
#Observed values
response = df$exit,
#Predicted values
predictor = df$preds2)
#Create the ROC plot
ggroc(roc2) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 2") +
theme_classic()
We can now see how the model improves. The package even has a function to explicitly compare (and choose) models
roc2 <- roc(
#Observed values
response = df$exit,
#Predicted values
predictor = df$preds2)
roc.test(roc1, roc2)
##
## DeLong's test for two correlated ROC curves
##
## data: roc1 and roc2
## Z = -2.9021, p-value = 0.003707
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.19133552 -0.03707599
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.6738184 0.7880242
The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically significant.
The roc2
object is a list. It means that it contains a
lot of information that we can extract using indexing. I can ask for all
the list elements.
## [1] "percent" "sensitivities" "specificities"
## [4] "thresholds" "direction" "cases"
## [7] "controls" "fun.sesp" "auc"
## [10] "call" "original.predictor" "original.response"
## [13] "predictor" "response" "levels"
… and extract the area under the curve:
## Area under the curve: 0.788
What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.
ggplot() +
geom_line(aes(y = roc2$sensitivities,
x = roc2$thresholds)) +
geom_line(aes(y = roc2$specificities,
x = roc2$thresholds),
lty = 2)
id.threshold <- (roc2$sensitivities - roc2$specificities) %>%
abs %>%
which.min
#Best balance
roc2$thresholds[id.threshold]
## [1] 0.2584505
Using this as a threshold, my cross table would look like this:
df %>%
group_by(exit) %>%
reframe(Predicted0 = mean(preds2 <= roc2$thresholds[id.threshold], na.rm = T),
Predicted1 = mean(preds2 > roc2$thresholds[id.threshold], na.rm = T))
As is clear, at this threshold, I’m (almost) equally wrong/right for
my predictions in each group in the outcome (exit
). You can
see this from the diagonal.
For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html
For the R enthusiast: R functions
Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.
The function that creates functions (sic!) is called
function()
.
In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).
After that, we open a set of curly brackets. This is where we put our R codes.
The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function:
return()
. R will only give one object per function.Store your function in an R object.
my_function <- function(){}
A toy example.
I want a function that tells me if the number i put into it is
smaller or larger than 10. My argument ´x=´ is the input; the output is
given in the return()
. For ease, I create a default value
for x. I chose x = 1
. However, once the function is
compiled (run the R-code), the user may redefine the x object to
whatever value (s)he wants.
## [1] TRUE
If you like your function and want to use it later, you can store it as an R object in your working directory `
Later, you fetch it as usual.
Your turn!
Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code. You can use the following code as a basis:
tab <- df %>%
#Select the predicted probabilities and the observed outcomes
dplyr::select(exit, preds1) %>%
#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds1) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
#Plot
ggplot() +
#Vertical lines == one per observation in the data
geom_segment(data = tab,
aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate the segments by predicted outcome
color = as.factor(exit)),
#Make colors transparent incase we have multiple observations with same probability
alpha = 0.6) +
scale_color_manual(values = c("grey", "purple")) +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds1,
x = index)) +
#Esthetics
ylab("Y-hat") +
xlab("index") +
ggtitle("Separation plot") +
theme_minimal() +
theme(legend.AgeExit = "bottom")
create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of
df$exit
anddf$preds1
.identify all the times the observed y (
exit
) is used in the code. Replace the name byy
. Do the same withy_pred
andpreds1
.create a data table
df0
by adata.frame(y, y_pred)
. This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning. Replacedf
bydf0
in the code.Remove the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.
pack the code chunk into the curly brackets of the following code
separation_plot <- function(y = NULL, y_pred = NULL){}
.Ok! Does it run? Try out with preds1 from model 1 as well. Does it work?
Want another challenge?
- What aesthetics would you have in the function that the user can later alter?
- The model object in R contains the two vectors you need:
mod1$y
andmod1$fitted.values
. Can you rewrite your function so that the only argument required is the model objectmod1
?.
Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.