Logistic regression models

Binomial logistic models allows us to analyze binary outcomes. They are part of the GLM family of models, meaning that they require a recoding strategy to make the dependent variable continuous and unbounded. They also require us to pick a probability distribution in order to link observed outcomes to the probability of the event. All of this R does for us, as long as we specify the model correctly. However, the interpretation requires us to partially or completely backtrack the recoding. Also, although the model assessment compares predicted and observed outcomes, we have the additional challenge that we have to decide on a useful cut point at which we believe the latent variable translates to 1s instead of 0s.

New R codes:

  • new from base: glm(), prop.table(), chisq.test(), predict(., type = "response")
  • new from dplyr package: n(), distinct()
  • new from ggplot: geom_errorbar(), geom_segment() (less important)
  • new from pROC: roc()
  • new from separationplot: separationplot()
  • new from tidyr: pivot_wider() (less important)

Old R codes:

  • old from base: dim(), table, as.data.frame()
  • old from dplyr package: group_by, mutate(), reframe()
  • old from ggplot: geom_bar(), geom_col, geom_point()
  • old from ggeffects: ggpredict(., terms = , condition = )

Introduction

Main methodological takeaways

GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation

  • no transformation: the outcome variable (logodds). These are the values of the recoded outcome variable on which the model is estimated. The model output and results table report coefficients in this unit (i.e. a one-unit change in x results in a \(\beta\) change in logodds).

  • no transformation (logoddsratio). These are the slope coefficients from the model output and your result stable. It is useful for hypothesis testing.

  • partial reversal (oddsratio, change in odds). Marginal effects are a one-number summary of the relative effect of a particular variable. I do this by exponentiating the slope coefficient. The focus is on the relative change in y when x changes according to your partial scenario.

  • reversal (probability). This is when we fill in the full scenario with values for all predictors and then back-transform to probability (the “latent variable”, z). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).

  • total reversal to original scale (binary). This is when we go from the latent variable/predicted probabilities to the actual outcome. This requires us to set a cutpoint (\(\tau\)) and map probabilities to binary outcomes. Useful mostly for model assessment (\(\chi^2\) test, ROC-curve ).

All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.

Substantive topic (the research)

We will be drawing on replication data on from my forthcoming study of judicial reappointments to the Court of Justice of the European Union (CJEU). The CJEU is the supreme court of the EU. One of its main tasks is therefore to perform judicial review of national legislation. That is, it can strike down government policies that it deems incompatible with EU legislation. Judges at the CJEU are appointed by the national government, and they sit for 6-year renewable terms. In the paper, we argue that governments appoint judges according to two main criteria: their potential influence on the Court’s caselaw (“performance”) and their willingness to take the caselaw in the governments’ preferred direction (“ideology”).

Research design: We study reappointment decisions when the judge’s term is expired. We cannot observe judges’ ideology directly, but we can infer that when the same judge faces two different appointing governments (i.e. an appointing and a reappointing government), their likelihood of exiting the Court increases with the absolute preference distance between the two governments. That is, governments with different ideologies, prefer different judges.

\(H_1\) As the political distance between appointing governments increase, the likelihood of a change in judge increases (positive effect).

Similarly, we hypthesize that high-performing judges – judges that have held many important positions in the past – are more likely to influence the Court’s caselaw also in the future.

\(H_2\) As the judges’ performance increases, the likelihood of a change in judge decreases (negative effect).

The data

The data lists all instances in which a judge in the CJEU has served until the end of their mandate in the 1958-2021 period.

I start by fetching the R packages we will use from the library.

#Data wrangling
library(tidyr);library(dplyr); 
#Presentation
library(ggplot2);
library(stargazer); library(ggeffects); 
#New packages for model assessment
library(separationplot); library(pROC)

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/Reappointments.rda",
              destfile = "Reappointments.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("Reappointments.rda")

#Rename
df <- Reappointments

Descriptive statistics

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

Univariate distribution

To approximate governments’ propensity to replace judges for political reasons, I have created a binary variable exit that indicates whether the judge exited the court at the end of their mandate or if they stayed on to serve another term. My main predictor is also binary: Is the judge a member of the upper or lower EU court (court)?

I begin by exploring the univariate distribution in numbers, a table and a barplot. The most interesting thing here, is to figure out how common the phenomenon is on average. I therefore calculate the mean; i.e. the proportion of judges that exit the Court.

mean(df$exit)
## [1] 0.2798507

We see that the proportion of judges that exit the Court is 0.28. That is, the predicted probability of a replacement in a base-line intercept-only model is 0.2798507.

If my phenomenon is rare or very common (outcome close to 0 or 1), it is a sign that:

  • the binomial logistic regression likely yields different results than a linear model (ref. the S-shape).
  • effect sizes are highly variable depending on the other covariates, meaning interpretation requires more care
  • I might have problems correctly predicting successes (or failure), because they are rare.

For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.

#In base
table(df$exit)
## 
##   0   1 
## 193  75
#In dplyr
df %>%
  group_by(exit) %>%
  #Count number of observations in the group
  reframe(N = n())

My variable is categorical (binary), so I’m going for a barplot.

#dplyr + ggplot2
df %>%
  ggplot +
  #Function automatically makes frequency table and displays it
  geom_bar(aes(exit)) +
  #Plot title
  ggtitle("Distribution of judges that exited the courts")

For the time being, I am looking for whether I have “enough” observations in the 0 and the 1 category. With few observations in either, I risk having problems estimating reliable results for many covariates. Here, I have 75 observations in my smallest category. That’s promising.

Bonus plot: plotting the proportion

Since we will work with probabilities, it is useful to plot the relative distribution of the dependent variable. I can use base R to transform the frequency table (table) to proportions (prop.table()).

df$exit %>%
  table %>%
  prop.table 
## .
##         0         1 
## 0.7201493 0.2798507

To illustrate that in ggplot, I make a new data frame where I calculate the height of each column, then make a barplot. When I provide both x-values and y-values to the graphic, the correct function is geom_col().

#Tidyverse + ggplot2
df %>%
  #Group by outcome
  group_by(exit) %>%
  reframe(
    #Number of observations in each group
    n = n(),
    #Number of observations in the data
    N = dim(df)[1],
    #The proportion
    prop = n/N) %>%
  ggplot +
  geom_col(aes(y = prop,
               x = exit)) +
  ggtitle("Distribution of judges that exit the court")

I then explore each of the predictors. This usually implies a bunch of histograms to check if I have variation in my predictors, if I have outliers and what the range of the variables really is. At this stage, I’m only scouting for potential problems and possibilities.

Main predictor: Governments’ political distance

The main predictor to test hypothesis 1 is the absolute distance in governments’ preferences. This is a highly stylized construct; I calculated it from cabinet parties’ electoral programs, so it will require some care at the interpretation stage in order to figure out the substantive effect size.

df %>%
  ggplot +
  geom_histogram(aes(free_economy_diff)) +
  ggtitle("Main predictor: Political distance between governments")

From the histogram, I retain that the preference distance is small in most distances, but there are a few outliers. There is plenty of variation here; which is promising.

Table of descriptive statistics

I finish off by making a table of descriptive statistics. I will use them actively when I make my interpretations. Here, I create a separate data table with only the relevant variables, so that I can pass them to stargazer() for descriptive statistics.

df0 <- 
  df %>%
  #The select() function has namesakes in other packages, so I often specify which package I'm drawing on
  dplyr::select(exit,
                # court,
                performance,
                free_economy_diff,
                age,
                tenure,
                attendance)

It is useful already here, to make a table of descriptive statistics. I can use stargazer() to get them automatically. Here, I nevertheless specify what type of statistics I want (interquartile range, median, minimum and maximum).

library(stargazer)
stargazer(df0,
          #Title of the table
          title = "Descriptive statistics",
          #Type of output
          type = "html",
          #Summary stats of data frame
          summary = T,
          #Additional arguments for statistics
          iqr = T, median = T, min.max = T)
Descriptive statistics
Statistic N Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
exit 268 0.280 0.450 0 0 0 1 1
performance 267 0.925 0.565 0.000 0.714 0.975 1.092 3.854
free_economy_diff 251 0.345 0.393 0.000 0.071 0.241 0.461 2.586
age 268 60.271 8.247 37.723 54.605 60.288 65.949 83.819
tenure 268 7.800 4.916 1.003 3.869 6.003 11.184 32.104
attendance 253 5.286 21.811 -50.370 -9.056 1.724 17.816 70.630

Bivariate statistics

My main predictor this time is a continuous variable (free_economy_diff) that measures the absolute preference distance between the appointing and reappointing government.

I can calculate Pearson’s R.

cor.test(df$exit, df$free_economy_diff)
## 
##  Pearson's product-moment correlation
## 
## data:  df$exit and df$free_economy_diff
## t = 4.4536, df = 249, p-value = 1.276e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1529470 0.3825747
## sample estimates:
##       cor 
## 0.2716223

Here, I find that the two are strongly positively correlated; and the correlation is statistically significant. More interestingly here, I can calcualte the shared variation between the two by correlating them, then take the square.

R2 <- 
  cor(df$exit, df$free_economy_diff, 
      #How to handle NAs in this function
      use = "pairwise.complete.obs")^2
R2
## [1] 0.07377867

They share 7 % of their variation.

Of course, you can do this visually. Here, I also show visually where the observations are on the x-axis through a “rugplot”.

df %>%
  ggplot +
  geom_smooth(aes(
    y = exit,
    x = free_economy_diff
  )) +
  #Rugplot shows observations on the x-axis
  geom_point(
    aes(
      y = 0,
      x = free_economy_diff
    ),
    shape = "|"
  ) +
  ggtitle("Correlation between political distance and judges' exit")

From the plot, we can see that there is a positive correlation between exit decisions and political distance. Furthermore, we see that the curve is steepest where there are the most observations on the predictor. There are a few outliers that subdue the final effect somewhat.

–>

–>

Analysis

Estimate a binomial logistic regression

The function for generalized linear models in R is glm(). This time, we also have to specify the probability distribution that will allow us to map the recoded dependent variable (the logodds) to probabilities.

I run a base-line model with my one predictor, then a second more complex model where I also control for alternative explanations for judges’ exit. The main alternative explanation for a judge’s exit is that they leave on their own volition, so I control for their career stage; age, tenure and change in attendance rate.

mod1 <- glm(exit ~ 
              free_economy_diff,
            family = "binomial",
            df)

mod2 <- glm(exit ~ 
              + free_economy_diff
              + performance + 
              + court
              + age
            + tenure  
            + attendance
            ,
            family = "binomial",
            df)

Present your results: table of results

Although Ward and Ahlquist (2018) don’t like regression tables (i.e. the BUTONs; Big Ugly Table of Numbers), it is useful to present them too.

stargazer::stargazer(mod1, mod2,
                     dep.var.caption = "Dependent variable",
                     title = "Exit decisions at the Court of Justice of the European Union (a binomial logit)",
                     type = "html")
Exit decisions at the Court of Justice of the European Union (a binomial logit)
Dependent variable
exit
(1) (2)
free_economy_diff 1.472*** 1.490***
(0.378) (0.443)
performance -0.556*
(0.307)
courtGC 0.809**
(0.376)
age 0.132***
(0.027)
tenure 0.045
(0.036)
attendance -0.015*
(0.009)
Constant -1.548*** -9.746***
(0.210) (1.759)
Observations 251 235
Log Likelihood -138.038 -110.922
Akaike Inf. Crit. 280.076 235.844
Note: p<0.1; p<0.05; p<0.01

The table lets us compare regression coefficients directly between models. They are reported as changes in logodds (our recoded, “latent” variable). Gelman and Hill (2007) have argued that we can think of regressions as a comparison of means; especially when the predictors are categorical. This is still the case. However, there is a difference. The linear model reports the difference in means in the unit of measurement of the dependent variable For us, that was the number of local staffers hired by MEPs in previous weeks. The binomial logistic regression instead reports the slope coefficients as logratios. That is, for a binary predictor such as the Court (General Court vs. Court of Justice), it compares the GC with the CJ by calculating the ratio between the two (it divides the odds in one group by the odds in the other).

The coefficients are therefore always proportional to the intercept, so the size of the coefficients are not really comparable across GLM-models. Here, we see that the marginal effect of the electoral system has a larger coefficient in the second model, but the intercept is also much smaller. The relative change in the probability (or oddds; or logodds) is therefore larger, but it comes from a lower base-line level. The table is still useful for checking the precision (standard error and stars) of the estimates.

Now that I have settled on my model, I will use the interpretation to tell my story, then assess the model’s performance.

Interpretation

Usually, I’d check the performance of the model first, but we’ll do that later in this notebook. Once we are satisfied with the performance of our model, we can start playing around with it. What does it tell us?

1. Regression coefficients

The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).

\(y \sim \alpha + \beta \times x\)

Regression coefficients reported in logodds are useful for the hypothesis testing. I can easily see whether my effects are positive or negative and whether their likelihood of being different from zero is statistically significant. From the results table, I can see that governments with different preferences tend to prefer different judges. Specifically, the effect of political distance is positive and statistically significant, meaning that as the political distance between appointing and reappointing governments increases, the likelihood that a judge is replaced also increases. But by how much?

GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side.

When I interpret, I usually follow up with three successive steps.

2. Marginal effect

In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. This is too abstract. We’ll have to backtrack to make this legible.

Step 1: construct a partial scenario

To make the interpretation of the marginal effect, I create a partial scenario. That is, I determine an increment in x that I find appealing. This can be informed by the data, a real-world or well-known anecdote or just a typical change. It is rarely useful to let x vary by one unit or from minimum to maximum. In our example, a one-unit increment in political distance is really too big to be realistic; let alone the min-max scenario. We saw this in the bivariate plot.

A few options are:

  • interquartile range
  • a quartile increment
  • min to a quartile increment
  • any increment informed by your exploration of the data

It depends on the data.

Here, I opt for a minimum to median change. We saw from the histogram that this is where most changes are.

summary(df$free_economy_diff)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.07139 0.24067 0.34545 0.46069 2.58621      17
change <- summary(df$free_economy_diff)[c(3)]
change
##    Median 
## 0.2406742

Step 2: fill in to calculate the change in logodds

Now, I can fill in to calculate the change.

\(\beta x =\) 1.4895535 \(\times\) 0.2406742 = 0.3584971

logoddsratio <- coefficients(mod2)[2] * change

Step 3: partial backtransofmation to relative change (%)

I can now back-transform from logoddsratio to % change.

I first take the exponent of the slope parameter (with the scenario) to get oddsratio. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.

\(exp(\beta x)\) = 1.4311769

oddsratio <- exp(logoddsratio)
oddsratio
## free_economy_diff 
##          1.431177

When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. It is the factor of change in y when x increases by one unit. Here, the \(y\) would increase by a factor of 1.4.

Unless your exponentiated coefficient is above 2 once you’ve created your partial scenario (i.e. the effect more than “doubles”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.

\((exp(\beta x) - 1) \times 100\)

marginal_effect <- (oddsratio-1) * 100
marginal_effect
## free_economy_diff 
##          43.11769

A “typical” change in government preferences (min-median) would increase the probability that a judge is replaced by 43.1176939%.

This sounds like quite an effect! What are the practical implications of this result?

3. First differences and scenarios

Because of the in-built interaction effect in GLMs, it is often interesting to go all the way to the predicted effect by filling in a a few full-fledged scenarios and backtransform the prediction from logodds to percentage points (probability of an exit).

Logistic transformation: \[\frac{1}{1+exp(-x)}\]

or

\[\frac{exp(x)}{1+exp(x)}\]

First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. In linear models (with linear effects), we can rely on all-else-equal statements in the construction of scenarios. However, in GLMs we can’t do that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.

If the point is merely to give point estimates (expected values, average predicted effect…), I use the predict() function on a hypothetical dataset.

I’ve picked four substantive scenarios as follows (similar to the slides): What is the the predicted probability that a judge exits the court when he faces a similar government to the one that appointed him vs. one in which he faces a distant government (the same increment as in the partial scenario as before).

To add to the fun, I also vary their age, so that I compare the same scenario for young and old judges as well (40 vs. 60 years).

Lastly, I’d have to set the value for all other covariates. I’ve opted for a judge at the Court of Justice that performs equal to the Court (i.e. the share of “important” cases in his portfolio is equal to that of the Court), who has not changed his participation rate in the last year and who is at the end of his first term at the Court.

I store the scenario as a data frame.

scenarios <- 
  data.frame(
  #Reasonable increment in preference distance between governments
    free_economy_diff =
      #rep() repeats the vector as many times as I want; here == 2
      rep(summary(df$free_economy_diff)[c(1,3)],
                            2),
    #Judge performance == court performance
    performance = 1,
    #A 40 and a 65 year old judge
    age = c(40, 40, 65, 65),
    #No change in attendance
    attendance = 0,
    #After one term in office
    tenure = 6,
    #At the Court of Justice
    court = "CJ"
  ) 

… then I predict the outcome rather than filling in the equation by hand. The type = "response" argument specifies that I want the prediction backtransformed/translated to probabilities. I store the predictions as a variable in the same data set.

scenarios <-
  scenarios %>%
  mutate(
    preds = predict(mod2, scenarios, type = "response")
  ) %>%
  #Calculate first-difference within each pair of scenario
  group_by(age) %>%
  mutate(first_diff = preds - lag(preds))

#check the results
scenarios

Wow! The predicted probability of exiting the Court is low among young judges (1 percentage points), and substantially higher for older judges (25 percentage points).

We also see that the first differences vary in the two groups. A moderate shift in government preferences would not change the probability of an exit much for the youngest members of the Court: For a young judge who is performing fine, there is almost no change in the likelihood of an exit. For older judges, the effect is 6 percentage point increase. Although the marginal effect of political preferences was pretty impressive – we found that the relative increase in the probability of an exit was 43.1176939% when the political distance increased – the first-difference is actually pretty small.

This is a pretty extreme scenario; the average age in the sample is 60.2712635. Also, it is unlikely that the effect of age is linear, I’d expect voluntary exits to be high around judges’ retirement age, but not much before. However, it illustrates my point that binomial logistic regressions have an in-build interaction effect in them.

If you want to calculate the uncertainty around this scenario, you can use the simulations in the ggeffects package.

4. Graphical display/scenarios on speed.

We will usually want to communicate our results through visuals. With a continuous variable, I’d go for an effect plot. You can define your own scenario by defining the varying variables with the terms= argument. The ggpredict() function sets the other covariates to realistic values given the data. The condition= argument is optional, but this is where you’d define the values of all covariates you want to hold constant.

eff.pref <- ggpredict(mod2, 
                      #Let the political distance vary across its empirical range
                      terms = "free_economy_diff[all]")


eff.pref %>% 
  plot +
  #Add a rugplot (as previously)
  geom_point(
    #Specify that I use a different data set here
    data = df,
    #The coordinates
    aes(
      y = 0,
      x = free_economy_diff
    ),
    shape = "|"
  ) +
  ylab("Probability of replacing the judge") +
  xlab("Political distance between appointing governments") +
  ggtitle("Effect of ideology on judicial appointments")

Model assessment

An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.

The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.

Reversal to probabilities

The model predictions can be reported as a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that a judge exits the Court. In other words, a model that does a good job in describing the outcome would assign high probabilities to judges that exit the court, and low probabilities to judges that stay.

I use the predict(., type = "response") function in base R to specify that I want the predicted values given as probabilities. This time, I use the predict function to do predictions in-sample. That is, the newdata = is my data frame.

df <- df %>%
  mutate(
    #Prediction from model 1
    preds1 = predict(mod1, 
                     #In-sample prediction
                     newdata = df,
                     #I want probabilities
                     type = "response"),
    #Predictions from model 2
    preds2 = predict(mod2, 
                     #In-sample prediction
                     newdata = df,
                     #I want probabilities
                     type = "response"),
  )

We can admire the distribution of the predicted values in a histogram. What do you think?

df %>%
  ggplot +
  geom_histogram(aes(preds1))

My predictions only involves two possible outcomes. That’s because the model only has one, binary predictor.

We can do the same with the more complex model.

df %>%
  ggplot +
  geom_histogram(aes(preds2))

The histogram of predicted probability gives a sense of the distribution of probabilities, but how can we compare with the model’s observed outcomes? The challenge in the assessment is that we compare a continuous prediction with discrete observed values.

A single-number summary

Brier score

Ward and Ahlquist (2018) suggest using the Brier score as a single-number summary of how well your model predicts in sample.

The Brier score approximates the average correct predictions (something like the \(R^2\) for linear models). It calculates the difference between the predicted and observed outcomes. It then squares them and calculates the mean. In an unbiased model, the average distance between the residuals is 0; hence the additional operation.

\[B_b \equiv \frac{1}{n}\Sigma^{n}_{i = 1}(\hat{\theta_i} - y_i)^2\]

df %>%
  mutate(diff = preds2 - exit,
         diff2 = diff^2) %>%
  reframe(Brier = mean(diff2, na.rm = T))

The Brier score ranges from 0 to 1 where lower scores indicate smaller distance between the predicted and observed outcome. Here, my model seems to perform pretty well.

A bonus: the Hosmer-Lemeshow test

The Hosmer - Lemeshow test orders the data according to the predicted probability of a positive outcome, then slices the data into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.

There are several packages that perform the test for you:

glmtoolbox We can use the glmtoolbox package to fetch the function for the test. It suffices to feed the model object into the hltest() function.

#install.packages("glmtoolbox")
library(glmtoolbox)
hltest(mod2, df = 8)
## 
##    The Hosmer-Lemeshow goodness-of-fit test
## 
##  Group Size Observed   Expected
##      1   24        0  0.7099593
##      2   24        1  1.6778688
##      3   24        4  2.5980216
##      4   24        4  3.6911927
##      5   24        7  5.0078211
##      6   24        6  6.0556472
##      7   24        8  7.9311446
##      8   24        9 10.4143126
##      9   24       12 14.2744994
##     10   19       16 14.6395329
## 
##          Statistic =  4.69259 
## degrees of freedom =  8 
##            p-value =  0.78987

Here, we see that in the first decile of the predicted probabilities, there are 0 judges that exited the Court, while we – when we sum over the probabilities – find 0 predicted exits (i.e. less than one judge).

Note that the test checks whether the observed and the predicted outcomes are significantly different from each other. We don’t want them to be different. We want the predicted group to be similar to the observed group. Here, the p-value is 0.7898701; rather good news for us.

ResourceSelection Another package is the ResourceSelection package, using te hoslem.test(). It asks for the observed outcome and the predicted probabilites, respectively. If you have NAs in your prediction, the function breaks down, and you’d have to subset the variables to get a comparison.

library(ResourceSelection)
hoslem.test(df$exit[!is.na(df$preds2)], 
            df$preds2[!is.na(df$preds2)])
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  df$exit[!is.na(df$preds2)], df$preds2[!is.na(df$preds2)]
## X-squared = 6.4216, df = 8, p-value = 0.6001

The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good. The different test functions also yield slightly different results, so you will have to use some discernment when assessing your model fit. However, you can play around and see whehter your model predictions improve as you add covariates. E.g. you can compare model 1 and model 2.

A visual comparison: the separation plot

We can state the same intuition differently : We should see few judges that leave the Court on the lower end of the predicted probability scale, and a higher density of judges that are replaced on the high end of the probability scale.

We can even illustrate this in a plot: The “separation plot”.

Here, I select the two variables (observed and predicted outcomes). I then randomly re-order the data by drawing my data anew using the sample() function. This is because I will arrange the data according to the predicted probability to leave office, but observations with identical probability may vary in whether I observe a 0 or 1 in the outcome. It means that the plot will look slightly different every time, especially if I have many categorical predictors, but few numeric ones. Last, I order the data according to the judges’ predicted probability of exiting and create an index variable from 0 to N =268 (the number of rows in the data). The index will constitute my coordinates on the x-axis.

tab <- 
  df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(exit, preds2) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds2) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

The separation plot then plots a bar (segment) for each observation in the data. The bars are color coded according to the observed outcome, but sorted by their predicted probability on the x-axis.

#Plot
tab %>%
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate and color the segments by predicted outcome
                   color = as.factor(exit)),
               #Make colors transparent increase we have multiple observations with same probability
               alpha = 0.6) +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds2,
                x = index))

The black line illustrates how the probability of observing a “success” (1) increases when we order the data from low to high probability. If there is a clear separation in the predicted probabilities in the two groups, then this line would have a sharp increase at some point. It would also start by being close to 0 and finish high up towards the 1 on the y-axis. We would furthermore see many green segments (exit == 1) to the right of the plot where the predicted probability of observing 1s is high, and many grey areas to the left where the probability of observing 0s is low (exit == 0).

The original R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.

library(separationplot)

test <- df %>% filter(!is.na(preds1))

separationplot(pred = test$preds1,
               actual = test$exit)

If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).

Your turn:

  • Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
  • Run the code on the predictions for model 1. How does the separation plot change? Why?

Choosing a cut point

An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point (\(\tau\)) should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.

A single cut point (\(\tau\))

As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model, i.e. the intercept-only model.

#Mean in data == intercept only
mean(df$exit)
## [1] 0.2798507
df <-
  df %>%
  mutate(exit_pred = as.numeric(preds1 > 0.28))

I can plot the predicted and the observed values against each other in a cross table.

#Base R
df[c("exit_pred", "exit")] %>%
  table
##          exit
## exit_pred   0   1
##         0 139  35
##         1  44  33
df %>%
  group_by(exit, exit_pred) %>%
  reframe(N = n()) %>%
  mutate(correct_prediction = (exit_pred == exit))

To make the table more comparable, I can calculate the proportion of correct predictions by the values of the dependent variable. That is, I could calculate the correct/false prediction rate for successes (exit == 1) and the correct/false prediction of failures (exit == 0)

#Base R
tab <- 
  df[c("exit_pred", "exit")] %>%
  #cross table
  table %>% 
  #probabilities by row (sums up to 1 per row)
  prop.table(., margin = 2)

tab
##          exit
## exit_pred         0         1
##         0 0.7595628 0.5147059
##         1 0.2404372 0.4852941
#
tab <- 
  df %>%
  #Group by outcome
  group_by(exit) %>%
  #Mean/proportion of 0 predictions and 1 predictions
  reframe(Predicted0 = mean(exit_pred == 0, na.rm = T),
          Predicted1 = mean(exit_pred == 1, na.rm = T))

tab

The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.

Here, I can clearly see that I have a
% correct prediction rate for negative outcomes/failures (exit == 0). In other words, my model performs OK in predicting the judges that do not exit the court. However, my model performs worse when predicting judges that exit after their mandate ended (

%).

It is a good idea to report the correct positives and negatives in your results table. If so, I usually go for a cut point decided by the proportion of successes in the data. In other words, does my model with predictors predict better than a simple guess from a model without predictors.

#For esthetics, round off proportions to two digits
tab <- round(tab, digits = 2)

stargazer(mod2,
          #Add two lines
          add.lines = list(
            c("Correct positives (prop.)", tab$Predicted1[2]),
            c("Correct negatives (prop.)", tab$Predicted0[1])
          ),
          type = "html")
Dependent variable:
exit
free_economy_diff 1.490***
(0.443)
performance -0.556*
(0.307)
courtGC 0.809**
(0.376)
age 0.132***
(0.027)
tenure 0.045
(0.036)
attendance -0.015*
(0.009)
Constant -9.746***
(1.759)
Correct positives (prop.) 0.49
Correct negatives (prop.) 0.76
Observations 235
Log Likelihood -110.922
Akaike Inf. Crit. 235.844
Note: p<0.1; p<0.05; p<0.01

This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.

Letting the cutpoint vary: The ROC curve

The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.

The ROC curve plots the correct positive rate (sensitivity) against the incorrect positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.

A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.

#load necessary packages; install it if need be
library(pROC)

#define the object to plot
roc1 <- roc(
  #Observed values
  response = df$exit, 
  #Predicted values
  predictor = df$preds1)

#Create the ROC plot
ggroc(roc1) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 1") +
  theme_classic()

We begin by the roc() function that does the calculations for us (sliding the cutpoint, comparing outcomes). The ggroc() function then plots the object. It’s a ggplot object, so I can add all the usual esthetics to it.

Let’s see how my more complex model is doing.

#define the object to plot
roc2 <- roc(
  #Observed values
  response = df$exit, 
  #Predicted values
  predictor = df$preds2)

#Create the ROC plot
ggroc(roc2) +
  ggtitle("Receiver Operating Curve (ROC)",
          subtitle = "Model 2") +
  theme_classic()

We can now see how the model improves. The package even has a function to explicitly compare (and choose) models

roc2 <- roc(
  #Observed values
  response = df$exit, 
  #Predicted values
  predictor = df$preds2)

roc.test(roc1, roc2)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc1 and roc2
## Z = -2.9021, p-value = 0.003707
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  -0.19133552 -0.03707599
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.6738184   0.7880242

The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically significant.

The roc2 object is a list. It means that it contains a lot of information that we can extract using indexing. I can ask for all the list elements.

names(roc2)
##  [1] "percent"            "sensitivities"      "specificities"     
##  [4] "thresholds"         "direction"          "cases"             
##  [7] "controls"           "fun.sesp"           "auc"               
## [10] "call"               "original.predictor" "original.response" 
## [13] "predictor"          "response"           "levels"

… and extract the area under the curve:

roc2$auc
## Area under the curve: 0.788

What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.

ggplot() +
  geom_line(aes(y = roc2$sensitivities,
                x = roc2$thresholds))  +
  geom_line(aes(y = roc2$specificities,
                x = roc2$thresholds),
            lty = 2)

id.threshold <- (roc2$sensitivities - roc2$specificities) %>% 
  abs %>%
  which.min

#Best balance
roc2$thresholds[id.threshold]
## [1] 0.2584505

Using this as a threshold, my cross table would look like this:

df %>% 
  group_by(exit) %>% 
  reframe(Predicted0 = mean(preds2 <= roc2$thresholds[id.threshold], na.rm = T),
          Predicted1 = mean(preds2 > roc2$thresholds[id.threshold], na.rm = T))

As is clear, at this threshold, I’m (almost) equally wrong/right for my predictions in each group in the outcome (exit). You can see this from the diagonal.

For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html

For the R enthusiast: R functions

Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.

The function that creates functions (sic!) is called function().

  1. In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).

  2. After that, we open a set of curly brackets. This is where we put our R codes.

  3. The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function: return(). R will only give one object per function.

  4. Store your function in an R object. my_function <- function(){}

A toy example.

I want a function that tells me if the number i put into it is smaller or larger than 10. My argument ´x=´ is the input; the output is given in the return(). For ease, I create a default value for x. I chose x = 1. However, once the function is compiled (run the R-code), the user may redefine the x object to whatever value (s)he wants.

bigger_than10 <- function(x = 1){ 
  return(x > 10)
}

#Try it out
bigger_than10(x = 20)
## [1] TRUE

If you like your function and want to use it later, you can store it as an R object in your working directory `

save(bigger_than10,
     file  = "bigger_than10.rda")

Later, you fetch it as usual.

load("bigger_than10.rda")

Your turn!

Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code. You can use the following code as a basis:

tab <- df %>% 
  #Select the predicted probabilities and the observed outcomes
  dplyr::select(exit, preds1) %>%
  #Randomize the order of the data
  arrange(sample(nrow(df))) %>%
  #Sort the data from low to high probability
  arrange(preds1) %>%
  #Create an index variable for the x-axis
  mutate(index = 1:nrow(df))

#Plot
ggplot() +
  #Vertical lines == one per observation in the data
  geom_segment(data = tab,
               aes(x = index,
                   xend = index,
                   y = 0,
                   yend = 1,
                   #Separate the segments by predicted outcome
                   color = as.factor(exit)),
               #Make colors transparent incase we have multiple observations with same probability
               alpha = 0.6) +
  scale_color_manual(values = c("grey", "purple")) +
  #A line that illustrate the increase in predicted probability
  geom_line(data = tab,
            aes(y = preds1,
                x = index)) +
  #Esthetics
  ylab("Y-hat") +
  xlab("index") +
  ggtitle("Separation plot") +
  theme_minimal() +
  theme(legend.AgeExit = "bottom")

  1. create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of df$exit and df$preds1.

  2. identify all the times the observed y (exit) is used in the code. Replace the name by y. Do the same with y_pred and preds1.

  3. create a data table df0 by a data.frame(y, y_pred). This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning. Replace df by df0 in the code.

  4. Remove the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.

  5. pack the code chunk into the curly brackets of the following code separation_plot <- function(y = NULL, y_pred = NULL){}.

  6. Ok! Does it run? Try out with preds1 from model 1 as well. Does it work?

Want another challenge?

  • What aesthetics would you have in the function that the user can later alter?
  • The model object in R contains the two vectors you need: mod1$y and mod1$fitted.values. Can you rewrite your function so that the only argument required is the model object mod1?.

Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.

Literature

Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.