Logistic regression models
In this session, we will be predicting regression outcomes. We do this for interpretation and for model evaluation.
Introduction
Main methodological takeaways
GLMs are non-linear by construction, so interpretation – but also model evaluation – requires a bit more care. Usually, it will involve back-transforming your dependent variable. I have four stages of transformation
no transformation (logodds). These are the regression coefficients from the model output and your resultstable. It is useful for hypothesis testing.
partial reversal (odds). Marginal effects are a one-number summary of the effect of a particular variable. I do this by exponentiating the slope coefficient.
reversal (probability). Predicted probabilities are useful for interpretation (first difference, effect graphics) and model assessment (Brier score, separation plot).
total reversal to original scale (binary). Useful mostly for model assessment (\(\chi^2\) test, ROC-curve )
All model assessments are essentially i) a comparison between the predicted and the observed outcome and – sometimes ii) a comparison between models. This is what Ward and Ahlquist (2018) refers to as “model selection”. We are choosing the best model out of several.
Substantive topic (the research)
We will be working on some more of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.
This time we will play around with another implication of the candidate-centered electoral systems: Intra-party competition. We formulate a hypothesis that MEPs that will later compete against party colleagues are less prone to share resources; in particular the local assistants that allow them to cultivate a personal vote. We furthermore make the assumption that – all else equal – the more strapped an MEP (and their party) is for money, the stronger incentives they have to pool resources.
The data
we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2016.
I start by fetching the R packages we will use from the library.
library(tidyr);library(dplyr); library(ggplot2);
library(stargazer)
Analysis
As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.
We begin by loading in the data.
If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory. Note that this time, we work with data on MEPs from 2016.
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2016.rda",
destfile = "MEP2016.rda")
Now, I can load it into R and rename the object to my liking
(df
).
#Load in
load("MEP2016.rda")
#Rename
<- MEP2016 df
Univariate distribution
To approximate MEPs’ propensity to share assistants with party
colleagues, I have created a binary variable PoolsLocal
that indicates whehter the MEP has indeed at least one local assistant
on his/her payroll that they also share with colleagues. My main
predictor is also binary: Is the MEP elected from a candidate-centred
system (OpenList
)?
I begin by exploring the univariate distribution in numbers, a table and a barplot.
mean(df$PoolsLocal)
## [1] 0.3536068
We see that the proportion of MEPs that share local assistants is 0.35. That is, the predicted probability of sharing assistants in a base-line intercept-only model is 0.3536068.
For simple statistics, I prefer base plotting to ggplot2. However, you can mix the coding with dplyr pipes if you want to.
#In base
table(df$PoolsLocal)
##
## 0 1
## 457 250
barplot(table(df$PoolsLocal))
#Tidyverse
$PoolsLocal %>%
df%>%
table barplot
Since we will work with probabilities, it is useful to plot the relative distribution of the dependent variable.
$PoolsLocal %>%
df%>%
table %>%
prop.table barplot(.,
main = "Pooling resources among MEPs",
xlab = "Pooling (shares at least one assistant)",
ylab= "Proportion")
I then explore each of the predictors that I will use and make a table of descriptive statistics. I will use them actively when I make my interpretations .
<- df %>% dplyr::select(PoolsLocal,
df0
OpenList,
LaborCost,
SeatsNatPal.prop, position)
library(stargazer)
stargazer(df0,
#Title of the table
title = "Descriptive statistics",
#Type of output
type = "html",
#Summary stats of data frame
summary = T,
#Additional arguments for statistics
iqr = T, median = T, min.max = T)
Statistic | N | Mean | St. Dev. | Min | Pctl(25) | Median | Pctl(75) | Max |
PoolsLocal | 707 | 0.354 | 0.478 | 0 | 0 | 0 | 1 | 1 |
OpenList | 707 | 0.463 | 0.499 | 0 | 0 | 0 | 1 | 1 |
LaborCost | 707 | 23.953 | 11.222 | 4.400 | 10.900 | 26.700 | 33.000 | 42.000 |
SeatsNatPal.prop | 686 | 0.229 | 0.173 | 0.000 | 0.071 | 0.241 | 0.357 | 0.668 |
position | 656 | 4.943 | 1.890 | 1.059 | 3.429 | 5.786 | 6.400 | 7.000 |
Bivariate statistics
My main predictor this time is a binary variable that flags whether the MEP was elected – and therefore might run for reelection – in a candidate-centered system (“OpenList”).
I explore the variation between the two in a cross table. I first calculate the counts of different value combinations (appropriate for discrete/categorical) variables. I can use different R dialects and obtain the same results.
#Base
table(df[, c("OpenList", "PoolsLocal")])
## PoolsLocal
## OpenList 0 1
## 0 208 172
## 1 249 78
#Tidyverse
%>%
df #Groups/combinations of values
group_by(PoolsLocal, OpenList) %>%
#Count occurences
count() %>%
#Spread OpenList by columns: from tidyr package
pivot_wider(names_from = PoolsLocal,
values_from = n)
… I then transform to proportions. This is already a first step in the analysis. By calculating proportions row-wise (margin = 1), I make the distributions comparable.
#Base
c("OpenList", "PoolsLocal")] %>%
df[, %>%
table prop.table(., margin = 1 )
## PoolsLocal
## OpenList 0 1
## 0 0.5473684 0.4526316
## 1 0.7614679 0.2385321
Now, we can clearly see that there is a larger proportion of MEPs
that pool resources in party-centered systems
(PoolsLocal == 1
; OpenList == 0
) compared to
MEPs from candidate-centered systems (PoolsLocal == 1
;
OpenList == 1
).
If I want to, I can test whether the relationship between the two variables is statistically significant even here. That is, I can perform a chisquare test (\(X^2\) test). It tests whether the combination of values (counts) in my table is different from what the univariate distributions would indicate. The test relies on another probability distribution you have encountered: The \(X^2\) distribution. As you may recall from methods 1, it describes the sum of squares of independent standard normal random variables. Here, we have a small cross table with two variables (four squares): Pooling resources and electoral system.
c("OpenList", "PoolsLocal")] %>%
df[, %>%
table chisq.test()
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 34.317, df = 1, p-value = 4.683e-09
We see that the p-value is really low (you’d have to move the comma 9 slots to the left to get all the decimals). That is, it is highly unlikely that the combinations we see are produced by accident. We can discard the null hypothesis.
Usually, I wouldn’t bother doing the chisquare test here, because that’s the purpose of the regression. However, the chisquare test sets us up nicely for the model assessment where we are interested in whether the predicted and observed outcomes are correlated.
Run a binomial logistic regression
The function for generalized linear models in R is
glm()
. This time, we also have to specify the probability
distribution that will allow us to map the recoded dependent variable
(the logodds) to probabilities.
I run a base-line model with my one predictor, then a second more complex model where I also control for the resources the MEP has at his/her disposal.
<- glm(PoolsLocal ~ OpenList,
mod1 family = "binomial",
df)
<- glm(PoolsLocal ~ OpenList +
mod2 +
LaborCost
SeatsNatPal.prop,family = "binomial",
df)
Present your results: table of results
Although Ward and Ahlquist (2018) dont like regression tables (i.e. the BUTONs), it is useful to present them too.
::stargazer(mod1, mod2,
stargazerdep.var.caption = "Dependent variable",
title = "MEPs' propensity to share local assistants (a binomial logit)",
type = "html")
Dependent variable | ||
PoolsLocal | ||
(1) | (2) | |
OpenList | -0.971*** | -1.124*** |
(0.166) | (0.181) | |
LaborCost | 0.056*** | |
(0.009) | ||
SeatsNatPal.prop | -1.930*** | |
(0.527) | ||
Constant | -0.190* | -1.094*** |
(0.103) | (0.286) | |
Observations | 707 | 686 |
Log Likelihood | -441.336 | -392.832 |
Akaike Inf. Crit. | 886.672 | 793.665 |
Note: | p<0.1; p<0.05; p<0.01 |
Now that I have settled on my model, I will assess its performance, then use the interpretations to tell my story. There is no, single, perfect model. Also, if there was any such thing, we wouldn’t know unless we saw some lower-performing alternatives. Therefore, the model assessment often involves a comparison between alternative models.
Model evaluation
An important step of any analytic work is, of course, to check if the model choice is correct. That is, before we begin the interpretation of the results, we check if the model provides an appropriate description of our phenomenon. All model statistics are – in effect – variations over a comparison between the predicted and the observed values.
The comparisons I make here involve a varying degree of reversal of the recoding of the dependent variable. We can either transform the predicted values into probabilities, or go all the way to discrete outcomes (binary). Each choice has its advantages.
While the interpretation of the results and the assessment of our hypothesis require us to engage with the uncertainty of the predictions, this is not an important point in the model evaluation. That is, we often rely on point estimates (Can you ask chatGPT what that is?). It also means that I will simply use the base-line R function for prediction without any simulations. I specify that I want the predicted values given as probabilities.
$preds1 <- predict(mod1,
df#In-sample prediction
newdata = df,
#I want probabilities
type = "response")
We can admire the distribution of the predicted values in a histogram. What do you think?
hist(df$preds1)
We can do the same with the more complex model.
$preds2 <- predict(mod2,
dfnewdata = df,
type = "response")
hist(df$preds2)
Is that better? What just happened?
Reversal to probabilities
The model assigns a probability of observing a positive outcome for each observation in our data. In our case, the model predicts the probability that an MEP shares staff with a colleague. In other words, a model that does a good job in describing the outcome would assign high probabilities to MEPs that share an assistant, and low probabilities to MEPs that don’t share.
A single-number summary: the Hosmer-Lemeshow test
The Hosmer - Lemeshow test orders the data according to the predicted probability, then slices the data up into groups; often 10 groups. It then counts the number of successes per group and performs a chisquare test of the table.
%>%
df #Filter out potential NAs
filter(!is.na(preds1)) %>%
#Select my variables
::select(PoolsLocal, preds1) %>%
dplyr#create a variable that identifies observations by their decile
mutate("decile" = ntile(preds1, 10)) %>%
#Group by decile
group_by(decile) %>%
#Count the number of successes
summarise(sum(PoolsLocal)) %>%
#Do the chisquare test; I ask for a simulated p-value
chisq.test(., simulate.p.value = T)
##
## Pearson's Chi-squared test with simulated p-value (based on 2000
## replicates)
##
## data: .
## X-squared = 5.6222, df = NA, p-value = 0.7961
Your turn:
How should we interpret the results of the Hosmer-Lemeshow test?
The Hosmer-Lemeshow test is sensitive to how many groups we create, so it is not a single fix to tell us whether our predictions are any good.
A visual comparison: the separation plot
We can state the same intuition differently : We should see few MEPs that pool resources on the lower end of the probability scale, and a higher density of MEPs that share assistants on the high end of the probability scale.
We can even illustrate this in a plot: The “separation plot”.
$preds1 <- predict(mod1, newdata = df,
dftype = "response")
<- df %>%
tab #Select the predicted probabilities and the observed outcomes
::select(PoolsLocal, preds1) %>%
dplyr#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds1) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
#Plot
ggplot() +
#Vertical lines == one per observation in the data
geom_segment(data = tab,
aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate the segments by predicted outcome
color = as.factor(PoolsLocal)),
#Make colors transparent incase we have multiple observations with same probability
alpha = 0.6) +
scale_color_manual(values = c("grey", "purple"),
labels = c("Does not pool", "Pool"),
name = "Observed outcomes: MEPs pooling resources") +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds1,
x = index)) +
#Esthetics
ylab("Y-hat") +
xlab("index") +
ggtitle("Separation plot",
subtitle = "Does the model assign high probabilities to positive outcomes?") +
theme_minimal() +
theme(axis.line.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
legend.position = "bottom",
legend.direction = "vertical")
The black line illustrates how the probability increases when we
order the data from low to high probability. If there is a clear
separation in the predicted probabilities in the two groups, then this
line would have a sharp increase at some point. It would also start by
being close to 0 and finish high up towards the 1 on the y-axis. We
would furthermore see many purple segments
(PoolsLocal == 1
) to the right of the plot where the
predicted probability of observing 1s is high, and many grey areas to
the left where the probability of observing 0s is low
(PoolsLocal == 0
).
The orginal R package for this plot still works, but its interface comes across as old. If you run this code, you’ll have a little R window popping up with the plot; it doesn’t appear in your “Viewer” window.
library(separationplot)
separationplot(pred = df$preds1,
actual = df$PoolsLocal)
If you are an avid R coder, you can write up your own function to make the plot (See the end of this script).
Your turn:
- Copy-paste my code chunk for the separation plot into chatGPT and ask it to comment. What does it tell you?
- Run the code on the predictions for model 2. How does the separation plot change?
Choosing a cut point
An alternative approach is to back transform the probability variable into 0s and 1s. But what cut point should I use? Ward and Ahlquist (2018) warn against simply slicing the variable in two by cutting at 0.5 (in the middle). You can explore the intuition for why that would be such a bad idea in the problem set.
As a rule of thumb, I often use the base-line model. It is the value at which I’d predict as much wrong in one direction as the other without any other predictors in the model.
$PoolsLocal_pred <-
df$preds1 > 0.35) %>%
(dfas.numeric()
I can plot the predicted and the observed values against each other in a cross table.
c("PoolsLocal_pred", "PoolsLocal")] %>%
df[ table
## PoolsLocal
## PoolsLocal_pred 0 1
## 0 249 78
## 1 208 172
To make the table more comparable, I can calculate the proportion of
correct predictions by the values of the dependent variable. That is, I
could calculate the correct/false prediction rate for successes
(PoolsLocal == 1
) and the correct/false prediction of
failures (PoolsLocal == 0
)
<-
tab c("PoolsLocal_pred", "PoolsLocal")] %>%
df[#cross table
%>%
table #probabilities by row (sums up to 1 per row)
prop.table(., margin = 2)
tab
## PoolsLocal
## PoolsLocal_pred 0 1
## 0 0.5448578 0.3120000
## 1 0.4551422 0.6880000
The proportion of correct predictions are reported along the main diagonal. The wrong predictions are reported in the off diagonal.
Here, I can clearly see that I have a 54% correct prediction rate for
negative outcomes/failures (PoolsLocal == 0
). In other
words, my model performs OK in predicting the MEPs that do not pool
resources. However, my model performs fairly well when predicting MEPs
that pool resources (69%).
This illustrates an important point. When we reverse predictions to discrete variables, we will have to decide on our priorities. Would we want to predict true positives (i.e. correct predictions of 1s) really accurately, but be less stringent about the true negatives (i.e. less accurate predictions of 0s)? Or the opposite? Or would we like to balance the two? In most political science applications, we’d opt for the balanced outcome.
Would you like to illustrate the correlation visually? Make a tiles plot with strong colors for high predictions. Note that the correct predictions now run on the anti-diagonal (from left upwards to the right).
%>%
tab %>%
data.frame ggplot() +
geom_tile(aes(x = PoolsLocal_pred,
y = PoolsLocal,
fill = Freq)) +
xlab("Observed outcomes") +
ylab("Predicted outcomes") +
scale_fill_gradient(high = "blue",
low = "white",
name = "Probability")
Letting the cutpoint vary: The ROC curve
The ROC curve is a graphical representation that shows the performance of a binary model at all possible cutpoints/thresholds (from probability 0 to 1). The model will predict either a positive or negative outcome for a given observation based on that probability threshold.
The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different threshold values. As the threshold value/cut point increases, the model becomes more conservative and predicts fewer positive outcomes, which in turn decreases the false positive rate but also decreases the true positive rate.
A good classifier will have a high true positive rate and a low false positive rate, which corresponds to a curve that is closer to the top-left corner of the ROC plot. The area under the ROC curve (AUC) is a measure of the overall performance of the classifier, where a value of 1 indicates perfect classification and a value of 0.5 indicates random guessing.
#load necessary packages; install it if need be
library(pROC)
#define the object to plot
<- roc(
roc1 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds1)
#Create the ROC plot
ggroc(roc1) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 1") +
theme_classic()
We begin by the roc()
function that does the calculations
for us (sliding the cutpoint, comparing outcomes). The
ggroc()
function then plots the object. It’s a ggplot
object, so I can add all the usual esthetics to it.
Let’s see how my more complex model is doing.
#define the object to plot
<- roc(
roc2 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds2)
#Create the ROC plot
ggroc(roc2) +
ggtitle("Receiver Operating Curve (ROC)",
subtitle = "Model 2") +
theme_classic()
We can now see how the model improves. The package even has a function to explicitly compare (and choose) models
<- roc(
roc2 #Observed values
response = df$PoolsLocal,
#Predicted values
predictor = df$preds2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc.test(roc1, roc2)
##
## DeLong's test for two correlated ROC curves
##
## data: roc1 and roc2
## Z = -6.3678, p-value = 1.918e-10
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.14291608 -0.07564453
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.6236973 0.7329776
The test reports the area under the curve (AUC) for each model and even test if the difference in AUCs between models is statistically signficant.
The roc2
object is a list. It means that it contains a
lot of information that we can extract using indexing. I can ask for all
the list elements.
names(roc2)
## [1] "percent" "sensitivities" "specificities"
## [4] "thresholds" "direction" "cases"
## [7] "controls" "fun.sesp" "auc"
## [10] "call" "original.predictor" "original.response"
## [13] "predictor" "response" "levels"
… and extract the area under the curve:
$auc roc2
## Area under the curve: 0.733
What is my ideal cut point/threshold value? The ROC curve is mostly used for assessing the model’s performance. However, if we have a set of cutpoints and the true positive and the true negative prediction rate for each cutpoint, then we can find the cutpoint that best balances between the two types of errors. That is, we are interested in when the two lines intersect.
ggplot() +
geom_line(aes(y = roc2$sensitivities,
x = roc2$thresholds)) +
geom_line(aes(y = roc2$specificities,
x = roc2$thresholds),
lty = 2)
<- (roc2$sensitivities - roc2$specificities) %>%
id.threshold %>%
abs
which.min
#Best balance
$thresholds[id.threshold] roc2
## [1] 0.3884075
For a recap, you may for example read this intro: https://drfloreiche.github.io/logit.html
Interpretation
Once we are satisfied with the performance of our model, we can start playing around with it. What does it have to tell us?
1. Regression coefficients
The equation that R optimizes for us is linear. The regression coefficient \(\beta\) therefore report changes in logodds (my recoded \(y\)).
\(y \sim \alpha + \beta \times x\)
Regression coefficients reported in logodds are useful for the
hypothesis testing. I can easily see whether my effects are positive or
negative; and whether they are statistically likely to be different from
zero. From the results table, I can also see that intra-party
competition (OpenList
) reduces MEPs’ likelihood of pooling
resources. But by how much?
GLMs have an inbuilt interaction effect due to the recoding of the observed variable (the log-transformation) and the models’ mapping of events on the Bernoulli probability distribution (i.e. a binomial probability distribution with one success/failure for each trial). This requires a more careful interpretation on our side. That’s what exercise 1 in the problem set for today aimed to illustrate.
When I interpret, I usually follow up with three additional steps.
2. Marginal effect
In linear regression models, we can interpret the \(\beta\) coefficient (slope parameter/coefficient) as the relative increase in \(y\) when \(x\) increases by one unit. In logistic regression models, the \(\beta\) coefficient represents the change in the log odds of the dependent variable for a one-unit increase in the independent variable. When I partially back transform the slope parameter in logistic regression, I obtain the odds ratio (exercice 2b). To do so, I exponentiate the slope parametr. This represents the ratio of the odds of the dependent variable between two groups that differ by one unit on the independent variable.
exp(mod1$coefficients[2])
## OpenList
## 0.3788176
When we do this, the reference point is 1. All ratios above 1 implies a positive relationship, all below a negative. The oddsratio is the factor with which we would multiply the increase in \(y\) when \(x\) increases by one unit. Here, the \(y\) would increase by a factor of 0.379.
Unless your exponentiated coefficient is above 2 (i.e. “double”), the most intuitive way to communicate the result, is in % (not percentage points, but relative increase/decrease). This is particularly true for negative relationships.
exp(mod1$coefficients[2])-1
## OpenList
## -0.6211824
MEPs from candidate-centered systems are 62% less likely to share an assistant, compared to members from party-centered systems.
3. First differences and scenarios
Because of the in-built interaction effect, it is particularly important to go all the way to the predicted effect by backtransforming from the logodds.
Logistic transformation: \[\begin{aligned}\frac{1}{1+exp(-x)}\end{aligned}\]or
\[\begin{aligned}\frac{exp(x)}{1+exp(x)}\end{aligned}\]
First-differences are useful to illustrate a point to the reader. That is, we choose two (or four) typical scenarios and tell the reader what the absolute change in the predicted probability would be. Until now, we have relied on all-else-equal statements in the construction of scenarios. However, when we illustrate with first differences, we do not have to that. It is more important to draw on the typical and do predictions in that neighborhood. This is where your substantive knowledge (and close study of the descriptive statistics) come in. The actual predicted effect (first difference) may or may not be impressive, depending on the size of the intercept and value of the other variables.
If the point is merely to give point estimates (expected values,
average predicted effect…), I use the predict()
function on
a hypothetical dataset.
Here, I illustrate the predicted effect of the 2022 election on MEPs from Venstre.
I use dplyr to select variable values from the actual data on
Denmark. You can of course just use data.frame()
and
specify by hand. The seat share for my scenario, I drew from
Wikipedia.
<- 43/179
share2019 <-23/179
share2022
<-
newdata %>%
df filter(Nationality == "Denmark") %>%
::select(OpenList, LaborCost)%>%
dplyr%>%
unique cbind(SeatsNatPal.prop = c(43/179, 23/179),
Year=c(2019,2022))
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
<- predict(mod2,
preds newdata = newdata,
type = "response")
#Predicted probabilitis
preds
## 1 2
## 0.4174272 0.4706101
#First difference
diff(preds)
## 2
## 0.05318288
We see that the Danish Members of the European Parliament had a 5 percentage point higher likelihood of sharing an assistant after the catastrophical national election in 2022.
Note my vocabulary: When we deal with absolute predicted changes, we talk about percentage points. % is reserved for relative changes.
Graphics
An image speaks more than a 1000 words; especially when we work with GLMs. Put your scenarios on speed and illustrate the effect of a predictor over its entire range instead of a simple first-difference. When you do this, you will go back to the all-else-equal ideal. Set the other variables to a typical but constant value, and let your predictor of choice vary.
When the predictor is binary, the tilted coefplot is a good choice. When it is continuous, you can use the classical effect graphic by drawing a line. In fact, the entire procedure is exactly the same as before, except for one detail: You need to reverse your dependent variable to a probability. You do this after you have simulated the predicted values (logodds), before you start plotting.
I begin by setting up my scenario. When doing so, I rely heavily on my descriptive table. Here, I continue with my riff on Venstre, but illustrate the effect of the electoral system.
#Scenario
<- data.frame(
newdata Intercept = 1,
OpenList = c(0, 1),
#Labor cost in Denmark.
LaborCost = 42,
SeatsNatPal.prop = 23/179)
#Simulate
<- MASS::mvrnorm(
coef.sim #Number of simulations
n = 10000,
#Coefficients
mu = mod2$coefficients,
#Uncertainty/variance-covariance matrix
Sigma = vcov(mod2)
)
#Matrix multiplication to get predicted LOGODDS
<- coef.sim %*% t(newdata)
logodds
#Reverse recoding to probability
<- 1/(1+exp(-logodds)) #or exp(logodds)/(1 + exp(logodds))
preds
#Summarize the simulation
<-
pred_summary apply(preds,
#Apply this to columns (==2)
MARGIN = 2,
#What a function?
FUN = quantile, probs = c(0.025, 0.5, 0.975))
#Add summary to data with scenarios
<- cbind(newdata,
dfp3 t(pred_summary))
ggplot(dfp3) +
#Plot a line this time
geom_point(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = OpenList)
+
) #plot a little line to insinuate a move from one descrete value to another
geom_line(aes(
#Predicted y-values
y = `50%`,
#Different scenarios
x = OpenList),
lty = 2, color = "grey",
+
) #Plot a ribbon of uncertainty
#Plot a the uncertainty
geom_errorbar(aes(
#Predicted y-values
ymin = `2.5%`,
ymax = `97.5%`,
#Different scenarios
x = OpenList),
alpha = 0.3,
#Error bars "sans serif"
width = 0) +
theme_minimal()
For the R enthusiast
Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.
The function that creates functions (sic!) is called
function()
.
In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).
After that, we open a set of curly brackets. This is where we put our R codes.
The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function:
return()
. R will only give one object per function.Store your function in an R object.
my_function <- function(){}
A toy example.
I want a function that tells me if the number i put into it is
smaller or larger than 10. My argument ´x=´ is the input; the output is
given in the return()
. For ease, I create a default value
for x. I chose x = 1
. However, once the function is
compiled (run the R-code), the user may redefine the x object to
whatever value (s)he wants.
<- function(x = 1){
bigger_than10 return(x > 10)
}
#Try it out
bigger_than10(x = 20)
## [1] TRUE
If you like your function and want to use it later, you can store it as an R object in your working directory `
save(bigger_than10,
file = "bigger_than10.rda")
Later, you fetch it as usual.
load("bigger_than10.rda")
Your turn
Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code.
create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of
df$PoolsLocal
anddf$preds1
.identify all the times the observed y (
PoolsLocal
) is used in the code. Replace the name byy
. Do the same withy_pred
andpreds1
.create a data table
df0
by adata.frame(y, y_pred)
. This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning.start the following code (copy paste from previous) without the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.
pack the code chunk into the
separation_plot(y = NULL, y_pred = NULL){}
.
<- df %>%
tab #Select the predicted probabilities and the observed outcomes
::select(PoolsLocal, preds1) %>%
dplyr#Randomize the order of the data
arrange(sample(nrow(df))) %>%
#Sort the data from low to high probability
arrange(preds1) %>%
#Create an index variable for the x-axis
mutate(index = 1:nrow(df))
#Plot
ggplot() +
#Vertical lines == one per observation in the data
geom_segment(data = tab,
aes(x = index,
xend = index,
y = 0,
yend = 1,
#Separate the segments by predicted outcome
color = as.factor(PoolsLocal)),
#Make colors transparent incase we have multiple observations with same probability
alpha = 0.6) +
scale_color_manual(values = c("grey", "purple")) +
#A line that illustrate the increase in predicted probability
geom_line(data = tab,
aes(y = preds1,
x = index)) +
#Esthetics
ylab("Y-hat") +
xlab("index") +
ggtitle("Separation plot") +
theme_minimal()
Want another challenge?
- What aesthetics would you have in the function that the user can later alter?
- The model object in R contains the two vectors you need:
mod1$y
andmod1$fitted.values
. Can you rewrite your function so that the only argument required is the model objectmod1
?.
Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.