The best way to assess a model is how well it predicts. Virtually all model statistics are based on the idea of prediction. This problem set is made to give you the intuition behind some of the model assessments we have surveyed. Excercises 2c and 3 are indended for the R enthusiasts.

We will continue to work on the same data as before.

Here, we will check out the predictions of the following two binomial logistic models.

MEPs’ propensity to share local assistants (a binomial logit)
Dependent variable
PoolsLocal
(1) (2)
OpenList -0.971*** -1.124***
(0.166) (0.181)
LaborCost 0.056***
(0.009)
SeatsNatPal.prop -1.930***
(0.527)
Constant -0.190* -1.094***
(0.103) (0.286)
Observations 707 686
Log Likelihood -441.336 -392.832
Akaike Inf. Crit. 886.672 793.665
Note: p<0.1; p<0.05; p<0.01

Predicted probabilities

We will share the responses on Padlet: https://padlet.com/siljesynnove/brier_score

Exercise 1: Calculate the Brier score

Ward and Ahlquist (2018) (ch 3, p 65-66) propose a very intuitive single-number summary of how well a logistic regression describes the outcome variable.

Brier score:

\(\begin{aligned}B_b) & \equiv \frac{1}{N} \Sigma_{i=1}^{n} (\hat{\theta_i} - y_i) \end{aligned}\)

  • Run model 1 and calculate the Brier score.
  • Explain in your own words what the score communicates.
  • Do the same for model 2. How does the score change?
  • Ask ChatGPT if there is an R function that can calculate this for you.

Predicted counts

We will share the responses on Padlet: https://padlet.com/siljesynnove/predicted_counts

The true positive rate (TPR), also known as sensitivity, is related to the false positive rate (FPR) as we change the cut value (represented as \(\tau\)) for a binary model.

When we change the cut value, this affects the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions made by the model.

The TPR is defined as:

TPR = TP / (TP + FN)

That is, the number of true positives divided by the sum of true positives and false negatives.

The FPR is defined as:

FPR = FP / (FP + TN)

That is, the number of false positives divided by the sum of false positives and true negatives.

Confusion matrix: predicted vs. observed values

Exercise 2a: Predicted values

Calculate the predicted values from model 1 and 2.

  • How many observations have a predicted probability above 0.5?

  • Decide on a \(\tau\), the cut point beyond which the predicted probability translates to a 1. Why did you choose that one?

  • Create a frequency table. How many 0s and 1s did you predict? How does it compare with the observed frequency of PoolsLocal?

Exercise 2b: True positives vs. false positives

  • Create a cross table where you compare your predicted 0s and 1s with the actual observed values. This is called a confusion matrix

Here is my attempt. I opted for a cutpoint of 0.35.

##           preds2
## PoolsLocal   0   1
##          0 249 208
##          1  78 172

The true positive rate (sensitivity) is defined as the number of correctly predicted 1s divided by all observed 1s. Here is how I calculated it.

#Cut value;tau
tau <- 0.35

#Predicted  outcome
df$preds0  <- predict(mod1, df, "response")>tau

#True positives
truePos <- sum(df$PoolsLocal == 1 & df$preds0 == 1, na.rm = T)

#All positives
Pos<-sum(df$PoolsLocal,na.rm = T)

#The rate
truePos/Pos
## [1] 0.688
  • Can you calculate the false positive rate? It is defined as the number of incorrectly predicted 0s divided by the number of observed 0s

Exercise 2c: Make a ROC curve (and learn how to write a loop).

As we change the cut value \(\tau\), we can observe a trade-off between the true positive rate (TPR) and the false positive rate (FPR). A higher cut value will result in fewer positive predictions (both true and false), and therefore a lower TPR and FPR. On the other hand, a lower cut value will result in more positive predictions (both true and false), and therefore a higher TPR and FPR.

This is what the ROC curve illustrates.

  • Calculate the TPR and the FPR for three cut values: 0, 0.35 and 1. Use three vectors/objects in the calculation TPRand FPR.

Here, I do it for the first cut value by relying on indexation (Hermansen 2023, ch 3). You can do it for the following two cut values by repacing 1 by 2, and then 3 ([1]).

#Empty objects
TPR=NULL;FPR=NULL

#Cut values
tau<-c(0,0.35,1)

#Indexation
df$preds0<-predict(mod1,df,"response") > tau[1]

truePos<-sum(df$PoolsLocal==1&df$preds0==1,na.rm = T)
Pos<-sum(df$PoolsLocal,na.rm = T)

falsePos<-sum(df$PoolsLocal==0&df$preds0==1,na.rm = T)
Neg<-sum(df$PoolsLocal==0,na.rm = T)

#Indexation
TPR[1]<-truePos/Pos
FPR[1]<-falsePos/Neg
  • Make a plot where you define FPR along the x-axis and the TPR along the y-axis. Draw a line between the three dots.

Here is my attempt.

  • Did you notice how the only thing you had to change in my code was the value of the indexation [1]?. We can replace that number by an R object and let the object change value automatically. We do that by writing a “for loop”. Here, I ask R to print the value of the object i for me.
i = 1
cat(i)
## 1

Now, I ask R to change the value of i for me from 1 to 3. Note the curly brackets.

for(i in 1:3){
  cat(i)
}
## 123
  • Use my R code for the first ROC curve (the TPR/sensitivity and the FPR/1-sensitivity). Wrap it in a for loop, and let the tau vary from 0 to 1 through increments of 0.05.

  • Use your code to illustrate the ROC curve of the second model in this problem sheet.

Here is my attempt.

Exercise 3: Write your own function

Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.

The function that creates functions (sic!) is called function().

  1. In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).

  2. After that, we open a set of curly brackets. This is where we put our R codes.

  3. The last line in the curly brackets states if you want to store the output in an object that R should render when you run your function: return(). R will only give one object per function.

  4. Store your function in an R object. my_function <- function(){}

Exercise 3a: A toy example.

I want a function that tells me if the number i put into it is smaller or larger than 10. My argument ´x=´ is the input; the output is given in the return(). For ease, I create a default value for x. I chose x = 1. However, once the function is compiled (run the R-code), the user may redefine the x object to whatever value (s)he wants.

## [1] TRUE

If you like your function and want to use it later, you can store it as an R object in your working directory `

Later, you fetch it as usual.

Your turn

Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code.

  1. create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of df$PoolsLocal and df$preds1.

  2. identify all the times the observed y (PoolsLocal) is used in the code. Replace the name by y. Do the same with y_pred and preds1.

  3. create a data table df0 by a data.frame(y, y_pred). This is just to make sure your two variables are the same length. If they’re not, R will throw you a warning.

  4. start the following code (copy paste from previous) without the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.

  5. pack the code chunk into the separation_plot <- as.function(y = NULL, y_pred = NULL){}.

Exercise 3b: Generalize your function

  • What aesthetics would you have in the function that the user can later alter?
  • The model object in R contains the two vectors you need: mod1$y and mod1$fitted.values. Can you rewrite your function so that the only argument required is the model object mod1?.

Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.

Literature

Hermansen, Silje Synnøve Lyder. 2023. R i praksis - en introduktion for samfundsvidenskaberne. 1st ed. Copenhagen: DJØF Forlag. https://www.djoef-forlag.dk/book-info/r-i-praksis.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.