The best way to assess a model is how well it predicts. Virtually all model statistics are based on the idea of prediction. This problem set is made to give you the intuition behind some of the model assessments we have surveyed. Excercises 2c and 3 are indended for the R enthusiasts.
We will continue to work on the same data as before.
Here, we will check out the predictions of the following two binomial logistic models.
Dependent variable | ||
PoolsLocal | ||
(1) | (2) | |
OpenList | -0.971*** | -1.124*** |
(0.166) | (0.181) | |
LaborCost | 0.056*** | |
(0.009) | ||
SeatsNatPal.prop | -1.930*** | |
(0.527) | ||
Constant | -0.190* | -1.094*** |
(0.103) | (0.286) | |
Observations | 707 | 686 |
Log Likelihood | -441.336 | -392.832 |
Akaike Inf. Crit. | 886.672 | 793.665 |
Note: | p<0.1; p<0.05; p<0.01 |
We will share the responses on Padlet: https://padlet.com/siljesynnove/brier_score
Ward and Ahlquist (2018) (ch 3, p 65-66) propose a very intuitive single-number summary of how well a logistic regression describes the outcome variable.
Brier score:
\(\begin{aligned}B_b) & \equiv \frac{1}{N} \Sigma_{i=1}^{n} (\hat{\theta_i} - y_i) \end{aligned}\)
We will share the responses on Padlet: https://padlet.com/siljesynnove/predicted_counts
The true positive rate (TPR), also known as sensitivity, is related to the false positive rate (FPR) as we change the cut value (represented as \(\tau\)) for a binary model.
When we change the cut value, this affects the number of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions made by the model.
The TPR is defined as:
TPR = TP / (TP + FN)
That is, the number of true positives divided by the sum of true positives and false negatives.
The FPR is defined as:
FPR = FP / (FP + TN)
That is, the number of false positives divided by the sum of false positives and true negatives.
Calculate the predicted values from model 1 and 2.
How many observations have a predicted probability above 0.5?
Decide on a \(\tau\), the cut point beyond which the predicted probability translates to a 1. Why did you choose that one?
Create a frequency table. How many 0s and 1s did you predict? How
does it compare with the observed frequency of
PoolsLocal
?
Here is my attempt. I opted for a cutpoint of 0.35.
## preds2
## PoolsLocal 0 1
## 0 249 208
## 1 78 172
The true positive rate (sensitivity) is defined as the number of correctly predicted 1s divided by all observed 1s. Here is how I calculated it.
#Cut value;tau
tau <- 0.35
#Predicted outcome
df$preds0 <- predict(mod1, df, "response")>tau
#True positives
truePos <- sum(df$PoolsLocal == 1 & df$preds0 == 1, na.rm = T)
#All positives
Pos<-sum(df$PoolsLocal,na.rm = T)
#The rate
truePos/Pos
## [1] 0.688
As we change the cut value \(\tau\), we can observe a trade-off between the true positive rate (TPR) and the false positive rate (FPR). A higher cut value will result in fewer positive predictions (both true and false), and therefore a lower TPR and FPR. On the other hand, a lower cut value will result in more positive predictions (both true and false), and therefore a higher TPR and FPR.
This is what the ROC curve illustrates.
TPR
and
FPR
.Here, I do it for the first cut value by relying on indexation (Hermansen 2023, ch 3). You can do it for the
following two cut values by repacing 1 by 2, and then 3
([1]
).
#Empty objects
TPR=NULL;FPR=NULL
#Cut values
tau<-c(0,0.35,1)
#Indexation
df$preds0<-predict(mod1,df,"response") > tau[1]
truePos<-sum(df$PoolsLocal==1&df$preds0==1,na.rm = T)
Pos<-sum(df$PoolsLocal,na.rm = T)
falsePos<-sum(df$PoolsLocal==0&df$preds0==1,na.rm = T)
Neg<-sum(df$PoolsLocal==0,na.rm = T)
#Indexation
TPR[1]<-truePos/Pos
FPR[1]<-falsePos/Neg
Here is my attempt.
[1]
?. We can replace that
number by an R object and let the object change value automatically. We
do that by writing a “for loop”. Here, I ask R to print the value of the
object i
for me.i = 1
cat(i)
## 1
Now, I ask R to change the value of i
for me from 1 to
3. Note the curly brackets.
for(i in 1:3){
cat(i)
}
## 123
Use my R code for the first ROC curve (the TPR/sensitivity and the FPR/1-sensitivity). Wrap it in a for loop, and let the tau vary from 0 to 1 through increments of 0.05.
Use your code to illustrate the ROC curve of the second model in this problem sheet.
Here is my attempt.
Have you tried to store your R code in an object to save space next time around? Often we use the same code over and over again, only tweaking the data we put into it. This is usually where writing a function will be a time saver. The sure sign that a function might be useful, is when I have a code block where I only change one or two things between each time that I use it.
The function that creates functions (sic!) is called
function()
.
In the parenthesis, we put in the arguments (i.e. the data/stuff that varies between each time we run the code).
After that, we open a set of curly brackets. This is where we put our R codes.
The last line in the curly brackets states if you want to store
the output in an object that R should render when you run your function:
return()
. R will only give one object per
function.
Store your function in an R object.
my_function <- function(){}
I want a function that tells me if the number i put into it is
smaller or larger than 10. My argument ´x=´ is the input; the output is
given in the return()
. For ease, I create a default value
for x. I chose x = 1
. However, once the function is
compiled (run the R-code), the user may redefine the x object to
whatever value (s)he wants.
## [1] TRUE
If you like your function and want to use it later, you can store it as an R object in your working directory `
Later, you fetch it as usual.
Create your own separation plot function. It consists in two long chunks, but there are only two data elements/objects that would actually vary across models: the predicted an the observed variables. By standardizing their names, you can make a more general R code.
create two R objects/vectors – ´y´ and ´y_pred´ – on the basis of
df$PoolsLocal
and df$preds1
.
identify all the times the observed y (PoolsLocal
)
is used in the code. Replace the name by y
. Do the same
with y_pred
and preds1
.
create a data table df0
by a
data.frame(y, y_pred)
. This is just to make sure your two
variables are the same length. If they’re not, R will throw you a
warning.
start the following code (copy paste from previous) without the variable selection element (line 2); you don’t need it. If the code runs, you’re good to go.
pack the code chunk into the
separation_plot <- as.function(y = NULL, y_pred = NULL){}
.
mod1$y
and mod1$fitted.values
. Can you rewrite
your function so that the only argument required is the model object
mod1
?.Tips: You challenge will probably be to make sure that the two vectors are of equal lengths. If you have missing data in the model, this will not be the case. I solved it by indexing on the rowname, but there are many ways around the problem.