Models for event counts
In this session, we will be analyzing event history data. These are models that take into account a combination of two outcomes of interest: The time it takes for an event to happen, and whether or not this event occured. For most of the applications, we will focus on the Cox proportional hazard model.
We will also familiarize with the
ggeffects
-package that will allow you to do predictions from various models and estimate their uncertainty.
We will familiarize with:
how to estimate different event history models
- Cox proportional hazard models and a brief run-through of parametric
models from
survival
- Cox proportional hazard models and a brief run-through of parametric
models from
event history model interpretation
- the flipping between baseline hazards and survival time
- predicting outcomes and estimating their
uncertainty using
ggeffects
model assessment
- the assumption of proportional hazards, i.e. that coefficients should not depend on where in the spell we are.
We will be working with data from Box-Steffensmeier, Reiter, and Zorn (2003) on election campaigns in the US. They argue that the campaign funding that candidates have (i.e. their “warchest”) may deter potential challenger candidates from entering the electoral competition. Specifically, they observe incumbent candidates in the period leading up to the 1990 election for the US House of Representatives. They measure the candidates cash on hand (i.e. money at their disposal) and whether and when a challenger candidate enters the race.
We start by loading in the data and relevant packages.
#Data wrangling and plotting
library(dplyr); library(ggplot2)
#Set ggplot theme to theme_minimal; a question of taste
theme_set(theme_minimal())
#Event history analysis/survival models
library(survival)
Download the data and put it in your working directory.
download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/warchest.rda",
destfile = "warchest.rda")
Now you can load it into RStudio and rename the object.
#Load in data
load("warchest.rda")
#Rename
<- warchest df
Introduction: explore the data
The authors measure the time it takes before a challenger enters the electoral race. The intuition is that a well-resourced incumbent candidate will deter challengers, such that the time it takes for a challenger to surface is higher if the candidate has a large “warchest”; in fact, the event may never happen.
Univariate statistics
Check out the size of the data and the number of variables.
dim(df)
## [1] 1376 7
Have a sneak-peak at the data.
head(df)
We see that there are 7 variables. The data spits observations into
periods in which the incumbent candidate’s warchest is observed. A new
period starts when a challenger surfaces. The caseid
report
the identity of each candidate. The te
measures the number
of weeks of the spell, while the `cut_hi
reports whether a
challenger entered the race. Our main predictor (warchest) is reported
in the ec
.
$caseid %>% unique %>% length df
## [1] 397
There are 397 candidates running for election, although there are 1376 observations in the data.
The number of events
table(df$cut_hi)
##
## 0 1
## 1336 40
We see that there were 40 challengers entering the race. In other words, it is a fairly rare event.
We can chech whether there are several events per incumbent candidate.
<-
tab %>%
df group_by(caseid) %>%
summarize(challengers = sum(cut_hi))
$challengers %>% table tab
## .
## 0 1
## 357 40
There isn’t. The study stops when an incumbent is challenged, meaning this is not a repeated event study.
Duration
When observations are split into several sub-period, it is slightly
more challenging to explore the duration of each spell. For descriptive
purposes, I create a new data set at the spell level. The
te
duration already reports the cumulative sum of durations
per candidate, so it suffices to retain the cases with the highest
duration.
<-
df0 %>%
df #... within each candidate
group_by(caseid) %>%
#Retain the highest count
slice_max(te)
$te %>%
df0hist(.,
main = "Number of weeks until an incumbent is challenged or the election is held")
A combined outcome variable: segments for durations
We are interested in the combiation of these two outcomes. How can we explore that? One way is to plot the durations and flag the ones that finishes in a challenge.
The base-line code is this.
%>%
df0 +
ggplot #Plot a segment per incumbent candidate
geom_segment(aes(
#The segment starts at week 0
x = 0,
#... and ends at the end of the study
xend = te,
#y axis reports the candidate id; here I order them according to their duration
y = caseid %>% as.factor,
yend = caseid %>% as.factor,
color = cut_hi %>% as.factor),
position = "identity")
We see that most candidates last (“survive”) for a long time, and that the number of challenges is small relative to the number of candidates. The plot is too crammed, however, so let’s do some partitionning and esthetical tweaks.
I split the data into Democrats and Republicans. Here, I only do the democrats.
%>%
df0 #There are so many candidates, that I split them by party
filter(dem == 1) %>%
#I give the challenger variable some more human values
mutate(challenger = if_else(cut_hi == 1,
"Challenger enters",
"No challenge")) %>%
+
ggplot #Plot a segment per incumbent candidate
geom_segment(aes(
#The segment starts at week 0
x = 0,
#... and ends at the end of the study
xend = te,
#y axis reports the candidate id; here I order them according to their duration
y = reorder(caseid, te),
yend = reorder(caseid, te),
color = challenger %>% as.factor),
position = "identity",
#Transparent segments with a size
alpha = 0.5,
size = 0.8) +
#I parse out the events in a different pane
facet_grid(cols = vars(challenger)) +
scale_color_manual(values = c("blue", "black"),
name = "Challenger enters") +
#Smaller text on the y axis
theme(axis.text = element_text(size = 5),
#supress legend
legend.position = "none") +
ggtitle("Campaign duration and challenger entry",
subtitle = "Among DEMOCRATS") +
#no y-axis label
ylab("") +
xlab("Campaign duration (weeks)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
A combined variable for censored data
We are now moving into the domain of event history functions. Let’s
explore how R combines the two outcomes when the estimation is done.
Note that I am now working in the main data frame again
(df
)
The Surv
function in the
survival
package creates a new, combined dependent variable
for us. (If you havent installed the package yet, you will have to do it
before you fetch it in the library).
We report the duration with start and stop times. (time=
and time2=
) and whether or not an event has taken place at
the end of this period (event=
).
In this data, there is no starting time. The te
variable
only reports the duration at the end of the period for the observation,
so we need to create a starting point. This starting point is either 0
(the first risk set) or the end-point of the previous risk set. In other
words, we need to do some recoding where we calculate the lag of the
end-time.
<-
df %>%
df #Sort data so that first event is first etc
arrange(caseid, te) %>%
#within each candidate
group_by(caseid) %>%
#find the previous duration at the end of the period
mutate(start = lag(te))
#First period starts at 0
$start[is.na(df$start)] = 0 df
Now, we are good to go. R can now label the survival time/duration with whether or not this duration is censored (+) or not.
Surv(time = df$start, time2 = df$te, event = df$cut_hi) %>% head
## [1] ( 0,26+] (26,50] ( 0,26+] (26,53+] (53,65+] (65,75+]
This will constitute the dependent variable in all of our analyses. It’s a combination of three pieces of information.
We can explore the base-line hazard/survival in a model with no
covariates by creating a Kaplan-Meyer curve. It is
somewhat of a homologue to the geom_smooth()
statistics in
ggplot2. It estimates hazard rates and survival curves for us from the
empirical distribution. That is, it estimates the survival rate at each
event over time. It gives us a sneak peak of the shape of the base-line
hazard. Of course, our predictors may change that shape, but it gives us
an indication of the distribution, just as histogram would do.
Incidentially, it also happens to function the same way as the Cox
proportional hazard model.
<- survfit(Surv(start, te, cut_hi) ~ 1, data = df)
KaplanMeier KaplanMeier
## Call: survfit(formula = Surv(start, te, cut_hi) ~ 1, data = df)
##
## records n events median 0.95LCL 0.95UCL
## [1,] 1376 397 40 NA NA NA
The r object tells us how many observations we have and how many events. The object is a list containing a lot of information. It reports the cumulative hazard (the increasing hazard of being challenged), the time and the survival rates.
Among other things, it reports the risk-sets. Risk sets are the set
of people/observations that are “alive” at time \(t\), running the risk of being challenged
(i.e. experiencing the event). These are the “mini-datasets” that R uses
when it calculates its “mini-logits”. For each risk set, a model is
estimated. The \(\betas\) are then
aggregated/averaged over to report a single effect of \(x\) with a single slope coefficient (\(\beta\)). In other words, it is all the
unique duration periods where an event happened or – in the end of the
study – did not happen. We could have calculated that from our
df0
.
$n.risk %>% length KaplanMeier
## [1] 80
$te%>% unique %>% length df0
## [1] 80
We can plot the descriptive statistics in many ways. We can do it by
hand simply by extracting the information ourselves. The robject
contains all of these variables: upper
(conf.int),
lower
, time
, surv
(mean predicted
survival rate).
ggplot() +
geom_ribbon(aes(ymin = KaplanMeier$lower,
ymax = KaplanMeier$upper,
x = KaplanMeier$time),
alpha = 0.2) +
geom_line(aes(x = KaplanMeier$time,
y = KaplanMeier$surv)) +
ggtitle("Kaplan-Meyer: a base-line survival curve")
There are ready-mades that can do this for you in no-time. Here, in baseR
%>%
KaplanMeier plot
Or in ggplot, using the ggsurvfit package.
library(ggsurvfit)
## Warning: package 'ggsurvfit' was built under R version 4.2.3
ggsurvfit(KaplanMeier) +
add_confidence_interval()
It is clear that most incumbent candidates stand the race all the way through to election. We see that from the survival probability at the end of the study.
Bivariate statistics
It is always useful to explore the bi- and tri-variate relationships in the data. It gives you an idea about your future results, but also challenges and opportunities that might occur.
Segments
Using the segments code from the previous section, I could split or color the data according to their values on the main predictors. We’re moving towards bivariate exploration.
For fun, I recode the warchest into rich/not rich incumbents.
$ec %>%hist df0
$rich <- ifelse(df0$ec > 0.25, "rich", "not rich") df0
… and color code accordingly. The code is mostly a copy-paste from prevously.
The only new thing here, is the color specification and me moving the legend to the bottom of the plot.
%>%
df0 #There are so many candidates, that I split them by party
filter(dem == 1) %>%
#I give the challenger variable some more human values
mutate(challenger = if_else(cut_hi == 1,
"Challenger enters",
"No challenge")) %>%
+
ggplot #Plot a segment per incumbent candidate
geom_segment(aes(
#The segment starts at week 0
x = 0,
#... and ends at the end of the study
xend = te,
#y axis reports the candidate id; here I order them according to their duration
y = reorder(caseid, te),
yend = reorder(caseid, te),
color = rich),
position = "identity",
#Transparent segments with a size
alpha = 0.5,
size = 0.8) +
#I parse out the events in a different pane
facet_grid(cols = vars(challenger)) +
scale_color_manual(values = c("blue", "black"),
name = "Challenger enters") +
#Smaller text on the y axis
theme(axis.text = element_text(size = 5),
#move legend
legend.position = "bottom") +
ggtitle("Campaign duration and challenger entry",
subtitle = "Among DEMOCRATS") +
#no y-axis label
ylab("") +
xlab("Campaign duration (weeks)")
Now, we can clearly see that he density of “rich” candidates is higher in the “No challenge” group, and high up in the plot where the long durations are. In other words the idea that rich candidates deter newcomers is visible both in the duration (somewhat) and the event/no event data.
Your turn
Can you do this for republicans? And give the bars a continous color range according to the size of their warchest?
Kaplan Meyer
We can also partition the data into groups and create different
Kaplan-Meyer curves for subgroups of the data. This is only useful for
categorical predictors, so I recode my data again. Note that I am now
working in the main data frame (df
).
#Create a bin
$rich <- ifelse(df$ec > 0.25, "rich", "not rich")
df
<- survfit(Surv(start, te, cut_hi) ~ rich, data = df)
KaplanMeier
#Plot using the ggsurvfit package
ggsurvfit(KaplanMeier) +
add_confidence_interval() +
ggtitle("Base-line survival rate using ggsurvfit")
Here, we see that the “not rich” candidates have a higher probability of being challenged, and that these challenges seem to arrive after a year (50-75 weeks). The richest candidates have a really high probability of survival (no challenge) throughout the campaign, although – there too – there’s a bit of rush in challenges towards the end.
Fitting the model
A base-line model with one predictor
Cox proportional hazard
The Cox proportional hazard model is semi-parametric in the sense that it simply uses the empirical base-line hazard (survival rate; number of observations still “alive” and those that experience the event at time \(t\)), once our predictors have accounted for their variation.
<- coxph(Surv(start, te, cut_hi) ~
mod.cox
ec+ cluster(caseid),
df)
summary(mod.cox)
## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec, data = df, cluster = caseid)
##
## n= 1376, number of events= 40
##
## coef exp(coef) se(coef) robust se z Pr(>|z|)
## ec -3.68704 0.02505 1.35286 1.38564 -2.661 0.00779 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ec 0.02505 39.93 0.001657 0.3786
##
## Concordance= 0.665 (se = 0.046 )
## Likelihood ratio test= 10.8 on 1 df, p=0.001
## Wald test = 7.08 on 1 df, p=0.008
## Score (logrank) test = 6.57 on 1 df, p=0.01, Robust = 12.71 p=4e-04
##
## (Note: the likelihood ratio and score tests assume independence of
## observations within a cluster, the Wald and robust score tests do not).
The model output gives us quite some information. It reports the raw coefficient, i.e. the log hazard ratio, but also the hazard ratios, the standard error, z-stats and p-value. In many studies, the results are reported as hazard ratios (the exponentiated coefficient), but the standard error is not transformed. This is extremely useless: How can I compare the size of the coefficient and its estimated variation (s.e.)? Usually, it is better to report either the non-transformed or both in combination with the standard error. Alternatively, you might report the confidence interval.
Textual interpretation
The outcome here are “hazard rates”. We can see that the effect of increasing the warchest is negative. It means that the richer candidates have a lower hazard rate (i.e. they have a lower risk of being challenged) and therefore a higher survival rate.
Providing a textual interpretation can be challenging. All the effects are proportional to the duration/timing of the event, and so proportional to the base-line hazard (which is not estimated).
1-exp(mod.cox$coefficients[1]*0.1)
## ec
## 0.3083697
One attempt can be something like this: Holding all covariates constant, we find that a 100.000 dollar increase in an incumbent’s warchest, means a 31% lower hazard in the hazard of being challenged. This is the same as saying: Holding all covariates constant, we find that a 100.000 dollar increase in an incumbent’s warchest, means a 31% longer survival time before a candidate is challenged.
The full model
Here, I replicate (Box-Steffensmeyer1996?) model.
<- coxph(Surv(start, te, cut_hi) ~
mod.full.cox #warchest
ec#prior vote share
+ iv
#democrats
+ dem
#from southern US states
+ south,
ties = "efron",
id = caseid,
df)
summary(mod.full.cox)
## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + dem + south,
## data = df, ties = "efron", id = caseid)
##
## n= 1376, number of events= 40
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ec -3.0250756 0.0485541 1.3866594 -2.182 0.0291 *
## iv -7.0156709 0.0008977 1.6607636 -4.224 2.4e-05 ***
## dem 0.2356684 1.2657545 0.3239191 0.728 0.4669
## south -0.4398328 0.6441441 0.4238880 -1.038 0.2994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ec 0.0485541 20.596 3.206e-03 0.73545
## iv 0.0008977 1113.954 3.463e-05 0.02327
## dem 1.2657545 0.790 6.709e-01 2.38820
## south 0.6441441 1.552 2.807e-01 1.47842
##
## Concordance= 0.776 (se = 0.042 )
## Likelihood ratio test= 34.37 on 4 df, p=6e-07
## Wald test = 25.18 on 4 df, p=5e-05
## Score (logrank) test = 26.49 on 4 df, p=3e-05
stargazer(mod.cox, mod.full.cox,
type = "html")
Dependent variable: | ||
start | ||
(1) | (2) | |
ec | -3.687*** | -3.025** |
(1.353) | (1.387) | |
iv | -7.016*** | |
(1.661) | ||
dem | 0.236 | |
(0.324) | ||
south | -0.440 | |
(0.424) | ||
Observations | 1,376 | 1,376 |
R2 | 0.008 | 0.025 |
Max. Possible R2 | 0.268 | 0.268 |
Log Likelihood | -208.923 | -197.136 |
Wald Test | 7.080*** (df = 1) | 25.180*** (df = 4) |
LR Test | 10.802*** (df = 1) | 34.375*** (df = 4) |
Score (Logrank) Test | 6.569** (df = 1) | 26.489*** (df = 4) |
Note: | p<0.1; p<0.05; p<0.01 |
Clustering and frailties?
When the data is split into a panel, as is the case here, each candidate is observed several times. However, in event history models we don’t necessarily account for the additional observations. This is obvious in single-event studies with continious time where there is only one duration (and line in the data) per candidate. But even when the data is split into a panel but with only one event, as long as the candidate is present only once per risk-set, we don’t have to account for unobserved heterogeneity.
If had been in the presence of a multiple-event
study where each candidate can go in and out of spells (being
challenged several times), we would account for it by adding a “frailty”
term. This is equivalent to a random-intercept model. We would have done
that by the function frailty(caseid)
. Here, this would
overidentify the model, so it is not possible. Another possibility in
all models is to account for correlations in the observations by
clustering the standard errors, but leave the
regression coefficients intact. We can do this directly with the
cluster
function. I could, for example, cluster per
candidate: cluster(caseid)
.
In this study, I do neither.
Parametric models
Parametric models rely on assumptions about the distributional form of the base-line hazard. The form is estimated using a probability distribution (e.g. exponential) or parameters that approximate the distribution (e.g. Weibull), and so we give the models their name after their approach.
Parametric models can also use the start/stop structure on the
dependent variable. However, those implemented in survival
cannot, so here , we rely on the eha
package and its
phreg()
function
library(eha)
Weibull model
The Weibull survival model is a commonly used parametric model in survival analysis. Just like the Cox model, it allows for the estimation of the hazard function of a survival time. It assumes that the hazard function follows a Weibull distribution. It has two parameters: a scale parameter (often denoted as \(\lambda\)) and a shape parameter (often denoted as \(k\)).
The Weibull distribution has a monotonic hazard function, meaning that the hazard either increases or decreases over time, depending on the value of the shape parameter.
<- phreg(Surv(start, te, cut_hi) ~
mod.weibull
ec+ iv
+ dem
+ south,
dist = "weibull",
df)
These parameters are reported in the output.
The exponential model
The exponential model is just a sub-category of the Weibull model. While the weibull usually estimates its two parameters in view of the data, we can also force the shape parameter to 1. The result is an assumption that the hazard rate is constant over time. This means that the probability of an event occurring in a given time interval is the same regardless of how long the candidate has been in the study. The exponential model is often used as a baseline model for comparison with more complex models.
<- phreg(Surv(start, te, cut_hi) ~
mod.exponential
ec+ iv
+ dem
+ south,
dist = "weibull",
shape = 1,
df)
stargazer(mod.full.cox, mod.weibull, mod.exponential,
# mod.full.curv.cox, mod.full.tt.cox,
colnames = T,
title = "Survival models",
type = "html")
Dependent variable: | |||
start | |||
Cox | parametric | ||
prop. hazards | prop. hazards | ||
(1) | (2) | (3) | |
ec | -3.025** | -2.583* | -1.189 |
(1.387) | (1.380) | (1.158) | |
iv | -7.016*** | -6.443*** | -6.048*** |
(1.661) | (1.629) | (1.634) | |
dem | 0.236 | 0.231 | 0.186 |
(0.324) | (0.324) | (0.325) | |
south | -0.440 | -0.532 | -0.524 |
(0.424) | (0.420) | (0.418) | |
log(scale) | 3.189*** | 2.141** | |
(0.449) | (1.047) | ||
log(shape) | 0.890*** | ||
(0.143) | |||
Observations | 1,376 | 1,376 | 1,376 |
R2 | 0.025 | ||
Max. Possible R2 | 0.268 | ||
Log Likelihood | -197.136 | -272.195 | -286.295 |
Wald Test | 25.180*** (df = 4) | ||
LR Test | 34.375*** (df = 4) | ||
Score (Logrank) Test | 26.489*** (df = 4) | ||
Note: | p<0.1; p<0.05; p<0.01 |
How do I choose the right model? If I have a clear theory about the shape of the hazard function, then I have a fairly easy time choosing a model (based on theory). Often this is not the case, the variation in hazard over the duration of the study is often noise/a disturbance in relation to our research question. This is often why people opt for the Cox model. However, if the cox model underperforms (i.e. it is overidentified, has many ties etc.) or we are looking to make predictions, we might opt for the parametric models (remember, the cox model is close to the data in terms of the base-line hazard).
The choice between different parametric models often ends up being a comparison of which one performs the best. How are their predictions.
Lastly, robust research results are not modl-dependent. If we find support for our claims in different models, that’s an asset, while if we only find support in one type of model, we need to ask ourselves why.
Visual interpretation
Generally speaking, it can be a challenge to visualize the results from these models, since the bazeline hazard varies. That is, everything depends on everything in the model (the time, the other covariates…). Our visualizations are therefore often examples of a very specific scenario. In particular, it is hard to visualize continuous variables. We therefore often end up recoding them for clariy.
Two types of visual interpretations are common.
Make a scenario with the ggeffects package
The ggeffects
package can help you make scenarios,
estimate uncertainties and make predictions in all of the models we have
used thus far in class. Here, I apply it to survival models, but you can
do the zeroinflated, the poisson, the logit etc with it too. Its output
lends itself to ggplotting.
Withiout any more information, the function will make all kinds of scenarios for you that may or may not make sense.
The function is clunky for survival models. Make sure to specify whether you want a survival or a cumulative hazard rate.
library(ggeffects)
#Predict otucomes
#Remind myself of the coefficients
names(mod.full.cox$coefficients)
## [1] "ec" "iv" "dem" "south"
<- ggpredict(mod.full.cox,
preds #specify the model and the backtransformation
type = "surv", # or cumhaz
#he variables i want to vary
terms = list(
ec = c(0 , 1)
)
,#the predictors i hold constant
condition = list(
iv = 0.73,
dem = 0,
south = 0
) )
The preds object is a specific form of data frame.
%>% class preds
## [1] "ggeffects" "data.frame"
It contains all we need for illustration: our x
(here:
time), our predicted values, the boundaries of the confidence interval;
and the group variable which here is my warchest (ev
). In
other models, the x
would obviously be our main predictor
as detrmined by the terms=
argument.
head(preds %>% as.data.frame)
Survival curves
The package allows us to plot the output directly.
%>%
preds plot
Now, it is easy to see why people often avoid reporting the confidence interval of these studies. Their predictions are often imprecise.
Here, I do it manually.
%>%
preds +
ggplot geom_line(aes(x = x,
y = predicted,
group = group,
color = group)) +
ggtitle("Time to challenge: effect of warchests on challenger candidates") +
xlab("Weeks of campaign") +
ylab("Survival (remains unchallenged)")
Coefplots with scenarios
You can subset the prediction to time-specific scenarios and make coefplots. That is, the scenario will be very much ideosynchratic to the time period you choose, but it may still communicate the effect of your covariate of choice.
Here, I opted for week 85 of the campaign. At that specific time, you might even create a bunch of predicitons by varying your predictor of choice. Here, I go for rich/not rich.
#Subset the predicted to a time point (that exists)
<- preds %>% filter(x == 85) preds1
#Relabel the grouping variable
$Warchest <- ifelse(preds1$group == 1, "Rich", "Not rich")
preds1
%>%
preds1 +
ggplot geom_point(aes(x = Warchest,
y = predicted)) +
geom_errorbar(aes(
x = Warchest,
ymax = conf.high,
ymin = conf.low
),width = 0.1) +
ggtitle("Probability of survival 85 weeks after kickoff") +
ylab("Candidate unchallenged")
Model assessment
The event history models have many assets, but also some achilles heels. The models we have discussed are all based on the proportional hazard assumption. The models also have some weakneses of their own:
proportional hazard: the effect of our predictors \(\beta\) is constant over the duration. This assumption is common to all of the models. If the assumption is violated, there are a series of fixes. They ususally involve changing the relationship between \(x\) and \(time\) through strata (i.e. fixed effects) or interactions.
ties: the Cox model estimates its effects by ordering the events and comparing the observation that experienced the event with other observations still alive, but that haven’t yet had an event. If many events happen at the same time, what is the ordering?
overfitting all models are prone to overfitting, but the Cox model even more. Overfitting happens when you have more predictors than what the size and variation (in the time and/or event) can vouch for.
shape of the hazard function this is for the parametric models (not Cox).
Proportional hazard assumption
The main assumption of the event history models is the proportional hazard assumption.
The proportional hazards assumption states that the hazard ratio (\(\beta\)) is constant over the entire duration of the study. That is, te effect of the covariates on the hazard function should not vary over time.
The cox.zph()
function examines the correlation between
the scaled Schoenfeld residuals (a measure of the difference between the
observed and expected values of the covariates at each event time) and
time. This correlation should be zero. The function returns a test
statistic and a p-value for each covariate in the Cox model. If the
p-value is less than 0.05, this suggests that the proportional hazards
assumption may not hold for that covariate.
cox.zph(mod.full.cox)
## chisq df p
## ec 0.554 1 0.46
## iv 1.833 1 0.18
## dem 0.289 1 0.59
## south 0.974 1 0.32
## GLOBAL 3.341 4 0.50
This actually looks good. None of the covariates display any clear
correlation with time. Often, it is the time-vaying covariates that pose
problems. The only variable that changes over the course of the study,
however, is the ec
, the warchest. Yet it looks good.
Now, for the sport of it, let’s assume the iv
variable
is in the red (as of now, it has a p-value of 0.18, which is ok). Before
we consider our options about how to fix the “problem”, we can make a
visual inspection. It gives us an idea about where on the time
horizon the problem occurs.
The cox.zph()
function also produces a plot of the
scaled Schoenfeld residuals against time for each covariate in the
model. Now we can visually assess whether the proportional hazards
assumption holds for each covariate. If the plot shows a clear pattern
or trend over time, this suggests that the proportional hazards
assumption may not hold for that covariate. The function gives us one
plot per covariate. Use the arrow-button in your “Plot”-window to go
through them. Here, I ask R to show them all at once.
#I want to see all four coefficients in a 2x2 matrix
par(mfrow = c(2,2))
%>%
mod.full.cox cox.zph() %>%
plot
#Go back to one window == one plot
par(mfrow = c(1,1))
Fixing the problem
If the proportional hazards assumption is not met for one or more covariates in the Cox model, there are several strategies that can be used to address the violation. These include stratifying on the covariate, including interaction terms between the covariate and time, or using time-varying coefficients.
Strata
Stratification in the event history models functions as fixed-effects. We can ask the model to make comparisons strictly within each stratum. The difference is that the model output doesn’t report the effect of the fixed-effect; it just makes the comparisons without much ado.
The other difference is more conceptual: To fix the time-problem, we basically create dummies (strata) for different slices of the duration/spell. That is, we manually ask the model to change the level of the base-line hazard for slices of the data.
The following is a dirty hack insofar as it is completely informed by
the data (i.e. I am looking to describe the data appropriately, assuming
it is noise for my theory), but it can work miracles. Look at the
Schoenfield residuals plot again. It seems that something seems to
happen around 75 weeks into the campaign. We saw that already in the
descriptive statistics thanks to the Kaplan Meier curves. The
challenging of candidates picks up speed/the challenger rate increases.
Here, we can see a slight bend upwards for both ev
and
ec
.
To do this, we also need to split the data into new
appropriate period/observations according to this new definition. The
survSplit()
function allows me to make many manual turning
points and reshape the data accordingly. I store the output in a new
dataframe.
<- survSplit(Surv(start, te, cut_hi)~.,
df1 data = df,
cut = c(73),
#The name of the new "dummy"
episode = "stratum")
#compare
dim(df);dim(df1)
## [1] 1376 9
## [1] 1519 10
I now have a new dataset with new observations, and we can estimate our model anew, but tell R that the baseline hazard is different towards the end; i.e. in our two strata (before and after week 73).
<- coxph(Surv(start, te, cut_hi) ~
mod.full.strata.cox #warchest
ec#prior vote share
+ iv
#democrats
+ dem
#from southern US states
+ south
#Stratum
+ strata(stratum)
,ties = "breslow",
tt = function(x,t, ...){
*log(t)},
xid = caseid,
x = T,
na.action = "na.omit",
df1)
summary(mod.full.strata.cox)
## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + dem + south +
## strata(stratum), data = df1, na.action = "na.omit", ties = "breslow",
## x = T, tt = function(x, t, ...) {
## x * log(t)
## }, id = caseid)
##
## n= 1519, number of events= 40
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ec -3.0088853 0.0493467 1.3864091 -2.170 0.030 *
## iv -6.9736554 0.0009362 1.6574337 -4.208 2.58e-05 ***
## dem 0.2340996 1.2637703 0.3239421 0.723 0.470
## south -0.4380054 0.6453223 0.4239028 -1.033 0.301
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ec 0.0493467 20.2648 3.259e-03 0.74709
## iv 0.0009362 1068.1200 3.636e-05 0.02411
## dem 1.2637703 0.7913 6.698e-01 2.38456
## south 0.6453223 1.5496 2.812e-01 1.48116
##
## Concordance= 0.776 (se = 0.042 )
## Likelihood ratio test= 34.11 on 4 df, p=7e-07
## Wald test = 25.01 on 4 df, p=5e-05
## Score (logrank) test = 26.32 on 4 df, p=3e-05
None of the coefficients changed; now remember we didn’t really have a problem to begin with, so to us, this is good news. The Wald test tells us that the explanatory power of our new model is better. That’s unsurprising, since we just added some fixed-effects into the mix. The Log likelihood is the same, while \(R^2\) is lower.
Linear fixes
The stratification provides manual and descrete fixes to the problem
that the baseline hazard and some of the covariates correlate by
changing the (ability of) the base-line hazard (to vary). We could do
this with linear predictors instead. That is, we do as we’d do in an
OLS: We recode the predictors so that they are no longer
linear. I can do this by creating an interaction between x and
time or an interaction between x and x for
example (i.e. a curvilinear effect). The argument tt=
allows me to define an R-function that is applied to all the variables
of my choice in the model. I apply the function by wrapping them into
tt()
.
Take 1: Here, I recode the relationship the candidate’s past vote
share and itself (iv
) such that I get a curvlinear
effect.
<- coxph(Surv(start, te, cut_hi) ~
mod.full.curv.cox #warchest
ec#prior vote share
+ iv
+ tt(iv)
#democrats
+ dem
#from southern US states
+ south
,ties = "breslow",
tt = function(x, ...){x*x},
id = caseid,
df)
summary(mod.full.curv.cox)
## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + tt(iv) +
## dem + south, data = df, ties = "breslow", tt = function(x,
## ...) {
## x * x
## }, id = caseid)
##
## n= 1376, number of events= 40
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ec -2.906e+00 5.471e-02 1.380e+00 -2.105 0.0353 *
## iv -2.645e+01 3.262e-12 1.192e+01 -2.219 0.0265 *
## tt(iv) 1.377e+01 9.529e+05 8.224e+00 1.674 0.0941 .
## dem 1.817e-01 1.199e+00 3.263e-01 0.557 0.5777
## south -5.170e-01 5.963e-01 4.270e-01 -1.211 0.2260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ec 5.471e-02 1.828e+01 3.657e-03 8.185e-01
## iv 3.262e-12 3.066e+11 2.343e-22 4.540e-02
## tt(iv) 9.529e+05 1.049e-06 9.525e-02 9.534e+12
## dem 1.199e+00 8.339e-01 6.326e-01 2.273e+00
## south 5.963e-01 1.677e+00 2.582e-01 1.377e+00
##
## Concordance= 0.773 (se = 0.041 )
## Likelihood ratio test= 36.6 on 5 df, p=7e-07
## Wald test = 32.78 on 5 df, p=4e-06
## Score (logrank) test = 40.56 on 5 df, p=1e-07
Take 2: Here, I create an interaction term. That is I recode the
relationship between time and the candidate’s past vote share
(iv
) such that the effect of x is proportional/exponential
with time.
<- coxph(Surv(start, te, cut_hi) ~
mod.full.tt.cox #warchest
ec#prior vote share
+ tt(iv)
#democrats
+ dem
#from southern US states
+ south
,ties = "breslow",
tt = function(x,t, ...){
*log(t)},
xid = caseid,
df)
summary(mod.full.tt.cox)
## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + tt(iv) + dem +
## south, data = df, ties = "breslow", tt = function(x, t, ...) {
## x * log(t)
## }, id = caseid)
##
## n= 1376, number of events= 40
##
## coef exp(coef) se(coef) z Pr(>|z|)
## ec -3.12756 0.04382 1.38792 -2.253 0.0242 *
## tt(iv) -1.67940 0.18649 0.41627 -4.034 5.47e-05 ***
## dem 0.22727 1.25517 0.32392 0.702 0.4829
## south -0.42788 0.65189 0.42360 -1.010 0.3124
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## ec 0.04382 22.8182 0.002886 0.6654
## tt(iv) 0.18649 5.3623 0.082474 0.4217
## dem 1.25517 0.7967 0.665246 2.3682
## south 0.65189 1.5340 0.284188 1.4954
##
## Concordance= 0.771 (se = 0.044 )
## Likelihood ratio test= 32.05 on 4 df, p=2e-06
## Wald test = 23.74 on 4 df, p=9e-05
## Score (logrank) test = 24.71 on 4 df, p=6e-05
This latter solution is the classic way of fixing violations of the proprtional hazard assumption. You can define the shape of the relationship yourself; often by inferring it from the Shoenfield residuals plot.
stargazer(mod.full.cox, mod.full.strata.cox,
mod.full.curv.cox, mod.full.tt.cox,colnames = T,
title = "Different fixes to violations of the proportional hazard assumption",
type = "html")
Dependent variable: | ||||
start | ||||
(1) | (2) | (3) | (4) | |
ec | -3.025** | -3.009** | -2.906** | -3.128** |
(1.387) | (1.386) | (1.380) | (1.388) | |
iv | -7.016*** | -6.974*** | -26.449** | |
(1.661) | (1.657) | (11.917) | ||
tt(iv) | 13.767* | -1.679*** | ||
(8.224) | (0.416) | |||
dem | 0.236 | 0.234 | 0.182 | 0.227 |
(0.324) | (0.324) | (0.326) | (0.324) | |
south | -0.440 | -0.438 | -0.517 | -0.428 |
(0.424) | (0.424) | (0.427) | (0.424) | |
Observations | 1,376 | 1,519 | 1,376 | 1,376 |
R2 | 0.025 | 0.022 | 0.026 | 0.023 |
Max. Possible R2 | 0.268 | 0.246 | 0.268 | 0.268 |
Log Likelihood | -197.136 | -197.388 | -196.143 | -198.422 |
Wald Test | 25.180*** (df = 4) | 25.010*** (df = 4) | 32.780*** (df = 5) | 23.740*** (df = 4) |
LR Test | 34.375*** (df = 4) | 34.113*** (df = 4) | 36.604*** (df = 5) | 32.046*** (df = 4) |
Score (Logrank) Test | 26.489*** (df = 4) | 26.317*** (df = 4) | 40.557*** (df = 5) | 24.715*** (df = 4) |
Note: | p<0.1; p<0.05; p<0.01 |
For each attempt at fixing the problem, I would obviously go back and
check whether the proportional hazard assumption is now respected.
Unfortunately, the cox.zph() function doesn’t handle tt
definititions.
Overidentification
In an overidentified Cox proportional hazards model, where there are more predictors than events, the prediction interval for survival times will typically be wider than in a model that is not overidentified. This is because the uncertainty in the estimated regression coefficients is greater in an overidentified model, leading to wider confidence intervals for the hazard ratios.
The prediction interval is a measure of the uncertainty in the predicted survival time for a given individual, taking into account both the uncertainty in the estimated regression coefficients and the variability in the baseline hazard function. In an overidentified Cox model, the prediction interval will be wider because there is greater uncertainty in the estimated regression coefficients, which in turn affects the predicted survival times.