Models for event counts

In this session, we will be analyzing event history data. These are models that take into account a combination of two outcomes of interest: The time it takes for an event to happen, and whether or not this event occured. For most of the applications, we will focus on the Cox proportional hazard model.

We will also familiarize with the ggeffects-package that will allow you to do predictions from various models and estimate their uncertainty.

We will familiarize with:

how to estimate different event history models
- Cox proportional hazard models and a brief run-through of parametric models from survival
event history model interpretation
- the flipping between baseline hazards and survival time
- predicting outcomes and estimating their uncertainty using ggeffects
model assessment
- the assumption of proportional hazards, i.e. that coefficients should not depend on where in the spell we are.

We will be working with data from Box-Steffensmeier, Reiter, and Zorn (2003) on election campaigns in the US. They argue that the campaign funding that candidates have (i.e. their “warchest”) may deter potential challenger candidates from entering the electoral competition. Specifically, they observe incumbent candidates in the period leading up to the 1990 election for the US House of Representatives. They measure the candidates cash on hand (i.e. money at their disposal) and whether and when a challenger candidate enters the race.

We start by loading in the data and relevant packages.

#Data wrangling and plotting
library(dplyr); library(ggplot2)

#Set ggplot theme to theme_minimal; a question of taste
theme_set(theme_minimal())

#Event history analysis/survival models
library(survival)

Download the data and put it in your working directory.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/warchest.rda",
              destfile = "warchest.rda")

Now you can load it into RStudio and rename the object.

#Load in data
load("warchest.rda")

#Rename
df <- warchest

Introduction: explore the data

The authors measure the time it takes before a challenger enters the electoral race. The intuition is that a well-resourced incumbent candidate will deter challengers, such that the time it takes for a challenger to surface is higher if the candidate has a large “warchest”; in fact, the event may never happen.

Univariate statistics

Check out the size of the data and the number of variables.

dim(df)

## [1] 1376    7

Have a sneak-peak at the data.

head(df)

We see that there are 7 variables. The data spits observations into periods in which the incumbent candidate’s warchest is observed. A new period starts when a challenger surfaces. The caseid report the identity of each candidate. The te measures the number of weeks of the spell, while the `cut_hi reports whether a challenger entered the race. Our main predictor (warchest) is reported in the ec.

df$caseid %>% unique %>% length

## [1] 397

There are 397 candidates running for election, although there are 1376 observations in the data.

The number of events

table(df$cut_hi)

## 
##    0    1 
## 1336   40

We see that there were 40 challengers entering the race. In other words, it is a fairly rare event.

We can chech whether there are several events per incumbent candidate.

tab <- 
df %>%
  group_by(caseid) %>%
  summarize(challengers = sum(cut_hi))

tab$challengers %>% table

## .
##   0   1 
## 357  40

There isn’t. The study stops when an incumbent is challenged, meaning this is not a repeated event study.

Duration

When observations are split into several sub-period, it is slightly more challenging to explore the duration of each spell. For descriptive purposes, I create a new data set at the spell level. The te duration already reports the cumulative sum of durations per candidate, so it suffices to retain the cases with the highest duration.

df0 <-
  df %>%
  #... within each candidate
  group_by(caseid) %>%
  #Retain the highest count
  slice_max(te)

df0$te %>% 
  hist(.,
       main = "Number of weeks until an incumbent is challenged or the election is held")

A combined outcome variable: segments for durations

We are interested in the combiation of these two outcomes. How can we explore that? One way is to plot the durations and flag the ones that finishes in a challenge.

The base-line code is this.

df0 %>% 
ggplot +
  #Plot a segment per incumbent candidate
  geom_segment(aes(
    #The segment starts at week 0
    x = 0,
    #... and ends at the end of the study
    xend = te,
    #y axis reports the candidate id; here I order them according to their duration
    y = caseid %>% as.factor,
    yend = caseid %>% as.factor,
    color = cut_hi %>% as.factor),
    position = "identity")

We see that most candidates last (“survive”) for a long time, and that the number of challenges is small relative to the number of candidates. The plot is too crammed, however, so let’s do some partitionning and esthetical tweaks.

I split the data into Democrats and Republicans. Here, I only do the democrats.

df0 %>% 
  #There are so many candidates, that I split them by party
  filter(dem == 1) %>%
  #I give the challenger variable some more human values
  mutate(challenger = if_else(cut_hi == 1, 
                              "Challenger enters",
                              "No challenge")) %>%
ggplot +
  #Plot a segment per incumbent candidate
  geom_segment(aes(
    #The segment starts at week 0
    x = 0,
    #... and ends at the end of the study
    xend = te,
    #y axis reports the candidate id; here I order them according to their duration
    y = reorder(caseid, te),
    yend = reorder(caseid, te),
    color = challenger %>% as.factor),
    position = "identity", 
    #Transparent segments with a size
    alpha = 0.5,
    size = 0.8) +
  #I parse out the events in a different pane
  facet_grid(cols = vars(challenger)) +
  scale_color_manual(values = c("blue", "black"),
                     name = "Challenger enters") +
  #Smaller text on the y axis
  theme(axis.text = element_text(size = 5),
        #supress legend
        legend.position = "none") +
  ggtitle("Campaign duration and challenger entry",
          subtitle = "Among DEMOCRATS") +
  #no y-axis label
  ylab("") +
  xlab("Campaign duration (weeks)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.

A combined variable for censored data

We are now moving into the domain of event history functions. Let’s explore how R combines the two outcomes when the estimation is done. Note that I am now working in the main data frame again (df)

The Surv function in the survival package creates a new, combined dependent variable for us. (If you havent installed the package yet, you will have to do it before you fetch it in the library).

We report the duration with start and stop times. (time= and time2=) and whether or not an event has taken place at the end of this period (event=).

In this data, there is no starting time. The te variable only reports the duration at the end of the period for the observation, so we need to create a starting point. This starting point is either 0 (the first risk set) or the end-point of the previous risk set. In other words, we need to do some recoding where we calculate the lag of the end-time.

df <- 
  df %>%
  #Sort data so that first event is first etc
  arrange(caseid, te) %>%
  #within each candidate
  group_by(caseid) %>%
  #find the previous duration at the end of the period
  mutate(start = lag(te))

#First period starts at 0
df$start[is.na(df$start)] = 0

Now, we are good to go. R can now label the survival time/duration with whether or not this duration is censored (+) or not.

Surv(time = df$start, time2 = df$te, event = df$cut_hi) %>% head

## [1] ( 0,26+] (26,50]  ( 0,26+] (26,53+] (53,65+] (65,75+]

This will constitute the dependent variable in all of our analyses. It’s a combination of three pieces of information.

We can explore the base-line hazard/survival in a model with no covariates by creating a Kaplan-Meyer curve. It is somewhat of a homologue to the geom_smooth() statistics in ggplot2. It estimates hazard rates and survival curves for us from the empirical distribution. That is, it estimates the survival rate at each event over time. It gives us a sneak peak of the shape of the base-line hazard. Of course, our predictors may change that shape, but it gives us an indication of the distribution, just as histogram would do. Incidentially, it also happens to function the same way as the Cox proportional hazard model.

KaplanMeier <- survfit(Surv(start, te, cut_hi) ~ 1, data = df)
KaplanMeier

## Call: survfit(formula = Surv(start, te, cut_hi) ~ 1, data = df)
## 
##      records   n events median 0.95LCL 0.95UCL
## [1,]    1376 397     40     NA      NA      NA

The r object tells us how many observations we have and how many events. The object is a list containing a lot of information. It reports the cumulative hazard (the increasing hazard of being challenged), the time and the survival rates.

Among other things, it reports the risk-sets. Risk sets are the set of people/observations that are “alive” at time \(t\), running the risk of being challenged (i.e. experiencing the event). These are the “mini-datasets” that R uses when it calculates its “mini-logits”. For each risk set, a model is estimated. The \(\betas\) are then aggregated/averaged over to report a single effect of \(x\) with a single slope coefficient (\(\beta\)). In other words, it is all the unique duration periods where an event happened or – in the end of the study – did not happen. We could have calculated that from our df0.

KaplanMeier$n.risk %>% length

## [1] 80

df0$te%>% unique %>% length

## [1] 80

We can plot the descriptive statistics in many ways. We can do it by hand simply by extracting the information ourselves. The robject contains all of these variables: upper (conf.int), lower, time, surv (mean predicted survival rate).

  ggplot() +
  geom_ribbon(aes(ymin = KaplanMeier$lower,
                  ymax = KaplanMeier$upper,
                  x = KaplanMeier$time),
              alpha = 0.2) +
  geom_line(aes(x = KaplanMeier$time,
                y = KaplanMeier$surv)) +
  ggtitle("Kaplan-Meyer: a base-line survival curve")

There are ready-mades that can do this for you in no-time. Here, in baseR

KaplanMeier %>%
  plot

Or in ggplot, using the ggsurvfit package.

library(ggsurvfit)

## Warning: package 'ggsurvfit' was built under R version 4.2.3

ggsurvfit(KaplanMeier) +
  add_confidence_interval()

It is clear that most incumbent candidates stand the race all the way through to election. We see that from the survival probability at the end of the study.

Bivariate statistics

It is always useful to explore the bi- and tri-variate relationships in the data. It gives you an idea about your future results, but also challenges and opportunities that might occur.

Segments

Using the segments code from the previous section, I could split or color the data according to their values on the main predictors. We’re moving towards bivariate exploration.

For fun, I recode the warchest into rich/not rich incumbents.

df0$ec %>%hist

df0$rich <- ifelse(df0$ec > 0.25, "rich", "not rich")

… and color code accordingly. The code is mostly a copy-paste from prevously.

The only new thing here, is the color specification and me moving the legend to the bottom of the plot.

df0 %>% 
  #There are so many candidates, that I split them by party
  filter(dem == 1) %>%
  #I give the challenger variable some more human values
  mutate(challenger = if_else(cut_hi == 1, 
                              "Challenger enters",
                              "No challenge")) %>%
ggplot +
  #Plot a segment per incumbent candidate
  geom_segment(aes(
    #The segment starts at week 0
    x = 0,
    #... and ends at the end of the study
    xend = te,
    #y axis reports the candidate id; here I order them according to their duration
    y = reorder(caseid, te),
    yend = reorder(caseid, te),
    color = rich),
    position = "identity", 
    #Transparent segments with a size
    alpha = 0.5,
    size = 0.8) +
  #I parse out the events in a different pane
  facet_grid(cols = vars(challenger)) +
  scale_color_manual(values = c("blue", "black"),
                     name = "Challenger enters") +
  #Smaller text on the y axis
  theme(axis.text = element_text(size = 5),
        #move legend
        legend.position = "bottom") +
  ggtitle("Campaign duration and challenger entry",
          subtitle = "Among DEMOCRATS") +
  #no y-axis label
  ylab("") +
  xlab("Campaign duration (weeks)")

Now, we can clearly see that he density of “rich” candidates is higher in the “No challenge” group, and high up in the plot where the long durations are. In other words the idea that rich candidates deter newcomers is visible both in the duration (somewhat) and the event/no event data.

Your turn

Can you do this for republicans? And give the bars a continous color range according to the size of their warchest?

Kaplan Meyer

We can also partition the data into groups and create different Kaplan-Meyer curves for subgroups of the data. This is only useful for categorical predictors, so I recode my data again. Note that I am now working in the main data frame (df).

#Create a bin
df$rich <- ifelse(df$ec > 0.25, "rich", "not rich")

KaplanMeier <- survfit(Surv(start, te, cut_hi) ~ rich, data = df)

#Plot using the ggsurvfit package
ggsurvfit(KaplanMeier) +
  add_confidence_interval() +
  ggtitle("Base-line survival rate using ggsurvfit")

Here, we see that the “not rich” candidates have a higher probability of being challenged, and that these challenges seem to arrive after a year (50-75 weeks). The richest candidates have a really high probability of survival (no challenge) throughout the campaign, although – there too – there’s a bit of rush in challenges towards the end.

Fitting the model

A base-line model with one predictor

Cox proportional hazard

The Cox proportional hazard model is semi-parametric in the sense that it simply uses the empirical base-line hazard (survival rate; number of observations still “alive” and those that experience the event at time \(t\)), once our predictors have accounted for their variation.

mod.cox <- coxph(Surv(start, te, cut_hi) ~ 
                   ec
                 + cluster(caseid),
                 df)

summary(mod.cox)

## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec, data = df, cluster = caseid)
## 
##   n= 1376, number of events= 40 
## 
##        coef exp(coef) se(coef) robust se      z Pr(>|z|)   
## ec -3.68704   0.02505  1.35286   1.38564 -2.661  0.00779 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##    exp(coef) exp(-coef) lower .95 upper .95
## ec   0.02505      39.93  0.001657    0.3786
## 
## Concordance= 0.665  (se = 0.046 )
## Likelihood ratio test= 10.8  on 1 df,   p=0.001
## Wald test            = 7.08  on 1 df,   p=0.008
## Score (logrank) test = 6.57  on 1 df,   p=0.01,   Robust = 12.71  p=4e-04
## 
##   (Note: the likelihood ratio and score tests assume independence of
##      observations within a cluster, the Wald and robust score tests do not).

The model output gives us quite some information. It reports the raw coefficient, i.e. the log hazard ratio, but also the hazard ratios, the standard error, z-stats and p-value. In many studies, the results are reported as hazard ratios (the exponentiated coefficient), but the standard error is not transformed. This is extremely useless: How can I compare the size of the coefficient and its estimated variation (s.e.)? Usually, it is better to report either the non-transformed or both in combination with the standard error. Alternatively, you might report the confidence interval.

Textual interpretation

The outcome here are “hazard rates”. We can see that the effect of increasing the warchest is negative. It means that the richer candidates have a lower hazard rate (i.e. they have a lower risk of being challenged) and therefore a higher survival rate.

Providing a textual interpretation can be challenging. All the effects are proportional to the duration/timing of the event, and so proportional to the base-line hazard (which is not estimated).

1-exp(mod.cox$coefficients[1]*0.1)

##        ec 
## 0.3083697

One attempt can be something like this: Holding all covariates constant, we find that a 100.000 dollar increase in an incumbent’s warchest, means a 31% lower hazard in the hazard of being challenged. This is the same as saying: Holding all covariates constant, we find that a 100.000 dollar increase in an incumbent’s warchest, means a 31% longer survival time before a candidate is challenged.

The full model

Here, I replicate (Box-Steffensmeyer1996?) model.

mod.full.cox <- coxph(Surv(start, te, cut_hi) ~ 
                        #warchest
                        ec
                      #prior vote share
                      + iv
                      #democrats
                      + dem
                      #from southern US states
                      + south,
                      ties = "efron",
                      id = caseid,
                      df)

summary(mod.full.cox)

## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + dem + south, 
##     data = df, ties = "efron", id = caseid)
## 
##   n= 1376, number of events= 40 
## 
##             coef  exp(coef)   se(coef)      z Pr(>|z|)    
## ec    -3.0250756  0.0485541  1.3866594 -2.182   0.0291 *  
## iv    -7.0156709  0.0008977  1.6607636 -4.224  2.4e-05 ***
## dem    0.2356684  1.2657545  0.3239191  0.728   0.4669    
## south -0.4398328  0.6441441  0.4238880 -1.038   0.2994    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       exp(coef) exp(-coef) lower .95 upper .95
## ec    0.0485541     20.596 3.206e-03   0.73545
## iv    0.0008977   1113.954 3.463e-05   0.02327
## dem   1.2657545      0.790 6.709e-01   2.38820
## south 0.6441441      1.552 2.807e-01   1.47842
## 
## Concordance= 0.776  (se = 0.042 )
## Likelihood ratio test= 34.37  on 4 df,   p=6e-07
## Wald test            = 25.18  on 4 df,   p=5e-05
## Score (logrank) test = 26.49  on 4 df,   p=3e-05

stargazer(mod.cox, mod.full.cox,
          type = "html")


	Dependent variable:

	start
	(1)	(2)

ec	-3.687^***	-3.025^**
	(1.353)	(1.387)

iv		-7.016^***
		(1.661)

dem		0.236
		(0.324)

south		-0.440
		(0.424)


Observations	1,376	1,376
R²	0.008	0.025
Max. Possible R²	0.268	0.268
Log Likelihood	-208.923	-197.136
Wald Test	7.080^*** (df = 1)	25.180^*** (df = 4)
LR Test	10.802^*** (df = 1)	34.375^*** (df = 4)
Score (Logrank) Test	6.569^** (df = 1)	26.489^*** (df = 4)

Note:	p<0.1; p<0.05; p<0.01

Clustering and frailties?

When the data is split into a panel, as is the case here, each candidate is observed several times. However, in event history models we don’t necessarily account for the additional observations. This is obvious in single-event studies with continious time where there is only one duration (and line in the data) per candidate. But even when the data is split into a panel but with only one event, as long as the candidate is present only once per risk-set, we don’t have to account for unobserved heterogeneity.

If had been in the presence of a multiple-event study where each candidate can go in and out of spells (being challenged several times), we would account for it by adding a “frailty” term. This is equivalent to a random-intercept model. We would have done that by the function frailty(caseid). Here, this would overidentify the model, so it is not possible. Another possibility in all models is to account for correlations in the observations by clustering the standard errors, but leave the regression coefficients intact. We can do this directly with the cluster function. I could, for example, cluster per candidate: cluster(caseid).

In this study, I do neither.

Parametric models

Parametric models rely on assumptions about the distributional form of the base-line hazard. The form is estimated using a probability distribution (e.g. exponential) or parameters that approximate the distribution (e.g. Weibull), and so we give the models their name after their approach.

Parametric models can also use the start/stop structure on the dependent variable. However, those implemented in survival cannot, so here , we rely on the eha package and its phreg() function

library(eha)

Weibull model

The Weibull survival model is a commonly used parametric model in survival analysis. Just like the Cox model, it allows for the estimation of the hazard function of a survival time. It assumes that the hazard function follows a Weibull distribution. It has two parameters: a scale parameter (often denoted as \(\lambda\)) and a shape parameter (often denoted as \(k\)).

The Weibull distribution has a monotonic hazard function, meaning that the hazard either increases or decreases over time, depending on the value of the shape parameter.

mod.weibull <- phreg(Surv(start, te, cut_hi) ~ 
                        ec
                      + iv
                      + dem
                      + south,
                      dist = "weibull",
                        df)

These parameters are reported in the output.

The exponential model

The exponential model is just a sub-category of the Weibull model. While the weibull usually estimates its two parameters in view of the data, we can also force the shape parameter to 1. The result is an assumption that the hazard rate is constant over time. This means that the probability of an event occurring in a given time interval is the same regardless of how long the candidate has been in the study. The exponential model is often used as a baseline model for comparison with more complex models.

mod.exponential <- phreg(Surv(start, te, cut_hi) ~ 
                        ec
                      + iv
                      + dem
                      + south,
                      dist = "weibull",
                      shape = 1,
                        df)

stargazer(mod.full.cox, mod.weibull, mod.exponential,
          # mod.full.curv.cox, mod.full.tt.cox,
          colnames = T,
          title = "Survival models",
          type = "html")

**Survival models**

	Dependent variable:

	start
	Cox	parametric
	prop. hazards	prop. hazards
	(1)	(2)	(3)

ec	-3.025^**	-2.583^*	-1.189
	(1.387)	(1.380)	(1.158)

iv	-7.016^***	-6.443^***	-6.048^***
	(1.661)	(1.629)	(1.634)

dem	0.236	0.231	0.186
	(0.324)	(0.324)	(0.325)

south	-0.440	-0.532	-0.524
	(0.424)	(0.420)	(0.418)

log(scale)		3.189^***	2.141^**
		(0.449)	(1.047)

log(shape)		0.890^***
		(0.143)


Observations	1,376	1,376	1,376
R²	0.025
Max. Possible R²	0.268
Log Likelihood	-197.136	-272.195	-286.295
Wald Test	25.180^*** (df = 4)
LR Test	34.375^*** (df = 4)
Score (Logrank) Test	26.489^*** (df = 4)

Note:	p<0.1; p<0.05; p<0.01

How do I choose the right model? If I have a clear theory about the shape of the hazard function, then I have a fairly easy time choosing a model (based on theory). Often this is not the case, the variation in hazard over the duration of the study is often noise/a disturbance in relation to our research question. This is often why people opt for the Cox model. However, if the cox model underperforms (i.e. it is overidentified, has many ties etc.) or we are looking to make predictions, we might opt for the parametric models (remember, the cox model is close to the data in terms of the base-line hazard).

The choice between different parametric models often ends up being a comparison of which one performs the best. How are their predictions.

Lastly, robust research results are not modl-dependent. If we find support for our claims in different models, that’s an asset, while if we only find support in one type of model, we need to ask ourselves why.

Visual interpretation

Generally speaking, it can be a challenge to visualize the results from these models, since the bazeline hazard varies. That is, everything depends on everything in the model (the time, the other covariates…). Our visualizations are therefore often examples of a very specific scenario. In particular, it is hard to visualize continuous variables. We therefore often end up recoding them for clariy.

Two types of visual interpretations are common.

Make a scenario with the ggeffects package

The ggeffects package can help you make scenarios, estimate uncertainties and make predictions in all of the models we have used thus far in class. Here, I apply it to survival models, but you can do the zeroinflated, the poisson, the logit etc with it too. Its output lends itself to ggplotting.

Withiout any more information, the function will make all kinds of scenarios for you that may or may not make sense.

The function is clunky for survival models. Make sure to specify whether you want a survival or a cumulative hazard rate.

library(ggeffects)

#Predict otucomes

#Remind myself of the coefficients
names(mod.full.cox$coefficients)

## [1] "ec"    "iv"    "dem"   "south"

preds <- ggpredict(mod.full.cox,
                   #specify the model and the backtransformation
                   type = "surv", # or cumhaz
                   #he variables i want to vary
                   terms = list(
                     ec = c(0 , 1)
                   )
                   ,
                   #the predictors i hold constant
                   condition = list(
                     iv = 0.73,
                     dem = 0,
                     south = 0
                   )
                   )

The preds object is a specific form of data frame.

preds %>% class

## [1] "ggeffects"  "data.frame"

It contains all we need for illustration: our x (here: time), our predicted values, the boundaries of the confidence interval; and the group variable which here is my warchest (ev). In other models, the x would obviously be our main predictor as detrmined by the terms= argument.

head(preds %>% as.data.frame)

Survival curves

The package allows us to plot the output directly.

preds %>%
  plot

Now, it is easy to see why people often avoid reporting the confidence interval of these studies. Their predictions are often imprecise.

Here, I do it manually.

preds %>%
  ggplot +
  geom_line(aes(x = x,
                y = predicted,
                group = group,
                color = group)) +
  ggtitle("Time to challenge: effect of warchests on challenger candidates") +
  xlab("Weeks of campaign") +
  ylab("Survival (remains unchallenged)")

Coefplots with scenarios

You can subset the prediction to time-specific scenarios and make coefplots. That is, the scenario will be very much ideosynchratic to the time period you choose, but it may still communicate the effect of your covariate of choice.

Here, I opted for week 85 of the campaign. At that specific time, you might even create a bunch of predicitons by varying your predictor of choice. Here, I go for rich/not rich.

#Subset the predicted to a time point (that exists)
preds1 <- preds %>% filter(x == 85)

#Relabel the grouping variable
preds1$Warchest <- ifelse(preds1$group == 1, "Rich", "Not rich")


preds1 %>%
  ggplot +
  geom_point(aes(x = Warchest,
                 y = predicted)) +
  geom_errorbar(aes(
    x = Warchest,
    ymax = conf.high,
    ymin = conf.low
  ),
  width = 0.1) +
  ggtitle("Probability of survival 85 weeks after kickoff") +
  ylab("Candidate unchallenged")

Model assessment

The event history models have many assets, but also some achilles heels. The models we have discussed are all based on the proportional hazard assumption. The models also have some weakneses of their own:

proportional hazard: the effect of our predictors \(\beta\) is constant over the duration. This assumption is common to all of the models. If the assumption is violated, there are a series of fixes. They ususally involve changing the relationship between \(x\) and \(time\) through strata (i.e. fixed effects) or interactions.
ties: the Cox model estimates its effects by ordering the events and comparing the observation that experienced the event with other observations still alive, but that haven’t yet had an event. If many events happen at the same time, what is the ordering?
overfitting all models are prone to overfitting, but the Cox model even more. Overfitting happens when you have more predictors than what the size and variation (in the time and/or event) can vouch for.
shape of the hazard function this is for the parametric models (not Cox).

Proportional hazard assumption

The main assumption of the event history models is the proportional hazard assumption.

The proportional hazards assumption states that the hazard ratio (\(\beta\)) is constant over the entire duration of the study. That is, te effect of the covariates on the hazard function should not vary over time.

The cox.zph() function examines the correlation between the scaled Schoenfeld residuals (a measure of the difference between the observed and expected values of the covariates at each event time) and time. This correlation should be zero. The function returns a test statistic and a p-value for each covariate in the Cox model. If the p-value is less than 0.05, this suggests that the proportional hazards assumption may not hold for that covariate.

cox.zph(mod.full.cox)

##        chisq df    p
## ec     0.554  1 0.46
## iv     1.833  1 0.18
## dem    0.289  1 0.59
## south  0.974  1 0.32
## GLOBAL 3.341  4 0.50

This actually looks good. None of the covariates display any clear correlation with time. Often, it is the time-vaying covariates that pose problems. The only variable that changes over the course of the study, however, is the ec, the warchest. Yet it looks good.

Now, for the sport of it, let’s assume the iv variable is in the red (as of now, it has a p-value of 0.18, which is ok). Before we consider our options about how to fix the “problem”, we can make a visual inspection. It gives us an idea about where on the time horizon the problem occurs.

The cox.zph() function also produces a plot of the scaled Schoenfeld residuals against time for each covariate in the model. Now we can visually assess whether the proportional hazards assumption holds for each covariate. If the plot shows a clear pattern or trend over time, this suggests that the proportional hazards assumption may not hold for that covariate. The function gives us one plot per covariate. Use the arrow-button in your “Plot”-window to go through them. Here, I ask R to show them all at once.

#I want to see all four coefficients in a 2x2 matrix
par(mfrow = c(2,2))

mod.full.cox %>%
  cox.zph() %>%
  plot

We can plot the Schoenfield residuals against the time variable (duration/spell). They should not correlate, but instead follow a straight line!

#Go back to one window == one plot
par(mfrow = c(1,1))

Fixing the problem

If the proportional hazards assumption is not met for one or more covariates in the Cox model, there are several strategies that can be used to address the violation. These include stratifying on the covariate, including interaction terms between the covariate and time, or using time-varying coefficients.

Strata

Stratification in the event history models functions as fixed-effects. We can ask the model to make comparisons strictly within each stratum. The difference is that the model output doesn’t report the effect of the fixed-effect; it just makes the comparisons without much ado.

The other difference is more conceptual: To fix the time-problem, we basically create dummies (strata) for different slices of the duration/spell. That is, we manually ask the model to change the level of the base-line hazard for slices of the data.

The following is a dirty hack insofar as it is completely informed by the data (i.e. I am looking to describe the data appropriately, assuming it is noise for my theory), but it can work miracles. Look at the Schoenfield residuals plot again. It seems that something seems to happen around 75 weeks into the campaign. We saw that already in the descriptive statistics thanks to the Kaplan Meier curves. The challenging of candidates picks up speed/the challenger rate increases. Here, we can see a slight bend upwards for both ev and ec.

To do this, we also need to split the data into new appropriate period/observations according to this new definition. The survSplit() function allows me to make many manual turning points and reshape the data accordingly. I store the output in a new dataframe.

df1 <- survSplit(Surv(start, te, cut_hi)~.,
          data = df,
          cut = c(73),
          #The name of the new "dummy"
          episode = "stratum")

#compare
dim(df);dim(df1)

## [1] 1376    9

## [1] 1519   10

I now have a new dataset with new observations, and we can estimate our model anew, but tell R that the baseline hazard is different towards the end; i.e. in our two strata (before and after week 73).

mod.full.strata.cox <- coxph(Surv(start, te, cut_hi) ~ 
                        #warchest
                        ec
                      #prior vote share
                      + iv
                      #democrats
                      + dem
                      #from southern US states
                      + south
                      #Stratum
                      + strata(stratum)
                      ,
                      ties = "breslow",
                      tt = function(x,t, ...){
                        x*log(t)},
                        id = caseid,
                      x = T,
                      na.action = "na.omit",
                        df1)

summary(mod.full.strata.cox)

## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + dem + south + 
##     strata(stratum), data = df1, na.action = "na.omit", ties = "breslow", 
##     x = T, tt = function(x, t, ...) {
##         x * log(t)
##     }, id = caseid)
## 
##   n= 1519, number of events= 40 
## 
##             coef  exp(coef)   se(coef)      z Pr(>|z|)    
## ec    -3.0088853  0.0493467  1.3864091 -2.170    0.030 *  
## iv    -6.9736554  0.0009362  1.6574337 -4.208 2.58e-05 ***
## dem    0.2340996  1.2637703  0.3239421  0.723    0.470    
## south -0.4380054  0.6453223  0.4239028 -1.033    0.301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##       exp(coef) exp(-coef) lower .95 upper .95
## ec    0.0493467    20.2648 3.259e-03   0.74709
## iv    0.0009362  1068.1200 3.636e-05   0.02411
## dem   1.2637703     0.7913 6.698e-01   2.38456
## south 0.6453223     1.5496 2.812e-01   1.48116
## 
## Concordance= 0.776  (se = 0.042 )
## Likelihood ratio test= 34.11  on 4 df,   p=7e-07
## Wald test            = 25.01  on 4 df,   p=5e-05
## Score (logrank) test = 26.32  on 4 df,   p=3e-05

None of the coefficients changed; now remember we didn’t really have a problem to begin with, so to us, this is good news. The Wald test tells us that the explanatory power of our new model is better. That’s unsurprising, since we just added some fixed-effects into the mix. The Log likelihood is the same, while \(R^2\) is lower.

Linear fixes

The stratification provides manual and descrete fixes to the problem that the baseline hazard and some of the covariates correlate by changing the (ability of) the base-line hazard (to vary). We could do this with linear predictors instead. That is, we do as we’d do in an OLS: We recode the predictors so that they are no longer linear. I can do this by creating an interaction between x and time or an interaction between x and x for example (i.e. a curvilinear effect). The argument tt= allows me to define an R-function that is applied to all the variables of my choice in the model. I apply the function by wrapping them into tt().

Take 1: Here, I recode the relationship the candidate’s past vote share and itself (iv) such that I get a curvlinear effect.

mod.full.curv.cox <- coxph(Surv(start, te, cut_hi) ~ 
                           #warchest
                           ec
                         #prior vote share
                         + iv
                         + tt(iv)
                         #democrats
                         + dem
                         #from southern US states
                         + south
                         ,
                         ties = "breslow",
                         tt = function(x, ...){x*x},
                         id = caseid,
                         df)

summary(mod.full.curv.cox)

## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + iv + tt(iv) + 
##     dem + south, data = df, ties = "breslow", tt = function(x, 
##     ...) {
##     x * x
## }, id = caseid)
## 
##   n= 1376, number of events= 40 
## 
##              coef  exp(coef)   se(coef)      z Pr(>|z|)  
## ec     -2.906e+00  5.471e-02  1.380e+00 -2.105   0.0353 *
## iv     -2.645e+01  3.262e-12  1.192e+01 -2.219   0.0265 *
## tt(iv)  1.377e+01  9.529e+05  8.224e+00  1.674   0.0941 .
## dem     1.817e-01  1.199e+00  3.263e-01  0.557   0.5777  
## south  -5.170e-01  5.963e-01  4.270e-01 -1.211   0.2260  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## ec     5.471e-02  1.828e+01 3.657e-03 8.185e-01
## iv     3.262e-12  3.066e+11 2.343e-22 4.540e-02
## tt(iv) 9.529e+05  1.049e-06 9.525e-02 9.534e+12
## dem    1.199e+00  8.339e-01 6.326e-01 2.273e+00
## south  5.963e-01  1.677e+00 2.582e-01 1.377e+00
## 
## Concordance= 0.773  (se = 0.041 )
## Likelihood ratio test= 36.6  on 5 df,   p=7e-07
## Wald test            = 32.78  on 5 df,   p=4e-06
## Score (logrank) test = 40.56  on 5 df,   p=1e-07

Take 2: Here, I create an interaction term. That is I recode the relationship between time and the candidate’s past vote share (iv) such that the effect of x is proportional/exponential with time.

mod.full.tt.cox <- coxph(Surv(start, te, cut_hi) ~ 
                           #warchest
                           ec
                         #prior vote share
                         + tt(iv)
                         #democrats
                         + dem
                         #from southern US states
                         + south
                         ,
                         ties = "breslow",
                         tt = function(x,t, ...){
                           x*log(t)},
                         id = caseid,
                         df)

summary(mod.full.tt.cox)

## Call:
## coxph(formula = Surv(start, te, cut_hi) ~ ec + tt(iv) + dem + 
##     south, data = df, ties = "breslow", tt = function(x, t, ...) {
##     x * log(t)
## }, id = caseid)
## 
##   n= 1376, number of events= 40 
## 
##            coef exp(coef) se(coef)      z Pr(>|z|)    
## ec     -3.12756   0.04382  1.38792 -2.253   0.0242 *  
## tt(iv) -1.67940   0.18649  0.41627 -4.034 5.47e-05 ***
## dem     0.22727   1.25517  0.32392  0.702   0.4829    
## south  -0.42788   0.65189  0.42360 -1.010   0.3124    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##        exp(coef) exp(-coef) lower .95 upper .95
## ec       0.04382    22.8182  0.002886    0.6654
## tt(iv)   0.18649     5.3623  0.082474    0.4217
## dem      1.25517     0.7967  0.665246    2.3682
## south    0.65189     1.5340  0.284188    1.4954
## 
## Concordance= 0.771  (se = 0.044 )
## Likelihood ratio test= 32.05  on 4 df,   p=2e-06
## Wald test            = 23.74  on 4 df,   p=9e-05
## Score (logrank) test = 24.71  on 4 df,   p=6e-05

This latter solution is the classic way of fixing violations of the proprtional hazard assumption. You can define the shape of the relationship yourself; often by inferring it from the Shoenfield residuals plot.

stargazer(mod.full.cox, mod.full.strata.cox,
          mod.full.curv.cox, mod.full.tt.cox,
          colnames = T,
          title = "Different fixes to violations of the proportional hazard assumption",
          type = "html")

**Different fixes to violations of the proportional hazard assumption**

	Dependent variable:

	start
	(1)	(2)	(3)	(4)

ec	-3.025^**	-3.009^**	-2.906^**	-3.128^**
	(1.387)	(1.386)	(1.380)	(1.388)

iv	-7.016^***	-6.974^***	-26.449^**
	(1.661)	(1.657)	(11.917)

tt(iv)			13.767^*	-1.679^***
			(8.224)	(0.416)

dem	0.236	0.234	0.182	0.227
	(0.324)	(0.324)	(0.326)	(0.324)

south	-0.440	-0.438	-0.517	-0.428
	(0.424)	(0.424)	(0.427)	(0.424)


Observations	1,376	1,519	1,376	1,376
R²	0.025	0.022	0.026	0.023
Max. Possible R²	0.268	0.246	0.268	0.268
Log Likelihood	-197.136	-197.388	-196.143	-198.422
Wald Test	25.180^*** (df = 4)	25.010^*** (df = 4)	32.780^*** (df = 5)	23.740^*** (df = 4)
LR Test	34.375^*** (df = 4)	34.113^*** (df = 4)	36.604^*** (df = 5)	32.046^*** (df = 4)
Score (Logrank) Test	26.489^*** (df = 4)	26.317^*** (df = 4)	40.557^*** (df = 5)	24.715^*** (df = 4)

Note:	p<0.1; p<0.05; p<0.01

For each attempt at fixing the problem, I would obviously go back and check whether the proportional hazard assumption is now respected. Unfortunately, the cox.zph() function doesn’t handle tt definititions.

Overidentification

In an overidentified Cox proportional hazards model, where there are more predictors than events, the prediction interval for survival times will typically be wider than in a model that is not overidentified. This is because the uncertainty in the estimated regression coefficients is greater in an overidentified model, leading to wider confidence intervals for the hazard ratios.

The prediction interval is a measure of the uncertainty in the predicted survival time for a given individual, taking into account both the uncertainty in the estimated regression coefficients and the variability in the baseline hazard function. In an overidentified Cox model, the prediction interval will be wider because there is greater uncertainty in the estimated regression coefficients, which in turn affects the predicted survival times.

Literature

Box-Steffensmeier, Janet M., Dan Reiter, and Christopher Zorn. 2003. “Nonproportional Hazards and Event History Analysis in International Relations.” Journal of Conflict Resolution 47 (1): 33–53. https://doi.org/10.1177/0022002702239510.

Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.