Models of choice: when outcomes are categorical

Many outcomes of interest to political scientists are categorical. These outcomes are often the result of a choice, based on some (latent) preference held by the chooser. Party choice among voters is a typical example, but also self-placement on a left-right scale in a survey can be conceived of as a categorical variable. How can we describe discrete outcomes with a linear equation?

Models for discrete outcomes are estimated by comparing outcomes between a treatment/success group and a reference group; just as with the binomial logit regression. However, they use different reference groups and/or subsets of the data when they recode the outcome before the analysis.

New R codes:

  • new from base: as.factor(), relevel(), reorder(), as.ordered(), predict(., type = "probs"), predict(., type = "class"), colSums()
  • new from nnet package: multinom()
  • new from MASS package: polr()
  • new from ggplot2 package: geom_area()
  • new from dplyr package: bind_rows()

Old R codes:

  • old from base: dim(), table, as.data.frame()
  • old from dplyr package: group_by, mutate(), reframe()
  • old from ggplot: geom_bar(), geom_col, geom_point()
  • old from ggeffects: ggpredict(., terms = , condition = )

Introduction

Main methodological takeaways

The choice of model depends on the measurement level of your outcome:

  • nominal outcome: the variable is categorical, and there is no objective way of ordering the categories. You’re often interested in modeling the utility that choosers derive from the different alternatives they are presented with. The choice of model depends on which perspective you take:

    • choice as a function of chooser characteristics \(\rightarrow\) multinomial logit model. These models are perfect when you are able do distinguish different choosers. For example, you have data on a survey respondent’s political orientation, age, education, income and you’re interested in what party they would vote for in the next election. I.e. the utility derived from a specific alternative will vary across respondents.

    • choice as a function of choice characteristics \(\rightarrow\) conditional logit model. These models are perfect when you are interested in characteristics related to the alternatives that choosers are presented with. However, you think that any chooser would value these charactersitics equally. I.e. the utility derived is the same across all data points.

  • ordinal outcomes: the categories can be ordered from low to high, but you don’t know the distance between each category. This model is somewhat of the middle-child: In contrast to the multinomial model, it retains information about the ordering of categories. In contrast to the linear model, it is adjusted to the reality that outcomes are bounded.

    • predictors’ effect on choice for a higher category is monotonous (i.e. always increasing/decreasing) \(\rightarrow\) ordinal logit model

The data

We will be working on the same data from the European Social Survey (ESS) that I use in my book. You can download it from my website: kap6.rda

download.file("https://siljehermansen.github.io/teaching/beyond-linear-models//kap6.rda",
              destfile = "kap6.rda")
#Load in data
load("kap6.rda")

#Rename
df <- kap6 %>%
  mutate(Education = Uddannelse,
         Income = Indtaegt,
         Female = Kvinde,
         #Reverse the coding
         Burden = (Belastning * -1) + 10,
         Immigrant = Innvandrer)

I start by fetching the R packages we will use from the library.

#Data wrangling
library(dplyr); 
#Presentation
library(ggplot2);
library(stargazer); library(ggeffects)

This time, I’m also setting the aesthetic theme for my ggplots with the theme_set() function. This instructs R to stick to this theme throughout my R session. “Themes” are ready-made templates for how your graphics should look like. There are many such themes. Here, I opted for the theme_minimal(). You can always alter the theme in the theme_minimal() or on a plot-by-plot basis by adding theme(...) to your pipe.

theme_set(theme_minimal())

Many outcomes of interest to political scientists are categorical and cannot immediately be ordered (i.e. they are nominal). These outcomes are often the result of a choice, based on some (latent) preference held by the chooser. Party choice among voters is the most typical example. The theoretical purpose of these models is often to estimate the utility that choosers derive from the different options (categories) that they are presented with. However, you may use them for pure descriptive purposes too.

Multinomial model: choice as a function of chooser characteristics

(Ward and Ahlquist 2018, ch. 9)

Univariate descriptive statistics

I will usually start by mapping out the univariate distribution of the dependent variable in a frequency table and as a barplot. I’ve stored the party names in full (“Parti”) and as an abbreviation (“Party”).

table(df$Party)
## 
##   A  DF   E   K  KF  LA  RV   S  SF   V 
##  17 143  77   8  65  48 134 268 108 311

At this point, I’m scouting for small categories. They might cause problems in the mulitvariate analysis. The multinomial model is estimated as a separate regression for each category, so when the categories are too small, you’ll have soon run into the problem of too many value combinations in the predictors that don’t exist (non-identification). If so, I might merge categories.

I can let ggplot2 report a barplot for me with the simple frequency distribution (geom_bar()).

df %>%
  ggplot +
  geom_bar(aes(Party)) +
  ggtitle("Party choice among ESS respondents")

However, I like to make adjustments to the presentation. In the final presentation, I therefore calculate the height of the bars myself (geom_col()), and take the opportunity to do a few additional tweaks.

df %>%
  group_by(Party) %>%
  reframe(n = n()) %>%
  mutate(N = sum(n),
         p = n/N,
         #Recode NA to a proper category
         Party = if_else(is.na(Party),
                         "Dont know",
                         Party),
         #Define the Party variable as a factor/unordered categorical
         Party = as.factor(Party),
         #For esthetic purposes, make Party into an ordered variable according to size (p)
         Party = reorder(Party, p)) %>%
  #Now, make the plot
  ggplot +
  geom_col(aes(
    #Height = proportion of respondents choosing this party
    y = p,
    #x values: party ordered by size
    x = Party
  )) +
  ylab("Proportion") +
  ggtitle("Party choice among ESS respondents")

Bivariate statistics

Usually, we’d connect the predictor with higher/lower values of the outcome to estimate the bivariate correlation. When one of the two is a categorical variable, we’ll often calculate the group mean. In fact, this is how linear models with categorical predictors are estimated: The regression coefficients are based on the mean value of the outcome within each value of the predictor.

When the outcome is categorical, we can revert this to report the mean value of the predictor for each outcome category. Statistically, all correlations are symmetric.

df %>%
  group_by(Party) %>%
  reframe(Scepticism = mean(Skepsis, na.rm = T)) %>%
  mutate(Party = if_else(is.na(Party),
                          "Dont know",
                          Party),
         Party = as.factor(Party),
         Party = reorder(Party, Scepticism)) %>%
  ggplot +
  geom_col(aes(
    y = Scepticism,
    x = Party
  )
  ) +
  ylim(0,10) +
  ggtitle("Scepticism towards immigration among party supporters")

Multinomial model

Set the reference level

The reference category comes first when we ask R to make a frequency table.

df$Party %>% table
## .
##   A  DF   E   K  KF  LA  RV   S  SF   V 
##  17 143  77   8  65  48 134 268 108 311

I can redefine the reference category. I start by defining my outcome as a categorical variable (as.factor()), then set the reference level (relevel()).

df <- 
  df %>%
  mutate(Party = as.factor(Party),
         Party = relevel(Party, ref = "S"))

Check: The reference category comes first.

df$Party %>% table
## .
##   S   A  DF   E   K  KF  LA  RV  SF   V 
## 268  17 143  77   8  65  48 134 108 311

When I choose reference level, I usually have a few criteria: * it should be a fairly large category (many observations) * it should rank somewhere on the mental scale we use to assess the criteria.

When choosing a party as a reference group, I’d go for a party that is either on the extreme end of an implicit scale or in the middle. Since all comparisons are done with respect to the reference category, it’s usually an asset to have a baseline that is well known to the reader. The descriptive statistics help me in my choice: Both the univariate (i.e. the size of the group) and the bivariate statistics (i.e. some “ranking”).

Estimate the model

The nnet package offers a function for multivariate logit models.

library(nnet)

The function is called multinom().

#Intercept-only
mod0.nom <- multinom(Party ~ 
                      1,
                    df)
## # weights:  20 (9 variable)
## initial  value 2714.747825 
## iter  10 value 2332.511892
## final  value 2326.831829 
## converged
#With predictors
mod1.nom <- multinom(Party ~ 
                      Skepsis
                     + Income
                     + Education
                     + Immigrant
                     + Female,
                    df)
## # weights:  70 (54 variable)
## initial  value 2507.515166 
## iter  10 value 2213.273825
## iter  20 value 2144.466477
## iter  30 value 2071.802382
## iter  40 value 2025.423935
## iter  50 value 2022.401439
## iter  60 value 2022.235229
## iter  70 value 2022.221248
## final  value 2022.221210 
## converged

After the estimation, we may call for a summary of the model object. However, it is slightly less plesant to look at than usual.

summary(mod1.nom)
## Call:
## multinom(formula = Party ~ Skepsis + Income + Education + Immigrant + 
##     Female, data = df)
## 
## Coefficients:
##    (Intercept)     Skepsis       Income    Education   Immigrant      Female
## A    -4.928344  0.28898306  0.075321079  0.040839475   0.8665345 -0.97083266
## DF   -3.218129  0.56484227 -0.007969436 -0.009093293  -0.8969781 -0.63859938
## E    -1.214491 -0.05159780 -0.030202911  0.028376023  -0.1510994  0.01519225
## K    -1.510827 -0.23374916 -0.072302609 -0.021650490 -11.3666588 -0.16085056
## KF   -2.739442  0.07978610  0.125455375  0.034992049  -0.8436163 -0.23819484
## LA   -2.941900  0.06976752  0.139137432  0.042106257  -0.2477992 -1.49084514
## RV   -1.688098 -0.22437478  0.116823317  0.102371601  -0.2008903 -0.32328085
## SF   -1.509715 -0.05712908 -0.018928704  0.071055027  -0.2337257  0.15212100
## V    -1.572977  0.14083006  0.151512520  0.032074457  -0.7957274 -0.51823830
## 
## Std. Errors:
##    (Intercept)    Skepsis     Income  Education    Immigrant    Female
## A    1.3655146 0.15983960 0.10284738 0.06103684 6.882055e-01 0.5666260
## DF   0.5802351 0.07261878 0.04333308 0.02493206 4.947342e-01 0.2304872
## E    0.6759832 0.09022233 0.05099885 0.03031498 4.510340e-01 0.2735720
## K    1.6473564 0.24583091 0.13469888 0.07679220 5.403737e-06 0.7217955
## KF   0.7308503 0.09124773 0.05365609 0.03199630 6.298137e-01 0.2815741
## LA   0.8856141 0.10871385 0.06592124 0.03924939 6.454712e-01 0.4005242
## RV   0.6010844 0.07763143 0.04358292 0.02725140 3.825200e-01 0.2271391
## SF   0.6042029 0.07845397 0.04406638 0.02721589 3.953798e-01 0.2379430
## V    0.4574523 0.05795580 0.03398339 0.01998866 3.576296e-01 0.1797763
## 
## Residual Deviance: 4044.442 
## AIC: 4152.442

You can instead use stargazer() directly to create a regression table. It becomes even more apparent that there is a separate equation for each category (except the Social democrats). It means that the results tables have many columns. It also means that the usual procedure of showing increasingly complex model fits in one table is going to make the table too wide (too many columns). Here, I stick with just the second model (with predictors).

stargazer(mod1.nom,
          type = "text")
## 
## ============================================================================================================
##                                                      Dependent variable:                                    
##                   ------------------------------------------------------------------------------------------
##                       A        DF         E         K         KF        LA        RV        SF         V    
##                      (1)       (2)       (3)       (4)        (5)       (6)       (7)       (8)       (9)   
## ------------------------------------------------------------------------------------------------------------
## Skepsis            0.289*   0.565***   -0.052     -0.234     0.080     0.070   -0.224***  -0.057    0.141** 
##                    (0.160)   (0.073)   (0.090)   (0.246)    (0.091)   (0.109)   (0.078)   (0.078)   (0.058) 
##                                                                                                             
## Income              0.075    -0.008    -0.030     -0.072    0.125**   0.139**  0.117***   -0.019   0.152*** 
##                    (0.103)   (0.043)   (0.051)   (0.135)    (0.054)   (0.066)   (0.044)   (0.044)   (0.034) 
##                                                                                                             
## Education           0.041    -0.009     0.028     -0.022     0.035     0.042   0.102***  0.071***    0.032  
##                    (0.061)   (0.025)   (0.030)   (0.077)    (0.032)   (0.039)   (0.027)   (0.027)   (0.020) 
##                                                                                                             
## Immigrant           0.867    -0.897*   -0.151   -11.367***  -0.844    -0.248    -0.201    -0.234   -0.796** 
##                    (0.688)   (0.495)   (0.451)  (0.00001)   (0.630)   (0.645)   (0.383)   (0.395)   (0.358) 
##                                                                                                             
## Female             -0.971*  -0.639***   0.015     -0.161    -0.238   -1.491***  -0.323     0.152   -0.518***
##                    (0.567)   (0.230)   (0.274)   (0.722)    (0.282)   (0.401)   (0.227)   (0.238)   (0.180) 
##                                                                                                             
## Constant          -4.928*** -3.218***  -1.214*    -1.511   -2.739*** -2.942*** -1.688*** -1.510**  -1.573***
##                    (1.366)   (0.580)   (0.676)   (1.647)    (0.731)   (0.886)   (0.601)   (0.604)   (0.457) 
##                                                                                                             
## ------------------------------------------------------------------------------------------------------------
## Akaike Inf. Crit. 4,152.442 4,152.442 4,152.442 4,152.442  4,152.442 4,152.442 4,152.442 4,152.442 4,152.442
## ============================================================================================================
## Note:                                                                            *p<0.1; **p<0.05; ***p<0.01

Interpretation

We can proceed to two types of interpretations.

  1. We can calculate marginal effects. However, the comparison level would then always be the Social democrats. How relevant is that to my research?
  2. We can calculate the predicted outcomes, either by calculating the probability (i.e. utility of choosing each party) for each outcome or by flagging the most likely outcome (the party choice).

Marginal effects

Regression coefficients the slope parameters usually allow us to assess the direction and significance of the effects. However, since all comparisons are done with respect to the Social democrats, the interpretation is somewhat idiosyncratic.

First, we see that the RV (Radical left; social liberal party) and K (Conservative) voters are both less skeptic towards immigration than the common Social democrat voter (once demographic variables are accounted for). The two parties have similar intercepts and slopes, although only RV voters are significantly different. We also remember from the statistics that there are only 8 respondents that have answered that they have a preference for the Conservatives.

Second, we see that voters from A (the Alternative), DF (Danish Popular party) and V (Liberal Left) are more skeptic towards immigration than the Social democrats.

Third, it seems that S, E, KF, LA and SF all have very similar voters when it comes to attitudes towards immigration (i.e. their regression coefficient is small and statistically insignificant).

Marginal effects We can proceed to the same interpretation as before by partially backtransforming the slope coefficient from logoddsratio to oddsratio to percentage change.

#Scenario 
x = 1
#Coefficient
coef <- summary(mod1.nom)$coefficients[7,"Skepsis"] * x
coef
## [1] -0.2243748
#Multiply by 100
rel_eff_RV <- (exp(coef) - 1) * 100
rel_eff_RV
## [1] -20.09844

We may, for example say that a one unit increase in skepticism decreases the likelihood that a voter will opt for RV rather than S by 20% . The effect is quite substantial. If we consider a two-point increase (the interquartile range), the effect amounts to a 36.1574039% decrease.

Predictions

The rubber only really hits the road when we backtransform the results to generate predictions. At that point, the results are no longer dependent on the reference category. To arrive there, we need to set a scenario. We thus follow three steps:

  1. set a scenario
  2. predict probability for each alternative within the scenario
  3. point prediction: identify the alternative with the highest probability
Descriptive statistics
Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
Skepsis 1,497 5.009 1.726 0.000 4.000 6.000 10.000
Income 1,327 5.784 2.906 1 3 8 10
Education 1,493 13.086 4.892 0 11 16 35
Immigrant 1,497 0.120 0.325 0 0 0 1
Female 1,502 0.481 0.500 0 0 1 1

Each choice alternative is described with its own equation with predictors relating to the chooser. As per usual, we begin by filling in the equation with a scenario to calculate the predicted logodds. Then we can backtransform.

The backtransformation is different, insofar as you’d backtransform the logodds to odds (exp(logodds)), then divide the relevant category by the sum of the odds for all the category. This forces the probabilities to always sum up to 1; which is what distinguishes the multinomial logit from the ordinary logit.

Scenario

I’ll opt for a typical scenario, given the statistics, but let the attitudes toward immigration vary.

#Scenario
scenario <- 
  data.frame(
    Intercept = 1,
    Skepsis = c(4, 6),
    Income = mean(df$Income, na.rm = T),
    Education = mean(df$Education, na.rm = T),
    Immigrant = 0,
    Female = 0)

Backtransformation: Probabilities

By hand We can obtain the predicted probability by simple matrix calculation. Multiply each variable in the matrix containing my scenarios with the relevant coefficient (also stored in a matrix).

#Coefficients stored as a matrix
coefs <- summary(mod1.nom)$coefficients %>% 
  as.matrix

coefs
##    (Intercept)     Skepsis       Income    Education   Immigrant      Female
## A    -4.928344  0.28898306  0.075321079  0.040839475   0.8665345 -0.97083266
## DF   -3.218129  0.56484227 -0.007969436 -0.009093293  -0.8969781 -0.63859938
## E    -1.214491 -0.05159780 -0.030202911  0.028376023  -0.1510994  0.01519225
## K    -1.510827 -0.23374916 -0.072302609 -0.021650490 -11.3666588 -0.16085056
## KF   -2.739442  0.07978610  0.125455375  0.034992049  -0.8436163 -0.23819484
## LA   -2.941900  0.06976752  0.139137432  0.042106257  -0.2477992 -1.49084514
## RV   -1.688098 -0.22437478  0.116823317  0.102371601  -0.2008903 -0.32328085
## SF   -1.509715 -0.05712908 -0.018928704  0.071055027  -0.2337257  0.15212100
## V    -1.572977  0.14083006  0.151512520  0.032074457  -0.7957274 -0.51823830
#Transpose/"tilt" the scenario
logodds <- coefs %*% t(scenario)

#From logodds to odds
odds <- exp(logodds)

#Divide the second category (DF) by the sum of the odds, the 1+ represents the reference category
probDF <- odds[2,1]/(1 + sum(odds[,1]))
probDF
##       DF 
## 0.070965

Check:

predict(mod1.nom, scenario, type = "probs")[1,"DF"]
## [1] 0.070965

Using R The predict function gives us all we need, and so does the ggpredict function.

preds <- predict(mod1.nom, scenario, type = "probs")
preds
##           S          A        DF          E           K         KF         LA
## 1 0.2183367 0.01324651 0.0709650 0.06418495 0.009380842 0.06339158 0.05908853
## 2 0.1805067 0.01951980 0.1815627 0.04786108 0.004859331 0.06147513 0.05616542
##          RV         SF         V
## 1 0.1234468 0.08719279 0.2907663
## 2 0.0651564 0.06430214 0.3185914

Among moderate-to-tolerant male, non-immigrant respondents (x = 4) with average income, the estimated probability of choosing (i.e. their utility) the different parties is as follows.

preds[1,] %>% sort
##           K           A          LA          KF           E          DF 
## 0.009380842 0.013246507 0.059088530 0.063391579 0.064184948 0.070965002 
##          SF          RV           S           V 
## 0.087192794 0.123446842 0.218336687 0.290766269

This preference ranking is slightly reorderd among moderately non-tolerant men in the same group (x = 6):

preds[2,] %>% sort
##           K           A           E          LA          KF          SF 
## 0.004859331 0.019519802 0.047861077 0.056165422 0.061475128 0.064302143 
##          RV           S          DF           V 
## 0.065156399 0.180506674 0.181562658 0.318591367

In particular, we see that intolerance among men favors DF (Danish populist party); which increases from 5 to 2nd place.

Backtransformation: Point prediction of choice

We can estimate which option respondents prefer by identifying the category with the highest predicted probability.

#Skepticism == 4
preds[1,] %>% which.max()
##  V 
## 10
#Skepticism == 6
preds[2,] %>% which.max()
##  V 
## 10

The predict function does this for us if we ask it to. Here, I predict in-sample.

df <- 
  df %>%
  mutate(preds = predict(mod1.nom, df, type = "class"))

Your turn!

  1. How does the ranking of parties look like for women?

  2. The previous scenario treated men and women as all-else-equal categories (only the skepticism changed). Can you instead model a “typical” woman and a “typical” man?

scenario_alt <-
  df %>%
  group_by(Female) %>%
  reframe(
    Skepsis = median(Skepsis, na.rm = T),
    Income = median(Income, na.rm = T),
    Education = median(Education, na.rm = T),
    Immigrant = median(Immigrant, na.rm = T))
preds_alt <- predict(mod1.nom, scenario_alt, "probs")
preds_alt[1,] %>% sort
##           K           A           E          LA          KF          SF 
## 0.006714753 0.016421024 0.055386842 0.059638133 0.064474455 0.074733958 
##          RV          DF           S           V 
## 0.091922950 0.114343199 0.200166927 0.316197760
preds_alt[2,] %>% sort
##           K           A          LA          KF           E          DF 
## 0.007379518 0.008546071 0.018475743 0.069405679 0.076310666 0.078918380 
##          RV          SF           V           S 
## 0.097216831 0.123225127 0.256494282 0.264027703

Visual interpretation (plots)

We can plot the effect of a predictor for each choice alternative by setting a scenario and making predictions for each of them. Here, I use the ggpredict() function. In contrast to the base R prediction, the result is a long-format matrix, where each alternative is a new line.

scenario <- ggpredict(mod1.nom,
                      terms = list(
                        Skepsis = seq(0, 10, 0.1)
                      ),
                      condition = list(
                        Income = 5,
                        Education = 13,
                        Immigrant = 0
                      ))

Effectplot I can then use the data frame with the scenarios to plot the effect plot. Now the facet_wrap() function comes in handy, so that we can display the effect of the predictor for each party.

scenario %>%
  as.data.frame() %>%
  ggplot +
  geom_line(
    aes(y = predicted,
        x = x)) +
  facet_wrap(~response.level)

The figure makes it clear that skepticism towards immigration funnels voters to DF (Danish popular party) and leads them away from/has no effect among all other parties in the data. This also illustrates another key insight. Since probabilities amount to 1, it means that an increase in support for one party, necessarily leads to a decrease elsewhere within the same scenario.

Stacked area plot Another nice visualization is to calculate the predicted probability (share) of each party for each scenario, then to color code them.

scenario %>%
  ggplot +
  #Areas rather than curves
  geom_area(aes(y = predicted,
                x = x,
                fill = response.level)
  ) +
  ggtitle("Party choice and attitudes towards immigration") +
  xlab("Skepticism towards immigration")

The plot effectively communicates the same thing we’ve seen. DF is a dominant choice among those skeptical towards immigration.

Model assessment

Independence of irrelevant alternatives (IIA)

The main assumption that the multinomial model relies on is the idea there is no other alternative that – if I’m presented to it – would make me reshuffle all of my other preferences as well.

Ward and Ahlquist (2018) (p 166-168) link this to the latent variable approach: Multinomial models can be seen as a way to estimate the utility that actors derive from different alternatives/choices. Preferences should be intransitive. That is, when I prefer A to B and B to C, then it should also mean that I prefer A to C. Why is this relevant?

If we can make that assumption, then it means that we can leave categories out of the data and still get (approximately) the same estimation for the other alternatives. Remember, although all alternatives are compared to a specific reference category, the probabilities sum up to 1 anyways. It means that it the base-line probability for the reference category changes (or any of the other alternatives), then this will impact the estimation of all the regression coefficients.

Let’s check this on a toy example. Do the regression coefficients for skepticism change when I remove DF supporters?

mod3.nom <- multinom(Party ~ 
                      Skepsis
                     + Income
                     + Education
                     + Immigrant
                     + Female,
                     subset = Party != "DF",
                    df)
## # weights:  63 (48 variable)
## initial  value 2102.743921 
## iter  10 value 1841.293250
## iter  20 value 1782.301200
## iter  30 value 1716.150237
## iter  40 value 1681.516315
## iter  50 value 1680.722192
## iter  60 value 1680.673087
## final  value 1680.672364 
## converged

Now, let’s check the similarity of the two coefficients. The bind_rows() from dplyr allows us to bind vectors with unequal length together by their shared observation names (party).

bind_rows(
  coefficients(mod1.nom)[,2],
  coefficients(mod3.nom)[,2]
)

The Hausman-McFadden test does a systematic comparison of the regression coefficients when we leave out a category. At this point, we do not want any significant results because this would indicate that the coefficients have changed; the IIA assumption is violated. The Hausman-McFadden test does not always given very consistent results. Furthermore, we don’t often know which alternatives we should/could remove without biasing the results.

The most flagrant violations would be visible to the eye however.

In-sample predictions

Ward and Ahlquist (2018) (p. 170-173) have several suggestions about how to test the model’s performance. Many of these rely on in-sample predictions.

correct <- mean(df$Party == df$preds, na.rm = T)
correct
## [1] 0.3204775

Our model predicts the correct party choice for 32% of the observations. However, this tells us little about how well the model discriminates between alternatives.

Confusion matrix

The confusion matrix makes a cross-tabulation of predicted vs. observed categories.

confusion <- table(df$Party, df$preds)
confusion
##     
##        S   A  DF   E   K  KF  LA  RV  SF   V
##   S  103   0  22   0   0   0   0  12   0 108
##   A    5   0   3   0   0   0   0   0   0   7
##   DF  32   0  41   0   0   0   0   0   0  59
##   E   37   0   7   0   0   0   0   4   0  22
##   K    4   0   0   0   0   0   0   0   0   4
##   KF  14   0   7   0   0   0   0   4   0  40
##   LA   8   0   4   0   0   0   0   2   0  27
##   RV  33   0   3   0   0   0   0  19   0  70
##   SF  47   0   7   0   0   0   0   6   0  45
##   V   68   0  18   0   0   0   0  11   0 186

The correct predictions are along the diagonal.

diag(confusion)
##   S   A  DF   E   K  KF  LA  RV  SF   V 
## 103   0  41   0   0   0   0  19   0 186

It is clear that the model is only able to correctly predict support for some of the larger parties. In fact, for many of the parties, the model does not predict any voters at all.

This is bad news for our model.

Receiver Operator Curve (ROC)

The ROC curve describes how the ratio between false positive vs. correct positive predictions changes as we let the cut point slide. In mulitnomial models, we’d need one ROC curve for each alternative.

Here, I write a loop to do this effectively. That is, I create a matrix with predictions, then go through alternative 1 to alternative 11 and make a separatate plot each time.

library(pROC)

# Step 1: Predict probabilities
pred_mat <- predict(mod1.nom, newdata = df, type = "probs")

# Step 2: Choose a category
i = 2

#Default plotting is in base R, set the facet
par(mfrow = c(3,4))

for(i in 1:ncol(pred_mat)){
   party = colnames(pred_mat)[i]
  
  
  positive_probs <- pred_mat[, party]
  
  # Step 3 & 4: Calculate ROC curve metrics and plot ROC curve
  roc_response <- ifelse(df$Party == party, 1, 0) 
  
  roc_curve <- roc(response = roc_response, 
                   predictor = positive_probs)
  
  # Plot ROC Curve
  plot(roc_curve, 
       main = paste("ROC curve for", party))
}

#Return to no facet
par(mfrow = c(1,1))

Ordered logit

Let’s explore our possibilities when the choice categories are pre-ordered. That is, we don’t use our resondents’ appreciation of the categories to derive their ranking (Ward and Ahlquist 2018, ch. 9).

Descriptive statistics

Univariate statistics

We’ll work with a survey question from ESS asking respondents to answer to what degree they agree with the statement that immigration constitutes a burden to the Danish society. The survey responses range from 0 (disagree) to 10 (agree), and therefore imply 11 response categories.

df %>%
  ggplot +
  geom_histogram(aes(Burden),
                 #If measurment is character or factor
                 stat = "count")

The main predictor is the respondent’s income reported in deciles. They therefore range from 1 to 10, where 10 flags the 10% highest income.

Bivariate statistics

As with the categorical variables, we can calculate group means. This time, if the ordering of the variable make sense and is related to the predictor, we’d expect to see the contour of a trend.

df %>%
  group_by(Burden) %>%
  reframe(Income = mean(Income, na.rm = T)) %>%
  ggplot +
  geom_point(aes(x = Burden,
                 y = Income),
             size = 3) +
  geom_smooth(aes(x = Burden,
                  y = Income),
              color = "black",
              lty = 2, se = F) +
  coord_cartesian(y = c(0,10)) +
  ggtitle("Conditional means: Attitudes towards immigration and income")

Ward and Ahlquist (2018) (p. 152-153) argue in favor of this plot as a means to check the parallel lines assumption. Their intuition, is that the effect of the predictor (Income), should be approximately similary – and so follow a line – across the different levels of the ordinal outcome. In our plot, there is little doubt about this, although things look less clear at the really low levels of “Burden”.

Analysis

Set the measurement level

The analysis requires us to explicitly state the order of the categories. That is, we have to recode the outcome into an ordinal (ranked categorical) variable.

df <- 
  df %>%
  #Define first as a factor/categorical
  mutate(Burden_ord = as.factor(Burden),
         #... then order it.
         Burden_ord = ordered(Burden_ord))

We can see the ordering in the end of the printout in our R console.

head(df$Burden)
## [1]  6  4  2  6  6 NA

Estimate an ordered logit

The model function is drawn from the MASS package: polr(). The specifications look otherwise similar to what we are used to. The only difference is that we have to specify that we want to keep the Hessian matrix.

library(MASS)
mod0.ord <- polr(Burden_ord ~ 
                  1,
                Hess = T,
                df)

mod1.ord <- polr(Burden_ord ~ 
                  Income
                + Education
                + Female
                + Immigrant,
                Hess = T,
                df)
summary(mod1.ord)
## Call:
## polr(formula = Burden_ord ~ Income + Education + Female + Immigrant, 
##     data = df, Hess = T)
## 
## Coefficients:
##              Value Std. Error t value
## Income    -0.06468    0.01774  -3.646
## Education -0.08184    0.01061  -7.714
## Female     0.11291    0.09768   1.156
## Immigrant -0.54020    0.15591  -3.465
## 
## Intercepts:
##      Value    Std. Error t value 
## 0|1   -5.7233   0.2795   -20.4753
## 1|2   -4.7274   0.2203   -21.4552
## 2|3   -3.2601   0.1850   -17.6216
## 3|4   -2.4228   0.1752   -13.8317
## 4|5   -1.9310   0.1706   -11.3200
## 5|6   -0.8156   0.1628    -5.0086
## 6|7   -0.4123   0.1621    -2.5434
## 7|8    0.2727   0.1648     1.6549
## 8|9    1.0305   0.1759     5.8568
## 9|10   1.6479   0.1943     8.4811
## 
## Residual Deviance: 5524.957 
## AIC: 5552.957 
## (203 observations deleted due to missingness)
stargazer(mod0.ord, mod1.ord,
          ord.intercepts = T,
          title = "Is immigration a burden to society?",
          type = "html")
Is immigration a burden to society?
Dependent variable:
Burden_ord
(1) (2)
Income -0.065***
(0.018)
Education -0.082***
(0.011)
Female 0.113
(0.098)
Immigrant -0.540***
(0.156)
0| 1 -4.147*** -5.723***
(0.210) (0.280)
1| 2 -3.178*** -4.727***
(0.133) (0.220)
2| 3 -1.741*** -3.260***
(0.073) (0.185)
3| 4 -0.944*** -2.423***
(0.058) (0.175)
4| 5 -0.488*** -1.931***
(0.054) (0.171)
5| 6 0.593*** -0.816***
(0.054) (0.163)
6| 7 0.985*** -0.412**
(0.059) (0.162)
7| 8 1.638*** 0.273*
(0.071) (0.165)
8| 9 2.388*** 1.030***
(0.094) (0.176)
9| 10 2.970*** 1.648***
(0.121) (0.194)
Observations 1,475 1,299
Note: p<0.1; p<0.05; p<0.01

Interpretation

Hypothesis testing (the coefficients)

We can see from the results table that income has a negative and statistically significant effect on the perception that immigration is a burden to society. But how substantial is this effect?

Marginal effect

The marginal effects in an ordinal model can be interpreted as the relative increase/decrease in likelihood of passing the next threshold.

coef_income <- mod1.ord$coefficients["Income"]
(exp(coef_income) - 1) * 100
##    Income 
## -6.263191

For each decile increase in income, the likelihood of opting for a higher response category decreases by 6%. If we consider the relative difference between the respondents with the highest and the lowest income, we find a 48% decrease.

Predictions

We can think of the effects as either changing the respondents’ propensity to think of immigration as a burden (the latent variable) or as their propensity to chose a certain category in relation to a (varying) intercept (parallel lines; separate regressions). This has a bearing on how we present the results

Predictions: Change in relation to the intercept

  1. set the scenarios
scenario <- ggpredict(mod1.ord,
                      terms = list(
                        Income = seq(0,10,0.1)
                      ))
  1. plot them automatically
scenario %>%
  plot

… or by setting the aesthetics yourself.

scenario %>%
  ggplot +
  geom_ribbon(aes(
    ymax = conf.high,
    ymin = conf.low,
    x = x
  ),
  alpha = 0.4) +
  geom_line(aes(
    y = predicted,
    x = x
  )) +
  facet_wrap(~ response.level)

Predictions: Change on the latent variable

scenario <- 
  data.frame(
    Income = c(2,8),
    Education = 13,
    Female = 0,
    Immigrant = 0
  )

preds <- predict(mod1.ord, scenario, type = "prob")
preds 
##            0          1          2         3          4         5          6
## 1 0.01066536 0.01769111 0.08400523 0.1138909 0.09724706 0.2698001 0.09257072
## 2 0.01564315 0.02560373 0.11601209 0.1439562 0.11224982 0.2691247 0.08036705
##           7          8          9         10
## 1 0.1265732 0.08991572 0.04250064 0.05513994
## 2 0.1016443 0.06701575 0.03030300 0.03808015
#Tilt/transpose the matrix
data.frame(Poor = preds[1,],
           Rich = preds[2,],
           Response = 0:10) %>%
  tidyr::pivot_longer(c(Rich, Poor)) %>%
  ggplot +
  geom_col(aes(y = value,
                   x = Response,
                   fill = name),
               position = "dodge",
               alpha = 0.4) +
  xlab(" ")

We can see that the density of respondents move slightly towards the left (higher propensity to see immigration as a burden) in the “rich” group compared to the “poor” group, although the move is small. Most respondents do not have much of an opinion.

Prediction: the most likely outcome.

Lastly, we can predict which response category a respondent will pick as if this is a multinomial model.

#Pick the option with the highest probability
preds <- predict(mod1.ord, scenario, type = "prob")
preds[1,] %>% which.max
## 5 
## 6
preds[2,] %>% which.max
## 5 
## 6
#...or let R do it
preds <- predict(mod1.ord, scenario, type = "class")
preds
## [1] 5 5
## Levels: 0 1 2 3 4 5 6 7 8 9 10

We see that, indeed, even a substantial shift in income is unlikely to move the needle much in terms of stated support for the idea that immigration is a burden to the Danis society (neither agree nor disagree = 5). In both scenarios, the predicted outcome is 5.

When we treat the outcome as if it was categorical, we can either use the stacked probability plot as before or use barplots to illustrate the distribution under different scenarios.

scenario <- ggpredict(mod1.ord,
                      terms = list(
                        Income = seq(0,10,0.1)
                      ))

scenario %>%
  ggplot +
  geom_area(aes(y = predicted,
                x = x,
                fill = response.level))

This illustrates the hybrid nature of the ordinal model; it’s somewhere between a linear and a discrete outcome model.

Model assessment

In-sample predictions

df <- 
  df %>%
  mutate(preds_ord = predict(mod1.ord, df, type = "class"),
         preds_ord = as.numeric(as.character(preds_ord)))

#Treat as a categorical: proportion correct
mean(df$Burden == df$preds_ord, na.rm = T)
## [1] 0.2540416
#Treat as a numeric: R2
cor(df$Burden, df$preds_ord, use = "pairwise.complete.obs")^2
## [1] 0.005636061
df %>%
  ggplot +
  geom_histogram(aes(preds_ord)) +
  ggtitle("Predicted responses")

Well, my model does not do a great job in discriminating between response categories, it seems..

Are the intercept distinct?

There is no point in disgingishing (ordering) the levels if they are not empirically ordered by the model and its predictors. We may therefore check if the intercepts indeed constitute differences between groups.

#Extract the intercepts
intercepts <- summary(mod1.ord)$coefficients[-(1:4),1]
#...and their standard error
se.intercepts <- summary(mod1.ord)$coefficients[-(1:4),2]

data.frame(intercepts,
           se.intercepts,
           threshold = names(intercepts)) %>%
  ggplot +
  geom_point(aes(y = threshold,
                 x = intercepts)) +
  geom_errorbar(aes(y = threshold,
                    xmin = intercepts - se.intercepts * 1.96,
                    xmax = intercepts + se.intercepts * 1.96),
                width = 0.2) +
  ggtitle("Are the intercepts ordered and distinguishable?")

From the graphic, we can see that the confidence intervals are generally not overlapping and that the intercepts are spaced out approximately equally.

The parallel lines assumption

The ordinal regression estimates a single slope parameter per predictor. It therefore relies on the assumption that passing from one category to another (for each intercept), the effect of the predictor should be (approximately) the same. We can test this by estimating binomial logits for each intercept/cutpoint in the outcome variable.

We can, for example, do this for the first cutpoint, from 1 to 2.

Remember the guessing game? We can ask which (or how many) respondents have rated their agreement to the statement that immigrants are a burden on society to larger than 1.

table(df$Burden > 1)
## 
## FALSE  TRUE 
##    59  1416

We can use this to recode the variable directly in the model equation.

mod1 <- glm(I(Burden >= 1) ~ Income + Education + Immigrant,
            family = "binomial",
            df)

The slopes in this regression should be approximately the same as a regression where Burden > 2. To estimate all of these models, we can either copy-paste or write a loop.

Here, I write a loop. I let R change the value of the cutpoint by defining a variable i that runs from 1 to 10; the number of cut points. Before I start, I define some empty R-objects where I can store my model results.

mod <- NULL
slopes = NULL
se.slopes = NULL

for(i in 1:10){
  mod[[i]] <- glm(I(Burden >= i) ~ Income + Education + Immigrant,
                  family = "binomial",
                  df)
  slopes[i] = summary(mod[[i]])$coefficients[2,1]
  se.slopes[i] = summary(mod[[i]])$coefficients[2,2]
}

I’ll end up with a list of lists (with models), and two vectors: One for the slope and one for their standard error.

Now, I can start comparing the slopes. Here, I compare the the effect of income. We can display this either as an effect plot or a coefplot.

data.frame(slopes, se.slopes, thresholds = 1:10) %>%
  ggplot +
  geom_point(aes(x = thresholds,
                 y = slopes)) +
  geom_errorbar(aes(x = thresholds,
                    ymax = slopes + 1.96 * se.slopes,
                    ymin = slopes - 1.96 * se.slopes)) +
  #Guiding line where the single slope is estimated
  geom_hline(aes(yintercept = mod1.ord$coefficients["Income"]),
             lty = 2)

The slopes from the non-pooled models are not different from the single estimate from the pooled model. The assumption holds.

An alternative take is to visualise the slopes in an effect graphic. We don’t have to backtransform to do this. The point is to check the direction.

data.frame(
  slopes, 
  se.slopes, 
  thresholds = as.factor(1:10)) %>%
  ggplot +
  #Define yaxis accroding to intercepts/slopes
  ylim(-min(slopes), max(slopes)) +
  geom_abline(aes(
    #Either the intercepts or just 0, since we're comparing
    intercept = 0,
    slope = slopes,
    lty = thresholds)) +
  ggtitle("Parallel lines assumption")

From the graphic, it is clear that all the slopes are positive, even though they don’t follow the exact same angle.

Do I need an ordinal model?

If the parallel lines assumption doesn’t hold, we might swap to a multinomial model. If, on the other hand, the model contains many categories that are well-behaved in terms of ordering (and the parallel lines seem to hold), we might go for a linear model.

mod.lm <- lm(Burden ~ 
               Income
             + Education
             + Female
             + Immigrant,
             df)

We find that a decile increase in Income decreases the propensity to view immigration as a burden by -0.08 scale units (out of 11 possible).

stargazer(mod0.ord, mod1.ord, mod.lm,
          ord.intercepts = T,
          title = "Is immigration a burden to society?",
          type = "html")
Is immigration a burden to society?
Dependent variable:
Burden_ord Burden
ordered OLS
logistic
(1) (2) (3)
Income -0.065*** -0.082***
(0.018) (0.023)
Education -0.082*** -0.103***
(0.011) (0.013)
Female 0.113 0.139
(0.098) (0.125)
Immigrant -0.540*** -0.683***
(0.156) (0.195)
0| 1 -4.147*** -5.723***
(0.210) (0.280)
1| 2 -3.178*** -4.727***
(0.133) (0.220)
2| 3 -1.741*** -3.260***
(0.073) (0.185)
3| 4 -0.944*** -2.423***
(0.058) (0.175)
4| 5 -0.488*** -1.931***
(0.054) (0.171)
5| 6 0.593*** -0.816***
(0.054) (0.163)
6| 7 0.985*** -0.412**
(0.059) (0.162)
7| 8 1.638*** 0.273*
(0.071) (0.165)
8| 9 2.388*** 1.030***
(0.094) (0.176)
9| 10 2.970*** 1.648***
(0.121) (0.194)
Constant 6.878***
(0.206)
Observations 1,475 1,299 1,299
R2 0.079
Adjusted R2 0.076
Residual Std. Error 2.246 (df = 1294)
F Statistic 27.756*** (df = 4; 1294)
Note: p<0.1; p<0.05; p<0.01

Personally, I’d say this an instance in which keeping the categories makes the model complex without only benefits. However, it all depends on what I’m interested in. I can compare the predictions of the two models in two histograms.

The predictions from the linear model are too little spread out. Here, I use the observed outcomes as a backdrop for the predictions (dark grey) from the linear model. My predictors fail to capture the more extreme outcomes.

df %>%
  mutate(preds_ols = predict(mod.lm, df, type = "response")) %>%
  ggplot +
  geom_histogram(aes(Burden),
                 alpha = 0.2,
                 binwidth = 1) +
  geom_histogram(aes(preds_ols)) +
  xlim(0,10) +
  ggtitle("Predictions from a linear model")

I can do the same for the ordinal model. Here, I sum over the predicted probabilities for each category of the outcome variable using colSums(). I use them to define the height of my columns, but let the categories define the x-axis.

#Set up data from predictions
data.frame(
  #Sum over the columns of the matrix of predictions, store them as a variable
  preds_ord = colSums(
    predict(mod1.ord, df, "prob"), 
    na.rm = T))  %>%
  #Add x-axis: the row names; i.e ordinal categories
  mutate(Burden = row.names(.),
         #Order the categorical variable according to its values
         Burden = reorder(Burden, 0:10)) %>% 
  #Plot
  ggplot +
  #Plot predictions
  geom_col(
    aes(y = preds_ord,
               x = Burden),
           alpha = 0.4) +
  #Plot observations, specify data object separately
  geom_bar(
    data = df,
    aes(Burden + 1),
    alpha = 0.2) +
  xlab("Belastning")+
  ggtitle("Predictions from an ordinal model")

The figure illustrates an important point. The point predictions from the model lose information in the back transformation from the continuous latent variable. That is, while the linear model has no notion that there might be extreme outcomes in my data, the ordinal (and multinomial) model is open to the idea that most observations have some probability of falling in all of the predefined categories.

After the model assessment, we may opt to keep the ordinal model or switch either to a linear or a categorical model. Usually, we’ll end up with a set of alternative models that we may report in the appendix (or for our own benefit) to show which parts of the results are “model dependent” (they appear only when I opt for one type of model) and which ones are more robust.

In our case, if the question relates to the effect of income on scepticism toward immigration, I’d say we have found that income has an undeniable effect (in both models), although the size of the effect is minor (also in both models).

Model recap this far

This course is structured around how to pick a good regression model, given the phenomenon and/or data we’re interested in. Two factors structure that choice: the structure of the data (what variation we use) and the measurement level of the outcome variable (data generating process). A good analytical project takes an active choice about both; we need a strategy for both.

Dependent variable

Regressions are equations estimated on outcomes that are linear and unbounded. When our outcome variable is manifestly neither, we have to recode it. We then draw on a probability distribution to link observed outcomes with their likelihood of ocuring.

Dependent variable Model R function Rationale Limits B-plan
Numeric Linear model/OLS lm() Calculate group means and compare When data is bounded or censured GLMs (this class)
Binary Binomial logit glm() Compare 1s to 0s in entire data Everything is relative linear model
Categorical Multinomial logit nnet::multinom() Compare each category to a reference category The removal of one choice, doesn’t change the ranking of the others Merge categories and/or logit model
Ordered Ordinal logit MASS::polr() Compare successively above and below two or more cutpoints (between categories) Parallel lines assumption, cutpoints/intercepts are distinguishable linear, multinomial or binomial logit models

Coming up:

  • choice in a choice situation: conditional logit
  • count outcomes: poisson and its fixes
  • duration: event history models

Data structure

The data structure determines the comparisons we can make and the limitations. That is, it allows us to leverage some variation and parse out other.

Structure Model R function Rationale Limits B-plan
Varying intercepts fixed effects lm() Compare strictly within groups Requires within-group variation, no use of between group variation (information) lme4::lmer()
Varying intercepts random effects lme4::lmer() Compare within and between (partial pooling) Little within or between-group variation gives little traction fixed effects or pooled
Varying slopes Separate effects for different groups lme4::lmer() Interaction effect between linear predictor and varying intercept (groups) Requires a lot of variation, very descriptive separate models for each group; a single slope
Two-level analysis two-level hierarchical model w/varying intercepts lme4::lmer() Regress predictors on random intercepts (groups) Requires many separate groups/between-group variation Pooled data with same predictor as level 1

Coming up:

  • missing data

Literature

Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press.