Statistics in general – and logit models in particular – are based on comparisons. All statements are done by comparison to a reference group. This is true for the binomial logistic regression (for binary outcomes) where all observations have one reference group. When outcomes are categorical, we also engage in the comparison game. However, the reference group is different. This problem set aims to help you see what implications this has for the interpretation of the models, as well as getting an intuition for what goes on “under the hood” in the estimation of the models.
We will be working on the same data from the European Social Survey (ESS) that is used in chapter 6 in (Hermansen 2023). You can download it from my website: kap6.rda
download.file("https://siljehermansen.github.io/teaching/beyond-linear-models//kap6.rda",
destfile = "kap6.rda")
#Rename
df <- kap6 %>%
mutate(Education = Uddannelse,
Income = Indtaegt,
Female = Kvinde,
Burden = Belastning,
Immigrant = Innvandrer)
Be prepared to share on padlet: https://padlet.com/siljesynnove/beyond-linear-regression-yxuvstrvfvjia47y
Many outcomes of interest to political scientists are categorical and cannot immediately be ordered (i.e. they are nominal). These outcomes are often the result of a choice, based on some (latent) preference held by the chooser. Party choice among voters is the most typical example. We can use the data to estimate the utility that choosers derive from different options simply by looking at the options they prefer. Here, we use the multinomial logistic model (Ward and Ahlquist 2018, ch 9).
Let’s make a hypothesis that voters’ income has a bearing on their choice of party because parties differ in the amount of redistribution they are willing to support.
as.factor()
and
relevel()
library(nnet)
mod.cat <- multinom(Party ~ Education + Income, df)
The model effectively estimates the effect of income at equal levels of education. Which party has the voters with the highest/lowest revenue? In your opinion, does our knowledge of respondents’ income help us understand how voters discriminate between parties?
Interpret the marginal effects (by exponentiating the slope coefficients). Which parties are similar to your reference category with respect to the effect of income? For the party-pairs that are significantly different, what is the effect of income on the respondents’ choice of party?
Make predictions for a scenario where a respondent is rich (Income = 9) by filling in the equation for Dansk Folkeparti (DF), Venstre (V) and Socialistisk Folkeparti (SF). What do you find?
Reestimate your model with a new reference category? What happened?
Based on your new model fit, make predictions for the same scenarios as in question 1e. What did you find? Why?
Fit an intercept-only model for party choice using the
multinom()
function from the nnet package. Be mindful of
the reference level you pick.
Calculate the regression coefficients in an intercept-only model by hand. You might use my code from the lecture slides for inspiration.
Calculate a the parameters of a multinomial model of party choice
with a binary predictor by hand. The data contains a predictor that
flags the respondent’s gender
(Kvinde
/Female
)
\[Pr(m_j) \sim log(odds)\] \[log(odds) = a_m + b_m \times Female_i\]
* use the code from excersise a, but subset the data according to the value of the predictor. That is, create two data frames, one for all observations with "Female == 0" (`df0`) and one for "Female == 1" (`df1`).
* calculate the odds of voting for each of the m-1 parties (be mindful of your reference category) for each of the data frames (df0 and df1).
* slope:
- calculate the oddsratio by dividing the odds of observing a female respondent (Female == 1; `df1`) by the odds of not observing a female respondent (Kvinde == 0; `df0`).
- logtransform the oddsratio
* intercept: logtransform the odds of observing Female == 0.
Compare your results to the regression coefficients from the multinomial model estimated with the `multinom()` function.
The data contains an ordinal variable where the respondents are asked
to what extent they consider immigration to be a burden to the Danish
society (Belastning
/Burden
). Here, we use the
ordered logit model (Ward and Ahlquist 2018, ch
8).
Recode the variable in R into an ordinal variable. To do so, you
can use the as.factor()
and the ordered()
functions. Store the result in a new variable
(e.g. Burden_ord
).
Fit an ordinal model where agreement to the statement is a function of the respondent’s income, education and immigrant background.
library(MASS)
mod.ord <- polr(Burden_ord ~ Income + Education + Immigrant,
Hess = T,
df)
Present the results in stargazer. You can use the argument
ord.intercept=
to display the
intercepts/cutvalues.
What is the marginal effect of income? Can you make a sentence that states the finding?
In the ordinal model, the cutpoints (where one category transitions to another) can be understood as intercepts. Specifically, the slope coefficients are estimated by comparing all observations above the cutpoint with all observations below the cutpoint.
\[Pr(y < \tau) \sim log(odds)\] \[log(odds) = a - bx\]
Note that the slope parameter is subtracted from the intercept!
* set a fixed scenario for each of the predictors.
* calculate the logodds by filling in the equation for each of the intercepts/cutpoints, holding the scenario constant
* backtransform from logodds to probability by using the logistic function
\[Pr(y < \tau) = \frac{exp(logodds)}{1 + exp(logodds)}\]
Using the same scenario, calculate the probability that an observation falls between cutpoints “0|1” and “1|2” by subtracting one cut point from the other.
Check your results by having R calculate the predicted probabilities. Remember to keep your scenario constant.
Calculate the cumulative sum of the probabilities using the
cumsum()
function:
predict(mod, newdata, type = "probs") %>% cumsum
. What
is the difference?
Estimate the effect of income on the perception that immigrants are a burden. Include the controls for education and immigrant background as before.
We will check the two assumptions of the model: the parallel lines assumption (i.e. the slope is approximately the same regardless of the threshold/cutpoint it passes) and that the categories are indeed ordered and distinguishable from each other.
Create 10 binary variables (one for each of the cut points in the
“Burden” variable), then run 10 binomial regressions. You might take
inspiration from the slides or from the following code.
mod <- glm(Burden < 1 ~ Income + Education + Immigrant, df, family = "binomial")
Display the 10 binary models and the ordinal model from the previous exercise in a stargazer() results table. Are the regression coefficients very different?
Create a coefplot for each of the 10 coefficients on income. To do so, you will want to make a small data frame with one observation per model and variables reporting the slope coefficient, its standard error and the cutvalue associated with each model. What do you see?
Create a similar coefplot for the intercepts/cutpoints of the model. There is no need to backtransform them.