Linear models: Interpretation of point estimates

In this session, we will be exploring how to estimate and interpret linear models. The concepts and procedures we use today may be slightly complicated for the simple example we are working with, but as we move over to the non-linear relationship between x and y in GLMs, they will come in very handy.

Introduction

Main methodological takeaways

We will use linear models (OLS) first to estimate group averages. Specifically, we will explore the distribution of the dependent variable within groups of predictors and compare their means. These are instances where OLS is always do-able.

We then move to use the model for inference – predicting outcomes from combinations of (metric) variables. Our focus is then on interpretation. In the application today, the interpretations often involve a detour; we could obtain the same thing in simpler ways. However, when we move to explore GLMs where the relationship between x and y is non-linear, these skills will be useful.

Main substantial take-aways:

  • the four stages of quantitative research design
  • how to combine parameters and variables to make predictions
  • how to use prediction for interpretation
  • three stages of interpretation
  • how to apply this to interaction models

I will introduce a few additional R-functions:

  • linear regression
    • (base R): lm(), predict() , data.frame()
    • ggplot2: geom_line(); ylim()
    • dplyr: lag() (of minor importance)
  • repetition
    • dplyr: group_by(), reframe(), mutate()

The rest of the R codes are implementations of what we already encountered in the homework (Hermansen 2023) and in the last R session (notebooks online). Make sure you work through the examples here, so that you are prepared for our next session.

Political science example

We will be working on some of the insights from Hermansen and Pegan (2024) study of election seeking among Members of the European Parliament (MEPs). You can find a short version of the paper in this EUROPP-blog post.

For a long time, there was a claim in the academic literature that what MEPs do in office, doesn’t matter for their relection; they are not held electorally accountable. In this paper, we show that MEPs haven’t really gotten that memo. They behave as if their relection depends on what they do. Voters in all democracies are confronted with an informational problem. They don’t know what their representatives do. It is up to the representatives to prove their usefulness to their voters. They can do that in different ways.

We make the assumption that most MEPs will seek (re-)election in the future and that they can increase their likelihood of that happening by investing in the right type of activities. The claim we explore today, is that some MEPs have an incentive to cultivate a personal vote, while others do not. Specifically, the electoral rules for EP elections vary by member state. MEPs elected from candidate-centered systems will gain a seat in parts due to the personal votes they get. As a result, they have an incentive to invest in constituency activities; they invest in their local presence in order to get a name recognition. By contrast, MEPs elected from party-centered systems can ride on the party label, and so will invest less.

The data

we will work with are a subset of the replication data for the article. They list all MEPs present in parliament in the spring semester of 2014.

I start by fetching the R packages we will use from the library.

library(dplyr); library(ggplot2)

As per usual, I start with descriptive statistics: The univariate distribution of the relevant variables and their bivariate relationship. After that, I piece together a regression model, by adding variables one by one and presenting the results.

We begin by loading in the data.

If I have an internet connection, I can download the file directly from the class’ website and put it in my working directory.

download.file(url = "https://siljehermansen.github.io/teaching/beyond-linear-models/MEP2014.rda",
              destfile = "MEP2014.rda")

Now, I can load it into R and rename the object to my liking (df).

#Load in
load("MEP2014.rda")

#Rename
df <- MEP2014

Linear regression : the concept

The analytical process in quantitative political sciences looks like this

  1. theorization: express it as an equation \(\rightarrow\) a regression model
  2. data collection: draw on your observation of the world \(\rightarrow\) a data object
  3. test the theorized relationships: run the model \(\rightarrow\) parameter size and significance
  4. interpretation: use the applied theory to describe the world \(\rightarrow\) predictions

Let’s exemplify with a numeric outcome (Number of local assistants an MEP has) variable and a binary predictor (Electoral system).

A comparison of means

Gelman and Hill (2007) take the approach that linear regression is, at its core, just a comparison of group means. The easiest way to see this, is to do it.

Here, I want to compare two group means; the local staff size among MEPs from party- and candidate-centered systems respectively.

tab <-
  df %>%
  group_by(OpenList) %>%
  reframe("mean_y" = mean(LocalAssistants))
tab

We can see that the average staff size in candidate-centered systems is above 3 assistants, while party-centered systems have above 2 staffers per MEP on average.

How large is this difference?

tab <- 
  tab %>%
  ungroup %>%
  mutate(diff = mean_y - lag(mean_y))
tab

We see that MEPs from candidate-centred systems have on average 1 more local staffers on their pay-roll.

Now, we know that

  • the mean of y in party-centered systems is 2.468
  • the average difference between party- and candidate-centered systems is 0.949.
  • the mean of y in candidate-centered systems is 2.468 + 0.949 = 3.417

This is parallel to how we interpret the regression coefficients in a linear model. Of course, if the predictor is continuous, we’ll have many such group averages. R would then have to fit a regression line so as to minimizes the errors (which is what the Ordinary Least Square estimation does).

Regression

We can also make the estimation by running a regression model. The results are often reported as a regression table where we find key information.

Effect of electoral system on local staff size among MEPs (results from a linear model)
Dependent variable:
LocalAssistants
OpenList 0.949***
(0.234)
Constant 2.468***
(0.161)
Observations 739
R2 0.022
Adjusted R2 0.020
Residual Std. Error 3.179 (df = 737)
F Statistic 16.396*** (df = 1; 737)
Note: p<0.1; p<0.05; p<0.01

Equations are expressions of a theory

Equations are a way to express how our political science theory links different phenomena together. We can then use data to test whether these links are present. Thus, all statistical models can be expressed as equations.

\(y = a + bx + \epsilon\)

  • data: \(y\) and \(x\)
  • parameters: \(a\) and \(b\)

Variables: the observed data

We often make explicit the indexing of the data when we describe a specific model. This is useful for understanding what data structure the model leverages.

\(y_i = a + bx_i + \epsilon_i\)

\(i\) is in fact a variable in its own right: It runs from 1 to the size of our data (\(N\)).

These indexes are useful when we read models that leverage complex data structures, but also for writing loops.

We can write the equation in slightly less generic terms by giving our variables a real name.

\(Local\:Assistants_i = a + b \times Open\:List_i + \epsilon_i\)

Let’s call N, the number of observations in our data frame. The function dim() reports first the number of observations, then the number of variables.

N <- dim(df)[1]
N
## [1] 739

It means that we can give i which ever value we want between 1 and 739. Here, I give it 1.

i <- 1

By indexing, I can see the first observation in my data.

df$LocalAssistants[i]
## [1] 2

The first MEP has 2 local staffers on his pay-roll.

Parameters: estimated quantities

Our regression coefficients are “parameters”. In contrast to the data, the parameters are estimated quantities. In fact, they draw on the data and the theory (equation).

We can rewrite our equation to reflect the results from the regression.

\(y = 2.468 + 0.949 x + \epsilon\)

The intercept (a) reports the mean value of the outcome (y) when all the predictors are 0 (x = 0). Here, the predicted local staff size among MEPs from party-centered systems (i.e. the reference level) is 2.468 employees.

The slope (b) reports the difference between the group averages of the outcome (y) when x increases by one unit. Here, the difference in mean local staff size among MEPs from party- and candidate-centered systems, respectively is 0.949 employees.

Prediction: Parameters and data

We can use the model results to make predictions. Even if do not intend to use the model to do forecasting (predicting out of sample), the predictions will help us interpret and understand what we have found.

Once the model is estimated, we can fill in all the values of the equation. We combine the parameters and data by assigning values to our predictors. We can follow two approaches when filling in the data:

  • observed data: fill in with the entire data set.

This is what we do when we make in-sample predictions for model assessment. These in-sample predictions allow us to calculate the residuals and generate model statistics.

  • scenarios: fill in with hypothetical variable values

This is what we do when we interpret the results as well as when we do out-of-sample predictions. We’ll mostly focus on the former.

In our example from the European Parliament, I can make two scenarios. I can make predictions for the staff situation when MEPs compete in candidate-centered systems (x = 1), as well as when they compete in party-centered systems (x = 0)

\(y = 2.468 + 0.949 \times 0 = 2.468\)

\(y = 2.468 + 0.949 \times 1 = 3.417\)

We can interpret the data in three steps. Going through these three steps become increasingly important as we move on to non-linear effects. In linear regression, this is the case when we recode the predictors to be curvilinear or have interactions. In GLMs, all effects are non-linear.

mod2 <- lm(LocalAssistants ~
             OpenList
           + LaborCost,
           df)
summary(mod2)
## 
## Call:
## lm(formula = LocalAssistants ~ OpenList + LaborCost, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.492 -1.938 -0.414  1.078 35.001 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.12659    0.28612  14.422  < 2e-16 ***
## OpenList     0.82883    0.22784   3.638 0.000294 ***
## LaborCost   -0.07020    0.01015  -6.913 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.083 on 736 degrees of freedom
## Multiple R-squared:  0.08141,    Adjusted R-squared:  0.07891 
## F-statistic: 32.61 on 2 and 736 DF,  p-value: 2.689e-14

1. Marginal effect: the regression coefficient

The marginal effect (or the “relative” effect) refers to the relative change in y when x increases by one scale unit. This is the slope of the regression line. In a linear regression with a linear effect, this slope is constant, regardless of the values of the intercept and the other predictors.

Here, the slope is 0.949.

Gelman and Hill (2007) highlight that political scientists alternate between two potential interpretations:

  • descriptive/comparisons of means: we can make interpretations that simply refers to the differences in means. I.e. we can say that MEPs’ local staff size is on average 0.949 higer in candidate-centred systems (compared to party-centred systems).

  • causal/relative effect: We can also interpret this in a causal framework by stating that changing the electoral system (a one-unit increase in x) increases the number of local employees MEPs hold.

\(\Rightarrow\) If Denmark changed from a candidate- to party-centered system, this would lead to a 0.949 average decrease in staff size.

This is a bold claim, and usually requires you to rely on descriptive statistics and your political knowledge to assess if such a change is realistic and how large a potential change would be.

Where the slope will cross the y-axis depends on the intercept (and other predictors), though.

2. Predicted effects

The second step in an interpretation is to create a scenario where you fill in hypothetical values to the predictors, then fill in the equation to calculate predictions.

Here, I construct two scenarios. Specifically, I create a data frame where each observation is a scenario, where I let the electoral system change in each scenario.

scenario <- data.frame(OpenList = c(0,1)) 
scenario

I can now use the regression coefficients to calculate predictions. I can do this by hand or use the predict() function in R. It takes two arguments: the model object (for the paramters/regression coefficients) and a data frame with scenarios.

scenario <- 
  scenario %>%
  mutate(LocalAssistants_pred = predict(mod,
                                        newdata = scenario))
scenario

Constructing scenarios also means we should be mindful of the value we assign to all the control variables as well. This is particularly important when effects are non-linear.

There are several ways to pick the scenarios parallel to the two types of interpretations (Ward and Ahlquist 2018):

  • construct typical scenarios draw on the data to construct typical examples. Here, we might even let the other predictors have different values in the different scenarios. This is the less common approach, but it is useful if we want to compare dynamics between sub-populations.

  • all-else-equal-scenarios let the predictor slide from its minimum to its maximum value while holding all other variables constant. This would mean you should know the descriptive statistics. As a rule of thumb, there is little to gain from making predictions in an area of x where you have no observations. At this point, you will set the other predictors (control variables) to a fixed – often typical – value (e.g. mean, mode, median…) as well.

This is the most common way of constructing scenarios. You can use the predictions to create an effectplot, for example. The approach is close to the causal interpretation of the regression. We communicate how the outcome would change as x is modified.

\(\Rightarrow\) Here, we’d say that the average local staff size increases from 2.468 to 3.417 employees when we go from studying party-centered to candidate-centered electoral systems.

3. First-differences/absolute effect

A third step in the interpretation is to rely on the predictions from the all-else-equal scenarios to calculate the absolute effect of the predictor.

scenario <- 
  scenario %>%
  mutate(first_difference = LocalAssistants_pred - lag(LocalAssistants_pred))
scenario

In our example, this means calculating the difference between the predicted outcomes when the predictor changes by one unit (or any number of units, really).

\(\Rightarrow\) The staff size increases by 0.948636 employees when the electoral system turns from party- to candidate-centered.

Your turn!

You can run a linear regression in R with the lm() function. You begin by stating the equation, then specify the model object.

mod1 <- lm(LocalAssistants ~
            OpenList,
          data = df)
  1. Add labor cost to as a control variable and store the results in mod2. Consider the regression coefficients/parameters by looking at R’s summary of the results: summary(mod2).

  2. What is the marginal effect of the electoral system in model 2? What is the difference with model 1 and why do you think the effect changed?

  3. Consider the descriptive statistics for Denmark and France and construct two typical scenarios for the two countries in a separate data frame.

  4. Make predictions given these scenarios. What is the expected staff size in each country?

  5. Compare the predicted outcomes using first-differences.

Non-linear effects

Politial science theories often imply a non-linear effect of x. That is the effect of x depends on the values of other variables in our scenario. This is true for:

  • the effect of x varies depending on the value of x: non-linear recodings: equations with quadric terms and exponential effects (log-transformation)
  • the effect of x varies depending on the value of another variable (i.e. the moderating variable, z): interaction effects
  • the effect of x depends on the intercept and all the other predictors in the equation: e.g. generalized linear models (GLMs) where the outcome (y) is recoded to alter the relationship between y and all the predictors in the equation.

Interaction effect

Models with interaction effects are like russian dolls, they contain regressions within regressions. Specifically, they let the effect (b) of one variable (x) depend on the value of another. To express that relationship, we add a multiplicative term between the two variables and estimate a parameter for it (b_3).

\(y = a + b_1x + b_2z + b_3xz\)

Interaction effects are notoriously hard to interpret. In my opinion, this is where creating scenarios and inspecting predicted effects becomes crucial.

Interaction effects pose two challenges relating to theory-testing.

The pitfall of p-hacking

Interaction effects draw on a subset of the data to estimate a separate effect of x within a subgroup defined by z (often labelled the “moderating” variable). This opens the question of how big our effective sample is.

Considering regressions as comparisons of means, it means that any group average that would be generated “by chance” in less than 1 out of 10 samples is “statistically significant”. In other words, it suffices to try out all the interaction effects that your model could potentially allow for. If you randomly try 10 interaction effects, you have a fair chance of finding one that is “significant”.

In other words, your choice of interaction effects should be theory-driven.

Stringent theory testing

Berry, Golder, and Milton (2012) take this approach to a new level. They argue that theories that predict interaction effects can (and should) be tested thoroughly. They highlight that too many researchers forget that interaction effects are symmetric. We should only retain interactive theories that substantiate that 1) the effect of x on y depends on z and that 2) the effect of z on y depends on x. This test should be done substantially (i.e. does it logically make sense in our theoretical story) and statistically (our topic).

To do so, they suggest focusing on marginal effects to validate as many of the following statements:

  1. marginal effect of x is substantial/as predicted when z is at its minimum
  2. marginal effect of x is substantial/as predicted when z is at its maximum
  3. marginal effect of z is substantial/as predicted when x is at its minimum
  4. marginal effect of z is substantial/as predicted when x is at its maximum
  5. marginal the effect of the product term (\(bxz\)) should be substantial

Let’s consider a conditional theory of local investment among MEPs. Specifically, I expect that as national elections draw near, some MEPs (those that would want to transition to national politics) are inclined to hire larger teams of local assistants. However, I also believe that this is more pressing for MEPs that would compete in a candidate-centered system.

\(Local \:Assistants = b_1 + b_2\times Calendar + b_3\times Electoral \:System + b_4 \times Calendar \times Electoral \:System\)

The symmetric element means that we should have an answer to all of these questions. For simplicity, Berry, Golder, and Milton (2012) suggest simply considering the minimum and maximum levels of z, the electoral calendar. However, it is up to us, as researchers, to determine what the predicted effect of x should be when z is low and high, respectively.

  1. MEPs increase local staff when they get closer to the national elections when they hail from candidate-centered systems (i.e. the effect of the electoral calendar is positive for candidate-centered systems) (this is definitely implied in my theory)

  2. MEPs increase local staff when they get closer to the national elections when they hail from party-centered systems (i.e. the effect of the electoral calendar is positive for party-centered systems) (is this implied in my theory)

  3. MEPs from candidate-centered systems hire more local assistants (than MEPs from party-centered systems) when we are far from a national election (is this implied in my theory?).

  4. MEPs from candidate-centered systems hire more local assistants (than MEPs from party-centered systems) when we are close to a national election (this is definitely implied in my theory)

  5. Is the increase in staff substantial/noteworthy when we have a realistic move in the electoral calendar (e.g. immediately before and after an election) and between electoral systems (i.e. is staff size substantially different between the two groups).

I implement this as a model in R by interacting the variables ProxNatElection (x, measuring the negative number of years until the next elextion) and NationalCandidateCentered (z, binary).

mod5 <- lm(
  LocalAssistants ~
  + NationalCandidateCentered 
  * ProxNatElection,
  df
)
summary(mod5)
## 
## Call:
## lm(formula = LocalAssistants ~ +NationalCandidateCentered * ProxNatElection, 
##     data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.850 -1.875 -0.658  0.967 36.307 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                 0.2339     0.4585   0.510     0.61
## NationalCandidateCentered                   3.7274     0.5440   6.851 1.55e-11
## ProxNatElection                            -0.7427     0.1547  -4.802 1.90e-06
## NationalCandidateCentered:ProxNatElection   1.0683     0.1978   5.400 9.01e-08
##                                              
## (Intercept)                                  
## NationalCandidateCentered                 ***
## ProxNatElection                           ***
## NationalCandidateCentered:ProxNatElection ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.116 on 735 degrees of freedom
## Multiple R-squared:  0.06293,    Adjusted R-squared:  0.05911 
## F-statistic: 16.45 on 3 and 735 DF,  p-value: 2.343e-10

From the model parameters we already understand that the product term is of some size and statistically significant. However, what does this mean?

1. Marginal effect

Each of the two variables involved in the interaction effect now has two regression coefficients. They can be read as an intercept and a slope coefficient where the slope depends on the value of the moderating (other) predictor.

In the following, I implement Berry, Golder, and Milton (2012)’s advice for the effects a. and b.

Intercept: The effect of the calendar is -0.743 when the national electoral system is party-centered (z = 0).

\(ME(Calendar|Electoral\:System = "Party") = b_2\times Calendar + b_3 \times Calendar \times 0\)

\(-0.743 + 1.068 \times 0 = -0.743\)

Slope: The effect of national candidate-centered system is dependent on the value of the electoral calendar. The effect of the national candidate-centered system is 0.326 when there is one year left before the election (z = 1)

\(ME(Calendar|Electoral\:System = "Candidate") = b_2\times Calendar + b_3 \times Calendar \times 1\)

\(3.727 + 1.068 \times -1 = 2.659\)

In this example the z-predictor has only two values. Berry, Golder, and Milton (2012) focus on the marginal effect, suggesting to always make scenarios by letting z – the moderating variable — slide from low to high to verify if the effect of x is consistent with the theory. In the marginal effect plot they let the x-axis report the value of z, while the y-axis reports how the marginal effect of x changes. This makes the most sense when both predictors are numeric.

#Data frame with scenarios of coefficients
me <- 
  data.frame(
    #Intercept
    b1 = mod5$coefficients[3],
    #Slope
    b2 = mod5$coefficients[4]
  ) %>%
  #Moderator: the electoral system
  cbind(., 
        z = c(0, 1)) %>%
  mutate(
    x = b1 + b2 * z
  )
## Warning in data.frame(..., check.names = FALSE): rækkenavne blev fundet fra en
## kort variabel og er blevet fjernet
me %>%
  ggplot +
  geom_line(
    aes(x = z,
        y = x)
  ) +
  #Line to guide the eye
  geom_hline(
    aes(
      yintercept = 0
    ),
    lty = 2
  ) +
  ylab("Marginal effect of electoral cycle (ME(x|z))") +
  xlab("Electoral system (z)") +
  ggtitle("Marginal effect of electoral cycle on MEP local staff is conditional on electoral system")

Personally, I find it more intuitive to consider the predicted outcomes.

2. Predicted outcomes We can also go all the way, and fill in values for the entire equation.

Here, I construct two sets of scenarios: One for the effect of electoral calendar when the national electoral system is candidate-centered, another for the effect of the electoral calendar when the system is party-centered.

scenario <- 
  data.frame(
    #Let the z variable vary from low to high
    NationalCandidateCentered = c(0,1),
    #Let the national electoral system vary over its range, then repeat twize (as many times as z has groups)
    ProxNatElection = rep(
      seq(-4, 0, 0.1), 
      2)
    ) %>%
  #Predict from model parameters
  mutate(
    LocalAssistants_pred = predict(mod5, .)
  )
scenario %>%
  mutate(electoral_system = if_else(NationalCandidateCentered == 1,
                                    "Candidate-centered",
                                    "Party-centered")) %>%
  ggplot +
  geom_line(
    aes(
      x = ProxNatElection,
      y = LocalAssistants_pred,
      color = electoral_system
    )
  ) +
  ylim(0, 5) +
  ggtitle("Predicted effect of electoral cycle on local staff size conditional on the electoral system")

3. First differences We can play around with the first-differences:

Take 1 I can now calculate what the effect of a one-year increment is within each electoral system. I do this by subsetting my scenarios to -1 and 0.

  scenario %>%
  filter(
    #Four years before the election (immediately after the previous one)
    ProxNatElection == -1 |
      #During the election
      ProxNatElection == 0 ) %>%
  group_by(NationalCandidateCentered) %>%
  reframe(
    LocalAssistants_pred,
    first_diff = LocalAssistants_pred - lag(LocalAssistants_pred))

These are the effect of a one-year increment within each electoral system.

Take 2:

  scenario %>%
  filter(
    #Four years before the election (immediately after the previous one)
    ProxNatElection == -1 |
      #During the election
      ProxNatElection == 0 ) %>%
  group_by(NationalCandidateCentered) %>%
  reframe(first_diff = LocalAssistants_pred - lag(LocalAssistants_pred)) %>%
  filter(!is.na(first_diff)) %>%
  mutate(first_diff - lag(first_diff))

More interestingly, we see that the one-year increment between electoral systems is higher. Do you recognize it from the regression output?

Take 3 A realistic scenario is to compare not a one-year increment but a four-year increment. This way, I compare staff sizes immediately after an election with the staff size immediately before an election (during the campaign period). I therefore subset my scenarios to -4 and 0.

scenario %>%
  filter(
    #Four years before the election (immediately after the previous one)
    ProxNatElection == -4 |
      #During the election
      ProxNatElection == 0 ) %>%
  group_by(NationalCandidateCentered) %>%
  reframe(
    LocalAssistants_pred,
    first_diff = LocalAssistants_pred - lag(LocalAssistants_pred))

Your turn!

  1. Remove the electoral system variable from the equation for model 5. Run identical models using proximity of election as a predictor on two subsets of the data (z == 0; z == 1). How does this relate to your results in model 5?

  2. I find some substantial differences/effects in model 5, but how consistent are they with my theory?

  3. Can you calculate the marginal effect of the electoral system, given the electoral calendar following Berry, Golder, and Milton (2012)’s advice (effects c. and d.)? You can refer to the table below to navigate their advise.

Table of marginal effects

Hypothesized relationship Variable names Intercept Slope Moderating variable Marginal effect
\(ME(x|z_{min})\) \(ME(Calendar|System = "Party-centered")\) -0.7427308 1.0682636 0 -0.7427308
\(ME(x|z_{max})\) \(ME(Calendar|System = "Candidate-centered")\) -0.7427308 1.0682636 1 0.3255328
\(ME(z|x_{min})\) \(ME(System|Calendar = -4)\)
\(ME(z|x_{max})\) \(ME(System|Calendar = 0)\)

Uncertainties and errors

We can use regressions to describe data. However, often we use regressions with the intention of making inferences, i.e. we want to describe a general trend beyond the data (in a population). Since we only observe parts of the real world, there is also some uncertainty about our inferences. We use the variation in the data to infer how much we might deviate from what is outside of our sample.

This is reflected in the standard error of the parameters and in the error term (residuals) of the regression.

Parameters: point estimates and standard errors

Until now, we’ve talked about differences in means without taking into account the variation there is in the data. However, if I were to plot the distribution of the two groups, we can see that they overlap. It is not the case that all MEPs from party-centered systems have fewer staffers than their homologues from the candidate-centered systems.

df %>%
  mutate(OpenList = if_else(OpenList == 1,
                            "Candidate-centred",
                            "Party-centred")) %>%
  ggplot +
  geom_density(
    aes(
      LocalAssistants,
      group = OpenList,
      fill = OpenList
    ),
    alpha = 0.5
  )
Plotting the distribution of local staff among MEPs show that the two groups largely overlap.

Plotting the distribution of local staff among MEPs show that the two groups largely overlap.

In fact, there is quite a lot of within-group variation that our model does not account for. That’s why we have standard errors.

Gelman and Hill (2007) differentiate between \(a\) and \(b\) and \(\hat a\) and \(\hat b\). This is to make explicit that the regression coefficients you see in the regression table are estimated on a data sample. In a new – same-sized – sample from the same population, you would most likely find slightly different results.

In fact, the estimation builds on the assumption that the (differences in) group averages are drawn from the distribution of group averages that would result from other – similar-sized – data samples drawn from the theoretical population. This random draw is assumed to approximate a normal distribution if the sample size is large enough. Otherwise, we’d refer to the t-distribution which allows for more variation.

  • regression coefficients are called “point estimates” insofar as we believe they are the average value (the most likely value) of that theorized distribution. They summarize the link between two variables in a single number.

  • the standard errors report the spread of the theorized distribution. However, the only thing we know about the population comes from our sample. It means in principle that the more heterogeneity/more variation we have in our sample, the more regression coefficients could potentially describe our theory. On the other hand, if the within-group variation is large (within value combinations of the predictors), the more variation there is in the point estimates (uncertainty). We reserve the word “standard deviation” to describe the variation in an observed variable, while we use the word “standard error” to describe the variation in a theorized variable (such as the parameters).

Residuals: regression error and model assessment

By describing the relationship between x and y with a single number, we will necessarily be wrong most of the time. This is reflected in the error term (\(\epsilon_i\) or \(r_i\)). However, if the estimation is unbiased, we will be right on average. That is, the sum of our errors should be 0. This is why we often omit reporting the error term when we present the model’s equation.

\[y = a + bx + \epsilon_i\]

The residuals (\(r\)) are defined as the difference between what the model predicts and what we observe.

df <- 
  df %>% 
  #Difference between observed y and predicted y
  mutate(LocalAssistants_pred = predict(mod5, df),
         residuals = LocalAssistants - LocalAssistants_pred)

We can check if the residuals are unbiased.

df$residuals %>% mean
## [1] 3.480269e-15

This is approximately 0.

We draw on residuals for model statistics. That is, we assess the model’s ability to describe the data by relying on how often and under what conditions the model is wrong in its in-sample predictions.

Model fit (in-sample predictions)

As we add more (relevant) predictors to our model, the spread in the residuals should drop. We can see this by comparing the **residual standard deviation* (\(\sigma\)) between different models.

\[\sigma = \sqrt{\Sigma{\frac{r^2}{n - k}}}\] You can calculate this by hand, but R’s summary() function already does it for us.

#Residual standard deviation of model 1
summary(mod1)$sigma
## [1] 3.178989
#Model 5 (the more complex model)
summary(mod5)$sigma
## [1] 3.115609

We can also correlate the predictions iwth the observed values to get the \(R^2\)

cor(df$LocalAssistants, df$LocalAssistants_pred)^2
## [1] 0.06293001

Assumptions of the linear model

Normally distributed errors A typical assumption of the linear model is that the residuals are normally distributed. That is, the outcome variable does not have to follow a normal distribution, but once we’ve added the predictors to the model, the residuals should do so.

We can get a “gut feeling” about the distribution of errors by making a histogram.

df$residuals %>% hist(., breaks = 30)

Another way is to standardize the residuals and then compare their quantiles to the quantiles in a standard normal distribution.

df <- 
  df %>% 
  mutate(residuals_s = residuals/sd(residuals, na.rm = T))

df$residuals_s %>% qqnorm()

If the residuals are normally distributed, the observations in the scatterplot would follow the diagonal in the plot.

Gelman and Hill (2007) argue that this assumption – while often put forward – is the least important condition to satisfy for the model to be valid. We can trace this back to their view of linear regressions as comparisons of averages.

Homosketasticity More crucial is the assumption that the errors should be equally distributed across the entire span of the outcome variable. This is because the standard errors reported for the regression coefficients assume that the model will be equally wrong across all predictions.

We can verify this by plotting the residuals against the predicted values of our model.

df %>%
  ggplot +
  geom_point(
    aes(
      y = residuals,
      x = LocalAssistants_pred
    )
  )

Here, we see that the residuals “fan out”; they have a larger spread on higher values of the predicted values. Also, I also underpredict the size of MEPs local staff more often than overpredicting them (i.e. there are more residuals with highly positive than negative values).

I read a few things into this:

  • standard errors: This is a tell-tale sign that my standard errors might be too small (I’m too certain about my results). Sometimes this is fixed by calculating “robust standard errors” that take into account the higher variablility in the parameters.

  • consider a GLM unequal spread also often indicates that I should consider a regression model that doesn’t rely on the normal distribution. I should consider going beyond a linear model.

Lastly, unequal spread is sometimes an early warning that I may have confounders that the model doesn’t account for (an omitted variable bias). That’s the topic for next week.

Literature

Berry, William D., Matt Golder, and Daniel Milton. 2012. “Improving Tests of Theories Positing Interaction.” The Journal of Politics 74 (3): 653–71. https://doi.org/10.1017/S0022381612000199.
Gelman, Andrew, and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge ; New York: Cambridge University Press.
Hermansen, Silje Synnøve Lyder. 2023. R i praksis - en introduktion for samfundsvidenskaberne. 1st ed. Copenhagen: DJØF Forlag. https://www.djoef-forlag.dk/book-info/r-i-praksis.
Hermansen, Silje Synnøve Lyder, and Andreja Pegan. 2024. “Blurred Lines Between Electoral and Parliamentary Representation: The Use of Constituency Staff Among Members of the European Parliament.” European Union Politics, 1–25. https://doi.org/10.1177/14651165221149900.
Ward, Michael D., and John S. Ahlquist. 2018. Maximum Likelihood for Social Science: Strategies for Analysis. Analytical Methods for Social Research. Cambridge: Cambridge University Press. https://doi.org/10.1017/9781316888544.