Introduction to linear regression analysis

IUROPA workshop 11. September, Oslo

In this session, we will go through the typical workflow of a regression analysis from setting up the data, descriptive statistics, model fitting and interpretation.

Introduction

We will work with data on preliminary reference rulings delivered by the European Court of Justice (ECJ). Specifically, we’ll make the hypothesis that the length of decision making (y) – the time it takes to come to a conclusion – depends on how complex the legal question is (x1) and the number of actors involved (x2).

Main methodological takeaways

We can describe this relationship with a linear equation (that we can later estimate).

$y_i = \alpha + \beta x_i + \epsilon_i$

Data We are interested in explaining variation in the outcome (“dependent”) variable, y, by variation in the predictor (“dependent”) variable, x. x and y are vectors of data points.

Parameters Our parameters describe the relationship between the two variables. We call them parameters because they are estimates; the result of applying theory (the linear equation) to data (our variables).

intercept ($\alpha$): the value of y when x is 0.
slope ($\beta$): the rate of increase in y when x increases by 1 unit.

The parameters are interesting for several purposes. Our research question and hypothesis is usually related to the slope parameter. Our theory predicts the direction of the relationship (negative/positive), and we may want to know the strength of the relationship (size of the coefficient).

Once we have the paramters estimated, we can also start predicting outcomes by drawing on the parameters, but changing the data values into hypothetical scenarios. This is useful for understanding the scope of what you’ve found.

errors ($\epsilon$): the error term is actually just the difference between our observed outcome y and the predicted outcome given our data and parameters. It varies by observation unit. Often we don’t report it, and if we do, we tend to summarize the error by their spread (standard deviation).

We can think of regression analysis in two ways: as a comparison of means or as a means of inference.

In this workbook, we will use linear models (OLS) first to estimate differences in group averages. Specifically, we will calculate the mean of the dependent variable for each each category in our predictor and compare their means. We then move to use the model for inference – predicting outcomes from combinations of (metric) variables. Our focus is then on interpretation. Last, we will use regression analysis to make an all else argument. That is, we will control for – or rule out – other alternative explanations for the variation in y than the one we have theorized.

R packages

I will work with the following R packages:

dplyr (or tidyverse) for data wrangling
ggplot2 and ggeffects for visualization
stargazer for tables

I start by fetching the R packages we will use from the library.

#Data wrangling and pipes
library(dplyr); 
#Visualization; set my preferred theme
library(ggplot2); theme_set(theme_minimal())

Make sure that your working directory is set to the folder where your data is and where you will store your new files. The easiest way to do this, is to set up a “project”. Here, I check where my working directory is.

getwd()

## [1] "C:/Users/dhf568/Dropbox/Teaching/Universitetet i Oslo/IUROPA/introduction to regression analysis"

The data

We will be working with one outcome variable – case duration – and one, then two predictors. We will set up and explore the data as we go. The three variables are stored in three data frames.

You can find the data we will work with on my website:

You can either i) open your browser and drag and drop the files into your working folder (for me, this is the “iuropa” folder on my computer) or ii) you can download the files directly from R.

#Main data frame: on a preliminary references observed on a judgment level
download.file(url = "https://siljehermansen.github.io/teaching/iuropa/df.rda",
              destfile = "df.rda")

#File on judicial actors positions on legal questions
download.file(url = "https://siljehermansen.github.io/teaching/iuropa/positions.rda",
              destfile = "positions.rda")

#File on member state submissions/observations
download.file(url = "https://siljehermansen.github.io/teaching/iuropa/observations.rda",
              destfile = "observations.rda")

Outcome variable: Duration of the proceeding

The main data frame (df.rda) contains all the information we need to set up our dependent variable. It also contains the data structure.

Once I have downloaded and stored the data on my computer, I load it into R.

load("df.rda")

The head() function in base R allows us to see the first six observations in the data frame. I use it to inspect the data. An alternative is the popup window that appears when we use View(df).

head(df)

As is clear, the data structure is here at the judgment level. The casen celex and ecli numbers are institutional ID numbers for each document/judgment. We will later use it as the key to merge data from different sources.

Our two variables of interest are the dates when the case was lodged and the judgment rendered. I check how the data is stored: that is, what is the registered measurement level?

class(df$date_lodged); class(df$date_decision)

## [1] "Date"

## [1] "Date"

The answer is that these are date objects. They count in effect the number of days since 1970-01-01.

I can check what period the data covers.

min(df$date_lodged); max(df$date_lodged)

## [1] "1995-01-04"

## [1] "2011-12-29"

Recode to create the dependent variable

I now create my dependent variable: Duration of the decision. To do so, I mutate() the data frame (i.e. change/chreate the specified variables). Next, I define the duration as a numeric variable to get the number of days between the two events.

df <-
  df %>%
  mutate(
    #Calculate the number of days between the lodging and judgment in the case
    duration = date_decision - date_lodged ,
    #Transform the variable class to number of days (numeric)
    duration = as.numeric(duration))

When I create new variables or recode old ones, I always check the result. I eyeball the data and run descriptive statistics. It’s amazing how often something goes wrong.

Descriptive statistics for numeric/continuous data

The type of descriptives that are useful depend on the measurement level: numeric (counts, any numbers) or categorical (unordered or ordered).

Summary numbers

When the variable is numeric, we can calculate the mean, standard deviation (spread), quantiles, minimum, maximum etc. The summary() function can summarize any R object for us.

summary(df$duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   146.0   528.0   632.0   660.9   764.2  1680.0

I’m scouting for weird observations: Very low and very high values, for example. I know that the duration cannot be negative and that very high durations might skew the final analysis.

Here, we see that cases lasted on average round(mean(df$duration)) days; sligthly less than two years. However, there is quite some variation.

Histogram

Personally, I always visualize the descriptives. When the variable is numeric, histograms are great. To do so, I use the ggplot2 vocabulary. Here, I first use tidyverse pipe to connect the data to the graphic (%>%), then the ggplot pipe (+) to connect the graphical elements.

#Specify the R object
df %>%
  #Create a plot
  ggplot +
  #Specify the type of plot and variable
  geom_histogram(aes(duration))

The histogram tells us that there is variation in the dependent variable. That’s good news: Variation is information and the data analysis is all about connecting the variation in one variable with the variation in another variable. That’s why I always check the variation in my data.

This is like lego, I start with the basic structure, but will add many visual elements later when I share the graphic.

Predictor: case complexity

One of the two predictors we’ll consider here is the number of issues (legal questions) that the domestic judge asks the Court. My expectation is that as the number of legal questions increases, the duration increases. It takes longer to resolve the case.

Aggregate

The data was collected by the IUROPA project. Coders with a legal training read the judgments and identified each question. The data is stored as a separate data frame.

load("positions.rda")

Check the data structure.

head(positions) # or View(positions)

this time, the data structure is different. In each “judgment” (celex), there are several “issues” (issue_id) and several possible answers proposed by different actors (position_id). Here, we’ll ignore the latter. Instead, we will have to aggregate the data back up to the judgment level.

I do this by calculating the number (length()) of unique issues (unique()) per case (group_by(celex)). The results are stored in a new R object (df0). To do this, I use tidyverse/dplyr language: group_by() and reframe()

df0 <- 
  #Original data frame
  positions %>%
  #Aggregate to celex/judgment level
  group_by(celex) %>%
  #Create a new data frame on the celex level
  reframe(n_issues = length(unique(issue_id)))

I always check the result. Does this look correct?

head(df0)

It does. I now have the key variable (celex) that I use to connec the two data frames. For each observation, I have a number of issues reported.

I merge the two data frames using the left_join() function from dplyr. It requires two data frames (x and y) and preferably the name of the common key. Merging data is tricky. We need to know what data structure we want to result from this. That’s why I always decide which data frame has the correct structure (if any). Here, the df object is the standard. I then check the dimensions of the data before and after the operation.

dim(df)

## [1] 2212   10

df <- left_join(df, df0,
                #The key variable in both data sets
                by = "celex")
dim(df)

## [1] 2212   11

A handy trick is to test the code first by creating a new object, e.g. df_test so that I don’t have to go back and rerun all my codes if anything goes wrong.

I also check the data structure.

head(df)

Univariate descriptives for integer or categorical data

This time our variable is a count number of issues. These are integers, so we can treat them as numeric or as (ordered) categories. Whenever I have categorical data, the descriptive statistics are frequency tables (how often does a category occur) and barplots.

Frequency table

I can create the frequency table in base R…

table(df$n_issues)

## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  17 
## 808 632 382 176  95  54  26  17   9   3   4   1   2   2   1

… or dplyr.

df %>%
  group_by(n_issues) %>%
  reframe(count = n())

Either way, we see that most cases have only one or two issues in them, but a few contain a bunch.

Barplot

Categorical data is best visualized with barplots.

df %>%
  ggplot +
  geom_bar(aes(n_issues))

Ok. It is clear that I have variation on both variables. I do take note, however, that some of that variation might be mostly noise. That is, most of the variation in my predictor is in the five categories 1-5. However, there is much more variation and nuance in the outcome. There will be unexplained variation. This is normal.

Main R functions in this section:

Function	name	package
download files from the net	`download()`	base
load data	`load()`	base
unique observations	`unique()`	base
length of vector	`length()`	base
dimension of the data frame	`dim()`	base
retain only unique occurrences in a variable/data frame	`unique()`	base
make operations within groups of data	`group_by()`	dplyr
aggregate data to a new data frame	`reframe()`	dplyr
alter extant data frame without changing the structure	`mutate()`	dplyr
merge data using the left (x) data frame as a basis	`left_join()`	dplyr
histogram for univariate distribution of numeric variables	`geom_histogram()`	ggplot2
barplot for univariate distribution of categorical variables	`geom_bar()`	ggplot2

Your turn!

Can you do the same? The data frame observations.rda contains the member state observations (amicus curia briefs) that are filed in each case. Our second expectation is that it will take longer for the Court to deliver a judgment whene more governments submit their views on how the Court should resolve the case. Can you…

load in the data and check the structure
aggregate the data to obtain a count of the number of governments submitting
merge the data by adding the new variable to your extant data frame (df).
find a useful way of describing the variation in the variable
can you see a problem? How would you fix it?

Bivariate analysis

Let’s move to the more fun part. How are the two variables connected? I will go through this following two schools:

regression as a comparison of group means: this mostly makes sense for a categorical predictor.
regression for inference

Comparison of group averages

We can use the same trick as before and calculate hte group mean, i.e. the average duration among cases that contains one legal issue, two legal issues etc.

df %>%
  group_by(n_issues) %>%
  reframe(y_m = mean(duration))

We can see already here that the average duration tends to increase with the number of issues treated. However the increments are not the same size and sometimes the increase is not continuous.

Let’s calculate the increment and the size of the group. That is, I am interested in the difference between the group means and how many judgments are in each group. To do so, I use the lag() function that fetches the value of the previous observation and the n() function that counts the number of occurences in each group. This time, I store the result in an R object for convenience (tab).

tab <- 
  df %>%
  #Categories: number of legal issues
  group_by(n_issues) %>%
  #New aggregated data frame with group averages
  reframe(
    #Average case duration for each category
    y_m = mean(duration),
          #Number of observations with n issues
          n = n()) %>%
  #New variable in the data frame: difference between groups
  mutate(diff = y_m - lag(y_m))
tab

Cases that imply two legal issues instead of one last on average 26.9 days longer.

When we have many ordered categories, we can even visualize this.

tab %>%
  ggplot +
  #Create a point for each group
  geom_point(
    aes(
      #Average duration on the y axis
      y = y_m,
      #Legal issues on the x axis
      x = n_issues,
      size = n
    )
  ) +
  #Draw a line between the groups
  geom_line(
    aes(
      y = y_m,
      x = n_issues
    )
  )

From the plot, we see that most of our data is in the lower span of the predictor. Also, this is where the increase in y is as theorized. Towards the extreme end of the predictor, we’re all over the place but note how few observations we have! This is not representative of the general trend.

Regression as a comparison of means

Regression analysis is really just a comparison of averages. …but it smoothes over the averages and give more weight to i) value combinations that are common (big groups) and, unfortunately, weird outliers (sometimes).

You can regress a numeric variable (outcome) on both categorical and numeric predictors. I’ll start by creating a categorical variable out of our predictor.

df <- 
  df %>%
  mutate(n_issues_cat = as.factor(n_issues))

When we have categorical, we compare the group means of the outcome with a reference category. If you don’t specify, it will be the first category (with a name beginning with 1 or A).

To fit a linear model, I specify my equation in the lm() function: $y \sim x$ (y is proportional to x). I also have to specify where the variables are stored (in the data object) and store the results in a separate R object (mod.cat).

mod.cat <- lm(duration ~ 
            n_issues_cat,
          df
)

The result is a list object in R. We can admire our parameters by asking for a summary.

summary(mod.cat)

## 
## Call:
## lm(formula = duration ~ n_issues_cat, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -531.85 -129.07  -29.04   97.16 1023.08 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     630.009      6.456  97.590  < 2e-16 ***
## n_issues_cat2    26.911      9.745   2.762 0.005800 ** 
## n_issues_cat3    44.057     11.394   3.867 0.000114 ***
## n_issues_cat4    87.315     15.265   5.720 1.21e-08 ***
## n_issues_cat5    75.844     19.903   3.811 0.000142 ***
## n_issues_cat6    70.621     25.793   2.738 0.006231 ** 
## n_issues_cat7    92.684     36.563   2.535 0.011316 *  
## n_issues_cat8    71.756     44.972   1.596 0.110730    
## n_issues_cat9    94.991     61.508   1.544 0.122642    
## n_issues_cat10  363.991    106.143   3.429 0.000616 ***
## n_issues_cat11  261.741     91.979   2.846 0.004473 ** 
## n_issues_cat12  115.991    183.618   0.632 0.527650    
## n_issues_cat13  480.991    129.918   3.702 0.000219 ***
## n_issues_cat14  343.491    129.918   2.644 0.008254 ** 
## n_issues_cat17 -172.009    183.618  -0.937 0.348978    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 183.5 on 2197 degrees of freedom
## Multiple R-squared:  0.03953,    Adjusted R-squared:  0.03341 
## F-statistic: 6.458 on 14 and 2197 DF,  p-value: 6.279e-13

the intercept ($\alpha$) is 630 days. This is the average duration when the number of legal issues is one.
each category then reports the difference in averages with the reference group ($\beta$). I.e. the average duration of judgments with two legal issues is on average 27 days higher than when there is only one issue to resolve.

We can make predictions of the value of our outcome/expected case duration by filling in the equation

$y = \alpha + \beta x$

$y = 630 + 27 x$

it means that the average duration in judgments with two legal issues is 657 days.

Visualize effects when predictor is categorical: coefplot

The best way to grasp what we have found is to make predictions and visualize them. When the predictor is categorical, we can visualize each expected group average as a dot and add wiskers/errorbars to visualize how certain we are about this prediction. The (un)certainty comes from the size of the group and the within-group variation.

We can of course make the predictions by hand as we did in the previous subsection, but we can also make R do this for us. The ggeffects package uses the data and the model to simulate what the results would be if our model is a correct description of the world.

library(ggeffects)
eff.cat <- ggpredict(mod.cat,
                     terms = "n_issues_cat")

I store the answer in a separate R object that I call eff.cat. It is a data frame containing all the relevant information:

eff.cat %>%
  as.data.frame %>%
  head

We can see that it makes predictions for each value of x (number of legal issues) as well as calculating the confidence intervals for the predictions (i.e. the prediction intervals).

Base R knows these type of objects and can create a default plot for us.

eff.cat %>%
  plot

We can use ggplot vocabulary to tweak the esthetics of the plot if we want. Ggplot is like the mirror of our dreams; It always zooms in on the visual elements so that the effects look large. I will usually zoom out to report the entire span of the y axis.

eff.cat %>%
  plot +
  ylim(0,1400)

Regression as a means of inference

A second way of exploring the bivariate relationship is to treat the predictor as a numeric value. This usually means that the regression analysis is making more generalizations; it summarizes the data even more succinctly to highlight the general trend in the data.

A correlation (relationship) between two variable is characterized by covariation; when one changes, so does the other.

Scatterplot

A basic way to visualize this is with the scatterplot.

df %>%
  ggplot +
  geom_point(
    aes(
      y = duration,
      x = n_issues
    )
    )

Since the predictor is an integer, it is hard to see exactly how many observations I have, so I can jitter (“shake”) the x-scale a little (only for the visuals) and make the dots transparent.

df %>%
  ggplot +
  #jitter instead of points
  geom_jitter(
    aes(
      y = duration,
      x = n_issues
    ),
    #Transparent dots
    alpha = 0.5,
    #I let the observations vary by +/- 0.4
    width = 0.4
    )

Is this a correlation? It becomes clear that while the average duration tends to be higher, it is not always the case.

Local regression (loess)

A second step is to calculate a lowess model. This is a group average where the group is defined by a “sliding window” such that we obtain a million dots (so a line) that wriggles empirically depending on the average case duration over the local span of the x axis (legal issues).

df %>%
  ggplot +
  geom_smooth(
    aes(
      y = duration,
      x = n_issues
    )
    )

Right… Now we see that there is – on average – a pretty straight line, continuous increase. We also see thath the uncertainty increases when the number of legal issues is high. This is normal, since there are few observations here.

Regression

We’ve seen that the increments vary by group and that the group size varies. Regression analysis simplify this by describing the relationship by an average increment in y when x increases with one unit. That is, it finds a way to summarize the data for us.

mod <- lm(duration ~ 
            n_issues,
          df
)

summary(mod)

## 
## Call:
## lm(formula = duration ~ n_issues, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -535.86 -128.73  -31.07   98.27 1026.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  616.314      6.753  91.270  < 2e-16 ***
## n_issues      18.710      2.307   8.109 8.35e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 184 on 2210 degrees of freedom
## Multiple R-squared:  0.0289, Adjusted R-squared:  0.02846 
## F-statistic: 65.76 on 1 and 2210 DF,  p-value: 8.354e-16

intercept: cases with zero legal issues tend to last 616 days. This is completely fictionaly, of course. Our scale starts at one.
slope: average duration increases on average by 19 days for each additional legal issue.
prediction: by filling in the equation, I find that the predicted duration for a case with one legal issue is 635 days.

Visualize the effect of a numeric predictor: effect plot

The ggeffects package can also handle predictions with numeric predictors. This time, the result is a line and a ribbon of uncertainty (confidence interval).

eff.numeric <- ggpredict(mod,
                         terms = "n_issues") 

eff.numeric %>%
  plot +
  ylim(0, 1400)

Often, we will want to illustrate where the data points are. We can do this with a rugplot. That is, at the bottom of the plot, we add a dot or a bar per observation placed where their value is on the x axis (number of legal issues).

eff.numeric %>%
  plot +
  ylim(0, 1400) +
  geom_jitter(data = df,
             aes(y = 0,
                 x = n_issues),
             alpha = 0.5,
             shape = "|")

A peak behind the scene

When I make my numeric interpretations, I index the R objects to extract the numbers. This way, if the data or the model changes, I have updated results.

#One unit increase
increase <- summary(mod)$coefficients["n_issues", "Estimate"]

paste("For each additional legal question, the duration of the case increases on average with",
      round(increase),
      "days.")

## [1] "For each additional legal question, the duration of the case increases on average with 19 days."

# Marginal effect with scenario

#One unit increase
summary(df$n_issues)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.386   3.000  17.000

typical_increase <- summary(mod)$coefficients["n_issues", "Estimate"] * c(2)


paste("Going from a farily straightforward case (one legal question) to a fairly complex case (3 questions) increases the duration with",
      round(typical_increase),
      "days.")

## [1] "Going from a farily straightforward case (one legal question) to a fairly complex case (3 questions) increases the duration with 37 days."

Presentation of results

In addition to the numeric interpretations and visualizations of different scenarios, it is common to report the regression table. The stargazer package contains one function: stargazer() and will be able to extract the relevant information for you automatically.

library(stargazer)

By default, it will print a latex code for you. However, you can ask for both text output and html. Here, I ask for a text output so that I can read it. If I put several model objects into the function, stargazer will sort out the variable names and make a column for each model. It’s perfect for comparing models. We’ll see that when we move to the multivariate analysis.

stargazer(mod.cat, mod,
          type = "text")

## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                         duration                     
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## n_issues_cat2              26.911***                                 
##                             (9.745)                                  
##                                                                      
## n_issues_cat3              44.057***                                 
##                             (11.394)                                 
##                                                                      
## n_issues_cat4              87.315***                                 
##                             (15.265)                                 
##                                                                      
## n_issues_cat5              75.844***                                 
##                             (19.903)                                 
##                                                                      
## n_issues_cat6              70.621***                                 
##                             (25.793)                                 
##                                                                      
## n_issues_cat7               92.684**                                 
##                             (36.563)                                 
##                                                                      
## n_issues_cat8                71.756                                  
##                             (44.972)                                 
##                                                                      
## n_issues_cat9                94.991                                  
##                             (61.508)                                 
##                                                                      
## n_issues_cat10             363.991***                                
##                            (106.143)                                 
##                                                                      
## n_issues_cat11             261.741***                                
##                             (91.979)                                 
##                                                                      
## n_issues_cat12              115.991                                  
##                            (183.618)                                 
##                                                                      
## n_issues_cat13             480.991***                                
##                            (129.918)                                 
##                                                                      
## n_issues_cat14             343.491***                                
##                            (129.918)                                 
##                                                                      
## n_issues_cat17              -172.009                                 
##                            (183.618)                                 
##                                                                      
## n_issues                                            18.710***        
##                                                      (2.307)         
##                                                                      
## Constant                   630.009***               616.314***       
##                             (6.456)                  (6.753)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                 2,212                    2,212          
## R2                           0.040                    0.029          
## Adjusted R2                  0.033                    0.028          
## Residual Std. Error   183.505 (df = 2197)      183.974 (df = 2210)   
## F Statistic         6.458*** (df = 14; 2197) 65.759*** (df = 1; 2210)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

To export your table to word, you can do the following:

write out the object as an html file using the out = argument. The file will appear in your working folder.

stargazer(mod.cat, mod,
          type = "html",
          out = "table.html")

open the file in your web browser
copy-paste to Word

It’s a hack, but it works fine!


	Dependent variable:

	duration
	(1)	(2)

n_issues_cat2	26.911^***
	(9.745)

n_issues_cat3	44.057^***
	(11.394)

n_issues_cat4	87.315^***
	(15.265)

n_issues_cat5	75.844^***
	(19.903)

n_issues_cat6	70.621^***
	(25.793)

n_issues_cat7	92.684^**
	(36.563)

n_issues_cat8	71.756
	(44.972)

n_issues_cat9	94.991
	(61.508)

n_issues_cat10	363.991^***
	(106.143)

n_issues_cat11	261.741^***
	(91.979)

n_issues_cat12	115.991
	(183.618)

n_issues_cat13	480.991^***
	(129.918)

n_issues_cat14	343.491^***
	(129.918)

n_issues_cat17	-172.009
	(183.618)

n_issues		18.710^***
		(2.307)

Constant	630.009^***	616.314^***
	(6.456)	(6.753)


Observations	2,212	2,212
R²	0.040	0.029
Adjusted R²	0.033	0.028
Residual Std. Error	183.505 (df = 2197)	183.974 (df = 2210)
F Statistic	6.458^*** (df = 14; 2197)	65.759^*** (df = 1; 2210)

Note:	p<0.1; p<0.05; p<0.01

Your turn!

Can you do the same with the number of member state observations?

explore the univariate distribution of the variable
explore the bivariate relationship between case duration and the number of governments submitting to the Court

Function	name	package
estimate a regression	`lm()`	`base`
create a results table	`stargazer()`	`stargazer`
create scenarios and simulate outcomes	`ggpredict()`	`ggeffects`
add points to graphic	`geom_point()`	`ggplot2`
add jittered points to graphic	`geom_jitter()`	`ggplot2`
add lines to graphic	`geom_line()`	`ggplot2`
add local regression line to graphic	`geom_smooth()`	`ggplot2`

Multivariate results

As researchers, we are often confronted with competing explanations of a phenomenon. This is where regression analysis becomes really useful. We can use multivariate regression (regression with several predictors) to rule out alternative explanations.

Thus, we usually run regression analyses not only because we want to summarize the data. We also want to make a ceteris paribus argument. That is, we want to make a statement that “all else equal, the effect of the number of actors involved in the case on duration is…”. Our analysis takes into account alternative explanations by comparing data points within scenarios (combinations of values) representing our alternative explanation.

Another way of saying this is that the biggest threat to our inference are spurious relationships. There is a third variable that provokes the variation in both x and y and gives us the impression that the correlation (or lack there of) that we find is causal. When we integrate other predictors in our analysis, we may find that the initial omission of the third variable (“control variable”) either suppressed or provoked the correlation between x and y.

For example, we may have a theory that the Court’s decision making becomes harder as the political salience of the case increases. We measure (“operationalize”) “political salience” as the number of member state governments that submit observations to the Court and “difficulty in the decision making” as the duration of the case.

The bivariate relationship between the two is positive, so it aligns with our expectations.

cor.test(df$duration, df$n_observations)

## 
##  Pearson's product-moment correlation
## 
## data:  df$duration and df$n_observations
## t = 5.2758, df = 2210, p-value = 1.451e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07017355 0.15249300
## sample estimates:
##       cor 
## 0.1115246

However, the number of observations is naturally limited by the number of member state governments that could have submitted their views to the court. This is related to the size of the European Union. The correlation between the two is positive.

cor.test(df$eu_size, df$n_observations)

## 
##  Pearson's product-moment correlation
## 
## data:  df$eu_size and df$n_observations
## t = 8.435, df = 2210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1359311 0.2166901
## sample estimates:
##       cor 
## 0.1766078

Furthermore, we also know that the Court’s workload has also varied over time depending on the decision making rules (i.e. chamber size, number of judges…) and case load of the Court. This is also related to the size of the EU. Here, the correlation is negative.

cor.test(df$eu_size, df$duration)

## 
##  Pearson's product-moment correlation
## 
## data:  df$eu_size and df$duration
## t = -20.181, df = 2210, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4290922 -0.3586892
## sample estimates:
##        cor 
## -0.3944694

This means that case duration is on average shorter after the EU enlargement, while the number of member state observations is “artificially” higher. Might it be that the relationship between difficulty of decision making and political salience is even higher than we initially thought?

To check for this explanation, we can estimate a separate regression line for each EU size. That is, instead of calculating one line like we do here…

df %>%
  ggplot +
  geom_smooth(
    aes(
      y = duration,
      x = n_observations
    )
  )

… we can estimate a separate line for each EU size

df %>%
  ggplot +
  geom_smooth(
    aes(
      y = duration,
      x = n_observations
    )
  ) + 
  facet_wrap(~ eu_size)

Instead of calculating a local regression, I can also force the line to be straight. I.e. I can run a bivariate regression with a separate intercept and slope per group

df %>%
  ggplot +
  geom_smooth(
    aes(
      y = duration,
      x = n_observations
    ),
    #Bivariate linear model
    method = "lm"
  ) + 
  facet_wrap(~ eu_size)

Here, we see that the slopes are very similar for each of the three regressions. It is only the intercept (the general duration) that really varies. This is what a multivariate regression does. It estimates a single slope, but allows for the general level of y to vary by a second predictor (the control variable, EU size),

Here, I estimate first a bivariate, then a trivariate regression and report the results side by side.

#Bivariate regression
mod.bi <- lm(duration ~ 
               n_observations,
             df
  
)

#Trivariate regression
mod.tri <- lm(duration ~ 
                n_observations
              + eu_size,
              df
  
)


	Dependent variable:

	duration
	(1)	(2)

n_observations	10.582^***	17.745^***
	(2.006)	(1.846)

eu_size		-13.955^***
		(0.635)

Constant	635.088^***	900.856^***
	(6.292)	(13.375)


Observations	2,212	2,212
R²	0.012	0.189
Adjusted R²	0.012	0.189
Residual Std. Error	185.526 (df = 2210)	168.113 (df = 2209)
F Statistic	27.834^*** (df = 1; 2210)	258.228^*** (df = 2; 2209)

Note:	p<0.1; p<0.05; p<0.01

Now, it is easy to see that the effect of political salience was suppressed (under-estimated) in my initial analysis.

If we want to, we can illustrate the regression line for each of the three eu scenarios.

eff.tri <- ggpredict(mod.tri,
                     terms = c("n_observations",
                               "eu_size[15, 27, 28]"))

eff.tri %>%
  plot

However, our interpretation is often more stylized. That is, we are interested in the common slope parameter (the rate of increase in duration when number of governments increases) rather than the varying intercepts/levels of the control variables.

Statistical uncertainty

Until now, I’ve glossed over the assumptions that the regression analysis relies on as well as ignored the statistical uncertainty in our estimations. Clearly, the average effect is only that; an average. In each case, the duration of the decision making may vary more or less that we expect. Regression analysis takes this into account by estimating the uncertainty of our effects.

summary(mod.tri)

## 
## Call:
## lm(formula = duration ~ n_observations + eu_size, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -530.73 -110.77  -20.66   84.32  986.78 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    900.8561    13.3745   67.36   <2e-16 ***
## n_observations  17.7452     1.8465    9.61   <2e-16 ***
## eu_size        -13.9550     0.6353  -21.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 168.1 on 2209 degrees of freedom
## Multiple R-squared:  0.1895, Adjusted R-squared:  0.1888 
## F-statistic: 258.2 on 2 and 2209 DF,  p-value: < 2.2e-16

The regression coefficient we see is only the expected effect; the mean change in y (duration) given one unit increment in x (number of observations). However, it is reasonable to believe that this effect is sometimes smaller and sometimes higher than expected. The model results address this by reporting a hypothetical normal distribution of possible effects. We can infer that distribution from the regression coefficient (the mean) and the standard error (the spread) attached to each predictor.

m <- summary(mod.tri)$coefficients["n_observations", "Estimate"]
se <- summary(mod.tri)$coefficients["n_observations", "Std. Error"]

With these two parameters, we can “reconstruct” the range of possible effects by using simulation. Here, I hypothesize 10000 parallel universes where member state governments submit observations and where the Court makes decisions. I then specify that in all of these universes my (estimation of the) theory is a natural law. That is, I specify the distribution of the regression coefficient and make 10000 draws from this distribution.

#Simulate a normal distributoin
sim <- 
  rnorm(
    #Number of simulations
    10000,
    #Mean
    m,
    #Standard deviation
    se)

We can illustrate the distribution. Here, I also tweak the x-axis and label the zero effect ($\beta$ = 0).

sim %>%
  as.data.frame() %>%
  ggplot +
  geom_histogram(aes(sim)) +
  xlim(0, 30) +
  geom_vline(aes(xintercept = 0),
             lty = 2)

According to statistical theory, if I were to draw a similar-sized data set as we have here (N = 2212) and run the same model with the same predictors, the regression coefficients would vary according to this distribution.

It means that I can use the spread of the distribution to construct a range of probable effect sizes (a “confidence interval”). According to convention, we tend to look at the 95% most likely outcomes. They can be found in the +/-1.96 * standard error range from the estimated regression coefficient.

data.frame(
  beta = m,
  low.conf = m - 1.96 * se,
  high.conf = m + 1.96 * se
) %>%
  ggplot +
  geom_point(aes(x = beta,
                 y = 0)) +
  geom_errorbar(aes(xmin = low.conf,
                    xmax = high.conf,
                    y = 0),
                width = 0) +
  xlim(0, 30) +
  ylim(0, 2000) +
  geom_vline(aes(xintercept = 0),
             lty = 2)

As long as the confidence interval does not overlap the zero-effect, then we may infer that the direction of our effect is non-random; it is statistically significant.

How can the model know the uncertainty?

Of course, this is an estimation as well. We use statistical theory and information in the data to infer the uncertainty. Specifically, the model uses the variation in the data and the size of the data (number of data points).

We therefore have two types of variation in the data that the model uses: - information: the shared variation between the predictor and the outcome - noise: the random errors/uncertainty

The better the predictor is at accounting for the variation in the outcome, the lower the noise (standard error).

Main takeaways

a positive theory can be described by a linear equation: $y = \alpha + \beta x$
a regression applies the theory to empirical data: can the theory explain what we see?
we can think of a regression as :
- a comparison of the differences in means (typically when the predictor is categorical)
- a means of inference (typically when the predictor is continuous)
a regression is useful for:
- describing the relationship between two variables
- …while making a ceteris paribus argument (by “controlling for” alternative explanations)
our research interest is usually in the direction of the regression coefficients
- we can use statistical theory to check whether the estimated direction is a statistical fluke or something systematic