Descriptive statistics

Exploring our data

All data analysis starts by exploring the data, regardless whether this is hand-me-over data from someone else or your own brand new stuff. My exploration usually involves a lot of figures. In doing so, I often use ggplot for visuals, in addition to the tools for numeric summaries available in tidyverse. The type of descriptive statistics that is useful for you will depend on the measurement level of your variables.

Introduction

I have two objectives with this notebook:

Overview of descriptive statistics For each measurement level of the variable(s), I suggest a numeric and a graphical way of exploring them.
Graphical display using ggplot2 In parallel, I introduce the ggplot2 dialect for plotting.

Why descriptive statistics?

Many students (and researchers!) ignore the value of descriptive statistics. They seem so simple! And why bother to do them when you are going to do more fancy analyses? The purpose of descriptive statistics is not analysis, however. They are there to describe your variables. One practical implication is that descriptive statistics rarely involves tests of statistical significance. That’s the role of the analysis. Descriptive statistics help you identify possibilities, flag potential problems, error probe and interpret your results.

Discover possibilities

When I get my hands on new data, I always spend time on descriptives. Often, this exploration is fairly unstructured. I explore each variable separately in univariate statistics. I then start to relate one to the other in bivariate and trivariate statistics. Some researchers claim that if the relationship they study is not visible in the bivariate, it is not a project worth pursuing! That’s a pretty extreme approach.

The point is to get a sense of what information is contained in the data: Variation is information. Variables with a lot of variation therefore contain more information.

My choice of model furthermore depends on how the dependent variable is measured and the structure of the data. Certainly, I can read the codebook (or design it), but I may also perceive new modeling opportunities by knowing what information is available to me. To begin with, I therefore explore the phenomenon I’m interested in ($y$).

For example, you may go into a project where you test some theoretical argument about when a government chooses to bomb their enemy. A possible operationalization of your dependent variable is the number of air strikes in a particular area. However, you might also realize that you have a time variable (date, time of day) of each strike. Now, you can also consider an event history model that analyzes each strike separately; it’s a more fine-grained analysis.

Flag potential problems

Data come with all kinds of problems and challenges that you have to handle. It is not always easy to know which problem you might have in the future, but by knowing the data, you will know the quirks of each variable and more easily spot problems (and solutions) when you run into them.

For example, there can also be too much variation. I could, for example, discover that my continuous variable on respondent income has a few very rich people in it. These are outliers. They may or may not skew your analysis later on. Thanks to the descriptive statistics, you will have the potential problem on your radar.

Error probe

Analyzing data usually involves a lot of data wrangling: You can create new variables from different information sources or change the measurement level of a variable. You can move up a notch by making indexes to get a more fine-grained variable or you can move down a measurement level because there is not enough/too much variation in the original data. After changing my data, I always check if my code resulted in what I wanted. The best way to do this is through some visualization of the univariate distribution of the new data.

Interprets and communicate

Knowledge of your descriptive statistics will come back once you’ve run your statistical model. Now you have to interpret it. It is time to figure out what scenarios are realistic and interesting to discuss. By that, I mean that you need to consider which values on the $x$-variables are common and what changes in these variables are likely. Now you can use your model to predict what the outcome is. This is at the very heart of your analytical venture. Descriptive statistics therefore helps you understand what you have found.

For example, it may be that your regression analysis reports a small coefficient for the effect of income on scepticism towards immigration. This does not automatically mean that the effect is not substantial. If your income is measured in DKK, then a one-unit increase is not a lot. However, if you see a salary bump of 100.000, then the shift – even if the regression coefficient is small – might also be substantial.

Graphics in R

R had already established itself as a good software for visual data presentation, but the ggplot2 package has contributed to making R unparalleled by making the graphical aspect available. You can find a comprehensive online book on how to use ggplot2 here.

ggplot2 works well in conjunction with data preparation with dplyr. It has – annoyingly enough – its own pipes. Overall, ggplot2 requires more arguments and functions to provide simple plots, but the possibilities for customization are also very large. Therefore, my ggplot graphics are produced with relatively many lines of code. I use them more when presenting to others, while I resort to faster solutions when I just want to take a “look”.

Getting started

We start by loading in the data packages we will use. If you haven’t already installed ggplot2, you will have to do that first.

#install.packages("ggplot2")
library(dplyr); library(ggplot2)

We will use the same data as in the previous session: It is available in my package or online on my website.

library(RiPraksis)
data(kap6)
df <- kap6

Main logic

The workflow in ggplot2 is a pipe. This pipe is read chronologically: I first establish a sheet, then I add elements.

specify the data you want to work with
establish a blank sheet
add elements to the sheet one by one
the visual adaptations (“theme”) can be based on a template (there is a grey default temple). The template already specifies all the esthetics for you. However, you can “tweak” what you want after having established the template.

Google and ChatGPT are your best friends, as usual.

Example Let’s make a scatterplot between income and scepticism.

The main function, ggplot(), establishes the plot, but no plot elements. If you are going to use the same data for all the graphical elements, you can specify the data object in this function. When you do that, you do not specify your dataset multiple times in the pipe, so you save time. From then on, you can refer directly to the variables in it.

You bind all elements together with a pipe using “+” when you use ggplot. However, you can mix tidyverse and ggplot dialects. Here, I start by a tidyverse pipe, then continue with ggplot to make an empty sheet.

df %>%
  ggplot()

The main elements of the graphics are included using a series of geom_...() functions. They specify the “geometry,” which is what information should be included (information from variables) and where it should be placed (coordinates).

Inside each of these functions, you will find the aesthetic function, aes(), where you specify the x/y coordinates that come from your variables.

Here, I am creating a scatterplot for the relationship between immigration skepticism and income. This means that I want to draw points (geom_point()).

df %>%
  ggplot +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt))

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

A simple scatter plot gives an impression of the bivariate relationship between immigration skepticism and income.

Univariate distributions

The first thing I do, is usually to eyeball the data and check a) conceptually what measurement level I think this variable has and decide b) how I want to treat the variable (e.g. binary variables could be treated as either categorical or numeric). c) It is sometimes useful to check how R has registered the variable (class(df$my_variable)).

From there, I pick my statistics using a pretty standardized set of statistics and graphics.

Numeric variables

Numeric variables include all variables that I can treat arithmetically: Continous variables (numbers with commmas) and counts (integers; bounded at zero), but sometimes we also treat variables with many ordered categories (integers; bounded at zero and some upper limit) as numeric.

Numeric summary: mean, quantiles…

Summarizing continuous variables usually means looking at their average, but also their minimum, maximum values and their spread.

summary(df$Skepsis)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   4.000   5.000   5.009   6.000  10.000       5

I can also check out their standard deviation to get a feel for how much variation I have in them.

sd(df$Skepsis, na.rm = TRUE)

## [1] 1.72577

Visual: histogram

base

df$Skepsis %>% hist

We can make it relative using a simple additional argument.

df$Skepsis %>% hist(., probability = T)

ggplot2 Histograms are easily implemented in ggplot too.

df %>%
  ggplot +
  geom_histogram(aes(Skepsis))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).

You can tweak them any which way you want, but it requires some more coding skills. Here, I make the histogram relative by manipulating the.

df %>%
  ggplot(aes(x = Skepsis)) +
  geom_histogram(aes(y = after_stat(density)))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_bin()`).

Categorical variables

We can explore categorical variables primarily by their frequency distribution. That is, we count the number of observations we have in each category. This makes the basis both for the numeric summary (a frequency table) and the visual inspection (a barplot).

Numeric summary: frequency table

Here, we are looking at what our respondents voted last election.

We can make a simple frequency table in base using the table() function.

df$Parti %>%
  table

## .
##                       Andet            Dansk Folkeparti 
##                          17                         143 
## Det Konservative Folkeparti        Det Radikale Venstre 
##                          65                         134 
##                Enhedslisten         Kristendemokraterne 
##                          77                           8 
##            Liberal Alliance          Socialdemokraterne 
##                          48                         268 
##     Socialistisk Folkeparti                     Venstre 
##                         108                         311

We can also do this using tidyverse, by first grouping the data by party with group_by(), then count the number of occurences/observations within each group using n().

df %>%
  group_by(Parti) %>%
  reframe(n = n())

Visual: barplot

base

In base, we need to make the frequency table first, then plot it using the appropriate function.

df$Parti %>%
  table %>%
  barplot()

Usually, it is much more intuitive to check the relative distribution than the absolute. The function prop.table() does exactly that for us.

df$Parti %>%
  #Frequency table
  table %>%
  #Relative distribution
  prop.table %>%
  #Barplot
  barplot()

ggplot2

In ggplot2, R does the frequency plot for us.

df %>%
  ggplot +
  geom_bar(aes(Parti))

Do you notice the difference? The base function removes all “NA” (people who did not answer the question). In ggplot2, we’d have to remove them manually.

We can do that using tidyverse filtering!

df %>%
  #Filter out missing observations before plotting
  filter(!is.na(Parti)) %>%
  ggplot() +
  #Barplot
  geom_bar(aes(x = Parti))

You can find more about data exploration in this tutorial, for example.

Your turn!

Describe the distribution of education among respondents (Uddannelse). What do you see?
Replicate my barplot of the respondent’s party choice using ggplot2. Change the direction of the bars by swapping the x- and y-axes in the aes() specification.
Can you describe the distribution of respondents that have a weak link to the labor market (Prekaritet)?
Store the plot in an R object.

Bivariate statistics

After exploring the variables separately, we usually check out their bivariate relationship. Once again, the choice of statistics depends on the measurement level.

Two numeric variables

Numeric summary: correlation

When we have two numeric variables, we can calculate their correlation using Pearsons R ($r$).

cor(df$Indtaegt, df$Skepsis,
    use = "pairwise.complete.obs")

## [1] -0.1526337

The negative correlation is apparent, but what does it mean? Can we put it into plain English?

If we take the square of this correlation, we get the “coefficient of determination” ($R^2$). We can interpret the result as a proportion of shared variation between the two variables. Sometimes we’d interpret it as a causal relationship.

cor(df$Indtaegt, df$Skepsis,
    use = "pairwise.complete.obs")^2

## [1] 0.02329706

We may say that 2 % of the variation in attitudes towards immigration is shared with (explained by?) income.

We can test the statistical significance between the two.

cor.test(df$Indtaegt, df$Skepsis)

## 
##  Pearson's product-moment correlation
## 
## data:  df$Indtaegt and df$Skepsis
## t = -5.6155, df = 1322, p-value = 2.387e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.20482310 -0.09957898
## sample estimates:
##        cor 
## -0.1526337

Considering numbers in this way is really abstract though. The most intuitive way of exploring a correlation is through visuals.

Visual: scatterplot

When both variables are numeric, I can create a scatterplot for the relationship between immigration skepticism and income.

base R

The scatterplot plots the variable values against each other such that each observation has a coordinate that places it as a point on the x- and y-axis. This is more intuitive when your data set is reasonably small.

Here, I use a tidyverse pipe to select the two variables, then use the base R plot() function.

df %>%
  select(Indtaegt, Skepsis) %>%
  plot

ggplot2

I can do the same thing in ggplot and take the advantage of its additional functions to make the plot more legible.

I want to draw points, so I add the elements in the geom_point() function to the blank sheet.

df %>%
  ggplot +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt))

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

A simple scatter plot gives an impression of the bivariate relationship between immigration skepticism and income.

It is hard to see any relationship between the two variables. Also, they only shared some 2% of their variation. However, the income variable also consists only of integers. Therefore, the points overlap each other, and we get a poor idea of how many units exist within the different value combinations. We can’t see what value combinations are the most common, only that there are a lot of them.

We can solve this in two ways: by making the points partially transparent, or by moving them slightly apart.

Transparent color I can specify that I want transparent colors. I do this with the argument alpha=. The value I specify ranges from 0 (completely transparent/invisible) to 1 (not transparent).

df %>%
  ggplot +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt),
             #all observations are transparent
             alpha = 0.2)

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

By using a transparent color, I can show where the majority of the observations are located in the distribution.

Note that when we want to adjust the points without the adjustment being dependent on information from the data, this is done outside of aes() but within the geom_…() function.

Jitter data Another alternative is to “jitter” the data points a little. This means that R adds a random variation to the coordinates of the points. The actual data points become more imprecise, but the point here is not to perform a precise analysis, but to present relationships in the data so that the human eye gets an intuitive understanding of what is going on.

df %>%
  ggplot +
  geom_jitter(aes(y = Skepsis,
                  x = Indtaegt),
              #Horizontal, but not vertical shake
              height = 0, width = 0.3)

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

I get a better grasp of where my observations are by ‘shaking’ the data points.

The additional arguments width= and height= specify how much the points should be jittered horizontally and vertically, respectively. Here, I say that I want precise data along the y-axis (“in height”), but I let the points randomly vary by 0.3 units along the x-axis (“in width”).

Trivariate graphics: Group the data.

I can easily add more groupings. Here I am coloring the points based on the value of a third variable: The variable can be continuous or categorical.

Numeric/continous grouping “Prekaritet” flags respondents that have a weak link to the labor market (i.e. they are unemployed or on a temporary work contract).

df %>%
  ggplot +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt,
                 #Grouping
                 color = Prekaritet))

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

Now I get a “heat scale” for the respondent’s link to the labor market. ggplot also automatically provides a legend. Now, the variable only has two values, 0 or 1. We better use it as a categorical moderator.

Categorical grouping If I want a categorical grouping, I need to feed the function a categorical variable (“character” or “factor”). It’s best to do this by mutating the data frame using tidyverse, not in the ggplot() function. Right here, I use as.factor() to change the measurement level into categorical.

df %>%
  mutate(Prekaritet = as.factor(Prekaritet)) %>%
  ggplot +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt,
                 #Grouping
                 color = Prekaritet))

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

Of course, I can combine all these solutions – jittering, transparency and recoding – until I get a plot that tells the story I want to convey.

#tidyverse to transform the data
df %>%
  mutate(Prekaritet = as.factor(Prekaritet)) %>%
  #ggplot2 to plot the data
  ggplot +
  geom_jitter(aes(y = Skepsis,
                 x = Indtaegt,
                 #Continuous moderating third variable
                 color = Prekaritet),
              #Shake data data horizontaly and vertically
              height = 0.1, width = 0.4,
              #Transparency
              alpha = 0.7)

## Warning: Removed 178 rows containing missing values or values outside the scale range
## (`geom_point()`).

It is hard to tell much from this graphic, so let’s move on to other visualizations.

Bivariate regression lines

I can also show the relationship between two variables using regression lines. ggplot does not report regression coefficients but instead illustrates the regression line and the uncertainty around it. This is particularly relevant if at least one of the variables (the x-variable) is continuous.

Now we can choose between local regression and ordinary linear regression. Which one we choose depends on what we want to achieve.

Local regression

Local regression is – in my eyes – the queen of all ggplot functions. I use it all the time when exploring my data. It draws a “smooth” line (locally, the sliding average over the different x-values). Perfectly suited for time trends or exploring non-linear relationships, for example.

The function that gives us bivariate regression lines is geom_smooth(). The default setting is local regression.

df %>%
ggplot +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 178 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Local regression: A simple way to present empirical relationships between two variables.

Here I illustrate the relationship between income and skepticism about immigration. ggplot adjusts the axis limits for us automatically so as to zoom in on the results. It means that they always come across as large effects.

Grouped: Is the relationship the same within different categories?

df %>%
  ggplot +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt,
                  color = as.factor(Prekaritet)))

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

## Warning: Removed 178 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Local regression: The queen of trivariate plots?

Would you like to replace the “spaghetti” with straight lines? Ask for a (bivariate) linear model.

df %>%
  ggplot +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt,
                  color = as.factor(Prekaritet)),
              method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 178 rows containing non-finite outside the scale range
## (`stat_smooth()`).

Here we clearly see that the association between immigration skepticism and income is primarily present for respondents who have a safe connection to the labor market.

Your turn!

Plot the relationship between income (x = Indtaegt) and weak link to the labor market (y = Prekaritet) using a local regression. What do you find? Would it make sense to swap the two, i.e. let income be on the y-axis?
Use the tidyverse mutate() function to create a new variable that flags whether a respondent has answered the question about labor market connection. Redefine the new variable into a numeric using the as.numeric() function. Then plot the relationship between income (Indtaegt) and the new variable using a local regression. What do you find?

One numeric and one cagetorical variable (+ more tidyverse)

When we have a numeric and a categorical variable we’d like to explore, we will usually calculate the group mean; i.e. we calculate the average of the numeric variable for each of the categories in the other.

Numeric summary: group averages

This is a good excuse to have a second look at tidyverse . dplyr has a very handy function that allows us to do operations on subset of the data depending on a moderating variable.

Let’s calculate the average scepticism among voters of different parties and create a summary statistic. To do so, we use reframe()

df %>%
  group_by(Parti) %>%
  reframe("Scepticism" = mean(Skepsis, na.rm = T))

Before moving on, I’m removing the NAs.

df %>%
  filter(!is.na(Parti)) %>%
  group_by(Parti) %>%
  reframe("Scepticism" = mean(Skepsis, na.rm = T))

The observations are ordered alphabetically. It is not intuitive. Let’s rearrange.

df %>%
  filter(!is.na(Parti)) %>%
  group_by(Parti) %>%
  reframe("Scepticism" = mean(Skepsis, na.rm = T)) %>%
  arrange(Scepticism)

No surprises here; Dansk Folkeparti has voters that are more sceptical than other parties. This being said, in general, there is not a massive spread here.

My results looked ok, so I’m storing them in tab.

tab <-
  df %>%
  filter(!is.na(Parti)) %>%
  group_by(Parti) %>%
  reframe("Scepticism" = mean(Skepsis, na.rm = T)) %>%
  arrange(Scepticism)

Visual: barplot

Let’s instead plot this.

base

barplot(tab$Scepticism,
        names.arg = tab$Parti,
        las = 2)

ggplot2

Instead of plotting the entire data set, I now rely on my table also in ggplot2. That is, I’ve already calculated the height of the bars, so no need for ggplot to do that. I signal my intent by using the geom_col() function. It simply reads the coordinates from the data frame so that the y-values and the x-values are defined by me.

tab %>%
  ggplot +
  geom_col(aes(y = Scepticism,
               x = Parti))

Bonus

Would you like to order your columns by size? You can do that by reordering the x-varible (Parti) according to the values of scepticism; i.e. you define it as an ordered categorical variable. The function is reorder().

tab %>% 
  mutate(Parti = reorder(Parti, Scepticism)) %>% 
  ggplot +
  geom_col(aes(y = Scepticism,
               x = Parti))

Two categorical variables

When we have two categorical variables, we do a cross-table.

table(df$Parti, df$Kvinde)

##                              
##                                 0   1
##   Andet                        11   6
##   Dansk Folkeparti             85  58
##   Det Konservative Folkeparti  33  32
##   Det Radikale Venstre         69  65
##   Enhedslisten                 32  45
##   Kristendemokraterne           4   4
##   Liberal Alliance             38  10
##   Socialdemokraterne          116 152
##   Socialistisk Folkeparti      43  65
##   Venstre                     175 136

Absolute frequencies are hard to fathom. I like them relative. Also, this time around, we want to know the proportion of voters that are men/women per party. It means that I calculate the distribution by rows (margin = 1).

table(df$Parti, df$Kvinde) %>%
  prop.table(., margin = 1)

##                              
##                                       0         1
##   Andet                       0.6470588 0.3529412
##   Dansk Folkeparti            0.5944056 0.4055944
##   Det Konservative Folkeparti 0.5076923 0.4923077
##   Det Radikale Venstre        0.5149254 0.4850746
##   Enhedslisten                0.4155844 0.5844156
##   Kristendemokraterne         0.5000000 0.5000000
##   Liberal Alliance            0.7916667 0.2083333
##   Socialdemokraterne          0.4328358 0.5671642
##   Socialistisk Folkeparti     0.3981481 0.6018519
##   Venstre                     0.5627010 0.4372990

We don’t usually test the statistical significance, when we explore the data. However, we can. The chisquare test allows us to check if the data table is generated by random. The p-value reports the probability that the relationships we see are generated by accident. Here, that’s very unlikely.

table(df$Parti, df$Kvinde) %>%
  chisq.test()

## Warning in chisq.test(.): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 38.546, df = 9, p-value = 1.391e-05

table <- table(df$Kvinde, df$Parti)

barplot(table, 
        beside = T)

ggplot2

df %>%
  #Create a new variable with categorical names
  mutate(Gender = if_else(Kvinde == 1,
                         "Woman",
                         "Man")) %>%
  #Plot
  ggplot +
  geom_bar(aes(x = Parti,
               #Moderate by gender
               fill = Gender),
           position = "dodge")

Your turn!

Key take-aways

explore the data and use graphics
choose statistics according to measurement level
use tidyverse/dplyr to reshape the data for group averages etc.
use ggplot2 to plot the data

Key functions in tidyverse/dplyr (numeric manipulation)

Function	What it does
`filter()`	Filters out observations
`select()`	Selects variables
`group_by()`	Groups the data
`mutate()`	Adds/changes variables, but not the number of observations
`reframe()`	Reduces to a data frame with fewer/summarizing variables

Key functions in ggplot (graphical display)

Function	What it does	Reshapes the data?	Measurement level	Statistic
`ggplot()`	Creates a blank page
`geom_bar()`	barplot/frequency count.	Yes.	One or two categorical variables	Univariate, Bi-/trivariate
`geom_histogram()`	histogram.	Yes	Numeric	Univariate
`geom_point()`	points/scatterplot.	No. Uses variables as coordinates.	Two numerical variables	Bivariate
`geom_col()`	bars/barplot	No. Uses data as coordinates, so `dplyr::mutate()` might be needed.	One categorical and one numeric variable; height is group mean	Bivariate
`geom_smooth()`	local regression line.	Yes	Two numeric or one binary (y) and one numeric	Bivariate

Recap exercise

Our end-of-class class activity was to find some potential variables, define their measurement level, the statistics we can do to describe them and the visuals that might be useful.

Overview of potential descriptive statistics:

N. of variables	variable name	measurement level	statistic	graphic	geom_…
Univariate
	`income`	continous, bounded at 0	mean, quartiles, median, sd	histogram	`geom_histogram()`
	`party choice`	categorical	frequency table	barplot	`geom_bar()`
	`job satisfaction`	ordinal, treated as either numeric or categorical
	`air strikes`	count; bounded at 0	mean, etc.	histogram	`geom_histogram()`
Bivariate
	`party choice + gender`	categorical + categorical (or treated as continous)	cross table	barplot	`geom_bar()`
	`income + grades`	continuous + ordinal (treated as categorical)	group means; avg. income foree ach grade	barplot	`geom_col()`
	`negotiation time + preference agreement`	continuous + continous	Pearson’s R	scatterplot; local regression	`geom_point()` `geom_smooth()`

Workflow and tweaks in ggplot2

Let’s follow an alternative workflow and create our own small dataset for plotting. Here, I am interested in immigration skepticism among respondents with different work situations.

dfp <- 
  df %>%
  #Filter out NA
  filter(!is.na(Prekaritet) & !is.na(Innvandrer)) %>%
  #Group
  group_by(Prekaritet, Innvandrer) %>%
  #Create a smaller data frame with group summary statistics
  reframe(
    #Group average of immigration attitudes
    Sceptical = mean(Skepsis, na.rm = T)) %>%
  #Define new variable names to appear for the viewer
  mutate(
    #New variable with intutitive answers; note the "quotation marks"
    `How is your work situation?` = if_else(Prekaritet == 1,
                                            "Unstable", 
                                            "Stable"),
    Immigrant = if_else(Innvandrer == 1,
                     "Immigrant", "Non-immigrant"))

dfp

p <- 
  dfp %>%
  ggplot +
  #Classical definition of x and y coordinates
  geom_col(aes(y = Sceptical,
               x = Immigrant,
               #Fill in with colors following the grouping
               fill = `How is your work situation?`),
           #place bars side-by-side
           position = "dodge")

Note how I have saved my plot in an object: p <- ggplot(). To see the plot, I request it.

A bar chart created from a separate dataset where I have aggregated and prepared the information.

It is useful to save the object this way because I will modify my R object further. I can do that using pipes.

Esthetical tweaks

Once you have added the graphical elements with your data, you can spend an endless amount of time fine-tuning the rest of the plot.

Tell R what language you speak (and where you are)

R was not created by Norwegians, so Danish characters can sometimes cause problems. We have to actively consider encoding choices in two settings: when we import data and when we save data, including in graphics.

Computers read and provide information as a series of 0s and 1s. To read and write text, we use encoding, a translation of binary codes to the alphabet we know. Traditionally, each country and language had its own encoding system. This was cumbersome, but today, most text exchanged on the internet follows “utf-8” encoding. It includes most of the characters we know how to read. Sys.setlocale() tells R where you are in the world. Here, I am telling it that I want Norwegian language with utf-8 encoding.

Sys.setlocale(category = "LC_ALL",
              locale = "dk_DK.UTF-8")

Titles and axis names

p <- 
  p +
  #Add a title and a subtitle
  ggtitle(label = "Relationship between immigration skepticism and work situation",
          subtitle = "Data from ESS (2014)") +
  #Name of y-axis
  ylab("Scepticism towards immigration") +
  #Name of x-axis: empty
  xlab("") +
  #What are the limits on my axis?
  ylim(c(0,6))

p

Define colors

You can choose your own colors, and there are many ways to do so. If you have linked your data with color choices, it may be useful to define a “palette.”

The text (“Precarious” and “Safe”) is part of the data I provided, so it is located in the variable “dfp$Worksituation.” I can change it in ggplot if I want to. I do this when defining the colors.

p <- 
  p +  
  #Define colors
  scale_color_manual(
    #What colors?
    values = c("purple", "magenta"),
    #Color the entire bar
    aesthetics = "fill",
    #What's the title of the legend?
    name = "Link to labor market",
    #What are the category names?
    labels = c("Stable",
               "Unstable")
    ) 
p

Background

You can define different templates for the aesthetic aspects of the plot with theme_...().

p <- 
  p +
  # Define a theme
  theme_minimal()
p

Here, I am using ready-made functions to modify the visual expression of the plot.

You can alter the themes at will. You do that after you have defined the theme, using the generik theme() function.

p <-
  p +
  # Modify the minimal theme after defining it
  theme(
    # Move the legend down to below the plot
    legend.position = "bottom",
    # Bold font for the title in the legend
    legend.title = element_text(face = "bold"),
    # Italic for axis values
    axis.text = element_text(face = "italic"),
    # Remove the gridlines
    panel.grid = element_blank())
p

Here, I’ve changed the background and other thematical elements in my graphic.

The element_...() functions are used inside the theme() function. They specify everything except the data content/information for an element. Should it be a blank element element_blank()? Specific font? element_text()… etc.

For example, I can specify that the text used for the axis names should be displayed in italic element_text(face = "italic").