Exploration of our data

We start by loading in the data packages we will use.

library(dplyr); library(ggplot2)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RiPraksis)
data(kap6)
df <- kap6

Univariate distributions

Continuous variables

Summarizing continuous variables usually means looking at their average, but also their minimum, maximum values and their spread.

summary(df$Skepsis)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   4.000   5.000   5.009   6.000  10.000       5

df$Skepsis %>% hist

We can make it relative using a simple additional argument.

df$Skepsis %>% hist(., probability = T)

Categorical variables

We can explore categorical variables primarily by their frequency distribution.

Here, we are looking at what our respondents voted last election.

df$Parti %>%
  table

## .
##                       Andet            Dansk Folkeparti 
##                          17                         143 
## Det Konservative Folkeparti        Det Radikale Venstre 
##                          65                         134 
##                Enhedslisten         Kristendemokraterne 
##                          77                           8 
##            Liberal Alliance          Socialdemokraterne 
##                          48                         268 
##     Socialistisk Folkeparti                     Venstre 
##                         108                         311

In base, we need to make the frequency table first, then plot it using the appropriate function.

df$Parti %>%
  table %>%
  barplot()

Usually, it is much more intuitive to check the relative distribution than the absolute. The function prop.table() does exactly that for us.

df$Parti %>%
  #Frequency table
  table %>%
  #Relative distribution
  prop.table %>%
  #Barplot
  barplot()

You can find more about data exploration in this tutorial, for example.

Bivariate statistics

After exploring the variables separately, we usually check out their bivariate relationship. Once again, the choice of statistics depnds on the measurement level.

Two continuous variables (+ ggplot2)

When we have two continuous variables, we can calculate their correlation using Pearsons R ($R^2$).

cor(df$Indtaegt, df$Skepsis,
    use = "pairwise.complete.obs")

## [1] -0.1526337

The negative correlation is apparent, but what does it mean? Can we put it into plain English?

If we take the square of this correlation, we can interpret the result as a proportion of shared variation. Sometimes we’d interpret it as a causal relationship.

cor(df$Indtaegt, df$Skepsis,
    use = "pairwise.complete.obs")^2

## [1] 0.02329706

We may say that 2 % of the variation in attitudes towards immigration is shared with (explained by?) income.

We can test the statistical significance between the two.

cor.test(df$Indtaegt, df$Skepsis)

## 
##  Pearson's product-moment correlation
## 
## data:  df$Indtaegt and df$Skepsis
## t = -5.6155, df = 1322, p-value = 2.387e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.20482310 -0.09957898
## sample estimates:
##        cor 
## -0.1526337

Considering numbers in this way is really abstract though. The most intuitive way of exploring a correlation is through visuals. Let’s take a deep-dive into ggplot2.

Scatterplot

The main function ggplot() establishes the plot, but no plot elements. If you are going to use the same data for all the graphical elements, you can specify the data object in this function. When you do that, you do not specify your dataset multiple times in the pipe, so you save time. From then on, you can refer directly to the variables in it.

You bind all elements together with a pipe using “+”. It’s easy to forget, so you’d rather use the %>% pipe from tidyverse. You’re not the only one who will make that mistake, so the message you get is surprisingly intuitive: Did you use %>% instead of +?.

ggplot(data = df)

The main elements of the graphics are included using a series of geom_...() functions. They specify the “geometry,” which is what information should be included (information from variables) and where it should be placed (coordinates).

Inside each of these functions, you will find the aesthetic function, aes(), where you specify the x/y coordinates that come from your variables.

Here, I am creating a scatterplot for the relationship between immigration skepticism and income. This means that I want to draw points (geom_point()).

ggplot(df) +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt))

## Warning: Removed 178 rows containing missing values (geom_point).

A simple scatter plot gives an impression of the bivariate relationship between immigration skepticism and income.

The income variable consists only of integers. Therefore, the points overlap each other, and we get a poor idea of how many units exist within the different value combinations. We can solve this in two ways: by making the points partially transparent, or by moving them slightly apart.

Transparent color I can specify that I want transparent colors. I do this with the argument alpha=. The value I specify ranges from 0 (completely transparent/invisible) to 1 (not transparent).

ggplot(df) +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt),
             #all observations are transparent
             alpha = 0.2)

## Warning: Removed 178 rows containing missing values (geom_point).

By using a transparent color, I can show where the majority of the observations are located in the distribution.

Note that when we want to adjust the points without the adjustment being dependent on information from the data, this is done outside of aes() but within the geom_…() function.

Jitter data Another alternative is to “jitter” the data points a little. This means that R adds a random variation to the coordinates of the points. The actual data points become more imprecise, but the point here is not to perform a precise analysis, but to present relationships in the data so that the human eye gets an intuitive understanding of what is going on.

ggplot(df) +
  geom_jitter(aes(y = Skepsis,
                 x = Indtaegt),
              #Horizontal, but not vertical shake
              height = 0, width = 0.3)

## Warning: Removed 178 rows containing missing values (geom_point).

I get a better grasp of where my observations are by ‘shaking’ the data points.

The additional arguments width= and height= specify how much the points should be jittered horizontally and vertically, respectively. Here, I say that I want precise data along the y-axis (“in height”), but I let the points randomly vary by 0.3 units along the x-axis (“in width”).

Trivariate graphics: Group the data.

I can easily add more groupings. Here I am coloring the points based on the value of a third variable: The variable can be continuous or categorical.

Continuous grouping “Subjective income” reports how rich the respondent feels in four categories from 0 (“low satisfaction”) to 3 (“high satisfaction”).

ggplot(df) +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt,
                 #Gruping
                 color = Prekaritet))

## Warning: Removed 178 rows containing missing values (geom_point).

Now I get a “heat scale” for the respondent’s perception of their own income. ggplot also automatically provides a legend.

Here we see that most of the respondents who report low satisfaction (dark color) are located to the left of the plot. They have low (objective) income.

Categorical grouping If I want a categorical grouping, I need to feed the function a categorical variable (“character” or “factor”). It’s best to do this in the plot data, not in the ggplot() function. Right here, I deviate from this rule with (as.factor())

ggplot(df) +
  geom_point(aes(y = Skepsis,
                 x = Indtaegt,
                 #Gruppering
                 color = as.factor(Prekaritet)))

## Warning: Removed 178 rows containing missing values (geom_point).

One solution does not prevent another. Of course, I can combine all these solutions until I get a plot that tells the story I want to convey.

ggplot(df) +
  geom_jitter(aes(y = Skepsis,
                 x = Indtaegt,
                 #Continuous moderating third variable
                 color = Prekaritet),
              #Shake data data horizontaly and vertically
              height = 0.1, width = 0.4,
              #Transparency
              alpha = 0.7)

## Warning: Removed 178 rows containing missing values (geom_point).

Bivariate regression lines

I can also show the relationship between two variables using regression lines. ggplot does not report regression coefficients but instead illustrates the regression line and the uncertainty around it. This is particularly relevant if at least one of the variables (the x-variable) is continuous.

Now we can choose between local regression and ordinary linear regression. Which one we choose depends on what we want to achieve.

Local regression

Local regression is – in my eyes – the queen of all ggplot functions. I use it all the time when exploring my data. It draws a “smooth” line (locally, the sliding average over the different x-values). Perfectly suited for time trends or exploring non-linear relationships, for example.

The function that gives us bivariate regression lines is geom_smooth(). The default setting is local regression.

ggplot(df) +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 178 rows containing non-finite values (stat_smooth).

Local regression: A simple way to present empirical relationships between two variables.

Here I illustrate the relationship between age and skepticism about immigration. The figure demonstrates the greatest advantage of local regression: There is no linear relationship between age and skepticism about immigration. Instead, it appears that younger and older respondents are more skeptical.

R adjusts the “window” automatically for us, but we can choose the size ourselves. Then we use the additional argument span = . How much of the variable span should be used for the local average? Low values give a more jagged line, high values give more generalization.

ggplot(df) +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt),
              span = 20)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 178 rows containing non-finite values (stat_smooth).

Local regression: A simple way to present a bivariate relationship.

Grouped: Is the relationship the same within different categories?

ggplot(df) +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt,
                  color = as.factor(Prekaritet)))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 178 rows containing non-finite values (stat_smooth).

Local regression: The queen of trivariate plots?

Would you like to replace the “spaghetti” with straight lines? Ask for a (bivariate) linear model.

ggplot(df) +
  geom_smooth(aes(y = Skepsis,
                  x = Indtaegt,
                  color = as.factor(Prekaritet)),
              method = "lm")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 178 rows containing non-finite values (stat_smooth).

Here we clearly see that the association between immigration skepticism and income is primarily present for respondents who feel that they lack money (categories 0 and 1).

One continous and one cagetorical variable (+ more tidyverse)

When we have a continous and a categorical variable we’d like to explore, we will usually calculate the group mean; i.e. we calculate the average of the continuous variable for each of the categories in the other.

This is a good excuse to have a second look at tidyverse . dplyr has a very handy function that allows us to do operations on subset of the data depending on a moderating variable.

Let’s calculate the average scepticism among voters of different parties and create a summary statistic. To do so, we use summarize()

df %>%
  group_by(Parti) %>%
  summarize("Scepticism" = mean(Skepsis, na.rm = T))

If we like this table, we can store it in an object. Before I do so, however, I’m removing the NAs.

tab <-
  df %>%
  filter(!is.na(Parti)) %>%
  group_by(Parti) %>%
  summarize("Scepticism" = mean(Skepsis, na.rm = T))

The observations are ordered alphabetically. It is not intuitive. Let’s rearrange.

tab <- 
  tab %>%
  arrange(Scepticism)

tab

No surprises here; Dansk Folkeparti has voters that are more sceptical than other parties. This being said, in general, there is not a massive spread here.

Let’s instead plot this.

base

barplot(tab$Scepticism,
        names.arg = tab$Parti,
        las = 2)

ggplot2

Instead of plotting the entire data set, I now rely on my table also in ggplot2. That is, I’v already calculated the height of the bars, so no need for ggplot to do that. I signal my intent by using method = "identity". Now, the y-values and the x-values are defined by me.

ggplot(tab) +
  geom_histogram(aes(y = Scepticism,
                     x = Parti),
                 stat = "identity")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Two categorical variables

When we have two categorical variables, we do a cross-table.

table(df$Parti, df$Kvinde)

##                              
##                                 0   1
##   Andet                        11   6
##   Dansk Folkeparti             85  58
##   Det Konservative Folkeparti  33  32
##   Det Radikale Venstre         69  65
##   Enhedslisten                 32  45
##   Kristendemokraterne           4   4
##   Liberal Alliance             38  10
##   Socialdemokraterne          116 152
##   Socialistisk Folkeparti      43  65
##   Venstre                     175 136

Absolute frequencies are hard to fathom. I like them relative. Also, this time around, we want to know the proportion of voters that are men/women per party. It means that I calculate the distribution by rows (margin = 1).

table(df$Parti, df$Kvinde) %>%
  prop.table(., margin = 1)

##                              
##                                       0         1
##   Andet                       0.6470588 0.3529412
##   Dansk Folkeparti            0.5944056 0.4055944
##   Det Konservative Folkeparti 0.5076923 0.4923077
##   Det Radikale Venstre        0.5149254 0.4850746
##   Enhedslisten                0.4155844 0.5844156
##   Kristendemokraterne         0.5000000 0.5000000
##   Liberal Alliance            0.7916667 0.2083333
##   Socialdemokraterne          0.4328358 0.5671642
##   Socialistisk Folkeparti     0.3981481 0.6018519
##   Venstre                     0.5627010 0.4372990

We don’t usually test the statistical significance, when we explore the data. However, we can. The chisquare test allows us to check if the data table is generated by random. The p-value reports the probability that the relationships we see are generated by accident. Here, that’s very unlikely.

table(df$Parti, df$Kvinde) %>%
  chisq.test()

## Warning in chisq.test(.): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  .
## X-squared = 38.546, df = 9, p-value = 1.391e-05

table <- table(df$Kvinde, df$Parti)

barplot(table, 
        beside = T)

ggplot2

ggplot(df) +
  geom_bar(aes(x = Parti,
               fill = as.factor(Kvinde)),
           position = "dodge")

Workflow and tweaks in ggplot2

Let’s follow an alternative workflow and create our own small dataset for plotting. Here, I am interested in immigration skepticism among respondents with different work situations.

dfp <- df %>%
  #Filter out NA
  filter(!is.na(Prekaritet)) %>%
  #Group
  group_by(Prekaritet) %>%
  #Group average
  summarize(Innvandringsskepsis = mean(Skepsis, na.rm = T))


#Make a new variable with the survey question

dfp$`How is your work situation?` <- dfp$Prekaritet %>% 
  as.factor

#Omdefiner kategoriene til intuitive svar
levels(dfp$`How is your work situation?`) <- c("Insecure",
                                               "Secure")

Now you can ask R to plot

ggplot(dfp) +
  #Classical definition of a coordinate system
  geom_bar(aes(y = Innvandringsskepsis,
               x = `How is your work situation?`),
           #Use the number you give as hight to the bars instead of calculating yourself
           stat = "identity")

dfp <- df %>%
  #Filter out NA
  filter(!is.na(Prekaritet) & !is.na(Prekaritet)) %>%
  #Group
  group_by(Prekaritet, Kvinde) %>%
  #Group average
  summarize(Innvandringsskepsis = mean(Skepsis, na.rm = T))

## `summarise()` has grouped output by 'Prekaritet'. You can override using the
## `.groups` argument.

#Make a new variable with the survey question
dfp$`How is your work situation?` <- dfp$Prekaritet %>% 
  as.factor
#Omdefiner kategoriene til intuitive svar
levels(dfp$`How is your work situation?`) <- c("Precarious", 
                                                        "Safe")
dfp$Gender = ifelse(dfp$Kvinde == 1,
                           "Woman", "Man")
dfp

p <- 
  ggplot(dfp) +
  #Klassisk definisjon av både x og y koordinater
  geom_bar(aes(y = Innvandringsskepsis,
               x = `How is your work situation?`,
               #Fill in with colors following the grouping
               fill = Gender),
           #place bars side-by-side
           position = "dodge",
           #Bruker tallene du oppgir som søylehøyde uten å telle selv.
           stat = "identity")

Note how I have saved my plot in an object: p <- ggplot(). To see the plot, I request it.

A bar chart created from a separate dataset where I have aggregated and prepared the information.

It is useful to save the object this way because I will modify my R object further. I can do that using pipes.

Esthetical tweaks

Once you have added the graphical elements with your data, you can spend an endless amount of time fine-tuning the rest of the plot.

Tell R what language you speak (and where you are)

R was not created by Norwegians, so Danish characters can sometimes cause problems. We have to actively consider encoding choices in two settings: when we import data and when we save data, including in graphics.

Computers read and provide information as a series of 0s and 1s. To read and write text, we use encoding, a translation of binary codes to the alphabet we know. Traditionally, each country and language had its own encoding system. This was cumbersome, but today, most text exchanged on the internet follows “utf-8” encoding. It includes most of the characters we know how to read. Sys.setlocale() tells R where you are in the world. Here, I am telling it that I want Norwegian language with utf-8 encoding.

Sys.setlocale(category = "LC_ALL",
              locale = "dk_DK.UTF-8")

Titles and axis names

p <- 
  p +
  #Add a title and a subtitle
  ggtitle(label = "Relationship between immigration skepticism and work situation",
          subtitle = "Data from ESS (2014)") +
  #Name of x-axis
  xlab("Gender") +
  #What are the limits on my axis?
  ylim(c(0,6))

p

Define colors

You can choose your own colors, and there are many ways to do so. If you have linked your data with color choices, it may be useful to define a “palette.”

The text (“Precarious” and “Safe”) is part of the data I provided, so it is located in the variable “dfp$Worksituation.” I can change it in ggplot if I want to. I do this when defining the colors.

p <- 
  p +  
  #Define colors
  scale_color_manual(
    #What colors?
    values = c("purple", "magenta"),
    #Color the entire bar
    aesthetics = "fill",
    #What's the title of the legend?
    name = "Link to professional life",
    #What are the category names?
    labels = c("Male",
               "Female")
    ) 
p

Background

You can define different templates for the aesthetic aspects of the plot with theme_...().

p <- 
  p +
  # Define a theme
  theme_minimal()
p

Here, I am using ready-made functions to modify the visual expression of the plot.

You can alter the themes at will. You do that after you have defined the theme, using the generik theme() function.

p <-
  p +
  # Modify the minimal theme after defining it
  theme(
    # Move the legend down to below the plot
    legend.position = "bottom",
    # Bold font for the title in the legend
    legend.title = element_text(face = "bold"),
    # Italic for axis values
    axis.text = element_text(face = "italic"),
    # Remove the gridlines
    panel.grid = element_blank())
p

Here, I’ve changed the background and other thematical elements in my graphic.

The element_...() functions are used inside the theme() function. They specify everything except the data content/information for an element. Should it be a blank element element_blank()? Specific font? element_text()… etc.

For example, I can specify that the text used for the axis names should be displayed in italic element_text(face = "italic").