Take2: Dialects and data manipulation

R-workshop 30th of October, 4A-1-60, Faculty of Law, University of Copenhagen

So far, we have been working in base-R. It is the oldest and most stable dialect in R and all our reading are also in base. It is possible to perform most operations in base-R, but in recent years two new dialects have emerged: ggplot2 and tidyverse. The first one is for graphical display, the other is more general but is particularly interesting for data wrangling.

Tidyverse

consists of a set of R packages that all follow roughly the same linguistic principles. This dialect has several advantages:

Data cleaning: As the name suggests, the package is particularly well-suited for cleaning up data. dplyr is a package that builds on tidyverse. It has made a lot of data preparation very fast. R has traditionally been poorly adapted to large datasets, where all functions that require going through long variables have been very slow. In addition, the package makes it possible to communicate with databases (SQL).

Intuitive: In addition to its own vocabulary (a set of functions), the package also relies on a different type of syntax. Many people find that this syntax is closer to the way we think and build an argument. Therefore, the codes can seem more intuitive.

ggplot2

is an R-package on which many other packages are based on. They all aim to help us with our graphics. You can do your plots in base R. That’s what I do in the book, but you may also find ggplot2 helpful in your own work. Generally speaking, I use base functions when I want to do things “quick and dirty”, while I use ggplot2 for more advanced plots. The reason is that ggplot2 often requires quite a few lines of codes to do even basic tweaks.

It is an advantage to have basic knowledge of all dialects. This way, you can work around any problem.

Pipes

Before using functions from tidyverse, you must fetch the package (or one of the packages that are based on it) out of the library and into your workspace memory. The functions and syntax are located in the package.

We will rely on the dplyr package, which is part of the tidyverse. We start by fetching the package from the library. If you haven’t already installed it, you need to do that first.

We will be using the dplyr package, which is part of the tidyverse. We start by loading the package from the library. If you have not already installed it, you must do so first.

# install.packages("dplyr")
library(dplyr)
## 
## Vedhæfter pakke: 'dplyr'
## De følgende objekter er maskerede fra 'package:stats':
## 
##     filter, lag
## De følgende objekter er maskerede fra 'package:base':
## 
##     intersect, setdiff, setequal, union

R follows up by informing us of which packages are masked from our workspace. For example, dplyr contains the intersect() function, which has a namesake in base-R. If we use the function name, it is the dplyr function that we will now be served.

Example data Here we will use a version of the Danish ESS (2014). We start by loading the data and giving it a name we like.

load("kap6.rda")
df <- kap6

Syntax: Pipes or parentheses?

In base-R: parentheses

In base-R, you communicate by first specifying the function you want to use, then the object name in parentheses, as well as arguments/additional arguments separated by a comma.

Here, I calculate the mean of the variable “Skepsis” among ESS respondents. The additional argument na.rm = TRUE addresses the problem of some respondents not having responded.

mean(df$Skepsis, na.rm = T)
## [1] 5.009241

To make the code more readable, we will often break the line after each comma.

mean(df$Skepsis, 
     na.rm = T)
## [1] 5.009241

In tidyverse: pipes

In the tidyverse, we use “pipes” to link different arguments together. We do this with “%>%” (percent, greater-than, percent). We can use functions from base-R or from the tidyverse together with the pipes.

We specify the object first, then the function. All arguments that do not specify the object must be added in the “old-fashioned” way. We indicate where the object should have been with “.”.

df$Skepsis %>% mean(., na.rm = T)
## [1] 5.009241

With pipes, it is extra neat to break the line for readability.

df$Skepsis %>% 
  mean(., na.rm = T)
## [1] 5.009241

Multiple operations on the same object

We can always build up a sequence of code. This should be done chronologically (first operation first). We can do it in three ways:

  • one operation at a time. We can do each operation separately and save them in separate objects.
na <- is.na(df$Skepsis)
ant.na <- sum(na)
ant.na
## [1] 5

We have 5 missing values in the variable.

two operations at a time

base-R: We can put parentheses in parentheses. The last one comes first. The object comes last.

sum(is.na(df$Skepsis))
## [1] 5

tidyverse: We can create a pipe: The object first, then the first operation, then the operation we want to perform on the result of this operation.

df$Skepsis %>%
  is.na(.) %>% 
  sum()
## [1] 5

There are no rules for what is best. You decide!

Recoding

80% of the analytical work in research is actually not running models. It is data manipulation: collecting and merging data, exploring the data, correct errors and recode to make new variables.

When we recode, we use information from variables that are already in the data to create new variables. When doing so, we can either increase the measurement level by inserting information, or decrease the measurement level by aggregating.

We can do the manipulations in both base and in tidyverse.

Making indexes

Here, I examplify by creating an additive index.

In base We can do this by hand, by i) summating the values of the three survey question for each respondent.

df$TagerJobs + df$TagerSkat + df$Belastning

Now the scale goes from 0 to 30. That’s not very intuitive. Let’s instead keep the old scale by calculating the average.

(df$TagerJobs + df$TagerSkat + df$Belastning)/3

If I’m happy with the outcome, I can store it in a new variable.

df$index <- 
  (df$TagerJobs + df$TagerSkat + df$Belastning)/3

dplyr I can do the exact same manipulation in tidyverse. The function mutate() creates new variables for me.

df %>%
  mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)

If I’m happy with the outcome, I overwrite the entire dataframe.

df <- 
  df %>%
  mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)

I always check the outcome of such operations. The best way here, is on the one hand to check the distribution of the variable in a histogram,

hist(df$index)

When I recode from old variables, I also check whether the new and old variables correlate as they should.

plot(df[, c("index", "TagerJobs", "TagerSkat", "Belastning")])

Conditional recoding

Often, our recoding involved going down a notch on the measurement level. This is when we aggregate information. To do so, we need to tell R what our rules are for what to aggregate.

Remember the guessing game?

5 > 8
## [1] FALSE

We can use this for recoding. Here, I recode income. I start by making a rule: People who earn less than than the median income are “poor”.

Here are the poor people

df$Indtaegt < 5

base

I begin by making a new variable.

df$Poor <- NA
df$Poor[df$Indtaegt < 5] = 1
df$Poor[df$Indtaegt >= 5] = 0

Now, I can check the frequency table.

table(df$Poor)
## 
##   0   1 
## 853 474

… and plot it

barplot(table(df$Poor))

tidyverse

In tidyverse, we use the case_when()-function to express the conditioning.

df %>%
  mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
                            Indtaegt >= 5 ~ 0)
         )

If this looks ok, then we can store the outcome.

df <- 
  df %>%
  mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
                            Indtaegt >= 5 ~ 0)
  )

Group averages

We can make manipulations on parts of the data. Tidyverse is unsurpassed on this one.

df <- 
  df %>%
  group_by(Poor) %>%
  mutate("scepticism" = mean(Skepsis, na.rm = T))

Remove variables or data points

base

I can remove data points using the indexing.

df_poor <- df[df$Poor == 1,]

I can also remove variables using the same indexing.

df_poor <- df_poor[, -3]

tidyverse

In tidyverse we have, as you are now used to, a function to do the same operations.

I can remove observations using the filter().

df_poor <-
  df %>%
  filter(Poor == 1)

I can also select or remove variables using the select() function. I can select variables by specifying their name, or deselect them by adding a an exclamation mark first.

df_poor <-
  df_poor %>%
  select(!Uddannelse)