Dialects and data manipulation

So far, we have been working in base R. It is the oldest and most stable dialect in R and all our reading are also in base-R. It is possible to perform most operations in base R, but in recent years two new dialects have emerged: ggplot2 and tidyverse. The first one is for graphical display, the other is more general but is particularly interesting for data wrangling.

Tidyverse

consists of a set of R packages that all follow roughly the same linguistic principles. This dialect has several advantages:

Data cleaning: As the name suggests, the package is particularly well-suited for cleaning up data. dplyr is a package that builds on tidyverse. It has made a lot of data preparation very fast. R has traditionally been poorly adapted to large datasets, where all functions that require going through long variables have been very slow. In addition, the package makes it possible to communicate with databases (SQL).

Intuitive: In addition to its own vocabulary (a set of functions), the package also relies on a different type of syntax. Many people find that this syntax is closer to the way we think and build an argument. Therefore, the codes can seem more intuitive.

ggplot2

is an R-package on which many other packages are based on. They all aim to help us with our graphics. You can do your plots in base R. That’s what I do in the book (Hermansen, 2023, ch 3), but you may also find ggplot2 helpful in your own work. Generally speaking, I use base functions when I want to do things “quick and dirty”, while I use ggplot2 for more advanced plots. The reason is that ggplot2 often requires quite a few lines of codes to do even basic tweaks.

It is an advantage to have basic knowledge of all dialects. This way, you can work around any problem.

Before we start

How to use R packages

Before using functions from tidyverse, you must first install the package to your computer (only once). Then, at the beginning of each session, you need fetch the package (or one of the packages that are based on it) in the library and load it into your workspace memory. Only then will the functions and syntax located in the package be available to you.

Here, we will rely on the dplyr package, which is part of the tidyverse. We start by fetching the package from the library. If you haven’t already installed it, you need to do that first.

# install.packages("dplyr")
library(dplyr)

## 
## Vedhæfter pakke: 'dplyr'

## De følgende objekter er maskerede fra 'package:stats':
## 
##     filter, lag

## De følgende objekter er maskerede fra 'package:base':
## 
##     intersect, setdiff, setequal, union

R follows up by informing us of which packages are masked from our workspace. For example, dplyr contains the intersect() function, which has a namesake in base-R. If we use the function name, it is the dplyr function that we will now be served.

Three ways to load in data in R-format

For this session, we will use a version of the Norwegian ESS (2014). We start by loading the data and giving it a name we like.

You can find the data in my R package: RiPraksis

library(RiPraksis)
data(kap6)

Alternatively, you can download the file from my webpage. You can do this automatically from R studio. Here, I have already set my working directory, so it suffices to specify that the destination file should be called kap.rda.

download.file("https://siljehermansen.github.io/teaching/beyond-linear-models/kap6.rda",
              destfile = "kap6.rda")

I can now load the R-file into R using the load() function. The data is an R-object that is stored as an .rda-file (R-object file). It is good practice to call the file the same thing as the object when you store it.

load("kap6.rda")

… or you can do the download by hand. Navigate to the link in your browser, click to download. Move the file from your download folder to your working folder (where your working directory is set). From R, you can now open the file by clicking: “File” -> “Open file” etc.

Finally, I copy my R object into another object with a name that is easier to write. My main data is usually kalled df.rda

df <- kap6

Eye-ball the data

I always start a session by checking out my data. I can look at the entire spread sheet:

View(df)

I at least check the dimensions (number of observations and variables) and variable names.

dim(df); names(df)

## [1] 1502   14

##  [1] "Parti"         "Indtaegt"      "Uddannelse"    "Kvinde"       
##  [5] "Innvandrer"    "Skepsis"       "KultSkepsis"   "TagerJobs"    
##  [9] "TagerSkat"     "Belastning"    "Arbejdslos"    "Begraenset"   
## [13] "IngenKontrakt" "Prekaritet"

I can also check the first six observations.

head(df)

…and the last six observations

head(df)

Pipes

Syntax: Pipes or parentheses?

In base-R: parentheses

In base-R, you communicate by first specifying the function you want to use, then the object name in parentheses, as well as arguments/additional arguments separated by a comma.

Here, I calculate the mean of the variable “Skepsis” among ESS respondents. The additional argument na.rm = TRUE addresses the problem of some respondents not having responded.

mean(df$Skepsis, na.rm = T)

## [1] 5.009241

To make the code more readable, we will often break the line after each comma.

mean(df$Skepsis, 
     na.rm = T)

## [1] 5.009241

In tidyverse: pipes

In the tidyverse, we use “pipes” to link different arguments together. We do this with “%>%” (percent, greater-than, percent). We can use functions from base-R or from the tidyverse together with the pipes.

We specify the object first, then the function. All arguments that do not specify the object must be added in the “old-fashioned” way. We indicate where the object should have been with “.”.

df$Skepsis %>% mean(., na.rm = T)

## [1] 5.009241

With pipes, it is extra neat to break the line for readability.

df$Skepsis %>% 
  mean(., na.rm = T)

## [1] 5.009241

Multiple operations on the same object

We can always build up a sequence of code. This should be done chronologically (first operation first). We can do it in three ways:

one operation at a time. We can do each operation separately and save them in separate objects.

na <- is.na(df$Skepsis)
n_na <- sum(na)
n_na

## [1] 5

We have 5 missing values in the variable.

two operations at a time

base-R: We can put parentheses in parentheses. The last one comes first. The object comes last.

sum(is.na(df$Skepsis))

## [1] 5

tidyverse: We can create a pipe: The object first, then the first operation, then the operation we want to perform on the result of this operation.

df$Skepsis %>%
  is.na(.) %>% 
  sum()

## [1] 5

There are no rules for what is best. You decide!

Recoding

80% of the analytical work in research is actually not running models. It is data manipulation: collecting and merging data, exploring the data, correct errors and recode to make new variables.

When we recode, we use information from variables that are already in the data to create new variables. When doing so, we can either increase the measurement level by inserting information, or decrease the measurement level by aggregating.

We can do the manipulations in both base and in tidyverse.

Making indexes

Here, I examplify by creating an additive index.

In base We can do this by hand, by i) summating the values of the three survey question for each respondent.

df$TagerJobs + df$TagerSkat + df$Belastning

Now the scale goes from 0 to 30. That’s not very intuitive. Let’s instead keep the old scale by calculating the average.

(df$TagerJobs + df$TagerSkat + df$Belastning)/3

If I’m happy with the outcome, I can store it in a new variable.

df$index <- 
  (df$TagerJobs + df$TagerSkat + df$Belastning)/3

dplyr I can do the exact same manipulation in tidyverse. The function mutate() creates new variables for me.

df %>%
  mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)

If I’m happy with the outcome, I overwrite the entire dataframe.

df <- 
  df %>%
  mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)

I always check the outcome of such operations. The best way here, is on the one hand to check the distribution of the variable in a histogram,

hist(df$index)

When I recode from old variables, I also check whether the new and old variables correlate as they should.

plot(df[, c("index", "TagerJobs", "TagerSkat", "Belastning")])

Conditional recoding

Often, our recoding involved going down a notch on the measurement level. This is when we aggregate information. To do so, we need to tell R what our rules are for what to aggregate.

Remember the guessing game?

5 > 8

## [1] FALSE

We can use this for recoding. Here, I recode income. I start by making a rule: People who earn less than than the median income are “poor”.

Here are the poor people

df$Indtaegt < 5

base

I begin by making a new variable.

df$Poor <- NA

df$Poor[df$Indtaegt < 5] = 1
df$Poor[df$Indtaegt >= 5] = 0

Now, I can check the frequency table.

table(df$Poor)

## 
##   0   1 
## 853 474

… and plot it

barplot(table(df$Poor))

tidyverse

In tidyverse, we use the case_when()-function to express the conditioning.

df %>%
  mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
                            Indtaegt >= 5 ~ 0)
         )

If this looks ok, then we can store the outcome.

df <- 
  df %>%
  mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
                            Indtaegt >= 5 ~ 0)
  )

A nice tool for conditional recoding is the if… else… algorithm. It is based on a condition and an instruction: if a statement is true, then do this; if the statement is not true, then do that.

base

In base R this is done by the ifelse() function.

df$Poor <- ifelse(df$Indtaegt < 5, 
                  yes = 1,
                  no = 0)

tidyverse

In tidyverse this is handled by the if_else() function.

df <- 
  df %>%
  mutate(Poor = if_else(Indtaegt < 5, 
                        1,
                        0))

This is a true time-saver. Also, you can nest one statement in the other such that you end up with several conditions and outputs.

df <- 
  df %>%
  mutate(Rich = if_else(Indtaegt < 3, 
                        "Poor",
                        if_else(Indtaegt > 7,
                                "Rich",
                                "Medium")))

Look! I’ve got three categories.

df$Rich %>% table

## .
## Medium   Poor   Rich 
##    669    241    417

Group averages

We can make manipulations on parts of the data. Tidyverse is unsurpassed on this one. My favorite function is group_by(). It groups/indexes the data according to a grouping variable (“Poor”) so that we can do operations within each group. Here I calculate the mean level of scepticism towards immigrants among poor and not poor respondents.

df <- 
  df %>%
  group_by(Poor) %>%
  mutate("scepticism_gr" = mean(Skepsis, na.rm = T))

I now have a new variable in the data set called “scepticism_gr”.

I always check if everything went as planned. That’s where simple descriptive statistics are useful. Here, I make a frequency table.

df$scepticism_gr %>% table

## .
## 4.80219349784567 5.24373795761079 5.29598308668076 
##              853              175              474

Hm… I recoded “Poor” to only two values, but here I’ve got three. Let’s look closer at the data. There are many variables in the data frame, so I select only a few.

df %>% select(Parti, scepticism_gr)

## Adding missing grouping variables: `Poor`

Right… Sometimes, there is a missing observation in my “Poor” variable. R has thus greated a third group of “NA” for me. Good to know…

Remove variables or data points

base

I can remove data points using the indexing.

df_poor <- df[df$Poor == 1,]

I can also remove variables using the same indexing.

df_poor <- df_poor[, -3]

tidyverse

In tidyverse we have, as you are now used to, a function to do the same operations.

I can remove observations using the filter().

df_poor <-
  df %>%
  filter(Poor == 1)

I can also select or remove variables using the select() function. I can select variables by specifying their name, or deselect them by adding a an exclamation mark first.

df_poor <-
  df_poor %>%
  select(!Uddannelse)

Your turn!

Can you do the same woodoo?

What is the average number of years of education among respondents in the data?
What is the average level of education within each party?
Use the reframe() instead of the mutate argument to calculate the group averages in R. What happens?
Store the results in a new R object.
Which party has the lowest educated voters?
Use indexing to sort the data frame according to level of education. You can use òrder from base R or arrange() in dplyr.

Recap of key functions

Function	What it does	Package
`dim()`	Dimension. Reports the number of observations and variables in the data.	base R
`names()`	Variable names. Reports the variable names in a data frame	base R
`mean()`	Calculates the mean of a numerical variable	base R
`is.na()`	TRUE/FALSE. Reports whether there is information on the observation.	base R
`mutate()`	Returns a “tibble”/data frame. Alters the data frame by creating a new/overwriting the old variable.	tidyverse
`group_by()`	Groups the data according to a specified moderating variable	tidyverse
`df[, 1]`	Selects first variable/column. Can also be indexed by variable names	base R
`select()`	Selects a variable/column. An alternative to base indexing.	tidyverse
`df[1,]`	Selects first observation/row. Used in combination with the “guessing game”/conditional statements.	base R
`filter()`	Selects observations/rows. Used in combination with the “guessing game”/conditions as an alternative to base indexing.	tidyverse
`ifelse()`	Conditional recoding. Recodes the entire variable at once	base R
`if_else()`	Conditional recoding. For use in the `mutate()`	tidyverse