Dialects and data manipulation
So far, we have been working in base-R. It is the oldest and most
stable dialect in R and all our reading are also in base. It is possible
to perform most operations in base-R, but in recent years two new
dialects have emerged: ggplot2
and tidyverse
.
The first one is for graphical display, the other is more general but is
particularly interesting for data wrangling.
Tidyverse
consists of a set of R packages that all follow roughly the same linguistic principles. This dialect has several advantages:
Data cleaning: As the name suggests, the package is
particularly well-suited for cleaning up data. dplyr
is a
package that builds on tidyverse. It has made a lot of data preparation
very fast. R has traditionally been poorly adapted to large datasets,
where all functions that require going through long variables have been
very slow. In addition, the package makes it possible to communicate
with databases (SQL).
Intuitive: In addition to its own vocabulary (a set of functions), the package also relies on a different type of syntax. Many people find that this syntax is closer to the way we think and build an argument. Therefore, the codes can seem more intuitive.
ggplot2
is an R-package on which many other packages are based on. They all aim to help us with our graphics. You can do your plots in base R. That’s what I do in the book, but you may also find ggplot2 helpful in your own work. Generally speaking, I use base functions when I want to do things “quick and dirty”, while I use ggplot2 for more advanced plots. The reason is that ggplot2 often requires quite a few lines of codes to do even basic tweaks.
It is an advantage to have basic knowledge of all dialects. This way, you can work around any problem.
Pipes
Before using functions from tidyverse, you must fetch the package (or one of the packages that are based on it) out of the library and into your workspace memory. The functions and syntax are located in the package.
We will rely on the dplyr package, which is part of the tidyverse. We start by fetching the package from the library. If you haven’t already installed it, you need to do that first.
We will be using the dplyr package, which is part of the tidyverse. We start by loading the package from the library. If you have not already installed it, you must do so first.
# install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
R follows up by informing us of which packages are masked from our workspace. For example, dplyr contains the intersect() function, which has a namesake in base-R. If we use the function name, it is the dplyr function that we will now be served.
Example data Here we will use a version of the Norwegian ESS (2014). We start by loading the data and giving it a name we like.
library(RiPraksis)
data(kap6)
<- kap6 df
Syntax: Pipes or parentheses?
In base-R: parentheses
In base-R, you communicate by first specifying the function you want to use, then the object name in parentheses, as well as arguments/additional arguments separated by a comma.
Here, I calculate the mean of the variable “Skepsis” among ESS respondents. The additional argument na.rm = TRUE addresses the problem of some respondents not having responded.
mean(df$Skepsis, na.rm = T)
## [1] 5.009241
To make the code more readable, we will often break the line after each comma.
mean(df$Skepsis,
na.rm = T)
## [1] 5.009241
In tidyverse: pipes
In the tidyverse, we use “pipes” to link different arguments together. We do this with “%>%” (percent, greater-than, percent). We can use functions from base-R or from the tidyverse together with the pipes.
We specify the object first, then the function. All arguments that do not specify the object must be added in the “old-fashioned” way. We indicate where the object should have been with “.”.
$Skepsis %>% mean(., na.rm = T) df
## [1] 5.009241
With pipes, it is extra neat to break the line for readability.
$Skepsis %>%
dfmean(., na.rm = T)
## [1] 5.009241
Multiple operations on the same object
We can always build up a sequence of code. This should be done chronologically (first operation first). We can do it in three ways:
- one operation at a time. We can do each operation separately and save them in separate objects.
<- is.na(df$Skepsis)
na <- sum(na)
ant.na ant.na
## [1] 5
We have 5 missing values in the variable.
two operations at a time
base-R: We can put parentheses in parentheses. The last one comes first. The object comes last.
sum(is.na(df$Skepsis))
## [1] 5
tidyverse: We can create a pipe: The object first, then the first operation, then the operation we want to perform on the result of this operation.
$Skepsis %>%
dfis.na(.) %>%
sum()
## [1] 5
There are no rules for what is best. You decide!
Recoding
80% of the analytical work in research is actually not running models. It is data manipulation: collecting and merging data, exploring the data, correct errors and recode to make new variables.
When we recode, we use information from variables that are already in the data to create new variables. When doing so, we can either increase the measurement level by inserting information, or decrease the measurement level by aggregating.
We can do the manipulations in both base and in tidyverse.
Making indexes
Here, I examplify by creating an additive index.
In base We can do this by hand, by i) summating the values of the three survey question for each respondent.
$TagerJobs + df$TagerSkat + df$Belastning df
Now the scale goes from 0 to 30. That’s not very intuitive. Let’s instead keep the old scale by calculating the average.
$TagerJobs + df$TagerSkat + df$Belastning)/3 (df
If I’m happy with the outcome, I can store it in a new variable.
$index <-
df$TagerJobs + df$TagerSkat + df$Belastning)/3 (df
dplyr I can do the exact same manipulation in
tidyverse. The function mutate()
creates new variables for
me.
%>%
df mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)
If I’m happy with the outcome, I overwrite the entire dataframe.
<-
df %>%
df mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)
I always check the outcome of such operations. The best way here, is on the one hand to check the distribution of the variable in a histogram,
hist(df$index)
When I recode from old variables, I also check whether the new and old variables correlate as they should.
plot(df[, c("index", "TagerJobs", "TagerSkat", "Belastning")])
Conditional recoding
Often, our recoding involved going down a notch on the measurement level. This is when we aggregate information. To do so, we need to tell R what our rules are for what to aggregate.
Remember the guessing game?
5 > 8
## [1] FALSE
We can use this for recoding. Here, I recode income. I start by making a rule: People who earn less than than the median income are “poor”.
Here are the poor people
$Indtaegt < 5 df
base
I begin by making a new variable.
$Poor <- NA df
$Poor[df$Indtaegt < 5] = 1
df$Poor[df$Indtaegt >= 5] = 0 df
Now, I can check the frequency table.
table(df$Poor)
##
## 0 1
## 853 474
… and plot it
barplot(table(df$Poor))
tidyverse
In tidyverse, we use the case_when()
-function to express
the conditioning.
%>%
df mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
>= 5 ~ 0)
Indtaegt )
If this looks ok, then we can store the outcome.
<-
df %>%
df mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
>= 5 ~ 0)
Indtaegt )
Group averages
We can make manipulations on parts of the data. Tidyverse is unsurpassed on this one.
<-
df %>%
df group_by(Poor) %>%
mutate("scepticism" = mean(Skepsis, na.rm = T))
Remove variables or data points
base
I can remove data points using the indexing.
<- df[df$Poor == 1,] df_poor
I can also remove variables using the same indexing.
<- df_poor[, -3] df_poor
tidyverse
In tidyverse we have, as you are now used to, a function to do the same operations.
I can remove observations using the filter()
.
<-
df_poor %>%
df filter(Poor == 1)
I can also select or remove variables using the select()
function. I can select variables by specifying their name, or
deselect them by adding a an exclamation mark first.
<-
df_poor %>%
df_poor select(!Uddannelse)