Dialects and data manipulation
So far, we have been working in base R. It is the oldest and most stable dialect in R and all our reading are also in base-R. It is possible to perform most operations in base R, but in recent years two new dialects have emerged:
ggplot2
andtidyverse
. The first one is for graphical display, the other is more general but is particularly interesting for data wrangling.
Tidyverse
consists of a set of R packages that all follow roughly the same linguistic principles. This dialect has several advantages:
Data cleaning: As the name suggests, the package is
particularly well-suited for cleaning up data. dplyr
is a
package that builds on tidyverse
. It has made a lot of data
preparation very fast. R has traditionally been poorly adapted to large
datasets, where all functions that require going through long variables
have been very slow. In addition, the package makes it possible to
communicate with databases (SQL).
Intuitive: In addition to its own vocabulary (a set of functions), the package also relies on a different type of syntax. Many people find that this syntax is closer to the way we think and build an argument. Therefore, the codes can seem more intuitive.
ggplot2
is an R-package on which many other packages are based on. They all
aim to help us with our graphics. You can do your plots in base R.
That’s what I do in the book (Hermansen, 2023, ch 3), but you may also
find ggplot2
helpful in your own work. Generally speaking,
I use base functions when I want to do things “quick and dirty”, while I
use ggplot2
for more advanced plots. The reason is that
ggplot2
often requires quite a few lines of codes to do
even basic tweaks.
It is an advantage to have basic knowledge of all dialects. This way, you can work around any problem.
Before we start
How to use R packages
Before using functions from tidyverse
, you must first
install the package to your computer (only once). Then, at the beginning
of each session, you need fetch the package (or one of the packages that
are based on it) in the library and load it into your workspace memory.
Only then will the functions and syntax located in the package be
available to you.
Here, we will rely on the dplyr
package, which is part
of the tidyverse
. We start by fetching the package from the
library. If you haven’t already installed it, you need to do that
first.
# install.packages("dplyr")
library(dplyr)
##
## Vedhæfter pakke: 'dplyr'
## De følgende objekter er maskerede fra 'package:stats':
##
## filter, lag
## De følgende objekter er maskerede fra 'package:base':
##
## intersect, setdiff, setequal, union
R follows up by informing us of which packages are masked from our
workspace. For example, dplyr contains the intersect()
function, which has a namesake in base-R. If we use the function name,
it is the dplyr function that we will now be served.
Three ways to load in data in R-format
For this session, we will use a version of the Norwegian ESS (2014). We start by loading the data and giving it a name we like.
- You can find the data in my R package:
RiPraksis
library(RiPraksis)
data(kap6)
- Alternatively, you can download the file from my webpage. You can do this automatically from R studio. Here, I have already set my working directory, so it suffices to specify that the destination file should be called kap.rda.
download.file("https://siljehermansen.github.io/teaching/beyond-linear-models/kap6.rda",
destfile = "kap6.rda")
I can now load the R-file into R using the load()
function. The data is an R-object that is stored as an .rda-file
(R-object file). It is good practice to call the file the same thing as
the object when you store it.
load("kap6.rda")
- … or you can do the download by hand. Navigate to the link in your browser, click to download. Move the file from your download folder to your working folder (where your working directory is set). From R, you can now open the file by clicking: “File” -> “Open file” etc.
Finally, I copy my R object into another object with a name that is
easier to write. My main data is usually kalled df.rda
<- kap6 df
Eye-ball the data
I always start a session by checking out my data. I can look at the entire spread sheet:
View(df)
I at least check the dimensions (number of observations and variables) and variable names.
dim(df); names(df)
## [1] 1502 14
## [1] "Parti" "Indtaegt" "Uddannelse" "Kvinde"
## [5] "Innvandrer" "Skepsis" "KultSkepsis" "TagerJobs"
## [9] "TagerSkat" "Belastning" "Arbejdslos" "Begraenset"
## [13] "IngenKontrakt" "Prekaritet"
I can also check the first six observations.
head(df)
…and the last six observations
head(df)
Pipes
Syntax: Pipes or parentheses?
In base-R: parentheses
In base-R, you communicate by first specifying the function you want to use, then the object name in parentheses, as well as arguments/additional arguments separated by a comma.
Here, I calculate the mean of the variable “Skepsis” among ESS
respondents. The additional argument na.rm = TRUE
addresses
the problem of some respondents not having responded.
mean(df$Skepsis, na.rm = T)
## [1] 5.009241
To make the code more readable, we will often break the line after each comma.
mean(df$Skepsis,
na.rm = T)
## [1] 5.009241
In tidyverse: pipes
In the tidyverse, we use “pipes” to link different arguments together. We do this with “%>%” (percent, greater-than, percent). We can use functions from base-R or from the tidyverse together with the pipes.
We specify the object first, then the function. All arguments that do not specify the object must be added in the “old-fashioned” way. We indicate where the object should have been with “.”.
$Skepsis %>% mean(., na.rm = T) df
## [1] 5.009241
With pipes, it is extra neat to break the line for readability.
$Skepsis %>%
dfmean(., na.rm = T)
## [1] 5.009241
Multiple operations on the same object
We can always build up a sequence of code. This should be done chronologically (first operation first). We can do it in three ways:
- one operation at a time. We can do each operation separately and save them in separate objects.
<- is.na(df$Skepsis)
na <- sum(na)
n_na n_na
## [1] 5
We have 5 missing values in the variable.
two operations at a time
base-R: We can put parentheses in parentheses. The last one comes first. The object comes last.
sum(is.na(df$Skepsis))
## [1] 5
tidyverse: We can create a pipe: The object first, then the first operation, then the operation we want to perform on the result of this operation.
$Skepsis %>%
dfis.na(.) %>%
sum()
## [1] 5
There are no rules for what is best. You decide!
Recoding
80% of the analytical work in research is actually not running models. It is data manipulation: collecting and merging data, exploring the data, correct errors and recode to make new variables.
When we recode, we use information from variables that are already in the data to create new variables. When doing so, we can either increase the measurement level by inserting information, or decrease the measurement level by aggregating.
We can do the manipulations in both base and in tidyverse.
Making indexes
Here, I examplify by creating an additive index.
In base We can do this by hand, by i) summating the values of the three survey question for each respondent.
$TagerJobs + df$TagerSkat + df$Belastning df
Now the scale goes from 0 to 30. That’s not very intuitive. Let’s instead keep the old scale by calculating the average.
$TagerJobs + df$TagerSkat + df$Belastning)/3 (df
If I’m happy with the outcome, I can store it in a new variable.
$index <-
df$TagerJobs + df$TagerSkat + df$Belastning)/3 (df
dplyr I can do the exact same manipulation in
tidyverse. The function mutate()
creates new variables for
me.
%>%
df mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)
If I’m happy with the outcome, I overwrite the entire dataframe.
<-
df %>%
df mutate("index" = (TagerJobs + TagerSkat + Belastning)/3)
I always check the outcome of such operations. The best way here, is on the one hand to check the distribution of the variable in a histogram,
hist(df$index)
When I recode from old variables, I also check whether the new and old variables correlate as they should.
plot(df[, c("index", "TagerJobs", "TagerSkat", "Belastning")])
Conditional recoding
Often, our recoding involved going down a notch on the measurement level. This is when we aggregate information. To do so, we need to tell R what our rules are for what to aggregate.
Remember the guessing game?
5 > 8
## [1] FALSE
We can use this for recoding. Here, I recode income. I start by making a rule: People who earn less than than the median income are “poor”.
Here are the poor people
$Indtaegt < 5 df
base
I begin by making a new variable.
$Poor <- NA df
$Poor[df$Indtaegt < 5] = 1
df$Poor[df$Indtaegt >= 5] = 0 df
Now, I can check the frequency table.
table(df$Poor)
##
## 0 1
## 853 474
… and plot it
barplot(table(df$Poor))
tidyverse
In tidyverse, we use the case_when()
-function to express
the conditioning.
%>%
df mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
>= 5 ~ 0)
Indtaegt )
If this looks ok, then we can store the outcome.
<-
df %>%
df mutate("Poor" = case_when(Indtaegt < 5 ~ 1,
>= 5 ~ 0)
Indtaegt )
A nice tool for conditional recoding is the if… else… algorithm. It is based on a condition and an instruction: if a statement is true, then do this; if the statement is not true, then do that.
base
In base R this is done by the ifelse()
function.
$Poor <- ifelse(df$Indtaegt < 5,
dfyes = 1,
no = 0)
tidyverse
In tidyverse this is handled by the if_else()
function.
<-
df %>%
df mutate(Poor = if_else(Indtaegt < 5,
1,
0))
This is a true time-saver. Also, you can nest one statement in the other such that you end up with several conditions and outputs.
<-
df %>%
df mutate(Rich = if_else(Indtaegt < 3,
"Poor",
if_else(Indtaegt > 7,
"Rich",
"Medium")))
Look! I’ve got three categories.
$Rich %>% table df
## .
## Medium Poor Rich
## 669 241 417
Group averages
We can make manipulations on parts of the data. Tidyverse is
unsurpassed on this one. My favorite function is
group_by()
. It groups/indexes the data according to a
grouping variable (“Poor”) so that we can do operations within each
group. Here I calculate the mean level of scepticism towards immigrants
among poor and not poor respondents.
<-
df %>%
df group_by(Poor) %>%
mutate("scepticism_gr" = mean(Skepsis, na.rm = T))
I now have a new variable in the data set called “scepticism_gr”.
I always check if everything went as planned. That’s where simple descriptive statistics are useful. Here, I make a frequency table.
$scepticism_gr %>% table df
## .
## 4.80219349784567 5.24373795761079 5.29598308668076
## 853 175 474
Hm… I recoded “Poor” to only two values, but here I’ve got three. Let’s look closer at the data. There are many variables in the data frame, so I select only a few.
%>% select(Parti, scepticism_gr) df
## Adding missing grouping variables: `Poor`
Right… Sometimes, there is a missing observation in my “Poor” variable. R has thus greated a third group of “NA” for me. Good to know…
Remove variables or data points
base
I can remove data points using the indexing.
<- df[df$Poor == 1,] df_poor
I can also remove variables using the same indexing.
<- df_poor[, -3] df_poor
tidyverse
In tidyverse we have, as you are now used to, a function to do the same operations.
I can remove observations using the filter()
.
<-
df_poor %>%
df filter(Poor == 1)
I can also select or remove variables using the select()
function. I can select variables by specifying their name, or
deselect them by adding a an exclamation mark first.
<-
df_poor %>%
df_poor select(!Uddannelse)
Your turn!
Can you do the same woodoo?
What is the average number of years of education among respondents in the data?
What is the average level of education within each party?
Use the
reframe()
instead of themutate
argument to calculate the group averages in R. What happens?Store the results in a new R object.
Which party has the lowest educated voters?
Use indexing to sort the data frame according to level of education. You can use
òrder
from base R orarrange()
in dplyr.
Recap of key functions
Function | What it does | Package |
---|---|---|
dim() |
Dimension. Reports the number of observations and variables in the data. | base R |
names() |
Variable names. Reports the variable names in a data frame | base R |
mean() |
Calculates the mean of a numerical variable | base R |
is.na() |
TRUE/FALSE. Reports whether there is information on the observation. | base R |
mutate() |
Returns a “tibble”/data frame. Alters the data frame by creating a new/overwriting the old variable. | tidyverse |
group_by() |
Groups the data according to a specified moderating variable | tidyverse |
df[, 1] |
Selects first variable/column. Can also be indexed by variable names | base R |
select() |
Selects a variable/column. An alternative to base indexing. | tidyverse |
df[1,] |
Selects first observation/row. Used in combination with the “guessing game”/conditional statements. | base R |
filter() |
Selects observations/rows. Used in combination with the “guessing game”/conditions as an alternative to base indexing. | tidyverse |
ifelse() |
Conditional recoding. Recodes the entire variable at once | base R |
if_else() |
Conditional recoding. For use in the mutate() |
tidyverse |