Dialects and data manipulation
So far, we have been working in base R. It is the oldest and most stable dialect in R and all our readings are also in base-R. It is possible to perform most operations in base R, but in recent years two new dialects have emerged:
. The first one is for graphical display, the other is more general but is particularly interesting for data wrangling.
consists of a set of R packages that all follow roughly the same linguistic principles. This dialect has several advantages:
Data cleaning: As the name suggests, the package is
particularly well-suited for cleaning up data: recoding variables,
aggregating data, pulling out descriptive statistics. In class, I will
mostly rely on dplyr
as one of the packages that build on
Intuitive: In addition to its own vocabulary (a set of functions), the package also relies on a different type of syntax. Many people find that this syntax is closer to the way we think and build an argument. Therefore, the codes can seem more intuitive.
helps us with our graphics. Itis an R-package on which many other
packages for graphical display are based on. You can also do your plots
in base R. That’s what I do in the book (Hermansen, 2023, ch 3), but you
may o find ggplot2
helpful in your own work. Generally
speaking, I use base functions when I want to do things “quick and
dirty”, while I use ggplot2
for more advanced plots. The
reason is that ggplot2
often requires quite a few lines of
codes to do even basic tweaks.
It is an advantage to have basic knowledge of all dialects. This way, you can work around any problem.
Before we start
How to use R packages
Before using functions from tidyverse
, you must first
install the package to your computer. You only need to do this once. At
the beginning of each session, you then need to fetch the package (or
one of the packages that are based on it) in the “library” where all
installed packages are stored. By doing this you load it into your
workspace memory. Only then will the functions and syntax located in the
package be available to you.
Here, we will rely on the dplyr
package, which is part
of the tidyverse
. We start by fetching the package from the
library. If you haven’t already installed it, you need to do that
R follows up by informing us of which packages are masked from our
workspace. For example, dplyr contains the intersect()
function, which has a namesake in base-R. If we use the function name,
it is the dplyr function that we will now be served.
Three ways to load in data in R-format
For this session, we will use a version of the Norwegian ESS (2014). We start by loading the data and giving it a name we like.
- You can find the data in my R package:
- Alternatively, you can download the file from my webpage. You can do this automatically from R studio. Here, I have already set my working directory, so it suffices to specify that the destination file should be called kap.rda.
destfile = "kap6.rda")
I can now load the R-file into R using the load()
function. The data is an R-object that is stored as an .rda-file
(R-object file). It is good practice to call the file the same thing as
the object when you store it.
- … or you can do the download by hand. Navigate to the link in your browser, click to download. Move the file from your download folder to your working folder (where your working directory is set). From R, you can now open the file by clicking: “File” -> “Open file” etc.
Finally, I copy my R object into another object with a name that is
easier to write. My main data is usually kalled df.rda
Eye-ball the data
I always start a session by checking out my data. I can look at the entire spread sheet:
I at least check the dimensions (number of observations and variables) and variable names.
## [1] 1502 14
## [1] "Parti" "Indtaegt" "Uddannelse" "Kvinde"
## [5] "Innvandrer" "Skepsis" "KultSkepsis" "TagerJobs"
## [9] "TagerSkat" "Belastning" "Arbejdslos" "Begraenset"
## [13] "IngenKontrakt" "Prekaritet"
I can also check the first six observations.
…and the last six observations
Syntax: Pipes or parentheses?
In base-R: parentheses
In base-R, you communicate by first specifying the function you want to use, then the object name in parentheses, as well as arguments/additional arguments separated by a comma.
Here, I calculate the mean of the variable “Skepsis” among Danish ESS
respondents. The additional argument na.rm = TRUE
the problem of some respondents not having responded.
## [1] 5.009241
To make the code more readable, we will often break the line after each comma.
## [1] 5.009241
It has no effect on how R reads the script, but the human eye finds it easier to navigate. You can automatically indent the code chunk by marking the line, and hit “Ctrl + I”.
In tidyverse: pipes
In tidyverse, we use “pipes” to link different commmands together: the object, the verb/function. We do this with “%>%” (percent, greater-than, percent). We can use functions from base-R or from the tidyverse together with the pipes.
We specify the object first, then the function. All arguments that do not specify the object must be added in the “old-fashioned” way. We indicate where the object should have been with “.”.
## [1] 5.009241
With pipes, it is extra neat to break the line for readability.
## [1] 5.009241
Multiple operations on the same object
We can always build up a sequence of code. This should be done chronologically (first operation first). We can do it in three ways:
- one operation at a time. We can do each operation separately and save them in separate objects.
#First operation: which observations are missing?
na <- is.na(df$Skepsis)
#Second operation: sum over the number of missing observations
n_na <- sum(na)
#Inspect the information contained in the object
## [1] 5
We have 5 missing values in the variable. This is useful when you’re uncertain about whether each operation performs the task you want and is easier for error probing.
two operations at a time
base-R: We can put parentheses in parentheses. The last one comes first. The object comes last.
## [1] 5
This saves time, but given the reverse order of the observations, it is less reflective of how we think insofar as we usually consider the first operation first.
tidyverse: We can create a pipe that performs several tasks successively: The object first, then the first operation, then the operation we want to perform on the result of this operation etc..
## [1] 5
If you want to store the result, you’d have to save it in a separate
object as we did in the previous example, e.g. n_na
There are no rules for what is best. You decide!
80% of the analytical work in research is actually not running models. It is data manipulation: collecting and merging data, exploring the data, correct errors and recode to make new variables.
When we recode, we use information from variables that are already in the data to create new variables. When doing so, we can either increase the measurement level by inserting information, or decrease the measurement level by aggregating.
We can do the manipulations in both base and in tidyverse.
Making indexes
Here, I exemplify by creating an additive index.
In base We can do this by hand, by i) summing the values of the three survey question for each respondent.
Now the scale goes from 0 to 30. That’s not very intuitive. Let’s instead keep the old scale by calculating the average.
If I’m happy with the outcome, I can store it in a new variable.
dplyr I can do the exact same manipulation in
tidyverse. The function mutate()
creates new variables for
If I’m happy with the outcome, I overwrite the entire dataframe by storing it in the old object.
I always check the outcome of such operations. The best way here, is on the one hand to check the distribution of the variable in a histogram,
When I recode from old variables, I also check whether the new and old
variables correlate as they should.
Conditional recoding
Often, our recoding involves going down a notch on the measurement level. This is when we aggregate information. To do so, we need to tell R what our rules are for what to aggregate.
Remember the guessing game?
## [1] FALSE
We can use this for recoding. Here, I recode income. I start by making a rule: People who earn less than than the median income are “poor”.
Here are the poor people
I begin by making a new variable.
Now, I can check the frequency table.
## 0 1
## 853 474
… and plot it
In tidyverse, we can use the case_when()
-function to
express the conditioning.
On my Danish keyboard, the sim character (~
) is to the
right of my lineshift key. I get it by typing “AltGr + ¨”.
If this looks ok, then we can store the outcome.
A nice tool for conditional recoding is the if… else… algorithm. It is based on a condition and an instruction: if a statement is true, then do this; if the statement is not true, then do that.
In base R this is done by the ifelse()
In tidyverse this is handled by the if_else()
This is a true time-saver. Also, you can nest one statement in the other such that you end up with several conditions and outputs.
Look! I’ve got three categories.
## .
## Medium Poor Rich
## 669 241 417
Group averages
We can make manipulations on parts of the data. Tidyverse is
unsurpassed on this one. My favorite function is
. It groups/indexes the data according to a
grouping variable (“Poor”) so that we can do operations within each
group. Here I calculate the mean level of scepticism towards immigrants
among poor and not poor respondents.
I now have a new variable in the data set called “scepticism_gr”.
I always check if everything went as planned. That’s where simple descriptive statistics are useful. Here, I make a frequency table.
## .
## 4.80219349784567 5.24373795761079 5.29598308668076
## 853 175 474
Hm… I recoded “Poor” to only two values, but here I’ve got three. Let’s look closer at the data. There are many variables in the data frame, so I select only a few.
## Adding missing grouping variables: `Poor`
Right… Sometimes, there is a missing observation in my “Poor” variable. R has thus greated a third group of “NA” for me. Good to know…
Remove variables or data points
I can remove or retain data points using the indexing. Here, I retain only observations of respondents that are categorized as poor and store the result in a new data frame.
I can also remove variables using the same indexing. Here, I remove the third variable.
It is often more precise to do this by calling variables by their name, as the third variable might change (by definition, once I’ve removed it, the third variable is “Kvinde”). However, you’d need a different approach.
You might either use the all-powerful %in% that identifies all observations in the first element that are also present in the second.
Alternatively, you might use the function subset()
base-R. Note how I remove a variable by using its negative.
In tidyverse we have, as you are now used to, a function to do the same operations.
I can remove observations using the filter()
I can also select or remove variables using the select()
function. I can select variables by specifying their name, or
deselect them by adding a an exclamation mark first.
Your turn!
Can you do the same woodoo?
What is the average number of years of education among respondents in the data?
What is the average level of education within each party?
Use the
instead of themutate
argument to calculate the group averages in R. What happens?Store the results in a new R object.
Which party has the lowest educated voters?
Use indexing to sort the data frame according to level of education. You can use
from base R orarrange()
in dplyr.
Recap of key functions
Function | What it does | Package |
dim() |
Dimension. Reports the number of observations and variables in the data. | base R |
names() |
Variable names. Reports the variable names in a data frame | base R |
mean() |
Calculates the mean of a numerical variable | base R |
is.na() |
TRUE/FALSE. Reports whether there is information on the observation. | base R |
mutate() |
Returns a “tibble”/data frame. Alters the data frame by creating a new/overwriting the old variable. | tidyverse |
group_by() |
Groups the data according to a specified moderating variable | tidyverse |
df[, 1] |
Selects first variable/column. Can also be indexed by variable names | base R |
select() |
Selects a variable/column. An alternative to base indexing. | tidyverse |
df[1,] |
Selects first observation/row. Used in combination with the “guessing game”/conditional statements. | base R |
filter() |
Selects observations/rows. Used in combination with the “guessing game”/conditions as an alternative to base indexing. | tidyverse |
ifelse() |
Conditional recoding. Recodes the entire variable at once | base R |
if_else() |
Conditional recoding. For use in the mutate() |
tidyverse |