Exploration of our data
We start by loading in the data packages we will use.
library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RiPraksis)
data(kap6)
<- kap6 df
Univariate distributions
Continuous variables
Summarizing continuous variables usually means looking at their average, but also their minimum, maximum values and their spread.
summary(df$Skepsis)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.000 5.000 5.009 6.000 10.000 5
$Skepsis %>% hist df
We can make it relative using a simple additional argument.
$Skepsis %>% hist(., probability = T) df
Categorical variables
We can explore categorical variables primarily by their frequency distribution.
Here, we are looking at what our respondents voted last election.
$Parti %>%
df table
## .
## Andet Dansk Folkeparti
## 17 143
## Det Konservative Folkeparti Det Radikale Venstre
## 65 134
## Enhedslisten Kristendemokraterne
## 77 8
## Liberal Alliance Socialdemokraterne
## 48 268
## Socialistisk Folkeparti Venstre
## 108 311
In base, we need to make the frequency table first, then plot it using the appropriate function.
$Parti %>%
df%>%
table barplot()
Usually, it is much more intuitive to check the relative distribution
than the absolute. The function prop.table()
does exactly
that for us.
$Parti %>%
df#Frequency table
%>%
table #Relative distribution
%>%
prop.table #Barplot
barplot()
You can find more about data exploration in this tutorial, for example.
Bivariate statistics
After exploring the variables separately, we usually check out their bivariate relationship. Once again, the choice of statistics depnds on the measurement level.
Two continuous variables (+ ggplot2)
When we have two continuous variables, we can calculate their correlation using Pearsons R (\(R^2\)).
cor(df$Indtaegt, df$Skepsis,
use = "pairwise.complete.obs")
## [1] -0.1526337
The negative correlation is apparent, but what does it mean? Can we put it into plain English?
If we take the square of this correlation, we can interpret the result as a proportion of shared variation. Sometimes we’d interpret it as a causal relationship.
cor(df$Indtaegt, df$Skepsis,
use = "pairwise.complete.obs")^2
## [1] 0.02329706
We may say that 2 % of the variation in attitudes towards immigration is shared with (explained by?) income.
We can test the statistical significance between the two.
cor.test(df$Indtaegt, df$Skepsis)
##
## Pearson's product-moment correlation
##
## data: df$Indtaegt and df$Skepsis
## t = -5.6155, df = 1322, p-value = 2.387e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.20482310 -0.09957898
## sample estimates:
## cor
## -0.1526337
Considering numbers in this way is really abstract though. The most intuitive way of exploring a correlation is through visuals. Let’s take a deep-dive into ggplot2.
Scatterplot
The main function ggplot() establishes the plot, but no plot elements. If you are going to use the same data for all the graphical elements, you can specify the data object in this function. When you do that, you do not specify your dataset multiple times in the pipe, so you save time. From then on, you can refer directly to the variables in it.
You bind all elements together with a pipe using “+”. It’s easy to forget, so you’d rather use the %>% pipe from tidyverse. You’re not the only one who will make that mistake, so the message you get is surprisingly intuitive: Did you use %>% instead of +?.
ggplot(data = df)
The main elements of the graphics are included using a series of
geom_...()
functions. They specify the “geometry,” which is
what information should be included (information from variables) and
where it should be placed (coordinates).
Inside each of these functions, you will find the aesthetic function,
aes()
, where you specify the x/y coordinates that come from
your variables.
Here, I am creating a scatterplot for the relationship between
immigration skepticism and income. This means that I want to draw points
(geom_point())
.
ggplot(df) +
geom_point(aes(y = Skepsis,
x = Indtaegt))
## Warning: Removed 178 rows containing missing values (geom_point).
The income variable consists only of integers. Therefore, the points overlap each other, and we get a poor idea of how many units exist within the different value combinations. We can solve this in two ways: by making the points partially transparent, or by moving them slightly apart.
Transparent color I can specify that I want
transparent colors. I do this with the argument alpha=
. The
value I specify ranges from 0 (completely transparent/invisible) to 1
(not transparent).
ggplot(df) +
geom_point(aes(y = Skepsis,
x = Indtaegt),
#all observations are transparent
alpha = 0.2)
## Warning: Removed 178 rows containing missing values (geom_point).
Note that when we want to adjust the points without the adjustment being dependent on information from the data, this is done outside of aes() but within the geom_…() function.
Jitter data Another alternative is to “jitter” the data points a little. This means that R adds a random variation to the coordinates of the points. The actual data points become more imprecise, but the point here is not to perform a precise analysis, but to present relationships in the data so that the human eye gets an intuitive understanding of what is going on.
ggplot(df) +
geom_jitter(aes(y = Skepsis,
x = Indtaegt),
#Horizontal, but not vertical shake
height = 0, width = 0.3)
## Warning: Removed 178 rows containing missing values (geom_point).
The additional arguments width=
and height=
specify how much the points should be jittered horizontally and
vertically, respectively. Here, I say that I want precise data along the
y-axis (“in height”), but I let the points randomly vary by 0.3 units
along the x-axis (“in width”).
Trivariate graphics: Group the data.
I can easily add more groupings. Here I am coloring the points based on the value of a third variable: The variable can be continuous or categorical.
Continuous grouping “Subjective income” reports how rich the respondent feels in four categories from 0 (“low satisfaction”) to 3 (“high satisfaction”).
ggplot(df) +
geom_point(aes(y = Skepsis,
x = Indtaegt,
#Gruping
color = Prekaritet))
## Warning: Removed 178 rows containing missing values (geom_point).
Now I get a “heat scale” for the respondent’s perception of their own income. ggplot also automatically provides a legend.
Here we see that most of the respondents who report low satisfaction (dark color) are located to the left of the plot. They have low (objective) income.
Categorical grouping If I want a categorical
grouping, I need to feed the function a categorical variable
(“character” or “factor”). It’s best to do this in the plot data, not in
the ggplot() function. Right here, I deviate from this rule with
(as.factor()
)
ggplot(df) +
geom_point(aes(y = Skepsis,
x = Indtaegt,
#Gruppering
color = as.factor(Prekaritet)))
## Warning: Removed 178 rows containing missing values (geom_point).
One solution does not prevent another. Of course, I can combine all these solutions until I get a plot that tells the story I want to convey.
ggplot(df) +
geom_jitter(aes(y = Skepsis,
x = Indtaegt,
#Continuous moderating third variable
color = Prekaritet),
#Shake data data horizontaly and vertically
height = 0.1, width = 0.4,
#Transparency
alpha = 0.7)
## Warning: Removed 178 rows containing missing values (geom_point).
Bivariate regression lines
I can also show the relationship between two variables using regression lines. ggplot does not report regression coefficients but instead illustrates the regression line and the uncertainty around it. This is particularly relevant if at least one of the variables (the x-variable) is continuous.
Now we can choose between local regression and ordinary linear regression. Which one we choose depends on what we want to achieve.
Local regression
Local regression is – in my eyes – the queen of all ggplot functions. I use it all the time when exploring my data. It draws a “smooth” line (locally, the sliding average over the different x-values). Perfectly suited for time trends or exploring non-linear relationships, for example.
The function that gives us bivariate regression lines is
geom_smooth()
. The default setting is local regression.
ggplot(df) +
geom_smooth(aes(y = Skepsis,
x = Indtaegt))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 178 rows containing non-finite values (stat_smooth).
Here I illustrate the relationship between age and skepticism about immigration. The figure demonstrates the greatest advantage of local regression: There is no linear relationship between age and skepticism about immigration. Instead, it appears that younger and older respondents are more skeptical.
R adjusts the “window” automatically for us, but we can choose the size ourselves. Then we use the additional argument span = . How much of the variable span should be used for the local average? Low values give a more jagged line, high values give more generalization.
ggplot(df) +
geom_smooth(aes(y = Skepsis,
x = Indtaegt),
span = 20)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 178 rows containing non-finite values (stat_smooth).
Grouped: Is the relationship the same within different categories?
ggplot(df) +
geom_smooth(aes(y = Skepsis,
x = Indtaegt,
color = as.factor(Prekaritet)))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 178 rows containing non-finite values (stat_smooth).
Would you like to replace the “spaghetti” with straight lines? Ask for a (bivariate) linear model.
ggplot(df) +
geom_smooth(aes(y = Skepsis,
x = Indtaegt,
color = as.factor(Prekaritet)),
method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 178 rows containing non-finite values (stat_smooth).
Here we clearly see that the association between immigration skepticism and income is primarily present for respondents who feel that they lack money (categories 0 and 1).
One continous and one cagetorical variable (+ more tidyverse)
When we have a continous and a categorical variable we’d like to explore, we will usually calculate the group mean; i.e. we calculate the average of the continuous variable for each of the categories in the other.
This is a good excuse to have a second look at tidyverse
. dplyr
has a very handy function that allows us to do
operations on subset of the data depending on a moderating variable.
Let’s calculate the average scepticism among voters of different
parties and create a summary statistic. To do so, we use
summarize()
%>%
df group_by(Parti) %>%
summarize("Scepticism" = mean(Skepsis, na.rm = T))
If we like this table, we can store it in an object. Before I do so, however, I’m removing the NAs.
<-
tab %>%
df filter(!is.na(Parti)) %>%
group_by(Parti) %>%
summarize("Scepticism" = mean(Skepsis, na.rm = T))
The observations are ordered alphabetically. It is not intuitive. Let’s rearrange.
<-
tab %>%
tab arrange(Scepticism)
tab
No surprises here; Dansk Folkeparti has voters that are more sceptical than other parties. This being said, in general, there is not a massive spread here.
Let’s instead plot this.
base
barplot(tab$Scepticism,
names.arg = tab$Parti,
las = 2)
ggplot2
Instead of plotting the entire data set, I now rely on my table also
in ggplot2. That is, I’v already calculated the height of the bars, so
no need for ggplot to do that. I signal my intent by using
method = "identity"
. Now, the y-values and the x-values are
defined by me.
ggplot(tab) +
geom_histogram(aes(y = Scepticism,
x = Parti),
stat = "identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Two categorical variables
When we have two categorical variables, we do a cross-table.
table(df$Parti, df$Kvinde)
##
## 0 1
## Andet 11 6
## Dansk Folkeparti 85 58
## Det Konservative Folkeparti 33 32
## Det Radikale Venstre 69 65
## Enhedslisten 32 45
## Kristendemokraterne 4 4
## Liberal Alliance 38 10
## Socialdemokraterne 116 152
## Socialistisk Folkeparti 43 65
## Venstre 175 136
Absolute frequencies are hard to fathom. I like them relative. Also, this time around, we want to know the proportion of voters that are men/women per party. It means that I calculate the distribution by rows (margin = 1).
table(df$Parti, df$Kvinde) %>%
prop.table(., margin = 1)
##
## 0 1
## Andet 0.6470588 0.3529412
## Dansk Folkeparti 0.5944056 0.4055944
## Det Konservative Folkeparti 0.5076923 0.4923077
## Det Radikale Venstre 0.5149254 0.4850746
## Enhedslisten 0.4155844 0.5844156
## Kristendemokraterne 0.5000000 0.5000000
## Liberal Alliance 0.7916667 0.2083333
## Socialdemokraterne 0.4328358 0.5671642
## Socialistisk Folkeparti 0.3981481 0.6018519
## Venstre 0.5627010 0.4372990
We don’t usually test the statistical significance, when we explore the data. However, we can. The chisquare test allows us to check if the data table is generated by random. The p-value reports the probability that the relationships we see are generated by accident. Here, that’s very unlikely.
table(df$Parti, df$Kvinde) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 38.546, df = 9, p-value = 1.391e-05
<- table(df$Kvinde, df$Parti)
table
barplot(table,
beside = T)
ggplot2
ggplot(df) +
geom_bar(aes(x = Parti,
fill = as.factor(Kvinde)),
position = "dodge")
Workflow and tweaks in ggplot2
Let’s follow an alternative workflow and create our own small dataset for plotting. Here, I am interested in immigration skepticism among respondents with different work situations.
<- df %>%
dfp #Filter out NA
filter(!is.na(Prekaritet)) %>%
#Group
group_by(Prekaritet) %>%
#Group average
summarize(Innvandringsskepsis = mean(Skepsis, na.rm = T))
#Make a new variable with the survey question
$`How is your work situation?` <- dfp$Prekaritet %>%
dfp
as.factor
#Omdefiner kategoriene til intuitive svar
levels(dfp$`How is your work situation?`) <- c("Insecure",
"Secure")
Now you can ask R to plot
ggplot(dfp) +
#Classical definition of a coordinate system
geom_bar(aes(y = Innvandringsskepsis,
x = `How is your work situation?`),
#Use the number you give as hight to the bars instead of calculating yourself
stat = "identity")
<- df %>%
dfp #Filter out NA
filter(!is.na(Prekaritet) & !is.na(Prekaritet)) %>%
#Group
group_by(Prekaritet, Kvinde) %>%
#Group average
summarize(Innvandringsskepsis = mean(Skepsis, na.rm = T))
## `summarise()` has grouped output by 'Prekaritet'. You can override using the
## `.groups` argument.
#Make a new variable with the survey question
$`How is your work situation?` <- dfp$Prekaritet %>%
dfp
as.factor#Omdefiner kategoriene til intuitive svar
levels(dfp$`How is your work situation?`) <- c("Precarious",
"Safe")
$Gender = ifelse(dfp$Kvinde == 1,
dfp"Woman", "Man")
dfp
<-
p ggplot(dfp) +
#Klassisk definisjon av både x og y koordinater
geom_bar(aes(y = Innvandringsskepsis,
x = `How is your work situation?`,
#Fill in with colors following the grouping
fill = Gender),
#place bars side-by-side
position = "dodge",
#Bruker tallene du oppgir som søylehøyde uten å telle selv.
stat = "identity")
Note how I have saved my plot in an object:
p <- ggplot()
. To see the plot, I request it.
p
It is useful to save the object this way because I will modify my R object further. I can do that using pipes.
Esthetical tweaks
Once you have added the graphical elements with your data, you can spend an endless amount of time fine-tuning the rest of the plot.
Tell R what language you speak (and where you are)
R was not created by Norwegians, so Danish characters can sometimes cause problems. We have to actively consider encoding choices in two settings: when we import data and when we save data, including in graphics.
Computers read and provide information as a series of 0s and 1s. To
read and write text, we use encoding, a translation of binary codes to
the alphabet we know. Traditionally, each country and language had its
own encoding system. This was cumbersome, but today, most text exchanged
on the internet follows “utf-8” encoding. It includes most of the
characters we know how to read. Sys.setlocale()
tells R
where you are in the world. Here, I am telling it that I want Norwegian
language with utf-8 encoding.
Sys.setlocale(category = "LC_ALL",
locale = "dk_DK.UTF-8")
Titles and axis names
<-
p +
p #Add a title and a subtitle
ggtitle(label = "Relationship between immigration skepticism and work situation",
subtitle = "Data from ESS (2014)") +
#Name of x-axis
xlab("Gender") +
#What are the limits on my axis?
ylim(c(0,6))
p
Define colors
You can choose your own colors, and there are many ways to do so. If you have linked your data with color choices, it may be useful to define a “palette.”
The text (“Precarious” and “Safe”) is part of the data I provided, so it is located in the variable “dfp$Worksituation.” I can change it in ggplot if I want to. I do this when defining the colors.
<-
p +
p #Define colors
scale_color_manual(
#What colors?
values = c("purple", "magenta"),
#Color the entire bar
aesthetics = "fill",
#What's the title of the legend?
name = "Link to professional life",
#What are the category names?
labels = c("Male",
"Female")
) p
Background
You can define different templates for the aesthetic aspects of the
plot with theme_...()
.
<-
p +
p # Define a theme
theme_minimal()
p
You can alter the themes at will. You do that after you have defined
the theme, using the generik theme()
function.
<-
p +
p # Modify the minimal theme after defining it
theme(
# Move the legend down to below the plot
legend.position = "bottom",
# Bold font for the title in the legend
legend.title = element_text(face = "bold"),
# Italic for axis values
axis.text = element_text(face = "italic"),
# Remove the gridlines
panel.grid = element_blank())
p
The element_...()
functions are used inside the
theme()
function. They specify everything except the data
content/information for an element. Should it be a blank element
element_blank()? Specific font? element_text()
… etc.
For example, I can specify that the text used for the axis names
should be displayed in italic
element_text(face = "italic")
.