Descriptive statistics
Exploring our data
All data analysis starts by exploring the data, regardless whether this is hand-me-over data from someone else or your own brand new stuff. My exploration usually involves a lot of figures. In doing so, I often use ggplot for visuals, in addition to the tools for numeric summaries available in tidyverse. The type of descriptive statistics that is useful for you will depend on the measurement level of your variables.
Introduction
I have two objectives with this notebook:
Overview of descriptive statistics For each measurement level of the variable(s), I suggest a numeric and a graphical way of exploring them.
Graphical display using ggplot2 In parallel, I introduce the
ggplot2
dialect for plotting.
Why descriptive statistics?
Many students (and researchers!) ignore the value of descriptive statistics. They seem so simple! And why bother to do them when you are going to do more fancy analyses? The purpose of descriptive statistics is not analysis, however. They are there to describe your variables. One practical implication is that descriptive statistics rarely involves tests of statistical significance. That’s the role of the analysis. Descriptive statistics help you identify possibilities, flag potential problems, error probe and interpret your results.
Discover possibilities
When I get my hands on new data, I always spend time on descriptives. Often, this exploration is fairly unstructured. I explore each variable separately in univariate statistics. I then start to relate one to the other in bivariate and trivariate statistics. Some researchers claim that if the relationship they study is not visible in the bivariate, it is not a project worth pursuing! That’s a pretty extreme approach.
The point is to get a sense of what information is contained in the data: Variation is information. Variables with a lot of variation therefore contain more information.
My choice of model furthermore depends on how the dependent variable is measured and the structure of the data. Certainly, I can read the codebook (or design it), but I may also perceive new modeling opportunities by knowing what information is available to me. To begin with, I therefore explore the phenomenon I’m interested in (\(y\)).
For example, you may go into a project where you test some theoretical argument about when a government chooses to bomb their enemy. A possible operationalization of your dependent variable is the number of air strikes in a particular area. However, you might also realize that you have a time variable (date, time of day) of each strike. Now, you can also consider an event history model that analyzes each strike separately; it’s a more fine-grained analysis.
Flag potential problems
Data come with all kinds of problems and challenges that you have to handle. It is not always easy to know which problem you might have in the future, but by knowing the data, you will know the quirks of each variable and more easily spot problems (and solutions) when you run into them.
For example, there can also be too much variation. I could, for example, discover that my continuous variable on respondent income has a few very rich people in it. These are outliers. They may or may not skew your analysis later on. Thanks to the descriptive statistics, you will have the potential problem on your radar.
Error probe
Analyzing data usually involves a lot of data wrangling: You can create new variables from different information sources or change the measurement level of a variable. You can move up a notch by making indexes to get a more fine-grained variable or you can move down a measurement level because there is not enough/too much variation in the original data. After changing my data, I always check if my code resulted in what I wanted. The best way to do this is through some visualization of the univariate distribution of the new data.
Interprets and communicate
Knowledge of your descriptive statistics will come back once you’ve run your statistical model. Now you have to interpret it. It is time to figure out what scenarios are realistic and interesting to discuss. By that, I mean that you need to consider which values on the \(x\)-variables are common and what changes in these variables are likely. Now you can use your model to predict what the outcome is. This is at the very heart of your analytical venture. Descriptive statistics therefore helps you understand what you have found.
For example, it may be that your regression analysis reports a small coefficient for the effect of income on scepticism towards immigration. This does not automatically mean that the effect is not substantial. If your income is measured in DKK, then a one-unit increase is not a lot. However, if you see a salary bump of 100.000, then the shift – even if the regression coefficient is small – might also be substantial.
Graphics in R
R had already established itself as a good software for visual data
presentation, but the ggplot2
package has contributed to
making R unparalleled by making the graphical aspect available. You can
find a comprehensive online book on how to use ggplot2 here.
ggplot2
works well in conjunction with data preparation
with dplyr
. It has – annoyingly enough – its own pipes.
Overall, ggplot2
requires more arguments and functions to
provide simple plots, but the possibilities for customization are also
very large. Therefore, my ggplot graphics are produced with relatively
many lines of code. I use them more when presenting to others, while I
resort to faster solutions when I just want to take a “look”.
Getting started
We start by loading in the data packages we will use. If you haven’t already installed ggplot2, you will have to do that first.
#install.packages("ggplot2")
library(dplyr); library(ggplot2)
We will use the same data as in the previous session: It is available in my package or online on my website.
library(RiPraksis)
data(kap6)
<- kap6 df
Main logic
The workflow in ggplot2 is a pipe. This pipe is read chronologically: I first establish a sheet, then I add elements.
specify the data you want to work with
establish a blank sheet
add elements to the sheet one by one
the visual adaptations (“theme”) can be based on a template (there is a grey default temple). The template already specifies all the esthetics for you. However, you can “tweak” what you want after having established the template.
Google and ChatGPT are your best friends, as usual.
Example Let’s make a scatterplot between income and scepticism.
The main function, ggplot()
, establishes the plot, but
no plot elements. If you are going to use the same data for all the
graphical elements, you can specify the data object in this function.
When you do that, you do not specify your dataset multiple times in the
pipe, so you save time. From then on, you can refer directly to the
variables in it.
You bind all elements together with a pipe using “+” when you use ggplot. However, you can mix tidyverse and ggplot dialects. Here, I start by a tidyverse pipe, then continue with ggplot to make an empty sheet.
%>%
df ggplot()
The main elements of the graphics are included using a series of
geom_...()
functions. They specify the “geometry,” which is
what information should be included (information from variables) and
where it should be placed (coordinates).
Inside each of these functions, you will find the aesthetic function,
aes()
, where you specify the x/y coordinates that come from
your variables.
Here, I am creating a scatterplot for the relationship between
immigration skepticism and income. This means that I want to draw points
(geom_point())
.
%>%
df +
ggplot geom_point(aes(y = Skepsis,
x = Indtaegt))
## Warning: Removed 178 rows containing missing values (`geom_point()`).
Univariate distributions
The first thing I do, is usually to eyeball the data and check a)
conceptually what measurement level I think this variable has and decide
b) how I want to treat the variable (e.g. binary variables could be
treated as either categorical or numeric). c) It is sometimes useful to
check how R has registered the variable
(class(df$my_variable)
).
From there, I pick my statistics using a pretty standardized set of statistics and graphics.
Numeric variables
Numeric variables include all variables that I can treat arithmetically: Continous variables (numbers with commmas) and counts (integers; bounded at zero), but sometimes we also treat variables with many ordered categories (integers; bounded at zero and some upper limit) as numeric.
Numeric summary: mean, quantiles…
Summarizing continuous variables usually means looking at their average, but also their minimum, maximum values and their spread.
summary(df$Skepsis)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 4.000 5.000 5.009 6.000 10.000 5
I can also check out their standard deviation to get a feel for how much variation I have in them.
sd(df$Skepsis, na.rm = TRUE)
## [1] 1.72577
Visual: histogram
base
$Skepsis %>% hist df
We can make it relative using a simple additional argument.
$Skepsis %>% hist(., probability = T) df
ggplot2 Histograms are easily implemented in ggplot too.
%>%
df +
ggplot geom_histogram(aes(Skepsis))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (`stat_bin()`).
You can tweak them any which way you want, but it requires some more coding skills. Here, I make the histogram relative by manipulating the.
%>%
df ggplot(aes(x = Skepsis)) +
geom_histogram(aes(y = after_stat(density)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (`stat_bin()`).
Categorical variables
We can explore categorical variables primarily by their frequency distribution. That is, we count the number of observations we have in each category. This makes the basis both for the numeric summary (a frequency table) and the visual inspection (a barplot).
Numeric summary: frequency table
Here, we are looking at what our respondents voted last election.
$Parti %>%
df table
## .
## Andet Dansk Folkeparti
## 17 143
## Det Konservative Folkeparti Det Radikale Venstre
## 65 134
## Enhedslisten Kristendemokraterne
## 77 8
## Liberal Alliance Socialdemokraterne
## 48 268
## Socialistisk Folkeparti Venstre
## 108 311
Visual: barplot
base
In base, we need to make the frequency table first, then plot it using the appropriate function.
$Parti %>%
df%>%
table barplot()
Usually, it is much more intuitive to check the relative distribution
than the absolute. The function prop.table()
does exactly
that for us.
$Parti %>%
df#Frequency table
%>%
table #Relative distribution
%>%
prop.table #Barplot
barplot()
ggplot2
In ggplot2, R does the frequency plot for us.
%>%
df +
ggplot geom_bar(aes(Parti))
Do you notice the difference? The base function removes all “NA” (people who did not answer the question). In ggplot2, we’d have to remove them manually.
We can do that using tidyverse filtering!
%>%
df #Filter out missing observations before plotting
filter(!is.na(Parti)) %>%
ggplot() +
#Barplot
geom_bar(aes(x = Parti))
You can find more about data exploration in this tutorial, for example.
Your turn!
Describe the distribution of education among respondents (
Uddannelse
). What do you see?Replicate my barplot of the respondent’s party choice using ggplot2. Change the direction of the bars by swapping the x- and y-axes in the
aes()
specification.Can you describe the distribution of respondents that have a weak link to the labor market (
Prekaritet
)?Store the plot in an R object.
Bivariate statistics
After exploring the variables separately, we usually check out their bivariate relationship. Once again, the choice of statistics depends on the measurement level.
Two numeric variables
Numeric summary: correlation
When we have two numeric variables, we can calculate their correlation using Pearsons R (\(r\)).
cor(df$Indtaegt, df$Skepsis,
use = "pairwise.complete.obs")
## [1] -0.1526337
The negative correlation is apparent, but what does it mean? Can we put it into plain English?
If we take the square of this correlation, we get the “coefficient of determination” (\(R^2\)). We can interpret the result as a proportion of shared variation between the two variables. Sometimes we’d interpret it as a causal relationship.
cor(df$Indtaegt, df$Skepsis,
use = "pairwise.complete.obs")^2
## [1] 0.02329706
We may say that 2 % of the variation in attitudes towards immigration is shared with (explained by?) income.
We can test the statistical significance between the two.
cor.test(df$Indtaegt, df$Skepsis)
##
## Pearson's product-moment correlation
##
## data: df$Indtaegt and df$Skepsis
## t = -5.6155, df = 1322, p-value = 2.387e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.20482310 -0.09957898
## sample estimates:
## cor
## -0.1526337
Considering numbers in this way is really abstract though. The most intuitive way of exploring a correlation is through visuals.
Visual: scatterplot
When both variables are numeric, I can create a scatterplot for the relationship between immigration skepticism and income.
base R
The scatterplot plots the variable values against each other such that each observation has a coordinate that places it as a point on the x- and y-axis. This is more intuitive when your data set is reasonably small.
Here, I use a tidyverse pipe to select the two variables, then use
the base R plot()
function.
%>%
df select(Indtaegt, Skepsis) %>%
plot
ggplot2
I can do the same thing in ggplot and take the advantage of its additional functions to make the plot more legible.
I want to draw points, so I add the elements in the
geom_point()
function to the blank sheet.
%>%
df +
ggplot geom_point(aes(y = Skepsis,
x = Indtaegt))
## Warning: Removed 178 rows containing missing values (`geom_point()`).
It is hard to see any relationship between the two variables. Also, they only shared some 2% of their variation. However, the income variable also consists only of integers. Therefore, the points overlap each other, and we get a poor idea of how many units exist within the different value combinations. We can’t see what value combinations are the most common, only that there are a lot of them.
We can solve this in two ways: by making the points partially transparent, or by moving them slightly apart.
Transparent color I can specify that I want
transparent colors. I do this with the argument alpha=
. The
value I specify ranges from 0 (completely transparent/invisible) to 1
(not transparent).
%>%
df +
ggplot geom_point(aes(y = Skepsis,
x = Indtaegt),
#all observations are transparent
alpha = 0.2)
## Warning: Removed 178 rows containing missing values (`geom_point()`).
Note that when we want to adjust the points without the adjustment being dependent on information from the data, this is done outside of aes() but within the geom_…() function.
Jitter data Another alternative is to “jitter” the data points a little. This means that R adds a random variation to the coordinates of the points. The actual data points become more imprecise, but the point here is not to perform a precise analysis, but to present relationships in the data so that the human eye gets an intuitive understanding of what is going on.
%>%
df +
ggplot geom_jitter(aes(y = Skepsis,
x = Indtaegt),
#Horizontal, but not vertical shake
height = 0, width = 0.3)
## Warning: Removed 178 rows containing missing values (`geom_point()`).
The additional arguments width=
and height=
specify how much the points should be jittered horizontally and
vertically, respectively. Here, I say that I want precise data along the
y-axis (“in height”), but I let the points randomly vary by 0.3 units
along the x-axis (“in width”).
Trivariate graphics: Group the data.
I can easily add more groupings. Here I am coloring the points based on the value of a third variable: The variable can be continuous or categorical.
Numeric/continous grouping “Prekaritet” flags respondents that have a weak link to the labor market (i.e. they are unemployed or on a temporary work contract).
%>%
df +
ggplot geom_point(aes(y = Skepsis,
x = Indtaegt,
#Grouping
color = Prekaritet))
## Warning: Removed 178 rows containing missing values (`geom_point()`).
Now I get a “heat scale” for the respondent’s link to the labor market. ggplot also automatically provides a legend. Now, the variable only has two values, 0 or 1. We better use it as a categorical moderator.
Categorical grouping If I want a categorical
grouping, I need to feed the function a categorical variable
(“character” or “factor”). It’s best to do this by mutating the data
frame using tidyverse, not in the ggplot() function. Right here, I use
as.factor()
to change the measurement level into
categorical.
%>%
df mutate(Prekaritet = as.factor(Prekaritet)) %>%
+
ggplot geom_point(aes(y = Skepsis,
x = Indtaegt,
#Grouping
color = Prekaritet))
## Warning: Removed 178 rows containing missing values (`geom_point()`).
Of course, I can combine all these solutions – jittering, transparency and recoding – until I get a plot that tells the story I want to convey.
#tidyverse to transform the data
%>%
df mutate(Prekaritet = as.factor(Prekaritet)) %>%
#ggplot2 to plot the data
+
ggplot geom_jitter(aes(y = Skepsis,
x = Indtaegt,
#Continuous moderating third variable
color = Prekaritet),
#Shake data data horizontaly and vertically
height = 0.1, width = 0.4,
#Transparency
alpha = 0.7)
## Warning: Removed 178 rows containing missing values (`geom_point()`).
It is hard to tell much from this graphic, so let’s move on to other visualizations.
Bivariate regression lines
I can also show the relationship between two variables using regression lines. ggplot does not report regression coefficients but instead illustrates the regression line and the uncertainty around it. This is particularly relevant if at least one of the variables (the x-variable) is continuous.
Now we can choose between local regression and ordinary linear regression. Which one we choose depends on what we want to achieve.
Local regression
Local regression is – in my eyes – the queen of all ggplot functions. I use it all the time when exploring my data. It draws a “smooth” line (locally, the sliding average over the different x-values). Perfectly suited for time trends or exploring non-linear relationships, for example.
The function that gives us bivariate regression lines is
geom_smooth()
. The default setting is local regression.
%>%
df +
ggplot geom_smooth(aes(y = Skepsis,
x = Indtaegt))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 178 rows containing non-finite values (`stat_smooth()`).
Here I illustrate the relationship between income and skepticism about immigration. ggplot adjusts the axis limits for us automatically so as to zoom in on the results. It means that they always come across as large effects.
Grouped: Is the relationship the same within different categories?
%>%
df +
ggplot geom_smooth(aes(y = Skepsis,
x = Indtaegt,
color = as.factor(Prekaritet)))
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## Warning: Removed 178 rows containing non-finite values (`stat_smooth()`).
Would you like to replace the “spaghetti” with straight lines? Ask for a (bivariate) linear model.
%>%
df +
ggplot geom_smooth(aes(y = Skepsis,
x = Indtaegt,
color = as.factor(Prekaritet)),
method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 178 rows containing non-finite values (`stat_smooth()`).
Here we clearly see that the association between immigration skepticism and income is primarily present for respondents who have a safe connection to the labor market.
Your turn!
Plot the relationship between income (x =
Indtaegt
) and weak link to the labor market (y =Prekaritet
) using a local regression. What do you find? Would it make sense to swap the two, i.e. let income be on the y-axis?Use the tidyverse
mutate()
function to create a new variable that flags whether a respondent has answered the question about labor market connection. Redefine the new variable into a numeric using theas.numeric()
function. Then plot the relationship between income (Indtaegt
) and the new variable using a local regression. What do you find?
One numeric and one cagetorical variable (+ more tidyverse)
When we have a numeric and a categorical variable we’d like to explore, we will usually calculate the group mean; i.e. we calculate the average of the numeric variable for each of the categories in the other.
Numeric summary: group averages
This is a good excuse to have a second look at tidyverse
. dplyr
has a very handy function that allows us to do
operations on subset of the data depending on a moderating variable.
Let’s calculate the average scepticism among voters of different
parties and create a summary statistic. To do so, we use
reframe()
%>%
df group_by(Parti) %>%
reframe("Scepticism" = mean(Skepsis, na.rm = T))
Before moving on, I’m removing the NAs.
%>%
df filter(!is.na(Parti)) %>%
group_by(Parti) %>%
reframe("Scepticism" = mean(Skepsis, na.rm = T))
The observations are ordered alphabetically. It is not intuitive. Let’s rearrange.
%>%
df filter(!is.na(Parti)) %>%
group_by(Parti) %>%
reframe("Scepticism" = mean(Skepsis, na.rm = T)) %>%
arrange(Scepticism)
No surprises here; Dansk Folkeparti has voters that are more sceptical than other parties. This being said, in general, there is not a massive spread here.
My results looked ok, so I’m storing them in tab
.
<-
tab %>%
df filter(!is.na(Parti)) %>%
group_by(Parti) %>%
reframe("Scepticism" = mean(Skepsis, na.rm = T)) %>%
arrange(Scepticism)
Visual: barplot
Let’s instead plot this.
base
barplot(tab$Scepticism,
names.arg = tab$Parti,
las = 2)
ggplot2
Instead of plotting the entire data set, I now rely on my table also
in ggplot2. That is, I’ve already calculated the height of the bars, so
no need for ggplot to do that. I signal my intent by using the
geom_col()
function. It simply reads the coordinates from
the data frame so that the y-values and the x-values are defined by me.
%>%
tab +
ggplot geom_col(aes(y = Scepticism,
x = Parti))
Bonus
Would you like to order your columns by size? You can do that by
reordering the x-varible (Parti
) according to the values of
scepticism; i.e. you define it as an ordered categorical variable. The
function is reorder()
.
%>%
tab mutate(Parti = reorder(Parti, Scepticism)) %>%
+
ggplot geom_col(aes(y = Scepticism,
x = Parti))
Two categorical variables
When we have two categorical variables, we do a cross-table.
table(df$Parti, df$Kvinde)
##
## 0 1
## Andet 11 6
## Dansk Folkeparti 85 58
## Det Konservative Folkeparti 33 32
## Det Radikale Venstre 69 65
## Enhedslisten 32 45
## Kristendemokraterne 4 4
## Liberal Alliance 38 10
## Socialdemokraterne 116 152
## Socialistisk Folkeparti 43 65
## Venstre 175 136
Absolute frequencies are hard to fathom. I like them relative. Also, this time around, we want to know the proportion of voters that are men/women per party. It means that I calculate the distribution by rows (margin = 1).
table(df$Parti, df$Kvinde) %>%
prop.table(., margin = 1)
##
## 0 1
## Andet 0.6470588 0.3529412
## Dansk Folkeparti 0.5944056 0.4055944
## Det Konservative Folkeparti 0.5076923 0.4923077
## Det Radikale Venstre 0.5149254 0.4850746
## Enhedslisten 0.4155844 0.5844156
## Kristendemokraterne 0.5000000 0.5000000
## Liberal Alliance 0.7916667 0.2083333
## Socialdemokraterne 0.4328358 0.5671642
## Socialistisk Folkeparti 0.3981481 0.6018519
## Venstre 0.5627010 0.4372990
We don’t usually test the statistical significance, when we explore the data. However, we can. The chisquare test allows us to check if the data table is generated by random. The p-value reports the probability that the relationships we see are generated by accident. Here, that’s very unlikely.
table(df$Parti, df$Kvinde) %>%
chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 38.546, df = 9, p-value = 1.391e-05
<- table(df$Kvinde, df$Parti)
table
barplot(table,
beside = T)
ggplot2
%>%
df #Create a new variable with categorical names
mutate(Gender = if_else(Kvinde == 1,
"Woman",
"Man")) %>%
#Plot
+
ggplot geom_bar(aes(x = Parti,
#Moderate by gender
fill = Gender),
position = "dodge")
Your turn!
Key take-aways
- explore the data and use graphics
- choose statistics according to measurement level
- use
tidyverse
/dplyr
to reshape the data for group averages etc. - use
ggplot2
to plot the data
Key functions in tidyverse/dplyr
(numeric
manipulation)
Function | What it does |
---|---|
filter() |
Filters out observations |
select() |
Selects variables |
group_by() |
Groups the data |
mutate() |
Adds/changes variables, but not the number of observations |
reframe() |
Reduces to a data frame with fewer/summarizing variables |
Key functions in ggplot
(graphical
display)
Function | What it does | Reshapes the data? | Measurement level | Statistic |
---|---|---|---|---|
ggplot() |
Creates a blank page | |||
geom_bar() |
barplot/frequency count. | Yes. | One or two categorical variables | Univariate, Bi-/trivariate |
geom_histogram() |
histogram. | Yes | Numeric | Univariate |
geom_point() |
points/scatterplot. | No. Uses variables as coordinates. | Two numerical variables | Bivariate |
geom_col() |
bars/barplot | No. Uses data as coordinates, so dplyr::mutate() might
be needed. |
One categorical and one continuous variable; height is group mean | Bivariate |
geom_smooth() |
local regression line. | Yes | Two numeric or one binary (y) and one numeric | Bivariate |
Recap exercise
Our end-of-class class activity was to find some potential variables, define their measurement level, the statistics we can do to describe them and the visuals that might be useful.
Overview of potential descriptive statistics:
N. of variables | variable name | measurement level | statistic | graphic | geom_… |
---|---|---|---|---|---|
Univariate | |||||
income |
continous, bounded at 0 | mean, quartiles, median, sd | histogram | geom_histogram() |
|
party choice |
categorical | frequency table | barplot | geom_bar() |
|
job satisfaction |
ordinal, treated as either numeric or categorical |
||||
air strikes |
count; bounded at 0 | mean, etc. | histogram | geom_histogram() |
|
Bivariate | |||||
party choice + gender |
categorical + categorical (or treated as continous) |
cross table | barplot | geom_bar() |
|
income + grades |
continuous + ordinal (treated as categorical) |
group means; avg. income foree ach grade |
barplot | geom_col() |
|
negotiation time + preference agreement |
continuous + continous |
Pearson’s R | scatterplot; local regression |
|
Workflow and tweaks in ggplot2
Let’s follow an alternative workflow and create our own small dataset for plotting. Here, I am interested in immigration skepticism among respondents with different work situations.
<-
dfp %>%
df #Filter out NA
filter(!is.na(Prekaritet) & !is.na(Innvandrer)) %>%
#Group
group_by(Prekaritet, Innvandrer) %>%
#Create a smaller data frame with group summary statistics
reframe(
#Group average of immigration attitudes
Sceptical = mean(Skepsis, na.rm = T)) %>%
#Define new variable names to appear for the viewer
mutate(
#New variable with intutitive answers; note the "quotation marks"
`How is your work situation?` = if_else(Prekaritet == 1,
"Unstable",
"Stable"),
Immigrant = if_else(Innvandrer == 1,
"Immigrant", "Non-immigrant"))
dfp
<-
p %>%
dfp +
ggplot #Classical definition of x and y coordinates
geom_col(aes(y = Sceptical,
x = Immigrant,
#Fill in with colors following the grouping
fill = `How is your work situation?`),
#place bars side-by-side
position = "dodge")
Note how I have saved my plot in an object:
p <- ggplot()
. To see the plot, I request it.
p
It is useful to save the object this way because I will modify my R object further. I can do that using pipes.
Esthetical tweaks
Once you have added the graphical elements with your data, you can spend an endless amount of time fine-tuning the rest of the plot.
Tell R what language you speak (and where you are)
R was not created by Norwegians, so Danish characters can sometimes cause problems. We have to actively consider encoding choices in two settings: when we import data and when we save data, including in graphics.
Computers read and provide information as a series of 0s and 1s. To
read and write text, we use encoding, a translation of binary codes to
the alphabet we know. Traditionally, each country and language had its
own encoding system. This was cumbersome, but today, most text exchanged
on the internet follows “utf-8” encoding. It includes most of the
characters we know how to read. Sys.setlocale()
tells R
where you are in the world. Here, I am telling it that I want Norwegian
language with utf-8 encoding.
Sys.setlocale(category = "LC_ALL",
locale = "dk_DK.UTF-8")
Titles and axis names
<-
p +
p #Add a title and a subtitle
ggtitle(label = "Relationship between immigration skepticism and work situation",
subtitle = "Data from ESS (2014)") +
#Name of y-axis
ylab("Scepticism towards immigration") +
#Name of x-axis: empty
xlab("") +
#What are the limits on my axis?
ylim(c(0,6))
p
Define colors
You can choose your own colors, and there are many ways to do so. If you have linked your data with color choices, it may be useful to define a “palette.”
The text (“Precarious” and “Safe”) is part of the data I provided, so it is located in the variable “dfp$Worksituation.” I can change it in ggplot if I want to. I do this when defining the colors.
<-
p +
p #Define colors
scale_color_manual(
#What colors?
values = c("purple", "magenta"),
#Color the entire bar
aesthetics = "fill",
#What's the title of the legend?
name = "Link to labor market",
#What are the category names?
labels = c("Stable",
"Unstable")
) p
Background
You can define different templates for the aesthetic aspects of the
plot with theme_...()
.
<-
p +
p # Define a theme
theme_minimal()
p
You can alter the themes at will. You do that after you have defined
the theme, using the generik theme()
function.
<-
p +
p # Modify the minimal theme after defining it
theme(
# Move the legend down to below the plot
legend.position = "bottom",
# Bold font for the title in the legend
legend.title = element_text(face = "bold"),
# Italic for axis values
axis.text = element_text(face = "italic"),
# Remove the gridlines
panel.grid = element_blank())
p
The element_...()
functions are used inside the
theme()
function. They specify everything except the data
content/information for an element. Should it be a blank element
element_blank()? Specific font? element_text()
… etc.
For example, I can specify that the text used for the axis names
should be displayed in italic
element_text(face = "italic")
.