# Loading packages
library(tidyverse) # For working with data
library(rio) # For importing datasets
library(DescTools) # Used here for calculating the Median
# Loading the datasets
<- import("Democracy Cross-National Data V4.1 09092015.sav")
dta <- import("2019 Canadian Election Study.rds") canada
3 Creating, summarising & recoding variables
In this chapter we will use the packages listed and the datasets loaded in the following code chunk. We will use Pippa Noriss’ Democracy Cross-national Data and the 2019 Canadian Election Study. Make sure you load these packages and datasets before you start working with the examples in this overview:
3.1 Creating or changing variables
We can use the function mutate
from package dplyr
to create new variables or change existing variables in a data frame. Package dplyr
is loaded as part of the tidyverse
.
We’ll use the variable Pop2006 as an example, which registers the population of countries in 2006:
head(dta$Pop2006)
[1] NA 3137503.17 33347948.22 66900.00 16391381.89 83612.15
If we would like to create a variable that registers the population in millions, we can create the new variable as follows:
<- dta |>
dta mutate(Pop2006_million = Pop2006 / 1000000)
dta <- dta |>
-
This part of the code says that we are going to start of with the data frame
dta
and assign the result to a data frame with the same name. Basically it means: we’re going to change the data framedta
. If you are working with your own data, you would replace this by the name of your own data frame.Note that we are using the pipe operator
|>
here, which means that we are taking the original data frame, and using this in the mutate function (see below). mutate(...)
-
The mutate function creates a new variable or changes an existing one. Within the brackets we specify what kind of mutation we want.
Pop2006_million
-
This is the name of the variable we’re creating. If you do not write the name of a new variable but instead use the name of an existing variable, this would change the values of that variable. In your own data, replace
Pop2006_million
by the name of the variable you would like to create. Pop2006 / 1000000
-
This is how we would like to create this new variable, in this case by dividing the existing variable
Pop2006
by one million. You can apply any kind of function or operator, as long as it works on a vector (in this case: an entire column of a dataset).
We can compare the first five cases to check if the transformation worked as intended:
|>
dta select(Pop2006, Pop2006_million) |> # Select the original and transformed variable
head() # Display the first few rows
Pop2006 Pop2006_million
1 NA NA
2 3137503.17 3.13750317
3 33347948.22 33.34794822
4 66900.00 0.06690000
5 16391381.89 16.39138189
6 83612.15 0.08361215
Other examples of mutations are:
# Calculate the difference between the population in 2006 and 2000
<- dta |>
dta mutate(Pop2006_difference = Pop2006 - Pop2000)
# Take the square root of the population
<- dta |>
dta mutate(Pop2006_squared = sqrt(Pop2006))
# Take the natural logarithm of the population
<- dta |>
dta mutate(Pop2006_log = log(Pop2006))
# Create a new variable with the same value (1) for all cases
<- dta |>
dta mutate(Country = 1)
Important things about mutate:
Do not use the
$
operator to select variables from a data frame. You must simply use the name of the variabe.Do not forget to assign the result, i.e. start the code with
dta <- dta |>
(replacingdta
by the name of your dataset). If you do not assign the result, R will simply print out the mutated data frame but not save it.If you apply a mutation to a data frame and assign the result, this will not generate any output. When it works succesfully, it will simply change your data frame or create a new one. You can subsequently inspect the data to see if the transformation worked.
3.2 Summarising data
We can summarise data using the function summarise
from pacakge dplyr
. Package dplyr
is loaded as part of the tidyverse
.
|>
dta summarise(mean_population = mean(Pop2006, na.rm=TRUE))
mean_population
1 34502734
This will simply give us the mean value of Pop2006 for all countries in the world. The most powerful use of summarise
is in combination with group_by
. For example, the below code groups countries by continent (DD_un_continent_name
) and then calculates the mean population size for each continent:
|>
dta group_by(DD_un_continent_name) |>
summarise(mean_population = mean(Pop2006, na.rm=TRUE))
# A tibble: 6 × 2
DD_un_continent_name mean_population
<chr> <dbl>
1 "" 32600
2 "Africa" 17475659.
3 "Americas" 25412866.
4 "Asia" 87338061.
5 "Europe" 17711125.
6 "Oceania" 2732066.
We can also group by more than one grouping variable. For example, here we calculate the mean population for democracies (Cheibub2Type = 0) and dictatorships (Cheibub2Type = 1) in each continent:
|>
dta group_by(DD_un_continent_name, Cheibub2Type) |>
summarise(mean_population = mean(Pop2006, na.rm=TRUE))
# A tibble: 15 × 3
# Groups: DD_un_continent_name [6]
DD_un_continent_name Cheibub2Type mean_population
<chr> <dbl> <dbl>
1 "" 0 NaN
2 "" NA 32600
3 "Africa" 0 18924251.
4 "Africa" 1 16597725.
5 "Americas" 0 26786031.
6 "Americas" 1 14770838.
7 "Asia" 0 138303981.
8 "Asia" 1 63553966.
9 "Asia" NA NaN
10 "Europe" 0 18538856.
11 "Europe" 1 7226530.
12 "Europe" NA NaN
13 "Oceania" 0 3515919.
14 "Oceania" 1 380506.
15 "Oceania" NA NaN
You can also calculate multiple summary functions:
|>
dta group_by(DD_un_continent_name) |>
summarise(mean_population = mean(Pop2006, na.rm=TRUE),
median_poplation = Median(Pop2006, na.rm=TRUE), # Median from pacakge DescTools
sd_population = sd(Pop2006, na.rm=TRUE))
# A tibble: 6 × 4
DD_un_continent_name mean_population median_poplation sd_population
<chr> <dbl> <dbl> <dbl>
1 "" 32600 32600 NA
2 "Africa" 17475659. 9244288. 25102665.
3 "Americas" 25412866. 6016000 59575769.
4 "Asia" 87338061. 14829470. 253704497.
5 "Europe" 17711125. 7441475. 28476042.
6 "Oceania" 2732066. 200462. 5920588.
If you want to save the result to a new data frame instead of printing it out, you can simply assign the result. Note that there is no output from the below code chunk because the result is saved in the data frame summary_data
:
<- dta |>
summary_data group_by(DD_un_continent_name) |>
summarise(mean_population = mean(Pop2006, na.rm=TRUE),
median_poplation = Median(Pop2006, na.rm=TRUE), # Median from pacakge DescTools
sd_population = sd(Pop2006, na.rm=TRUE))
The difference between mutate and summarise is that mutate creates a new value for each case in the dataset, while summarise will summarise the data by the grouping variable.
3.3 Recoding variables
Recoding variables means that we change the coding of values. This often applies to nominal or ordinal variables (factors), for example when we would like to group some categories together.
3.3.1 Recoding nominal variables
We illustrate this using the variable religion
‘Majority religion’ that we create from Pippa Noris’ Democracy Cross-national Data. We already imported the dataset (dta
) and loaded the relevant packages (tidyverse
, rio
) at the start of this overview.
First, we create variable religion
using the function factorize
from rio
. This creates a factor version of the original (labelled numeric) variable Fox_emajrel
.
<- dta |>
dta mutate(religion = factorize(Fox_emajrel))
The variable religion
can take the following values:
table(dta$religion)
Catholic Orthodox Christian Protestant Christian
43 12 16
Other Christian Islam, Sunni Islam, Shi'i
0 33 3
Islam, Other Buddhist Hindu
1 8 2
Jewish Animist Other
1 4 5
Islam (general) Mixed Christian (general)
8 12 25
We see that we have various large religions and various smaller categories. We can also see that various Islamic denominations and Christian denominations have been split up. Suppose we want, for the purpose of further analysis, simplify the categorisation to five categories: Christian, Islamic, Hindu, Buddhist and Other.
We can use the recode
function from dplyr
to recode the categories. Package dplyr
is loaded as part of the tidyverse
.
<- dta |>
dta mutate(religion_recoded = recode(religion,
"Animist" = "Other",
"Catholic" = "Christian",
"Islam (general)" = "Islam",
"Islam, Other" = "Islam",
"Islam, Shi'i" = "Islam",
"Islam, Sunni" = "Islam",
"Jewish" = "Other",
"Mixed" = "Other",
"Orthodox Christian" = "Christian",
"Protestant Christian" = "Christian",
"Other Christian" = "Christian",
"Christian (general)" = "Christian"
) )
mutate(religion_recoded =
-
This part of the code indicates that we want to create a new variable
religion_recoded
. If you work with your own data you would of course choose an appropriate name. recode(religion,
-
This part of the code indicates that we would like to recode the existing variable
religion
. If you work with your own data, specify the name of the existing variable that you want to recode. "Animist" = "Other"
-
These are the recoding statements, where we writing the old name on the left and the new name on the right between double quotation marks (“).
We can include as many of these statements as we like. If we do not include a value, it will remain unchanged (by default). For example, we do not include a statement for
Other
, so this value will be the same in the recoded variable.
We can see that the recoding has resulted in the five categories that we would like:
table(dta$religion_recoded)
Christian Islam Buddhist Hindu Other
96 45 8 2 22
We can also check whether the old values have been correctly recoded in the new values:
table(dta$religion, dta$religion_recoded)
Christian Islam Buddhist Hindu Other
Catholic 43 0 0 0 0
Orthodox Christian 12 0 0 0 0
Protestant Christian 16 0 0 0 0
Other Christian 0 0 0 0 0
Islam, Sunni 0 33 0 0 0
Islam, Shi'i 0 3 0 0 0
Islam, Other 0 1 0 0 0
Buddhist 0 0 8 0 0
Hindu 0 0 0 2 0
Jewish 0 0 0 0 1
Animist 0 0 0 0 4
Other 0 0 0 0 5
Islam (general) 0 8 0 0 0
Mixed 0 0 0 0 12
Christian (general) 25 0 0 0 0
(For example, all four majority Animist countries have correctly been recoded to ‘Other’).
3.3.2 Recoding ordinal variables
If you have an ordinal variable you want to recode, you would probably like the order of the variable to be kept.
For example, we have variable economy
which we create from the Norris dataset. It is a classification of of countries based on their GDP per capita. We already imported the dataset (dta
) and loaded the relevant packages (tidyverse
, rio
) at the start of this overview.
First, we create variable economy
using the function factorize
from rio
. This creates a factor version of the original (labelled numeric) variable TypeEcon2006
.
<- dta |>
dta mutate(economy = factorize(TypeEcon2006))
The variable economy
can take the following values:
table(dta$economy)
High ($15,000+) Medium ($2,000-14,999) Low ($2000 or less)
36 68 87
Sometimes one wants to merge certain categories, for example, to take together the High and Medium level countries, but preserve the order from low to high. We can achieve this using recode_factor
:
<- dta |>
dta mutate(economy_recoded = recode_factor(economy,
"Low ($2000 or less)" = "Lower",
"High ($15,000+)" = "Higher",
"Medium ($2,000-14,999)" = "Higher",
.ordered = TRUE)
)
mutate(economy_recoded
-
This part of the code indicates that we want to create a new variable
economy_recoded
. If you work with your own data you would choose an appropriate name for the recoded variable. recode_factor(economy,
-
This part of the code indicates that we would like to recode the existing variable
economy
. If you work with your own data, specify the name of the existing variable that you want to recode. "Low ($2000 or less)"="Lower"
-
These are the recoding statements, where we writing the old name on the left and the new name on the right between double quotation marks (“).
We can include as many of these statements as we like. Note that the spelling of the original category needs to be exact.
.ordered = TRUE
-
This indicates that we would like to create an ordered factor. Please note the dot before ordered.
If we inspect the first few values, we see that the factor is ordered. The order of the factor is determined by the order in which you have specified the recoding statements (i.e. Lower, then Higher):
head(dta$economy_recoded)
[1] Lower Lower Higher Higher Lower Higher
Levels: Lower < Higher
3.3.3 Recoding numeric variables
Recoding numeric variables works similarly to recoding norminal or ordinal variables.
As an example take the Freedom House Democracy Ratings for 2014 (fhrate14
), which are epxressed on a scale from 1 to 7. Freedom House also uses a three-point rating: Free (1.0 to 2.5), Partly Free (3.0 to 5.0) and Not Free (5.5 to 7). We already imported the dataset (dta
) and loaded the relevant packages (tidyverse
, rio
) at the start of this overview.
We can recode the variable fhrate14
as follows:
<- dta |>
dta mutate(fhrate14_recoded = recode(fhrate14,
"1" = 1,
"1.5" = 1,
"2.0" = 1,
"2.5" = 1,
"3" = 2,
"3.5" = 2,
"4" = 2,
"4.5" = 2,
"5" = 2,
"5.5" = 3,
"6" = 3,
"6.5" = 3,
"7" = 3)
)
table(dta$fhrate14_recoded)
1 2 3
89 55 51
Note: if you have many unique values, this may not be the most efficient way to recode a numeric variable. In such cases using ifelse
or case_when
will work more efficiently; these functions will be discussed in one of the upcoming weeks.1 There are also other packages that offer recoding functions (such as recode
from package car
or rec
from pacakge sjmisc
), but we will not discuss these here.
3.4 Missing data
Many real world datasets are incomplete: they have missing data. For example, in Pippa Norris’ dataset on countries many variables are not observed for each country. We already imported this dataset (dta
) and loaded the relevant packages (tidyverse
, rio
) at the start of this overview.
For some countries the dataset contains missing values on the variable Pop2006
(population size in 2006):
|>
dta filter(is.na(Pop2006)) |>
select(Nation, Pop2006)
Nation Pop2006
1 Afghanistan NA
2 Iraq NA
3 Kosovo NA
4 Montenegro NA
5 Nauru NA
6 South Sudan NA
7 Taiwan NA
8 Timor Leste NA
9 Tuvalu NA
The term NA
means Not Available
. In this case there may not have been reliable data for these countries’ population sizes in 2006 (Afghanistan, Iraq) or the country simply did not exist in 2006 (South Sudan). There can be various reasons for missing data in a dataset.
3.4.1 Types of missing data
In R, NA
is commonly used to signal missing data, but if we load datasets in an SPSS, Stata or other file format, missing data may have different codes. Often, this type of information can be found in the codebook, a separate file (often a PDF) that describes the dataset and its variables.
One example is the variable cps19_pidtrad
(Traditional party identification) from the 2019 Canadian Election Study (we loaded this data at the start of this overview):
levels(canada$pes19_pidtrad)
[1] "Liberal" "Conservative"
[3] "NDP" "Bloc Québécois"
[5] "Green" "People’s Party"
[7] "Another party (please specify)" "None of these"
[9] "Don't know/ Prefer not to answer"
This variable has a category ‘None of these’ and ‘Don’t know/Prefer not to answer’ which for most analysis should be treated as missing data. However, there are currently treated as regular answer categories.
We recommend to always check the levels or values of a variable to check for missing data issues.
This type of problem can also be encountered in interval-ratio variables (numeric variables), where in some cases numbers like 999
are used to indicate missing values. Note: this is something we do not recommend, but you may encounter it in real-world data you are working with.
One example in which the value 999
has been used to indicate missing data. If we calculate the mean for this variable without telling R that 999
is actually missing data, we will overestimate the mean age:
<- data.frame(age = c(55, 64, 37, 56, 999, 42, 47, 22, 49, 68, 59, 999))
data_age mean(data_age$age) # The result is incorrect, because of the incorrect treatment of the missing values
[1] 208.0833
3.4.2 Recoding missing data
If there are values in the data that you would like to treat as missing data, you can use na_if
from package dplyr
. Package dplyr
is loaded as part of the tidyverse
.
<- canada |>
canada mutate(pes19_pidtrad = na_if(pes19_pidtrad,
"Don't know/ Prefer not to answer")) |>
mutate(pes19_pidtrad = na_if(pes19_pidtrad,
"None of these")) |>
mutate(pes19_pidtrad = droplevels(pes19_pidtrad))
table(canada$pes19_pidtrad, useNA = "ifany") # Display a table including NAs
Liberal Conservative
1746 1501
NDP Bloc Québécois
693 186
Green People’s Party
274 64
Another party (please specify) <NA>
23 33335
mutate(pes19_pidtrad = ...
-
We are going to change the existing variable
pes19_pidtrad
. na_if(pes19_pidtrad, "Don't know/ Prefer not to answer")
-
This function changes particular values in a variable to
NA
. In this example we would like to change values of"Don't know/ Prefer not to answer"
ofpes19_pidtrad
into missing values (NA
). For your own data you will need to insert the appropriate variable name and the value you would like to have changed toNA
. mutate(pes19_pidtrad = droplevels(pes19_pidtrad))
-
Finally, we use
droplevels
to ensure that the levels that we recoded asNA
are completely removed as levels from factorpes19_pidtrad
, so that in any subsequent analyses these are completely ignored. This is not necessary for other types of variables (numeric
orcharacter
).
We see that there are no more respondents who answer ‘None of these’ or ‘Don’t know/Prefer not to answer’. Note: Because we have two values that we would like to be transformed into NA
, we have two mutate
statements.
This also works for replacing numeric values, like in our data_age
example. Here we want to change the value of 999
to NA
:
<- data_age |>
data_age mutate(age = na_if(age, 999))
$age data_age
[1] 55 64 37 56 NA 42 47 22 49 68 59 NA
And the mean will be correctly calculated after recoding the missing values:
mean(data_age$age, na.rm=TRUE)
[1] 49.9
Note: it is better to use na_if
than recode
to recode missing values.2
3.4.3 Filtering out missing data
You can filter out missing data, using the is.na
function:
|>
dta filter(!is.na(Pop2006)) |>
select(Nation, Pop2006) |>
head()
Nation Pop2006
1 Albania 3137503.17
2 Algeria 33347948.22
3 Andorra 66900.00
4 Angola 16391381.89
5 Antigua & Barbuda 83612.15
6 Argentina 39120455.54
filter(!is.na(Pop2006))
-
We filter out cases which have a non-missing value on variable Pop2006. Note the
!
which means not, so we want only cases that do not have a missing value on Pop2006.
In our example, we could use:
<- dta |> dta mutate(fhrate14_recoded2 = case_when(fhrate14 <= 2.5 ~ 1, <= 5 ~ 2, fhrate14 <= 7 ~ 3, fhrate14 TRUE ~ NA_real_)) table(dta$fhrate14_recoded2)
1 2 3 89 55 51
The reason is that if you use recode to recode to missing values, you need to tell R exactly what type of missing data you have, for example
NA_character_
instead of justNA
for a character variable. Otherwise you will run into incompatible vector problems that are better avoided.↩︎