library(tidyverse)
library(rio)
6 Combining & joining data sets
The required packages for this chapter are:
We use data from the 2019 Canadian Election Study as an example in this overview:
<- import("2019 Canadian Election Study.rds") canada
6.1 Adding rows or columns
6.1.1 Adding rows
If we have two datasets with the same columns (variables), we can add the rows of the two data sets together using bind_rows
from package dplyr
(this package is loaded with the tidyverse).
As an example, we have two data sets with the same variables, names
and age
:
<- data.frame(name = c("Billy", "Xin", "Hugo"), age = c(20, 30, 40))
dta1 <- data.frame(name = c("Theresa", "Elin", "Lena"), age = c(25, 35, 45))
dta2
dta1
name age
1 Billy 20
2 Xin 30
3 Hugo 40
dta2
name age
1 Theresa 25
2 Elin 35
3 Lena 45
We can combine these data sets as follows:
<- bind_rows(dta1, dta2)
dta_combined dta_combined
name age
1 Billy 20
2 Xin 30
3 Hugo 40
4 Theresa 25
5 Elin 35
6 Lena 45
6.1.2 Adding columns
If we want to add additional columns to a data frame, we can use bind_cols
from package dplyr
(this package is loaded with the tidyverse).
For example, if we have two datasets, one with name and age of people, and another with their self-reported gender:
<- data.frame(name = c("Billy", "Xin", "Hugo"), age = c(20, 30, 40))
dta <- data.frame(gender = c("Non-binary", "Female", "Male")) gender
dta
name age
1 Billy 20
2 Xin 30
3 Hugo 40
gender
gender
1 Non-binary
2 Female
3 Male
We can combine this, using bind_cols
:
bind_cols(dta, gender)
name age gender
1 Billy 20 Non-binary
2 Xin 30 Female
3 Hugo 40 Male
Important: bind_cols
will match by position, so the order of rows must be exactly the same. For more flexible ways of adding columns to a dataset, see ‘joining of data sets’ below.
6.2 Joining data sets
If we have two datasets with a common variable (i.e. a variable that has the same name and comparable coding, for example a country name or a unique identifier), we can combine the datasets, using join
from dplyr
.
Let’s start with some toy data on 4 countries:
<- data.frame(country = c("USA", "Germany", "Netherlands", "Kenya"),
country_dta1 population = c(332, 84, 18, 56))
<- data.frame(country = c("Netherlands", "Germany", "Kenya", "Argentina"),
country_dta2 official_name = c("Nederland", "Bundesrepublik Deutschland", "Republic of Kenya", "República Argentina"))
country_dta1
country population
1 USA 332
2 Germany 84
3 Netherlands 18
4 Kenya 56
country_dta2
country official_name
1 Netherlands Nederland
2 Germany Bundesrepublik Deutschland
3 Kenya Republic of Kenya
4 Argentina República Argentina
Note that only three countries feature in both datasets, but the variable country
is present in both data frames and the values are the same for the three countries that are in both datasets.
We can merge the data in these datasets as follows, using so called ‘joins’ from pacakge dplyr
. The function full_join
returns all rows in both datasets:
<- full_join(x = country_dta1,
joined_data y = country_dta2,
by = c("country"))
x = country_dta1
-
This specifies the first of the two datasets. Replace with your own dataset when working with your own data.
y = country_dta2
-
This specifies the second of the two datasets.
by = c("country")
-
This specifies the column on which we would like to match values of the two data frames (specified as a vector, i.e. between double parentheses). In our example, we use the variable country, as this variable is comparable between our two data sets.
joined_data
country population official_name
1 USA 332 <NA>
2 Germany 84 Bundesrepublik Deutschland
3 Netherlands 18 Nederland
4 Kenya 56 Republic of Kenya
5 Argentina NA República Argentina
We can see that all five countries are present in the joined data set. The USA is missing from country_dta2
, so has a missing value (NA
) on official_name
in the joined data set. Argentina is missing from country_dta1
, so has a missing value on population
in the joined data set.
There are four types of joins:
Function | Includes |
---|---|
inner_join() |
All rows in x and y . |
left_join() |
All rows in x . |
right_join() |
All rows in y . |
full_join() |
All rows in x or y . |
We can see the result if we run and print the four types of joins:
# Inner join: only the three countries that are in both data sets
inner_join(x = country_dta1, y = country_dta2, by = c("country"))
country population official_name
1 Germany 84 Bundesrepublik Deutschland
2 Netherlands 18 Nederland
3 Kenya 56 Republic of Kenya
# Left join: only the four countries in the first data set
left_join(x = country_dta1, y = country_dta2, by = c("country"))
country population official_name
1 USA 332 <NA>
2 Germany 84 Bundesrepublik Deutschland
3 Netherlands 18 Nederland
4 Kenya 56 Republic of Kenya
# Right join: only the four countries in the second data set
right_join(x = country_dta1, y = country_dta2, by = c("country"))
country population official_name
1 Germany 84 Bundesrepublik Deutschland
2 Netherlands 18 Nederland
3 Kenya 56 Republic of Kenya
4 Argentina NA República Argentina
# Full join: all five countries
full_join(x = country_dta1, y = country_dta2, by = c("country"))
country population official_name
1 USA 332 <NA>
2 Germany 84 Bundesrepublik Deutschland
3 Netherlands 18 Nederland
4 Kenya 56 Republic of Kenya
5 Argentina NA República Argentina
6.2.1 Joining with different variables
If we have two data frames with different names, we need to instruct R which variables are to be compared.
For example, suppose we have a third data frame with country information:
<- data.frame(country_name = c("USA", "Netherlands", "Germany"),
country_dta3 capital_city = c("Washington DC", "Amsterdam", "Berlin"))
country_dta3
country_name capital_city
1 USA Washington DC
2 Netherlands Amsterdam
3 Germany Berlin
We note that the variable containing the country name is called country_name
in this data frame. In order to join it with country_dta1
, we need to specify by
as follows:
full_join(x = country_dta1, y = country_dta3,
by = c("country" = "country_name"))
country population capital_city
1 USA 332 Washington DC
2 Germany 84 Berlin
3 Netherlands 18 Amsterdam
4 Kenya 56 <NA>
by = c("country" = "country_name")
-
The syntax
"country" = "country_name"
means that we are comparing variablecountry
from the first data set to variablecountry_name
from the second data set.
Note that this solution only works when the country names are spelled identically (and only the variable name is different). The below will not work properly, because the country names are not comparable between the two datasets:
<- data.frame(country = c("United States of America", "Deutschland", "Nederland"),
country_dta4 capital_city = c("Washington DC", "Berlin", "Amsterdam"))
country_dta4
country capital_city
1 United States of America Washington DC
2 Deutschland Berlin
3 Nederland Amsterdam
full_join(country_dta1, country_dta4)
country population capital_city
1 USA 332 <NA>
2 Germany 84 <NA>
3 Netherlands 18 <NA>
4 Kenya 56 <NA>
5 United States of America NA Washington DC
6 Deutschland NA Berlin
7 Nederland NA Amsterdam
In such cases, the solution would be to first recode the variable that is used for joining, for example:
<- country_dta4 |>
country_dta4_recoded mutate(country = recode(country,
"United States of America" = "USA",
"Deutschland" = "Germany",
"Nederland" = "Netherlands"))
full_join(country_dta1, country_dta4_recoded)
country population capital_city
1 USA 332 Washington DC
2 Germany 84 Berlin
3 Netherlands 18 Amsterdam
4 Kenya 56 <NA>