# Loading packages
library(tidyverse) # For working with data
library(rio) # For importing datasets
# Loading the datasets
<- import("Democracy Cross-National Data V4.1 09092015.sav")
dta <- import("2019 Canadian Election Study.rds") canada
4 Missing data
Many real world datasets are incomplete: they have missing data. In this chapter we discuss how to deal with this in R.
We will use the packages listed and the datasets loaded in the following code chunk. We will use Pippa Noriss’ Democracy Cross-national Data and the 2019 Canadian Election Study. Make sure you load these packages and datasets before you start working with the examples in this overview:
For some countries the dataset contains missing values on the variable Pop2006
(population size in 2006):
|>
dta filter(is.na(Pop2006)) |>
select(Nation, Pop2006)
Nation Pop2006
1 Afghanistan NA
2 Iraq NA
3 Kosovo NA
4 Montenegro NA
5 Nauru NA
6 South Sudan NA
7 Taiwan NA
8 Timor Leste NA
9 Tuvalu NA
The term NA
means Not Available
. In this case there may not have been reliable data for these countries’ population sizes in 2006 (Afghanistan, Iraq) or the country simply did not exist in 2006 (South Sudan). There can be various reasons for missing data in a dataset.
4.1 Types of missing data
In R, NA
is commonly used to signal missing data, but if we load datasets in an SPSS, Stata or other file format, missing data may have different codes. Often, this type of information can be found in the codebook, a separate file (often a PDF) that describes the dataset and its variables.
One example is the variable cps19_pidtrad
(Traditional party identification) from the 2019 Canadian Election Study (we loaded this data at the start of this overview):
levels(canada$pes19_pidtrad)
[1] "Liberal" "Conservative"
[3] "NDP" "Bloc Québécois"
[5] "Green" "People’s Party"
[7] "Another party (please specify)" "None of these"
[9] "Don't know/ Prefer not to answer"
This variable has a category ‘None of these’ and ‘Don’t know/Prefer not to answer’ which for most analysis should be treated as missing data. However, there are currently treated as regular answer categories.
We recommend to always check the levels or values of a variable to check for missing data issues.
This type of problem can also be encountered in interval-ratio variables (numeric variables), where in some cases numbers like 999
are used to indicate missing values. Note: this is something we do not recommend, but you may encounter it in real-world data you are working with.
One example in which the value 999
has been used to indicate missing data. If we calculate the mean for this variable without telling R that 999
is actually missing data, we will overestimate the mean age:
<- data.frame(age = c(55, 64, 37, 56, 999, 42, 47, 22, 49, 68, 59, 999))
data_age mean(data_age$age) # The result is incorrect, because of the incorrect treatment of the missing values
[1] 208.0833
4.2 Recoding missing data
If there are values in the data that you would like to treat as missing data, you can use na_if
from package dplyr
. Package dplyr
is loaded as part of the tidyverse
.
<- canada |>
canada mutate(pes19_pidtrad = na_if(pes19_pidtrad,
"Don't know/ Prefer not to answer")) |>
mutate(pes19_pidtrad = na_if(pes19_pidtrad,
"None of these")) |>
mutate(pes19_pidtrad = droplevels(pes19_pidtrad))
table(canada$pes19_pidtrad, useNA = "ifany") # Display a table including NAs
Liberal Conservative
1746 1501
NDP Bloc Québécois
693 186
Green People’s Party
274 64
Another party (please specify) <NA>
23 33335
mutate(pes19_pidtrad = ...
-
We are going to change the existing variable
pes19_pidtrad
. na_if(pes19_pidtrad, "Don't know/ Prefer not to answer")
-
This function changes particular values in a variable to
NA
. In this example we would like to change values of"Don't know/ Prefer not to answer"
ofpes19_pidtrad
into missing values (NA
). For your own data you will need to insert the appropriate variable name and the value you would like to have changed toNA
. mutate(pes19_pidtrad = droplevels(pes19_pidtrad))
-
Finally, we use
droplevels
to ensure that the levels that we recoded asNA
are completely removed as levels from factorpes19_pidtrad
, so that in any subsequent analyses these are completely ignored. This is not necessary for other types of variables (numeric
orcharacter
).
We see that there are no more respondents who answer ‘None of these’ or ‘Don’t know/Prefer not to answer’. Note: Because we have two values that we would like to be transformed into NA
, we have two mutate
statements.
This also works for replacing numeric values, like in our data_age
example. Here we want to change the value of 999
to NA
:
<- data_age |>
data_age mutate(age = na_if(age, 999))
$age data_age
[1] 55 64 37 56 NA 42 47 22 49 68 59 NA
And the mean will be correctly calculated after recoding the missing values:
mean(data_age$age, na.rm=TRUE)
[1] 49.9
Note: it is better to use na_if
than recode
to recode missing values.1
4.3 Filtering out missing data
You can filter out missing data, using the is.na
function:
|>
dta filter(!is.na(Pop2006)) |>
select(Nation, Pop2006) |>
head()
Nation Pop2006
1 Albania 3137503.17
2 Algeria 33347948.22
3 Andorra 66900.00
4 Angola 16391381.89
5 Antigua & Barbuda 83612.15
6 Argentina 39120455.54
filter(!is.na(Pop2006))
-
We filter out cases which have a non-missing value on variable Pop2006. Note the
!
which means not, so we want only cases that do not have a missing value on Pop2006.
The reason is that if you use recode to recode to missing values, you need to tell R exactly what type of missing data you have, for example
NA_character_
instead of justNA
for a character variable. Otherwise you will run into incompatible vector problems that are better avoided.↩︎