4 Missing data

Many real world datasets are incomplete: they have missing data. In this chapter we discuss how to deal with this in R.

We will use the packages listed and the datasets loaded in the following code chunk. We will use Pippa Noriss’ Democracy Cross-national Data and the 2019 Canadian Election Study. Make sure you load these packages and datasets before you start working with the examples in this overview:

# Loading packages
library(tidyverse)  # For working with data
library(rio)        # For importing datasets

# Loading the datasets
dta <- import("Democracy Cross-National Data V4.1 09092015.sav") 
canada <- import("2019 Canadian Election Study.rds")

For some countries the dataset contains missing values on the variable Pop2006 (population size in 2006):

dta |>
  filter(is.na(Pop2006)) |>
  select(Nation, Pop2006)

       Nation Pop2006
1 Afghanistan      NA
2        Iraq      NA
3      Kosovo      NA
4  Montenegro      NA
5       Nauru      NA
6 South Sudan      NA
7      Taiwan      NA
8 Timor Leste      NA
9      Tuvalu      NA

The term NA means Not Available. In this case there may not have been reliable data for these countries’ population sizes in 2006 (Afghanistan, Iraq) or the country simply did not exist in 2006 (South Sudan). There can be various reasons for missing data in a dataset.

4.1 Types of missing data

In R, NA is commonly used to signal missing data, but if we load datasets in an SPSS, Stata or other file format, missing data may have different codes. Often, this type of information can be found in the codebook, a separate file (often a PDF) that describes the dataset and its variables.

One example is the variable cps19_pidtrad (Traditional party identification) from the 2019 Canadian Election Study (we loaded this data at the start of this overview):

levels(canada$pes19_pidtrad)

[1] "Liberal"                          "Conservative"                    
[3] "NDP"                              "Bloc Québécois"                  
[5] "Green"                            "People’s Party"                  
[7] "Another party (please specify)"   "None of these"                   
[9] "Don't know/ Prefer not to answer"

This variable has a category ‘None of these’ and ‘Don’t know/Prefer not to answer’ which for most analysis should be treated as missing data. However, there are currently treated as regular answer categories.

We recommend to always check the levels or values of a variable to check for missing data issues.

This type of problem can also be encountered in interval-ratio variables (numeric variables), where in some cases numbers like 999 are used to indicate missing values. Note: this is something we do not recommend, but you may encounter it in real-world data you are working with.

One example in which the value 999 has been used to indicate missing data. If we calculate the mean for this variable without telling R that 999 is actually missing data, we will overestimate the mean age:

data_age <- data.frame(age = c(55, 64, 37, 56, 999, 42, 47, 22, 49, 68, 59, 999))
mean(data_age$age)  # The result is incorrect, because of the incorrect treatment of the missing values

[1] 208.0833

4.2 Recoding missing data

If there are values in the data that you would like to treat as missing data, you can use na_if from package dplyr. Package dplyr is loaded as part of the tidyverse.

canada <- canada |>
  mutate(pes19_pidtrad = na_if(pes19_pidtrad, 
                               "Don't know/ Prefer not to answer")) |>
  mutate(pes19_pidtrad = na_if(pes19_pidtrad, 
                               "None of these")) |>
  mutate(pes19_pidtrad = droplevels(pes19_pidtrad))

table(canada$pes19_pidtrad, useNA = "ifany") # Display a table including NAs


                       Liberal                   Conservative 
                          1746                           1501 
                           NDP                 Bloc Québécois 
                           693                            186 
                         Green                 People’s Party 
                           274                             64 
Another party (please specify)                           <NA> 
                            23                          33335

mutate(pes19_pidtrad = ...: We are going to change the existing variable pes19_pidtrad.
na_if(pes19_pidtrad, "Don't know/ Prefer not to answer"): This function changes particular values in a variable to NA . In this example we would like to change values of "Don't know/ Prefer not to answer" of pes19_pidtrad into missing values (NA). For your own data you will need to insert the appropriate variable name and the value you would like to have changed to NA.
mutate(pes19_pidtrad = droplevels(pes19_pidtrad)): Finally, we use droplevels to ensure that the levels that we recoded as NA are completely removed as levels from factor pes19_pidtrad, so that in any subsequent analyses these are completely ignored. This is not necessary for other types of variables (numeric or character).

We see that there are no more respondents who answer ‘None of these’ or ‘Don’t know/Prefer not to answer’. Note: Because we have two values that we would like to be transformed into NA, we have two mutate statements.

This also works for replacing numeric values, like in our data_age example. Here we want to change the value of 999 to NA:

data_age <- data_age |>
  mutate(age = na_if(age, 999))
data_age$age

 [1] 55 64 37 56 NA 42 47 22 49 68 59 NA

And the mean will be correctly calculated after recoding the missing values:

mean(data_age$age, na.rm=TRUE)

[1] 49.9

Note: it is better to use na_if than recode to recode missing values.¹

4.3 Filtering out missing data

You can filter out missing data, using the is.na function:

dta |>
  filter(!is.na(Pop2006)) |>
  select(Nation, Pop2006) |>
  head()

             Nation     Pop2006
1           Albania  3137503.17
2           Algeria 33347948.22
3           Andorra    66900.00
4            Angola 16391381.89
5 Antigua & Barbuda    83612.15
6         Argentina 39120455.54

filter(!is.na(Pop2006)): We filter out cases which have a non-missing value on variable Pop2006. Note the ! which means not, so we want only cases that do not have a missing value on Pop2006.

The reason is that if you use recode to recode to missing values, you need to tell R exactly what type of missing data you have, for example NA_character_ instead of just NA for a character variable. Otherwise you will run into incompatible vector problems that are better avoided.↩︎