2 Filtering, selecting & renaming

Here we will introduce a number of important techniques that will help you to work with existing data sets and to set up your own data sets. This concerns how to select specific columns, how to filter cases based on conditions and how to rename variables.

We will illustrate them on different data sets. One is the ‘deniro.csv’ file from week 1 (available here) which contains Rotten Tomato ratings of movies with Robert De Niro. There are 87 records. Each record has Year, Rating, Title.

Make sure to save the file to your project folder, because that is where R will look for the file when you open it.

You can import this data set in R using the following code:

library(rio) 
deniro_data <- import("deniro.csv")

We can take a look at the first six rows of the data set by using the head() function:

head(deniro_data)

  Year Score                Title
1 1968    86            Greetings
2 1970    17          Bloody Mama
3 1970    73             Hi, Mom!
4 1971    40          Born to Win
5 1973    98         Mean Streets
6 1973    88 Bang the Drum Slowly

2.1 Tibbles

We already encountered data frames in the first week. Data frames are one of the ‘original’ ways in R to interact with data. However, because of the way they are displayed, it can sometimes be difficult to manipulate data in them and to print their content.

A tibble, or tbl_df, is a modern ‘version’ of the data frame. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. Tibbles are therefore easier to use if you have large data sets containing complex objects. While the data set used here is not particularly complex, it makes sense to get used to tibbles.

To be able to work with tibbles, you must install tidyverse install.packages("tidyverse") (already done on University computers). We can load the tidyverse packages by typing:

library(tidyverse)

To transform our data set into a tibble, we can write.

deniro_data <- as_tibble(deniro_data)

deniro_data <- as_tibble(deniro_data): With this code we turn the existing object deniro_data into a tibble, a data frame with class tbl_df. If your data frame has a different name, you need to change the name. When we now open the data frame we can see that the way the data frame is displayed has changed. Note that nothing has been changed in terms of content. All the original variables and values are still the same. It is only a visual change at this point.

deniro_data

# A tibble: 87 × 3
    Year Score Title                 
   <int> <int> <chr>                 
 1  1968    86 Greetings             
 2  1970    17 Bloody Mama           
 3  1970    73 Hi, Mom!              
 4  1971    40 Born to Win           
 5  1973    98 Mean Streets          
 6  1973    88 Bang the Drum Slowly  
 7  1974    97 The Godfather, Part II
 8  1976    41 The Last Tycoon       
 9  1976    99 Taxi Driver           
10  1977    47 1900                  
# ℹ 77 more rows

2.2 Selecting

Once we have opened our data set we can start selecting variables. For example, if we want to select the columns (variables) regarding publication year (Year) and the title of the movie (Title) by their names, we can write:

deniro_data |> 
  select(Year, Title)

# A tibble: 87 × 2
    Year Title                 
   <int> <chr>                 
 1  1968 Greetings             
 2  1970 Bloody Mama           
 3  1970 Hi, Mom!              
 4  1971 Born to Win           
 5  1973 Mean Streets          
 6  1973 Bang the Drum Slowly  
 7  1974 The Godfather, Part II
 8  1976 The Last Tycoon       
 9  1976 Taxi Driver           
10  1977 1900                  
# ℹ 77 more rows

deniro_data |>: This part of the code extract one or multiple columns from deniro_data. Note that we use the native pipe operator (|>), see week 2.
select(Year, Title): Here we select two columns of the tibble, named Year and Title. Note that as a matter of style, we suggest to put any new commands in a pipe on the next line (see overview of week 2)

There are several special functions that can be used inside select(). It is possible to apply a function to the columns and, for example, select only numeric columns:

deniro_data |> 
  select(where(is.numeric))

# A tibble: 87 × 2
    Year Score
   <int> <int>
 1  1968    86
 2  1970    17
 3  1970    73
 4  1971    40
 5  1973    98
 6  1973    88
 7  1974    97
 8  1976    41
 9  1976    99
10  1977    47
# ℹ 77 more rows

deniro_data |>: This part of the code extract one or multiple columns from deniro_data. Note that we use the native pipe operator (|>), see week 2.
select(where(is.numeric)): Here we select only those columns in deniro_data that are numeric.

select() can be also used to remove columns from the data frame. For this we would use the ‘-’ sign:

deniro_data |> 
  select(-Year, -Title)

# A tibble: 87 × 1
   Score
   <int>
 1    86
 2    17
 3    73
 4    40
 5    98
 6    88
 7    97
 8    41
 9    99
10    47
# ℹ 77 more rows

Note that at this point, none of the changes are ‘permanent’ because the resulting data frame (tibble) is not ‘created’ but only printed. You simply ask R to execute the command in the existing data set and print out the result (not save it). If you want to select (or remove) a variable on a permanent basis then you would need to create a new object. So, for example, if you want to only keep the columns (variables) of Year and Movie Title by their names, then you would need to write:

new_deniro_data <- deniro_data |> 
  select(Year, Title)

new_deniro_data <-: This part of the code creates a new object called ‘new_deniro_data’.
deniro_data |> select(Year, Title): This part of the code extract one or multiple columns as a data table.

Note that you do not always have to create a new data set. If you would like to change the existing data set, assign the result to an object with the same name:

deniro_data <- deniro_data |> 
  select(Year, Title)

Watch out: This will overwrite the existing deniro_data with the changes. If you want to get back to the original data set you need to open it again.

2.3 Filtering variables

The filter() function is used to subset the rows. There are a number of important functions and operators that are useful when constructing the expressions used to filter the data:

== means ‘equal to’
< means ‘smaller than’
> means ‘larger than’
<= means ‘equal to or smaller than’
>= means ‘equal to or larger than’
& means ‘AND’
| means ‘OR’
! means ‘is not’
%in% is used to select one of multiple values

To illustrate the use of some of them, I will use the ‘starwars’ data set. This is a data set that is automatically loaded with the tidyverse package. We can load the package by typing:

library(tidyverse)
starwars

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Let us assume we want to filter by one or more criteria. For a character variable, we would write:

starwars |> 
  filter(eye_color == "blue")

# A tibble: 19 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172  77   blond      fair       blue            19   male  mascu…
 2 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 3 Beru Wh…    165  75   brown      light      blue            47   fema… femin…
 4 Anakin …    188  84   blond      fair       blue            41.9 male  mascu…
 5 Wilhuff…    180  NA   auburn, g… fair       blue            64   male  mascu…
 6 Chewbac…    228 112   brown      unknown    blue           200   male  mascu…
 7 Jek Ton…    180 110   brown      fair       blue            NA   <NA>  <NA>  
 8 Lobot       175  79   none       light      blue            37   male  mascu…
 9 Mon Mot…    150  NA   auburn     fair       blue            48   fema… femin…
10 Qui-Gon…    193  89   brown      fair       blue            92   male  mascu…
11 Finis V…    170  NA   blond      fair       blue            91   male  mascu…
12 Ric Olié    183  NA   brown      fair       blue            NA   male  mascu…
13 Adi Gal…    184  50   none       dark       blue            NA   fema… femin…
14 Mas Ame…    196  NA   none       blue       blue            NA   male  mascu…
15 Cliegg …    183  NA   brown      fair       blue            82   male  mascu…
16 Luminar…    170  56.2 black      yellow     blue            58   fema… femin…
17 Barriss…    166  50   black      yellow     blue            40   fema… femin…
18 Jocasta…    167  NA   white      fair       blue            NA   fema… femin…
19 Tarfful     234 136   brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter(eye_color == "blue"): The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for a specific eye color.

One mistake that people often make in filtering character variables is that they forget the quotation marks.

If there are multiple eye colors that you would want to select, you can use %in%:

starwars |> 
  filter(eye_color %in% c("blue", "red"))

# A tibble: 24 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 6 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 7 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 8 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 9 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
10 IG-88       200   140 none       metal      red             15   none  mascu…
# ℹ 14 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter(eye_color %in% c("blue", "red")): The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for the eye_color to be one of the colors in the vector c("blue", "red") , i.e. that eye_color is equal to blue or to red.

For a numeric variable (such as height), we do not use the quotation marks. Let us also filter for all characters that are at least 172 cm tall and have blue eyes.

starwars |> 
  filter(eye_color == "blue" & height >= 172)

# A tibble: 13 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 3 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 4 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 5 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 6 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
 7 Lobot       175    79 none       light      blue            37   male  mascu…
 8 Qui-Gon…    193    89 brown      fair       blue            92   male  mascu…
 9 Ric Olié    183    NA brown      fair       blue            NA   male  mascu…
10 Adi Gal…    184    50 none       dark       blue            NA   fema… femin…
11 Mas Ame…    196    NA none       blue       blue            NA   male  mascu…
12 Cliegg …    183    NA brown      fair       blue            82   male  mascu…
13 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter(eye_color == "blue" & height >= 172): The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for a specific eye color and at least a specific height.

If you would like to filter characters that are at least 172 cm tall or have blue eyes, you can use:

starwars |> 
  filter(eye_color == "blue" | height >= 172)

# A tibble: 60 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 7 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 8 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 9 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 50 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

You can apply the expressions mentioned above to filter the data. The “!” operator works a bit different because it will select all cases that do not satisfy a condition. So, for example, if we want to filter all cases that do not have blue eyes, we would write:

starwars |> 
  filter(!(eye_color == "blue"))

# A tibble: 68 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 2 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 4 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 5 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Han Solo    180    80 brown      fair       brown           29   male  mascu…
 9 Greedo      173    74 <NA>       green      black           44   male  mascu…
10 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
# ℹ 58 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter(!(eye_color == "blue")): The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask to only select cases that do not have a specific eye color. Note that we put an exclamation point around the condition and that the condition is surrouded by brackets.

Note again that none of the changes are ‘permanent’. You simply ask R to execute the command in the existing data set ‘temporarily’. If you want to filter on a permanent basis then you would need to create a new object (or assign to the current object). So, for example, if you want to only select the cases whose home planet is ‘Tatooine’, then you would need to write:

new_starwars <- starwars |> 
  filter(homeworld == "Tatooine")

new_starwars <-: This part of the code creates a new object called ‘new_starwars’.
starwars |>: This part of the code takes the starwars tibble as the start of our pipe.
filter(homeworld == "Tatooine"): This part of the code subset a data frame using the condition as specified.

One mistake that people often make in filters is using only one equal sign =, where there should be two ==.

2.4 Renaming variables

To change the names of individual variables we use the rename() function. It uses the order new_name = old_name to rename selected variables.

If we want to rename the column name to character.name, you would write:

starwars |> 
  rename(character.name = name)

# A tibble: 87 × 14
   character.name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>            <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke Skywalker     172    77 blond      fair       blue            19   male 
 2 C-3PO              167    75 <NA>       gold       yellow         112   none 
 3 R2-D2               96    32 <NA>       white, bl… red             33   none 
 4 Darth Vader        202   136 none       white      yellow          41.9 male 
 5 Leia Organa        150    49 brown      light      brown           19   fema…
 6 Owen Lars          178   120 brown, gr… light      blue            52   male 
 7 Beru Whitesun …    165    75 brown      light      blue            47   fema…
 8 R5-D4               97    32 <NA>       white, red red             NA   none 
 9 Biggs Darkligh…    183    84 black      light      brown           24   male 
10 Obi-Wan Kenobi     182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

However, as was pointed out above, this is not a permanent change. To do this, we need to make sure that it is changed in the existing data frame.

starwars <- starwars |> 
  rename(character.name = name)

starwars <- starwars |>: This part of the code overwrites the existing data frame ‘starwars’.
rename(character.name = name): This part of the code renames a variable with the general pattern: <new_name> = <old_name>