library(rio)
<- import("deniro.csv") deniro_data
2 Filtering, selecting & renaming
Here we will introduce a number of important techniques that will help you to work with existing data sets and to set up your own data sets. This concerns how to select specific columns, how to filter cases based on conditions and how to rename variables.
We will illustrate them on different data sets. One is the ‘deniro.csv’ file from week 1 (available here) which contains Rotten Tomato ratings of movies with Robert De Niro. There are 87 records. Each record has Year, Rating, Title.
Make sure to save the file to your project folder, because that is where R will look for the file when you open it.
You can import this data set in R using the following code:
We can take a look at the first six rows of the data set by using the head()
function:
head(deniro_data)
Year Score Title
1 1968 86 Greetings
2 1970 17 Bloody Mama
3 1970 73 Hi, Mom!
4 1971 40 Born to Win
5 1973 98 Mean Streets
6 1973 88 Bang the Drum Slowly
2.1 Tibbles
We already encountered data frames in the first week. Data frames are one of the ‘original’ ways in R to interact with data. However, because of the way they are displayed, it can sometimes be difficult to manipulate data in them and to print their content.
A tibble, or tbl_df, is a modern ‘version’ of the data frame. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. Tibbles are therefore easier to use if you have large data sets containing complex objects. While the data set used here is not particularly complex, it makes sense to get used to tibbles.
To be able to work with tibbles, you must install tidyverse install.packages("tidyverse")
(already done on University computers). We can load the tidyverse packages by typing:
library(tidyverse)
To transform our data set into a tibble, we can write.
<- as_tibble(deniro_data) deniro_data
deniro_data <- as_tibble(deniro_data)
-
With this code we turn the existing object
deniro_data
into a tibble, a data frame with class tbl_df. If your data frame has a different name, you need to change the name. When we now open the data frame we can see that the way the data frame is displayed has changed. Note that nothing has been changed in terms of content. All the original variables and values are still the same. It is only a visual change at this point.
deniro_data
# A tibble: 87 × 3
Year Score Title
<int> <int> <chr>
1 1968 86 Greetings
2 1970 17 Bloody Mama
3 1970 73 Hi, Mom!
4 1971 40 Born to Win
5 1973 98 Mean Streets
6 1973 88 Bang the Drum Slowly
7 1974 97 The Godfather, Part II
8 1976 41 The Last Tycoon
9 1976 99 Taxi Driver
10 1977 47 1900
# ℹ 77 more rows
2.2 Selecting
Once we have opened our data set we can start selecting variables. For example, if we want to select the columns (variables) regarding publication year (Year
) and the title of the movie (Title
) by their names, we can write:
|>
deniro_data select(Year, Title)
# A tibble: 87 × 2
Year Title
<int> <chr>
1 1968 Greetings
2 1970 Bloody Mama
3 1970 Hi, Mom!
4 1971 Born to Win
5 1973 Mean Streets
6 1973 Bang the Drum Slowly
7 1974 The Godfather, Part II
8 1976 The Last Tycoon
9 1976 Taxi Driver
10 1977 1900
# ℹ 77 more rows
deniro_data |>
-
This part of the code extract one or multiple columns from
deniro_data
. Note that we use the native pipe operator (|>
), see week 2. select(Year, Title)
-
Here we select two columns of the tibble, named
Year
andTitle
. Note that as a matter of style, we suggest to put any new commands in a pipe on the next line (see overview of week 2)
There are several special functions that can be used inside select()
. It is possible to apply a function to the columns and, for example, select only numeric columns:
|>
deniro_data select(where(is.numeric))
# A tibble: 87 × 2
Year Score
<int> <int>
1 1968 86
2 1970 17
3 1970 73
4 1971 40
5 1973 98
6 1973 88
7 1974 97
8 1976 41
9 1976 99
10 1977 47
# ℹ 77 more rows
deniro_data |>
-
This part of the code extract one or multiple columns from
deniro_data
. Note that we use the native pipe operator (|>
), see week 2. select(where(is.numeric))
-
Here we select only those columns in
deniro_data
that are numeric.
select()
can be also used to remove columns from the data frame. For this we would use the ‘-’ sign:
|>
deniro_data select(-Year, -Title)
# A tibble: 87 × 1
Score
<int>
1 86
2 17
3 73
4 40
5 98
6 88
7 97
8 41
9 99
10 47
# ℹ 77 more rows
Note that at this point, none of the changes are ‘permanent’ because the resulting data frame (tibble) is not ‘created’ but only printed. You simply ask R to execute the command in the existing data set and print out the result (not save it). If you want to select (or remove) a variable on a permanent basis then you would need to create a new object. So, for example, if you want to only keep the columns (variables) of Year and Movie Title by their names, then you would need to write:
<- deniro_data |>
new_deniro_data select(Year, Title)
new_deniro_data <-
-
This part of the code creates a new object called ‘new_deniro_data’.
deniro_data |> select(Year, Title)
-
This part of the code extract one or multiple columns as a data table.
Note that you do not always have to create a new data set. If you would like to change the existing data set, assign the result to an object with the same name:
<- deniro_data |>
deniro_data select(Year, Title)
Watch out: This will overwrite the existing deniro_data
with the changes. If you want to get back to the original data set you need to open it again.
2.3 Filtering variables
The filter()
function is used to subset the rows. There are a number of important functions and operators that are useful when constructing the expressions used to filter the data:
== means ‘equal to’
< means ‘smaller than’
> means ‘larger than’
<= means ‘equal to or smaller than’
>= means ‘equal to or larger than’
& means ‘AND’
| means ‘OR’
! means ‘is not’
%in% is used to select one of multiple values
To illustrate the use of some of them, I will use the ‘starwars’ data set. This is a data set that is automatically loaded with the tidyverse
package. We can load the package by typing:
library(tidyverse)
starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Let us assume we want to filter by one or more criteria. For a character variable, we would write:
|>
starwars filter(eye_color == "blue")
# A tibble: 19 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
3 Beru Wh… 165 75 brown light blue 47 fema… femin…
4 Anakin … 188 84 blond fair blue 41.9 male mascu…
5 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
6 Chewbac… 228 112 brown unknown blue 200 male mascu…
7 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
8 Lobot 175 79 none light blue 37 male mascu…
9 Mon Mot… 150 NA auburn fair blue 48 fema… femin…
10 Qui-Gon… 193 89 brown fair blue 92 male mascu…
11 Finis V… 170 NA blond fair blue 91 male mascu…
12 Ric Olié 183 NA brown fair blue NA male mascu…
13 Adi Gal… 184 50 none dark blue NA fema… femin…
14 Mas Ame… 196 NA none blue blue NA male mascu…
15 Cliegg … 183 NA brown fair blue 82 male mascu…
16 Luminar… 170 56.2 black yellow blue 58 fema… femin…
17 Barriss… 166 50 black yellow blue 40 fema… femin…
18 Jocasta… 167 NA white fair blue NA fema… femin…
19 Tarfful 234 136 brown brown blue NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
filter(eye_color == "blue")
-
The
filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for a specific eye color.
One mistake that people often make in filtering character variables is that they forget the quotation marks.
If there are multiple eye colors that you would want to select, you can use %in%
:
|>
starwars filter(eye_color %in% c("blue", "red"))
# A tibble: 24 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 R5-D4 97 32 <NA> white, red red NA none mascu…
6 Anakin … 188 84 blond fair blue 41.9 male mascu…
7 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
8 Chewbac… 228 112 brown unknown blue 200 male mascu…
9 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
10 IG-88 200 140 none metal red 15 none mascu…
# ℹ 14 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
filter(eye_color %in% c("blue", "red"))
-
The
filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for theeye_color
to be one of the colors in the vectorc("blue", "red")
, i.e. that eye_color is equal to blue or to red.
For a numeric variable (such as height), we do not use the quotation marks. Let us also filter for all characters that are at least 172 cm tall and have blue eyes.
|>
starwars filter(eye_color == "blue" & height >= 172)
# A tibble: 13 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Owen La… 178 120 brown, gr… light blue 52 male mascu…
3 Anakin … 188 84 blond fair blue 41.9 male mascu…
4 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
5 Chewbac… 228 112 brown unknown blue 200 male mascu…
6 Jek Ton… 180 110 brown fair blue NA <NA> <NA>
7 Lobot 175 79 none light blue 37 male mascu…
8 Qui-Gon… 193 89 brown fair blue 92 male mascu…
9 Ric Olié 183 NA brown fair blue NA male mascu…
10 Adi Gal… 184 50 none dark blue NA fema… femin…
11 Mas Ame… 196 NA none blue blue NA male mascu…
12 Cliegg … 183 NA brown fair blue 82 male mascu…
13 Tarfful 234 136 brown brown blue NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
filter(eye_color == "blue" & height >= 172)
-
The
filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask for a specific eye color and at least a specific height.
If you would like to filter characters that are at least 172 cm tall or have blue eyes, you can use:
|>
starwars filter(eye_color == "blue" | height >= 172)
# A tibble: 60 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Owen La… 178 120 brown, gr… light blue 52 male mascu…
4 Beru Wh… 165 75 brown light blue 47 fema… femin…
5 Biggs D… 183 84 black light brown 24 male mascu…
6 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
7 Anakin … 188 84 blond fair blue 41.9 male mascu…
8 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
9 Chewbac… 228 112 brown unknown blue 200 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 50 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
You can apply the expressions mentioned above to filter the data. The “!” operator works a bit different because it will select all cases that do not satisfy a condition. So, for example, if we want to filter all cases that do not have blue eyes, we would write:
|>
starwars filter(!(eye_color == "blue"))
# A tibble: 68 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 C-3PO 167 75 <NA> gold yellow 112 none mascu…
2 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
3 Darth V… 202 136 none white yellow 41.9 male mascu…
4 Leia Or… 150 49 brown light brown 19 fema… femin…
5 R5-D4 97 32 <NA> white, red red NA none mascu…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Han Solo 180 80 brown fair brown 29 male mascu…
9 Greedo 173 74 <NA> green black 44 male mascu…
10 Jabba D… 175 1358 <NA> green-tan… orange 600 herm… mascu…
# ℹ 58 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
filter(!(eye_color == "blue"))
-
The
filter()
function is used to subset a data frame, retaining all rows that satisfy your conditions. Here we ask to only select cases that do not have a specific eye color. Note that we put an exclamation point around the condition and that the condition is surrouded by brackets.
Note again that none of the changes are ‘permanent’. You simply ask R to execute the command in the existing data set ‘temporarily’. If you want to filter on a permanent basis then you would need to create a new object (or assign to the current object). So, for example, if you want to only select the cases whose home planet is ‘Tatooine’, then you would need to write:
<- starwars |>
new_starwars filter(homeworld == "Tatooine")
new_starwars <-
-
This part of the code creates a new object called ‘new_starwars’.
starwars |>
-
This part of the code takes the starwars tibble as the start of our pipe.
filter(homeworld == "Tatooine")
-
This part of the code subset a data frame using the condition as specified.
One mistake that people often make in filters is using only one equal sign =, where there should be two ==.
2.4 Renaming variables
To change the names of individual variables we use the rename()
function. It uses the order new_name = old_name
to rename selected variables.
If we want to rename the column name
to character.name
, you would write:
|>
starwars rename(character.name = name)
# A tibble: 87 × 14
character.name height mass hair_color skin_color eye_color birth_year sex
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke Skywalker 172 77 blond fair blue 19 male
2 C-3PO 167 75 <NA> gold yellow 112 none
3 R2-D2 96 32 <NA> white, bl… red 33 none
4 Darth Vader 202 136 none white yellow 41.9 male
5 Leia Organa 150 49 brown light brown 19 fema…
6 Owen Lars 178 120 brown, gr… light blue 52 male
7 Beru Whitesun … 165 75 brown light blue 47 fema…
8 R5-D4 97 32 <NA> white, red red NA none
9 Biggs Darkligh… 183 84 black light brown 24 male
10 Obi-Wan Kenobi 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
However, as was pointed out above, this is not a permanent change. To do this, we need to make sure that it is changed in the existing data frame.
<- starwars |>
starwars rename(character.name = name)
starwars <- starwars |>
-
This part of the code overwrites the existing data frame ‘starwars’.
rename(character.name = name)
-
This part of the code renames a variable with the general pattern:
<new_name> = <old_name>