library(dice) # Import required package
Warning: package 'dice' was built under R version 4.4.3
Loading required package: gtools
Warning: package 'gtools' was built under R version 4.4.3
We can use R to calculate the probabilities of various dice-rolling events, such as the probability of getting specific numbers when rolling a six-sided dice several times. This can be helpful, for example, to check your manual calculations and to see if you grasped the concepts of probability correctly. For dice-rolling events, we can use the dice
package. If you are working on your personal computer you may need to install the package first using the command install.packages("dice")
. Installing the package is not necessary on university computers.
You can load the package as follows:
library(dice) # Import required package
Warning: package 'dice' was built under R version 4.4.3
Loading required package: gtools
Warning: package 'gtools' was built under R version 4.4.3
For a specified dice-rolling process, the command getEventProb()
calculates the probability of an event. We can specify this by passing a list object in to eventList
. The package can simulate various scenarios. I will demonstrate the basic functions but you may want to have a look at the additional functions, see the manual.
getEventProb(nrolls = 1,
ndicePerRoll = 1,
nsidesPerDie = 6,
eventList = list(6))
[1] 0.1666667
getEventProb(
This part of the code specifies that we want to get the probability for an event.
nrolls = 1,
This integer represents the number of dice rolls to make. In this case, we roll the die once.
ndicePerRoll = 1,
This integer represents the number of dice to use in each dice roll.
nsidesPerDie = 6,
This specifies the number of sides of the dice. You can, for example, change this if you would like to work with a 10 side dice. If you wand to simulate a six-side die, leave it at 6. One short remark here: Modern English considers dice as both singular and plural nouns. However, the R function still uses ‘Die’. You must write the function exactly like this. R is, as mentioned earlier, case sensitive and can only find the function if you write it as nsidesPerDie
.
eventList = list(6))
This specifies the event that we are interested in. Note that you have to use a list()
command. A list is a type of object in R. For now, you do not need to know more about this. If you want to get the probability of e.g. rolling a 4, you need to change this part of the code to eventList = list(4))
We can change the code according to the event that we are interested in. If we want to know the probability of rolling either a 1 or a 4 when rolling one six-sided die, we would write:
getEventProb(nrolls = 1,
ndicePerRoll = 1,
nsidesPerDie = 6,
eventList = list(c(1,4)))
[1] 0.3333333
getEventProb(
This part of the code specifies that we want to get the probability for an event.
nrolls = 1,
This integer represents the number of dice rolls to make.
ndicePerRoll = 1,
This integer represents the number of dice to use in each dice roll.
nsidesPerDie = 6,
This specifies the number of sides of the dice.
eventList = list(c(1,4)))
This specifies the event that we are interested in. Here, we are interested in a 1 or a 4. You write it by including c(1,4)
in your list. You can add more numbers. For example, if you want to calculate the probaility of rolling a 1, 2 or a 5 you would write eventList = list(c(1,2,5)))
.
If we want to know the probability of rolling two 1’s when rolling a six-sided die twice, we would write:
getEventProb(nrolls = 2,
ndicePerRoll = 1,
nsidesPerDie = 6,
eventList = list(1,1),
orderMatters = TRUE)
[1] 0.02777778
getEventProb(
This part of the code specifies that we want to get the probability for an event.
nrolls = 2,
This integer represents the number of dice rolls to make.
ndicePerRoll = 1,
This integer represents the number of dice to use in each dice roll.
nsidesPerDie = 6,
This specifies the number of sides of the dice.
eventList = list(1,1))
This specifies the event that we are interested in. Here, we are interested in two 1’s. Note that this is different than before because we do not write c()
. Instead, we specify that the outcome of interest is, in this example, 1, followed by a 1. For other events, change the numbers in the bracket.
orderMatters = TRUE
This option indicates that the order of conditions in evenList
matters. In this example, it makes no difference.
We can use R to calculate the probabilities of drawing combinations of cards from a deck of cards. A standard deck of cards has 52 cards in 13 values and 4 suits. The suits are Spade, Club, Diamond and Heart. Each suit has 13 card values: 2-10, 3 “face cards” Jack, Queen, King (J, Q, K) and an Ace. Unfortunately, there is no R package available similar to the dice rolling example. However, the openintro
data contains a pre-defined deck of cards which contains all the cards in a standard deck:
data(cards, package = "openintro") # Specify that we want to use the dataset cards from openintro
There are 52 observations on 4 variables in the data set.
value
a factor with levels 10 2 3 4 5 6 7 8 9 A J K Q
color
a factor with levels black red
suit
a factor with levels Club Diamond Heart Spade
face
a logical vector (TRUE for face cards and FALSE for all other cards)
To calculate the probability of an ‘Ace’ card we can check how often such a card occurs. This can be easily done in R with the filter()
function from the tidyverse
package. The package is installed on all university computers.
library(tidyverse) # Import required package
We can then use the filter()
function to pick cases based on their values. The function was introduced above. For our purpose here, we use a specific way to select cards in our deck as follows:
|>
cards filter(value == "A")
# A tibble: 4 × 4
value color suit face
<fct> <fct> <fct> <lgl>
1 A red Heart TRUE
2 A red Diamond TRUE
3 A black Spade TRUE
4 A black Club TRUE
cards |>
This part of the code specifies that we are working with the ‘cards’ data frame. |>
is called the pipe operator in R (see week 2).
filter(value == "A")
Here we ask R to pick cases (i.e. filter) if any of the cards (note that this variable is named ‘value’ equals the string ‘A’. Please note that the function is case sensitive (i.e. ‘a’ will not work).
Now that we know that there are 4 aces and because we know that there are 52 cards in our deck, we can simply divide 4 by 52 to get the probability.
4/52
[1] 0.07692308
Similarly, we can test how many ‘Diamond’ cards there are and calculate the probability:
|>
cards filter(suit == "Diamond")
# A tibble: 13 × 4
value color suit face
<fct> <fct> <fct> <lgl>
1 2 red Diamond FALSE
2 3 red Diamond FALSE
3 4 red Diamond FALSE
4 5 red Diamond FALSE
5 6 red Diamond FALSE
6 7 red Diamond FALSE
7 8 red Diamond FALSE
8 9 red Diamond FALSE
9 10 red Diamond FALSE
10 J red Diamond TRUE
11 Q red Diamond TRUE
12 K red Diamond TRUE
13 A red Diamond TRUE
13/52
[1] 0.25
If we are interested in face cards, our code changes slightly because this is a TRUE/FALSE vector. We do not include quotation marks but simply write:
|>
cards filter(face == TRUE)
# A tibble: 16 × 4
value color suit face
<fct> <fct> <fct> <lgl>
1 J red Heart TRUE
2 Q red Heart TRUE
3 K red Heart TRUE
4 A red Heart TRUE
5 J red Diamond TRUE
6 Q red Diamond TRUE
7 K red Diamond TRUE
8 A red Diamond TRUE
9 J black Spade TRUE
10 Q black Spade TRUE
11 K black Spade TRUE
12 A black Spade TRUE
13 J black Club TRUE
14 Q black Club TRUE
15 K black Club TRUE
16 A black Club TRUE
16/52
[1] 0.3076923
We can find the occurrence of combinations, e.g. find all cards that are either Ace or a Diamond. To do this, we use the “|” symbol to indicate ‘or’.
|>
cards filter(suit == "Diamond" | value == "A")
# A tibble: 16 × 4
value color suit face
<fct> <fct> <fct> <lgl>
1 A red Heart TRUE
2 2 red Diamond FALSE
3 3 red Diamond FALSE
4 4 red Diamond FALSE
5 5 red Diamond FALSE
6 6 red Diamond FALSE
7 7 red Diamond FALSE
8 8 red Diamond FALSE
9 9 red Diamond FALSE
10 10 red Diamond FALSE
11 J red Diamond TRUE
12 Q red Diamond TRUE
13 K red Diamond TRUE
14 A red Diamond TRUE
15 A black Spade TRUE
16 A black Club TRUE
Based on this, there are 16 cards that are either an Ace or a Diamond card. Therefore, the probability of drawing either an Ace card or a Diamond card is 16 divided by 52.
16/52
[1] 0.3076923
You can also use the “&” symbol to indicate ‘AND’, for example to get the number of cards that are red and an Ace.
|>
cards filter(color == "red" & value == "A")
# A tibble: 2 × 4
value color suit face
<fct> <fct> <fct> <lgl>
1 A red Heart TRUE
2 A red Diamond TRUE
You can extend the code, for example by writing cards |> filter(value %in% c("A", "K") | suit == "Diamond")
if you want to find the number of cards that are either Diamond, or Ace, or King.
We can explore probabilities within a contingency table. For this illustration, we use the data set from the book called photo_classify
from the openintro
package. This is a simulated data set for photo classifications based on a machine learning algorithm versus what the true classification is for those photos. While the data are not real, they resemble performance that would be reasonable to expect in a well-built classifier. For more information, see chapter 3.2 in the book.
First, we make a contingency table of the two variables mach_lean
and truth
:
library(flextable)
Attaching package: 'flextable'
The following object is masked from 'package:purrr':
compose
This code displays the percentages across columns:
library(openintro)
data(photo_classify)
<- proc_freq(x = photo_classify,
table_example row = "mach_learn",
col = "truth",
include.row_percent = FALSE,
include.table_percent = FALSE)
table_example
mach_learn | truth | |||
---|---|---|---|---|
fashion | not | Total | ||
pred_fashion | Count | 197 | 22 | 219 |
Col. pct | 63.8% | 1.5% | ||
pred_not | Count | 112 | 1,491 | 1,603 |
Col. pct | 36.2% | 98.5% | ||
Total | Count | 309 | 1,513 | 1,822 |
The flextable
package also allows us to display the table percentages. The code below creates a table with joint probabilities for the photo_classify
data. The proportions are calculated by dividing the count in each cell by the table’s total (1822).
<- proc_freq(x = photo_classify,
table_example2 row = "mach_learn",
col = "truth",
include.row_percent = FALSE,
include.column_percent = FALSE,
include.table_percent = TRUE,
include.column_total = TRUE)
table_example2
mach_learn | truth | ||
---|---|---|---|
fashion | not | Total | |
pred_fashion | 197 (10.8%) | 22 (1.2%) | 219 (12.0%) |
pred_not | 112 (6.1%) | 1,491 (81.8%) | 1,603 (88.0%) |
Total | 309 (17.0%) | 1,513 (83.0%) | 1,822 (100.0%) |
table_example = proc_freq(…)
We define that we want to create a table called table_example2
using the function proc_freq
.
x = photo_classify
This specifies which data set (data.frame) we would like to use.
row = "mach_learn",
This specifies the variable we want to use in the rows of the cross table.
col = "truth",
This specifies the variable we want to use in the column of the cross table.
include.row_percent = FALSE
This specifies that we do not want to include row percentages.
include.column_percent = FALSE,
This specifies that we do not want to include column percentages.
include.table_percent = TRUE
This specifies that we do want to include table percentages.
include.column_total = TRUE)
This specifies that we want to include column totals.
Based on this we can see that in 10.81 % of the cases the classifier correctly predicted the fashion photos (see cell fashion
, pred_fashion
) and in 6.15 % of the cases, the classifier did not correctly predict a fashion photo. Likewise, in 1.21 % of the cases the classifier predicted a photo that was not a fashion photo as ‘fashion’ and in 81.83 % of the cases the classifier was correct in its assessment that a photo which was not a fashion photo was indeed not a fashion photo.
This is the same table as displayed on p. 96 of the book.
To further investigate conditional probabilities, we can use the information from the table that we just created:
Assume we want to calculate the probability that a photo was about fashion (truth
is fashion
) given that the prediction was that the photo is about fashion (mach_learn
is pred_fashion
). We can see that there were 219 cases in which the prediction was ‘fashion’ and the photo was, in fact, a fashion photo in 197 cases.
P(truth
is fashion
given mach_learn
is pred_fashion
) = \((\frac{197}{219})= 0.900\)
In R, we can calculate the value like this:
197/219
[1] 0.8995434