10  Probability in R

10.1 Dice rolling

We can use R to calculate the probabilities of various dice-rolling events, such as the probability of getting specific numbers when rolling a six-sided dice several times. This can be helpful, for example, to check your manual calculations and to see if you grasped the concepts of probability correctly. For dice-rolling events, we can use the dice package. If you are working on your personal computer you may need to install the package first using the command install.packages("dice"). Installing the package is not necessary on university computers.

You can load the package as follows:

library(dice) # Import required package
Warning: package 'dice' was built under R version 4.4.3
Loading required package: gtools
Warning: package 'gtools' was built under R version 4.4.3

For a specified dice-rolling process, the command getEventProb() calculates the probability of an event. We can specify this by passing a list object in to eventList. The package can simulate various scenarios. I will demonstrate the basic functions but you may want to have a look at the additional functions, see the manual.

10.1.1 Probability of rolling a 6 when rolling one six-sided dice

getEventProb(nrolls = 1,
             ndicePerRoll = 1,
             nsidesPerDie = 6,
             eventList = list(6))
[1] 0.1666667
getEventProb(

This part of the code specifies that we want to get the probability for an event.

nrolls = 1,

This integer represents the number of dice rolls to make. In this case, we roll the die once.

ndicePerRoll = 1,

This integer represents the number of dice to use in each dice roll.

nsidesPerDie = 6,

This specifies the number of sides of the dice. You can, for example, change this if you would like to work with a 10 side dice. If you wand to simulate a six-side die, leave it at 6. One short remark here: Modern English considers dice as both singular and plural nouns. However, the R function still uses ‘Die’. You must write the function exactly like this. R is, as mentioned earlier, case sensitive and can only find the function if you write it as nsidesPerDie.

eventList = list(6))

This specifies the event that we are interested in. Note that you have to use a list() command. A list is a type of object in R. For now, you do not need to know more about this. If you want to get the probability of e.g. rolling a 4, you need to change this part of the code to eventList = list(4))

10.1.2 Probability of rolling a 1 or a 4 when rolling one six-sided die

We can change the code according to the event that we are interested in. If we want to know the probability of rolling either a 1 or a 4 when rolling one six-sided die, we would write:

getEventProb(nrolls = 1,
             ndicePerRoll = 1,
             nsidesPerDie = 6,
             eventList = list(c(1,4)))
[1] 0.3333333
getEventProb(

This part of the code specifies that we want to get the probability for an event.

nrolls = 1,

This integer represents the number of dice rolls to make.

ndicePerRoll = 1,

This integer represents the number of dice to use in each dice roll.

nsidesPerDie = 6,

This specifies the number of sides of the dice.

eventList = list(c(1,4)))

This specifies the event that we are interested in. Here, we are interested in a 1 or a 4. You write it by including c(1,4) in your list. You can add more numbers. For example, if you want to calculate the probaility of rolling a 1, 2 or a 5 you would write eventList = list(c(1,2,5))).

10.1.3 Probability of rolling two 1’s when rolling a six-sided die twice

If we want to know the probability of rolling two 1’s when rolling a six-sided die twice, we would write:

getEventProb(nrolls = 2, 
             ndicePerRoll = 1,
             nsidesPerDie = 6,
             eventList = list(1,1),
             orderMatters = TRUE)
[1] 0.02777778
getEventProb(

This part of the code specifies that we want to get the probability for an event.

nrolls = 2,

This integer represents the number of dice rolls to make.

ndicePerRoll = 1,

This integer represents the number of dice to use in each dice roll.

nsidesPerDie = 6,

This specifies the number of sides of the dice.

eventList = list(1,1))

This specifies the event that we are interested in. Here, we are interested in two 1’s. Note that this is different than before because we do not write c(). Instead, we specify that the outcome of interest is, in this example, 1, followed by a 1. For other events, change the numbers in the bracket.

orderMatters = TRUE

This option indicates that the order of conditions in evenList matters. In this example, it makes no difference.

10.2 Deck of cards

We can use R to calculate the probabilities of drawing combinations of cards from a deck of cards. A standard deck of cards has 52 cards in 13 values and 4 suits. The suits are Spade, Club, Diamond and Heart. Each suit has 13 card values: 2-10, 3 “face cards” Jack, Queen, King (J, Q, K) and an Ace. Unfortunately, there is no R package available similar to the dice rolling example. However, the openintro data contains a pre-defined deck of cards which contains all the cards in a standard deck:

data(cards, package = "openintro")  # Specify that we want to use the dataset cards from openintro

There are 52 observations on 4 variables in the data set.

value

a factor with levels 10 2 3 4 5 6 7 8 9 A J K Q

color

a factor with levels black red

suit

a factor with levels Club Diamond Heart Spade

face

a logical vector (TRUE for face cards and FALSE for all other cards)

To calculate the probability of an ‘Ace’ card we can check how often such a card occurs. This can be easily done in R with the filter() function from the tidyverse package. The package is installed on all university computers.

library(tidyverse) # Import required package

We can then use the filter() function to pick cases based on their values. The function was introduced above. For our purpose here, we use a specific way to select cards in our deck as follows:

cards |> 
  filter(value == "A")
# A tibble: 4 × 4
  value color suit    face 
  <fct> <fct> <fct>   <lgl>
1 A     red   Heart   TRUE 
2 A     red   Diamond TRUE 
3 A     black Spade   TRUE 
4 A     black Club    TRUE 
cards |>

This part of the code specifies that we are working with the ‘cards’ data frame. |> is called the pipe operator in R (see week 2).

filter(value == "A")

Here we ask R to pick cases (i.e. filter) if any of the cards (note that this variable is named ‘value’ equals the string ‘A’. Please note that the function is case sensitive (i.e. ‘a’ will not work).

Now that we know that there are 4 aces and because we know that there are 52 cards in our deck, we can simply divide 4 by 52 to get the probability.

4/52
[1] 0.07692308

Similarly, we can test how many ‘Diamond’ cards there are and calculate the probability:

cards |> 
  filter(suit == "Diamond") 
# A tibble: 13 × 4
   value color suit    face 
   <fct> <fct> <fct>   <lgl>
 1 2     red   Diamond FALSE
 2 3     red   Diamond FALSE
 3 4     red   Diamond FALSE
 4 5     red   Diamond FALSE
 5 6     red   Diamond FALSE
 6 7     red   Diamond FALSE
 7 8     red   Diamond FALSE
 8 9     red   Diamond FALSE
 9 10    red   Diamond FALSE
10 J     red   Diamond TRUE 
11 Q     red   Diamond TRUE 
12 K     red   Diamond TRUE 
13 A     red   Diamond TRUE 
13/52
[1] 0.25

If we are interested in face cards, our code changes slightly because this is a TRUE/FALSE vector. We do not include quotation marks but simply write:

cards |> 
  filter(face == TRUE) 
# A tibble: 16 × 4
   value color suit    face 
   <fct> <fct> <fct>   <lgl>
 1 J     red   Heart   TRUE 
 2 Q     red   Heart   TRUE 
 3 K     red   Heart   TRUE 
 4 A     red   Heart   TRUE 
 5 J     red   Diamond TRUE 
 6 Q     red   Diamond TRUE 
 7 K     red   Diamond TRUE 
 8 A     red   Diamond TRUE 
 9 J     black Spade   TRUE 
10 Q     black Spade   TRUE 
11 K     black Spade   TRUE 
12 A     black Spade   TRUE 
13 J     black Club    TRUE 
14 Q     black Club    TRUE 
15 K     black Club    TRUE 
16 A     black Club    TRUE 
16/52
[1] 0.3076923

We can find the occurrence of combinations, e.g. find all cards that are either Ace or a Diamond. To do this, we use the “|” symbol to indicate ‘or’.

cards |> 
  filter(suit == "Diamond" | value == "A")
# A tibble: 16 × 4
   value color suit    face 
   <fct> <fct> <fct>   <lgl>
 1 A     red   Heart   TRUE 
 2 2     red   Diamond FALSE
 3 3     red   Diamond FALSE
 4 4     red   Diamond FALSE
 5 5     red   Diamond FALSE
 6 6     red   Diamond FALSE
 7 7     red   Diamond FALSE
 8 8     red   Diamond FALSE
 9 9     red   Diamond FALSE
10 10    red   Diamond FALSE
11 J     red   Diamond TRUE 
12 Q     red   Diamond TRUE 
13 K     red   Diamond TRUE 
14 A     red   Diamond TRUE 
15 A     black Spade   TRUE 
16 A     black Club    TRUE 

Based on this, there are 16 cards that are either an Ace or a Diamond card. Therefore, the probability of drawing either an Ace card or a Diamond card is 16 divided by 52.

16/52
[1] 0.3076923

You can also use the “&” symbol to indicate ‘AND’, for example to get the number of cards that are red and an Ace.

cards |> 
  filter(color == "red" & value == "A")
# A tibble: 2 × 4
  value color suit    face 
  <fct> <fct> <fct>   <lgl>
1 A     red   Heart   TRUE 
2 A     red   Diamond TRUE 

You can extend the code, for example by writing cards |> filter(value %in% c("A", "K") | suit == "Diamond") if you want to find the number of cards that are either Diamond, or Ace, or King.

10.3 Conditional probabilities with a contingency table

We can explore probabilities within a contingency table. For this illustration, we use the data set from the book called photo_classify from the openintro package. This is a simulated data set for photo classifications based on a machine learning algorithm versus what the true classification is for those photos. While the data are not real, they resemble performance that would be reasonable to expect in a well-built classifier. For more information, see chapter 3.2 in the book.

First, we make a contingency table of the two variables mach_lean and truth:

library(flextable) 

Attaching package: 'flextable'
The following object is masked from 'package:purrr':

    compose

This code displays the percentages across columns:

library(openintro)
data(photo_classify)

table_example <- proc_freq(x = photo_classify, 
                           row = "mach_learn", 
                           col = "truth", 
                           include.row_percent = FALSE, 
                           include.table_percent = FALSE) 
table_example

mach_learn

truth

fashion

not

Total

pred_fashion

Count

197

22

219

Col. pct

63.8%

1.5%

pred_not

Count

112

1,491

1,603

Col. pct

36.2%

98.5%

Total

Count

309

1,513

1,822

The flextable package also allows us to display the table percentages. The code below creates a table with joint probabilities for the photo_classify data. The proportions are calculated by dividing the count in each cell by the table’s total (1822).

table_example2 <- proc_freq(x = photo_classify, 
                           row = "mach_learn", 
                           col = "truth", 
                           include.row_percent = FALSE,
                           include.column_percent = FALSE,
                           include.table_percent = TRUE,
                           include.column_total = TRUE) 
table_example2

mach_learn

truth

fashion

not

Total

pred_fashion

197 (10.8%)

22 (1.2%)

219 (12.0%)

pred_not

112 (6.1%)

1,491 (81.8%)

1,603 (88.0%)

Total

309 (17.0%)

1,513 (83.0%)

1,822 (100.0%)

table_example = proc_freq(…)

We define that we want to create a table called table_example2 using the function proc_freq.

x = photo_classify

This specifies which data set (data.frame) we would like to use.

row = "mach_learn",

This specifies the variable we want to use in the rows of the cross table.

col = "truth",

This specifies the variable we want to use in the column of the cross table.

include.row_percent = FALSE

This specifies that we do not want to include row percentages.

include.column_percent = FALSE,

This specifies that we do not want to include column percentages.

include.table_percent = TRUE

This specifies that we do want to include table percentages.

include.column_total = TRUE)

This specifies that we want to include column totals.

Based on this we can see that in 10.81 % of the cases the classifier correctly predicted the fashion photos (see cell fashion, pred_fashion) and in 6.15 % of the cases, the classifier did not correctly predict a fashion photo. Likewise, in 1.21 % of the cases the classifier predicted a photo that was not a fashion photo as ‘fashion’ and in 81.83 % of the cases the classifier was correct in its assessment that a photo which was not a fashion photo was indeed not a fashion photo.

This is the same table as displayed on p. 96 of the book.

10.3.1 Conditional probability

To further investigate conditional probabilities, we can use the information from the table that we just created:

Assume we want to calculate the probability that a photo was about fashion (truth is fashion) given that the prediction was that the photo is about fashion (mach_learn is pred_fashion). We can see that there were 219 cases in which the prediction was ‘fashion’ and the photo was, in fact, a fashion photo in 197 cases.

P(truth is fashion given mach_learn is pred_fashion ) = \((\frac{197}{219})= 0.900\)

In R, we can calculate the value like this:

197/219
[1] 0.8995434