Below are the solutions to these exercises for answering probability questions with simulation.

#################### # # # Exercise 1 # # # #################### # Number of simulation. The higher this number is, the better our estimates. n <- 1e5L # Number of flips nflips <- 100L # How many in a row for a success comparison <- 10L # Keep track of "successes" num_succe <- 0L # We let 1 represent "heads" and 0 "not heads" for (i in 1:n) { num_succe <- num_succe + (max(rle(rbinom(nflips, 1, 0.5))$lengths) >= comparison) } # Estimated probability num_succe / n

## [1] 0.08387

#################### # # # Exercise 2 # # # #################### # Number of simulation n <- 5e5L # Number the kids 1, 2, 3, ..., 6 according to their alphabetical order kids <- 1L:6L # Keep track of "successes" num_succe <- 0L for (i in 1L:n) { num_succe <- num_succe + identical(sample(kids, size = 6), 1L:6L) } num_succe / n

## [1] 0.00139

# Analytical solution 1 / factorial(6)

## [1] 0.001388889

#################### # # # Exercise 3 # # # #################### # Number of simulation n <- 5e5L # Number the kids 1, 2, 3, ..., 6 and let 1, 2, 3 be girls. kids <- 1L:6L # Keep track of "successes" num_succe <- 0L # Remember that success is when all of the girls come first, which also means the sum of # the first three elements is equal to six. for (i in 1L:n) { num_succe <- num_succe + (sum(sample(kids, size = 3)) == 6L) } # Estimated probability num_succe / n

## [1] 0.05012

# Analytical solution factorial(3)^2 / factorial(6)

## [1] 0.05

#################### # # # Exercise 4 # # # #################### # Number of simulation n <- 1e5L # Run length --- Number of flips nflips <- 6L # Keep track of "successes" num_succe <- 0L # We let 1 represent "heads" and 0 "not heads" for (i in 1:n) { num_succe <- num_succe + (max(rle(rbinom(nflips, 1, 0.5))$lengths) == 1L) } num_succe / n

## [1] 0.03175

# Analytical solution (1/2)^5

## [1] 0.03125

#################### # # # Exercise 5 # # # #################### # Define a deck of cards, and just know that here we are only concerned with the suit. deck <- rep(c("C","D", "S", "H"), times = rep(13, 4)) # Number of simulation n <- 3e5L # Keep track of "successes" num_succe <- 0L for (i in 1L:n) { num_succe <- num_succe + (length(unique(sample(deck, 5L))) == 1L) } num_succe / n

## [1] 0.001833333

# Analytical solution 4 * choose(13, 5) / choose(52, 5)

## [1] 0.001980792

#################### # # # Exercise 6 # # # #################### # Build a complete deck. deck <- matrix(c(rep(as.character(1:13), times = 4), rep(c("C","D", "S", "H"), times = rep(13, 4))), nrow = 52L, dimnames = list(NULL, c("number", "suit"))) # Number of simulation n <- 2e5L # Keep track of "successes" num_succe <- 0L for (i in 1L:n) { # Draw random hand hand <- deck[sample(1L:52L, 13L), ] # Check that 'none is heart,' then proceed to next hand if (!any(hand[, "suit"] == "H")) { # Check also that there is no ace if (!any(hand[, "number"] == "1")) { num_succe <- num_succe + 1L } } } num_succe / n

## [1] 0.00364

# Analytical solution choose(36, 13) / choose(52, 13)

## [1] 0.003638961

#################### # # # Exercise 7 # # # #################### ## Birthday problem # Possible days to be born in days_pos <- 1L:365L # Party sizes to check groups <- c(13, 23, 33, 53) # Number of simulation n <- 1e5L # Keep track of "successes", i.e. two people in the party share a birthday num_succe <- vector(mode = "numeric", length = 4) names(num_succe) <- paste0("party_size_", groups) # Run simulation and add to num_succe each time there is a duplicate in each group for (i in 1L:n) { a <- lapply(groups, function(x) sample(days_pos, size = x, replace = TRUE)) num_succe <- num_succe + sapply(a, function(x) any(duplicated(x))) } # Estimated probability num_succe / n

## party_size_13 party_size_23 party_size_33 party_size_53 ## 0.19519 0.50492 0.77646 0.98052

#################### # # # Exercise 8 # # # #################### ## Problem related to St. Petersburg paradox # Number of simulation n <- 1e5L # Maximum number of throws max_throws <- 1000L # Keep track of "successes" num_succe <- 0L # Costs cost_game <- 15L # Define coin-throwing function coin_throw <- function() { rbinom(1, 1, 0.5) } for (i in 1L:n) { for (throw in 1L:max_throws) { if(coin_throw() == 0L) { num_succe <- num_succe + (2^throw > cost_game) break } } } # Estimated probability num_succe / n

## [1] 0.12411

# Analytical probability throws_needed <- min(which(2^(1:1000) > cost_game)) - 1 1 - sum(1/2^(1:throws_needed))

## [1] 0.125

#################### # # # Exercise 9 # # # #################### # Define coin-throwing function coin_throw <- function() { rbinom(1, 1, 0.5) } # Define a function for one simulation of the problem hht_vs_thh <- function() { # Vector of last three throws. Don't need to store more. last_three <- vector(mode = "numeric", length = 3) # Definition of win/loss win <- c(1, 1, 0) loss <- c(0, 1, 1) # Fill first three for (i in 1L:3L) { last_three[i] <- coin_throw() } # Start checking and adding one more throw if no winner while (TRUE) { if (identical(last_three, win)) { value <- 1L break } else if (identical(last_three, loss)) { value <- 0L break } # Throw once more last_three[1:2] <- last_three[2:3] last_three[3] <- coin_throw() } return(value) } # Keep track of "successes" num_succe <- 0L # Number of simulation n <- 1e5L for (i in 1:n) { num_succe <- num_succe + hht_vs_thh() } num_succe / n

## [1] 0.25092

# Analytical solution 1/4

## [1] 0.25

#################### # # # Exercise 10 # # # #################### ## Monty Hall problem # What can be behind the doors possible_rewards <- c("G", "G", "C") # Name the doors 1, 2, 3 num_door <- 1L:3L # Number of simulation n <- 1e5L # Keep track of "successes" num_succe <- 0L for (i in 1:n) { # What is behind the doors is randomized behind_doors <- sample(possible_rewards) # You choose a door at random chos_door <- sample(num_door, size = 1) # The doors you didn't choose remain_doors <- setdiff(num_door, chos_door) # Doors remaining that have a goat remain_goat_doors <- intersect(which(behind_doors == "G"), remain_doors) # Host selects one remaining goat door at random to show you if (length(remain_goat_doors) == 1L) { host_door <- remain_goat_doors } else { host_door <- sample(remain_goat_doors, 1) } # You strategy is to choose the door that is # not the one the host showed you # and not the one you choose initially # and you succeed if a car is there num_succe <- num_succe + (behind_doors[-c(chos_door, host_door)] == "C") } # Estimated solution num_succe / n

## [1] 0.66489

# Analytical solution 2 / 3

## [1] 0.6666667

Marc says

For Exercise 5, I think the analytical calculation should be (13/52)*(12/51)*(11/50)*(10/49)*(9/48) as you are using up cards from the deck as they are being dealt. Thus the correct answer would be 0.0004951981 (0.05%)

sindri says

Thanks for the comment Marc. My analytical and simulated solution do take into account that we are “sampling without replacement” from the deck. Since the game of poker is very popular the result is also readily available online. But actually taking a closer look your solution, you just need to multiply by 4 to get mine. This represents the first card can be in any of the four suits.

Carlos Pedrotti says

For exercise #9, maybe there’s a simpler solution, working with strings and regex match:

x <- 0L

for(i in 1:100000) {

y <- paste(sample(c("H", "T"), 300, replace = TRUE), collapse = "")

x <- x + (regexpr("HHT", y) < regexpr("THH", y))

}

p <- x/100000

p

Please comment if there's some mistake

Carlos Pedrotti says

For exercise #10, another alternative solution using grep:

x <- 0L

for (i in 1:10000) {

y <- sample(c(1,0,0)) #permutation – 1 for the car, 0 for the goats

z <- 0L

# use [-grep(…)] to discard goats from the y[2:3], considering y[1] the first choice

z <- z + sum(y[2:3][-grep(0, y[2:3])]) # z == 1 only if there is a car in the remainder

if (z == 1) {x <- x+1}

}

p <- x/10000

p

sindri says

Good job Carlos. #10 seems legit on first sight. Regarding solution of exercise 9, how does your code deal with the case when neither pattern appears in the first 300 throws? I suspect failure to deal with this case might result in a very slight (downward) bias. Also, I always like to define constants upfront even though it makes the code longer, see interesting Wikipedia article on the topic here.

Carlos Pedrotti says

If there is no match for both sides, the result for (regexpr(“HHT”, y) < regexpr("THH", y) is FALSE, since -1 !< -1. If only the left side results -1, there would be a false positive. If only the right side results is -1 there would be a false negative. Since the last two probabilities are equal, they annulate each one. And the first probability is < 10^-7, so wouldn't affect the simulation model.

Carlos Pedrotti says

Specifically: p = 1.47e-90 (pretty rare)

Ghislain Gueye says

Exercise 1 reads “what is the probability of having the same side come up 10 times in a row?”.

The solution given here implies that the exercise requires the calculation of the probability of having the same side come up AT LEAST 10 times in a row. The code should be

max(rle(rbinom(nflips, 1, 0.5))$lengths) == comparison

instead of

max(rle(rbinom(nflips, 1, 0.5))$lengths) >= comparison

Notice the “==” and the “>=”

sindri says

Thanks for the comment Ghislain. The devil lies in the detail and I guess I could have made the question clearer. But I will stick with the question as it is since 11 in a row also implies 10, 9, 8, 7, …, 2 in a row happened before that. Doesn’t it?

Ghislain Gueye says

It does; however, a real life situation might require that you need exactly 10 and nothing else (no more and no less). But as you said, it is a minor detail that can be overlooked in this exercise. Thank you for your response and I have further comments that I will post soon.

Raghu says

The analytical for exercise 1 is (90*2^90)/2^100.

Close to the simulation but this assumes a pure 10 and not at least 10.

Evan says

hi, sindri

This problem set seems very intriguing but as far as your R solutions, is there a more simplified route to take?

I’m confused by your code. I’m not too advanced in R but there must be a more comprehendible method?

sindri says

Thanks Evan. I guess there is a room for simplifying the code in all these solutions. However, I don’t think most of the solutions are overly complex. I would suggest playing around with some easier examples using the sample() function and for loops. For example: What is the probability of getting 10 when throwing two standard dices?

Evan says

What do the, for example, 1e5L in question 1 mean? Or 5e5L in question 2 mean?

sindri says

Good question.

These are small details that make for shorter and sometimes faster code. But, honestly, I wouldn’t include them if I were to rewrite the solutions.

Anyway, here are some clues:

3e2 is short for 300, so 3e4 is short for 30000 and 3e-2 for 0.03. It’s R way of writing numbers in ‘scientific notation’.

30L tells R that 30 is to be stored as integer and not point float number. Similar to as.integer(30).

ZiCheng says

Hi Sindri,

thank you so much for the problem sets! I have a different solution to Exercise 1.

B=1e5

n=100

aa<-replicate(B,{

toss=10)

})

sum(aa)/B

The answer is slightly different from yours. I assume it is because of the randomness of rbinom?

ZiCheng says

B=1e5

n=100

library(tidyverse)

aa<-replicate(B,{

toss=10)

})

sum(aa)/B

ZiCheng says

I don’t know why my code didn’t show properly…

B=1e5

n=100

aa<-replicate(B,{toss=10)})

sum(aa)/B

ZiCheng says

I give up… why? why don’t the code show properly…?

Drug Rehab Orlando says

Hi Sindri,

B=1e5

n=100

aa<-replicate(B,{

toss=10)

})

sum(aa)/B