- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In today’s set we take a break of hypothesis testing and we come back to the fundamental of statistics: the probability. Precisely, in this set, you will see how to compute probability of complex events, use conditional and marginal distribution function and learn to sample from and plot a multivariate distribution function.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

So far we know that probabilities take a value between 0 and 1. We know that probabilities of realization of single events that form a set can be added together to compute the probability of realization of any event in that set. For example, the probability of getting hit by a bus on a given day or bitten by a shark is equal to the sum of those probabilities. However, we can say that because it’s almost impossible that both event happen on the same day (if you know somebody who got bitten by a shark, survived, then got hit by a bus, please stay far from them for your own safety!). Those kinds of event are called mutually exclusive and can be identified by looking at the Venn diagram of the outcome. More info here. For those interested, when two evens are not mutually exclusive, we can still add their probabilities together to get the probability of realization of one event or the other, but we must subtract the probability of getting both even to the total. The next exercise should give you an idea why we must subtract this value from the total.

The quality assurance department of a video game studio classify found bug in two categories: graphic issues or collision bug. One of the tester created this dataset compiling the bugs he found during an average workday.

- Use the
`VennDiagram`

package to draw the Venn Diagram of the dataset. - What is the probability of finding a graphic issues uniquely? Of getting only a collision bug?
- What is the probability that the tester find a graphic bug that is also a collision bug?
- What is the probability that the tester find a graphic bug or a collision bug?

**Exercise 2**

If we have two events A and B, we know how to compute the probability that A or B happen. Now if you want to know the probability of observing A and B, there’s two possible scenarios: the one where the realization of A influence the probability of realization of the event B and the one where the probability of B stay the same whether A happen or not. This last case is the easier to compute: we just have to multiply both probabilities to get the probability of realization of both events.

This result can be extended to more than two event. For example, if you flip a coin three times and want to know the probability to get three heads, you know that each coin flip result doesn’t influence the next result. As consequence you can just multiply the probability of each event, in this case 0.5*0.5*0.5=0.125 to know the probability of this particular result.

- Sample with replacement 500 integers between 1 and 10 and store the result in a vector called Event.A.
- Sample with replacement 500 integers between 1 and 5 and store the result in a vector called Event.B.
- If each element in both vector represent the result of a simultaneous draw. Empirically, what is the probability to draw the number 5 in both vectors, at the same time? What is the probability to draw a 1 in the first vector and a number bigger than 3 in the second?
- Use the multiplication rule to compute the probability of those events and compare the results of the last exercise.

**Exercise 3**

When the realization of an event A change the probability of realization of the event B we estimate what is called a conditional probability. To do so, we use the same process than we used to estimate probability, but since the event A change the possible outcome of event B, we will used the number of those possible outcomes as denominator in our formula. So the general formula for estimation of probability #of observation of B/total number of observations become #of observations of B when A happen/total number of observations when A happen. Here’s some more formal definition here.

- Load this dataset and explore it (make a histogram and list the unique observed value).
- Compute the probability to observe each value.
- There’s seems to be two sub-processes that compose those random events. Let’s assume that this dataset represent a lottery where you have 1 chance out of 100 to get a bonus that multiply by 10 your prize and that this bonus appear only in a winning situation. In this case, we could be interested to know the probability of winning and not having the bonus. Use the dataset to estimate the probability of those individual events.
- Compute the probabilities of winning each amount when the bonus is applied.

**Exercise 4**

In a rural fair, people can pay 5 dollars to play a game where they choose to open one of three doors and pick a plastic ball from a closed box that sits behind the door. If the ball is red, they win 50 dollars and if the ball is blue, they win nothing. Each box contains 50 balls, but the amount of red ball change from one box to the other. A bored statistician have spent an afternoon compiling which door has been chosen by 450 players and if they won.

- Load this dataset.
- Estimate the probability of winning at this game.
- Estimate the probability of winning at this game, if you choose the first door, the second door or the third door.
- Create a contingency table of this situation.
- Use the table to compute the conditional probability of winning if someone chose the first door, the second or the third door.

Just as for the ordinary probability we can create a distribution from the conditional probabilities to better understand how a random process behave. The easiest way to compute such a distribution is to use a contingency table where all the outcome of two even are listed in the margin and the elements are the number of observations of each combination of outcome. The conditional distribution if an outcome Ai happened correspond to the ECDF computed by using the observation on the row or column of Ai.

Another useful distribution is the marginal distribution, which is the distribution of the individual event A and B. The name marginal come from the fact that when using a contingency table to estimate it, we must use the total of each rows and columns to compute the ECDF and those values are often put in the margins. The next exercise should help you get familiar with those concepts.

**Exercise 5**

A sample of 50 articles from three websites on the same subject has been analyzed by a professional facts checker to see the quality of their news coverage. The news has been classify in three categories: factually correct, mostly correct and fake news. The following dataset show the result of his work.

- What is the probability of getting factually correct, mostly correct and fake news by looking at a random article from one of those sites?
- What is the probability of reading a fake news from the first website?
- What is the probability of reading the second website if you are reading a factually correct article?
- What is the marginal distribution in this situation?
- What is the conditional distribution for the mostly correct news?

Let’s look at the multivariate normal distribution and how the marginal and the conditional distribution are used in this case. Basically, a multivariate normal distribution is a function of dimension higher than 1 whose component are normaly distributed.

**Exercise 6**

- Generate 2000 points from a standard normal distribution and store the results in a vector called x.
- Generate 2000 points from a normal distribution of mean 10 and a standard deviation of 5 and store the result in a vector called y.
- Create a matrix with two columns x and y which will be the coordinate of 2000 points.
- Make a basic plot of the points in the last matrix and draw the histogram of both x and y matrix.

We know the marginal distributions of the multivariate normal distribution of the last exercise: they are the distribution of the x and y variables. Fun fact: the projection of the multivariate normal distribution on the x-z plane will be identical to the distribution of the variable x i.e. if we look at the 3D histogram of those points by putting our eye over the x axis the shape of the curve would look like the distribution of x. Same thing with the projection of the curve on the y-z axis.

**Exercise 7**

Create an histogram of the point in the matrix in the last exercise which the x coordinate are smaller than 1.5 but bigger than 1.3. Then, do the same things for points whose y coordinate are between 10 and 11.

Those are the conditional distributions for some fixed value of x or y. We can see that those conditional distributions are also normally distributed!

**Exercise 8**

We did before a basic plot of the points from this multivariate distribution, but this plot didn’t show the shape of the distribution. We can do better. Use the `plot3D`

package and the `hist3D()`

function (more detail here) to draw the 3d histogram of the dataset of last exercise.

**Exercise 9**

Another way to represent a 3D distribution in 2D is to use an heatmap. Draw the heatmap of your sample by using:

- the
`image2D`

function from the`plot3D`

package. - the
`hist2d()`

function from the`gplots`

package. - the
`hexbinplot`

`hexbin`

package.

**Exercise 10**

The factor x and y of our multivariate normal distribution are independent, meaning that the value of one value doesn’t influence the value of the other. To create a more realistic sample, you should use the `mvrnorm`

package which let you pass a matrix as argument containing the covariance between each variable. This statistics is a measure of the dependence between the factor that take a value between o and 1. You can read more about it here.

Use the `mvrnorm`

function to sample 500 points from a multivariate normal distribution of dimension two. The marginal distribution of the first factor is a normal distribution of mean equal to 5 and a standard deviation of 3, while the marginal distribution of the second has a mean of 9 and a standard deviation of 1.5. The covariance between both factor is of 0.6.

Then draw the heatmap of this distribution.

- Lets Begin with something sample
- Probability functions beginner
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

`apply`

, check the R documentation.
Note: We are going to use random numbers functions and random processes functions in R such as `runif`

. A problem with these functions is that every time you run them, you will obtain a different value. To make your results reproducible you can specify the value of the seed using `set.seed(‘any number’)`

before calling a random function. (If you are not familiar with seeds, think of them as the tracking number of your random number process.) For this set of exercises, we will use `set.seed(1).`

Don’t forget to specify it before every exercise that includes random numbers.

Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

**Generating dice rolls** Set your seed to 1 and generate 30 random numbers using `runif`

. Save it in an object called `random_numbers`

. Then use the `ceiling`

function to round the values. These values represent rolling dice values.

**Exercise 2**

Simulate one dice roll using the function `rmultinom`

. Make sure `n = 1`

is inside the function, and save it in an object called `die_result`

. The matrix `die_result`

is a collection of 1 one and 5 zeros, with the one indicating which value was obtained during the process. Use the function `which`

to create an output that shows only the value obtained after the dice is rolled.

**Exercise 3**

Using `rmultinom`

, simulate 30 dice rolls. Save it in a variable called `dice_result`

and use `apply`

to transform the matrix into a vector with the result of each dice.

**Exercise 4**

Some gambling games use 2 dice, and after being rolled they sum their value. Simulate throwing 2 dice 30 times and record the sum of the values of each pair. Use `rmultinom`

to simulate throwing 2 dice 30 times. Use the function `apply`

to record the sum of the values of each experiment.

**Exercise 5**

Simulate normal distribution values. Imagine a population in which the average height is 1.70 m with a standard deviation of 0.1. Using `rnorm`

, simulate the height of 100 people and save it in an object called `heights`

.

To get an idea of the values of heights, use the function `summary`

.

**Exercise 6**

90% of the population is smaller than ____________?

**Exercise 7**

Which percentage of the population is bigger than 1.60 m?

**Exercise 8**

Run the following line code before this exercise. This will load a library required for the exercise.

`if (!'MASS' %in% installed.packages()) install.packages('MASS')`

library(MASS)

Simulate 1000 people with height and weight using the function `mvrnorm`

with ` mu = c(1.70, 60) `

and

` Sigma = matrix(c(.1,3.1,3.1,100), nrow = 2) `

**Exercise 9**

How many people from the simulated population are taller than 1.70 m and heavier than 60 kg?

**Exercise 10**

How many people from the simulated population are taller than 1.75 m and lighter than 60 kg?

]]>- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-6)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-5)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Until now, we used random variable simulation and bootstrapping to test hypothesis and compute statistics of a single sample. In today’s set, we’ll learn how to use permutation to test hypothesis about two different samples and how to adapt bootstrapping to this situation.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

- Generate 500 points from a beta distribution of parameter a=2 and b=1.5, then store the result in a vector named beta1.
- Generate 500 points from the same distribution and store those points in a vector named beta2.
- Concatenate both vectors to create a vector called beta.data.
- Plot the ecdf of beta1 and beta2.
- Sample 500 points from beta.data and plot the ecdf of this sample. Repeat this process 5 times.
- Does all those samples share the same distribution and if the answer is yes, what is the distribution?

**Exercise 2**

When we test an hypothesis, we suppose that this hypothesis is true, we simulate what would happen if that’s the case and if our initial observation happen less that α percent of the time we reject the hypothesis. Now, from the first exercise, we know that if two samples share the same distribution, we can assume that any sample drawn from those samples will follow the same distribution. In particular, if we shuffle the observations from a sample of size n1 and those of a sample of size n2, shuffle them and draw two new samples of size n1 and n2, they all should have a similar CDF. We can use this fact to test the hypothesis that two samples have the same distribution. This is process is called a permutation test.

Load this dataset where each column represents a variable and we want to know if they are identically distributed. Each exercise below follow a step of a permutation test.

- What are the null and alternative hypotheses for this test?
- Concatenate both samples into a new vector called data.ex.2.
- Write a function that take data.ex.2 and the size of both sample as arguments, create a temporary vector by permuting data.ex.2 and return two new samples. The first sample has the same number of observations than the first column of the dataset, the second is made from the rest of the observations. Name this function permutation.sample (we will used it in the next exercise.) Why do we want the function to return samples of those size?
- Plot the ECDF of both initial variables in black.
- Use the function permutation.sample 100 times to generate permuted samples, then compute the ECDF of those samples and add the plot of those curve to the previous plot. Use the color red for the first batch of samples and green for the second batch.
- By looking at the plot, can you tell if the null hypothesis is true?

**Exercise 3**

A business analyst think that the daily returns of the apple stocks follow a normal distribution with mean of 0 and a standard deviation of 0.1. Use this dataset of the daily return of those stocks for the last 10 years to test this hypothesis.

**Exercise 4**

Permutation test can help us verify if two samples come from the same distribution, but if this is true, we can conclude that both sample share the same statistics. As a consequence permutation test can also be used to test if statistic of two sample are the same. One really useful application of this is to test if two mean are the same or significantly different (as you have probably realized by now, statistician are obsessed with mean and love to spend time studying it!). In this situation, the question is to determine if the difference of mean in two sample are random or a consequence of a difference of distribution.

You should be quite familiar with tests by now, so how would you proceed to do a permutation test to verify if two means are equals? Used that process to test the equality of the mean of both sample in this dataset.

**Exercise 5**

Looking at the average annual wage of the United States and Switzerland both country have relatively the same level of wealth since those statistics are of 60154 and 60124 US dollar respectively. In this dataset, you will find simulated annual wage from citizen of both countries. Test the hypothesis that both the American and the Swiss have the same average annual wage based on those samples at a level of 5%.

**Exercise 6**

To test if two samples from different distribution have the same statistics, we cannot use the permutation test: we instead will use bootstrapping. To test if two sample as the same mean, for example, you should follow those steps:

- Formulate a null and an alternative hypothesis.
- Set a significance level.
- Compute the difference of mean of both samples. This will be the reference value we will use to compute the p-value.
- Concatenate both samples and compute the mean of this new dataset.
- Shift both samples so that they share the mean of the concatenated dataset.
- Use bootstrap to generate an estimate of the mean of both shifted samples.
- Compute the difference of both means.
- Repeat the last two steps at least 1000 times.
- Compute the p-value and draw a conclusion.

Use the dataset from last exercise to see if the USA and Switzerland have the same average wage at a level of 5%.

**Exercise 7**

Test the hypothesis that both samples in this dataset have the same mean.

**Exercise 8**

R have functions that use analytic methods to test if two samples have an equal mean.

- Use the t.test() function to test the equality of the mean of the samples of the last exercise.
- Use this function to test the hypothesis that the average wage in the US are bigger than in Switzerland.

**Exercise 9**

The globular cluster luminosity dataset list measurement about the luminosity of cluster of stars in different region of the milky way galaxy and the Andromeda galaxy. Test the hypothesis that the average luminosity in both galaxy have a difference of 24,78.

**Exercise 10**

A company that mold aluminum for auto parts has bought a smaller company to increase the amount of parts they can produce each year. In their factory, the smaller company used the standard equipment, but used a different factory layout, had a different supply line and managed their employees work schedules in a completely different manner that their new parent company. Before changing the company culture, the engineer in the parent company are interested to know which of the approach is the more effective. To do so they measure the time it took to make an auto part in each factory, 150 times and created this dataset where the first column represent the sample of the small factory.

- Does the average time it takes to make a part is the same in both factory?
- Does the production time follow the same distribution in both factory?
- If the engineer want to minimize the percentage of part that take more than one hour to be made, which setup they should implement in both their factory: the one of the parent company or the one of the smaller company?

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-4)
- Probability functions beginner
- Combinations Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Probability is at the heart of data science. Simulation is also commonly used in algorithms such as the bootstrap. After completing this exercise, you will have a slightly stronger intuition for probability and for writing your own simulation algorithms.

Most of the problems in this set have an exact analytical solution, which is not the case for all probability problems, but they are great for practice since we can check against the exact correct answer.

To get the most out of the exercises, it pays off to read the instructions carefully and think about what the solution should be before starting to write `R`

code. Often this helps you weed out irrelevant information that can otherwise make your algorithm unnecessarily complicated.

Answers are available here.

**Exercise 1**

In 100 coin tosses, what is the probability of having the same side come up 10 times in a row?

You might want to use some of the following functions to answer this question:`sample(), rbinom(), rle()`

.

**Exercise 2**

Six kids are standing in line. What is the probability that they are in alphabetical order by name? Assume no two children have the same exact name.

**Exercise 3**

Remember the kids from the last question? There are three boys and three girls. How likely is it that all the girls come first?

**Exercise 4**

In six coin tosses, what is the probability of having a different side come up with each throw, that is, that you never get two tails or two heads in a row?

**Exercise 5**

A random five-card poker hand is dealt from a standard deck. What is the chance of a flush (all cards are the same suit)?

**Exercise 6**

In a random thirteen-card hand from a standard deck, what is the probability that none of the cards is an ace and none is a heart (♥)?

**Exercise 7**

At four parties each attended by 13, 23, 33, and 53 people respectively, how likely is it that at least two individuals share a birthday at each party? Assume there are no leap days, that all years are 365 days, and that births are uniformly distributed over the year.

**Exercise 8**

A famous coin tossing game has the following rules: The player tosses a coin repeatedly until a tail appears or tosses it a maximum of 1000 times if no tail appears. The initial stake starts at 2 dollars and is doubled every time heads appears. The first time tails appears, the game ends and the player wins whatever is in the pot. Thus the player wins 2 dollars if tails appears on the first toss, 4 dollars if heads appears on the first toss and tails on the second, 8 dollars if heads appears on the first two tosses and tails on the third, and so on. Mathematically, the player wins 2^{k} dollars, where k equals the number of tosses until the first tail. What is the probability of profit if it costs 15 dollars to participate?

**Exercise 9**

Back to coin tossing. What is the probability the pattern heads-heads-tails appears before tails-heads-heads?

**Exercise 10**

Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car; behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He then says to you, “Do you want to pick door #2?” What is the probability of winning the car if you use the strategy of first picking a random door and then switching doors every time? Note that the host will always open a door you did not pick, and it always reveals a goat.

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-7)
- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In today’s set you will have to use the stuff you’ve seen in the first fourth installment of this series of exercise set, but in a more practical setting. Take this as a fun test before we start learning cool stuff like A/B testing, conditional probability and the Bayes theorem. I hope you will enjoy doing it!

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

A company makes windows who should be able to withstand wind of 120 km/h. The quality assurance department of that company has for mandate to make sure that the failure rate of those windows is less than 1% for each batch of windows produced by their factory. To do so, they choose randomly 10 windows per batch of 150 and place them in a wind tunnel where they are tested.

- Which probability function should be used to compute the number of failing engine in a QA test if the failure rate is 1%?
- What is the probability that a windows work correctly during the QA test?
- What is the probability that no windows breaks during the test?
- What is the probability that up to 3 windows breaks during the test?
- Simulate this process to estimate the average amount of engine failure during the test

**Exercise 2**

A team of biologist is interested in a type of bacteria who seems to be resistant to extreme change to their environment. In a particular study they put a culture of bacteria in an acidic solution, observed how many days 250 individual bacteria would survive and created this dataset. Find the 90% confidence interval for the mean of this dataset.

**Exercise 3**

The MNIST database is a large dataset of handwritten digits used by data scientist and computer science experts as a reference to test and compare the effectiveness of different machine learning and computer vision algorithms. If a state of the art algorithm can identify the handwritten digits in this dataset 99,79% of the time and we use this algorithm on a set of 1000 digits:

- What is the probability that this algorithm doesn’t recognize 4 digits?
- What is the probability that this algorithm doesn’t recognize 6 or 7 digits?
- What is the probability that this algorithm doesn’t recognize 3 digits or less?
- If we use this algorithm on a set of 3000 digits, what is the probability that it fails more than 10 times?

**Exercise 4**

A custom officer in an airport as to check the luggage of every passenger that goes through custom. If 5% of all passenger travels with forbidden substances or objects:

- What is the chance that the fourth traveler who is checked has a forbidden item in his luggage?
- What is the probability that the first traveler caught with forbidden item is caught before the fourth traveler?

**Exercise 5**

A start-up want to know if their marketing push in a specific market has been successful. To do so, they interview 1000 people in a survey and ask them if they know their product. Of that number, 710 where able to identify or name their product. Since the start-up has limited resource, they decided that they would reallocate half the marketing budget to their data science department if more than 70% of the market knew about their product.

- Simulate the result of the survey by creating a matrix containing 710 ones representing the positive response and 290 zeros representing the negative response to the survey.
- Use bootstrapping to compute the proportion of positive answer that is smaller than 95% of the other possible proportion.
- What is the percentage of bootstrapped proportion smaller than 70%?
- As a consequence of your last answer, what the start-up should do?

**Exercise 6**

A data entry position need to be filed at a tech company. After doing the interview process, human resource selected the two ideal candidate to do a final test where they had to complete a sample day of work (they take data entry really seriously in this company). The first candidate did his work with an average time of 5 minutes for each form and a variance of 35 minutes while the second did it with a mean of 6.5 minutes and a variance of 25. Assuming that the time needed by an employer to fill in a form follow a normal distribution:

- Simulate the work of both candidates by generating 200 points of data from both distributions.
- Use bootstrapping to compute the 95% confidence interval for both means.
- Can we conclude that a candidate is faster than the other?

**Exercise 7**

A business wants to launch a product in a new market. Their study show that to be viable a market must be composed of at least 60% of potential consumer making more than 35 000$. If the last census show that the salary of this population follow an exponential distribution with a mean of 60000 and that the rate of an exponential distribution is equal to 1/mean, should this business launch their product in this market?

**Exercise 8**

A batch of 1000 ohms resistance are scheduled to be solder to two other 200 ohms resistance to create a serial circuit of 1400 ohms. But no manufacturing process is perfect and no resistance has perfectly the value it supposed to have. Suppose that the first resistance is made following a normal process that makes batch of resistance with a mean of 998 ohms and a standard deviation of 5.2 ohms, while the two other come from another process who produce batch of resistance with a mean of 202 and a variance of 2.25. What is the percentage of circuits will have for resistance a value in the interval [1385,1415]? (Note: you can use bootstrap to solve this problem or you can use the fact that the sum of two normal distributions is equal to another normal distribution whose mean is equal to the sum of their two means. The variance the new distribution is calculated the same way. You can learn more here)

**Exercise 9**

A probiotic supplement company claim that three kinds of bacteria are present in equal part in each of their pill. An independent laboratory is hired to test if this company respects this claim. After taking a small sample of five pills, they get the following dataset where the numbers are in millions.

In this dataset, the rows represent pills used in the sample and each column represents a different kind of bacteria. For each kind of bacteria:

- Compute the mean.
- Compute the variance.
- Compute the quartile.
- Compute the range which is define by the maximum value minus the minimum value.

**Exercise 10**

A shipping company estimate that the delivery delays of his shipment, in hours, follow a student distribution with a parameter of 6. What is the proportion of delivery that are between 1 hours late and 3 hours late?

- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-8)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Until now, in this series of exercise sets, we have used only continuous probability distributions, which are functions defined on all the real numbers on a certain interval. As a consequence, random variable who have those distributions can assume an infinity of values. However, a lot of random situations only have a finite amount of possible outcome and using a continuous probability distributions to analyze them is not really useful. In today set, we’ll introduce the concept of discrete probability functions, which can be used in those situations and some examples of problems in which they can be used.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

Just as continuous probability distributions are characterized by a probability density function discrete probability functions are characterized by a probability mass function which gives the probability that a random variable is equal to one value.

The first probability mass function we will use today is the binomial distribution, which is used to simulate n iterations of a random process who can either result in a success, with a probability of p, or a failure, with a probability of (1-p). Basically, if you want to simulate something like a coins flip, the binomial distribution is the tool you need.

Suppose you roll a 20 sided dice 200 times and you want to know the probability to get a 20 exactly five times on your rolls. Use the `dbinom(n, size, prob)`

function to compute this probability.

**Exercise 2**

For the binomial distribution, the individual events are independents, meaning that the probability of realization of two events can be calculated by adding the probability of realization of both event. This principle can be generalize to any number of events. For example, the probability of getting three tails or less when you flip a coins 10 time is equal to the probability of getting 1 tails plus the probability of getting 2 tails plus the probability of getting 3 tails.

Knowing this, use the `dbinom()`

function to compute the probability of getting six correct responses at a test made of 10 questions which have true or false for answer if you answer randomly. Then, use the `pbinom()`

function to compute the cumulative probability function of the binomial distribution in that situation.

**Exercise 3**

Another consequence of the independence of events is that if we know the probability of realization of a set of events we can compute the probability of realization of one of his subset by subtracting the probability of the unwanted event. For example, the probability of getting two or three tails when you flip a coins 10 time is equal to the probability of getting at least 3 tails minus the probability of getting 1 tails.

Knowing this, compute the probability of getting 6 or more correct answer on the test described in the previous exercise.

**Exercise 4**

Let’s say that in an experiment a success is defined as getting a 1 if you roll a 20 sided die. Use the `barplot()`

function to represent the probability of getting from 0 to 10 success if you roll the die 10 times. What happened to the barplot if you roll a 10 sided die instead? If you roll a 3 sided die?

**Exercise 5**

Another discrete probability distribution close to the binomial distribution is the Poisson distribution, which give the probability of a number of events to occur during a fixed amount of time if we know the average rate of his occurrence. For example, we could use this distribution to estimate the amount of visitor who goes on a website if we know the average number of visitor per second. In this case, we must assume two things: first that the website has visitor from around the world since the rate of visitor must be constant around the day and two that when a visitor is coming on the site he is not influenced by the last visitor since a process can be expressed by the Poisson distribution if the events are independent from each other.

Use the `dpois()`

function to estimate the probability of having 85 visitors on a website in the next hour if in average 80 individual connect on the site per hour. What is the probability of getting 2000 unique visitors on the website in a day?

**Exercise 6**

Poisson distribution can be also used to compute the probability of an event occurring in an amount of space, as long as the unit of the average rate is compatible with the unit of measure of the space you use. Suppose that a fishing boat catch 1/2 ton of fish when his net goes through 5 squares kilometers of sea. If the boat combed 20 square kilometer, what is the probability that they catch 5 tons of fish?

**Exercise 7**

Until now, we used the Poisson distribution to compute the probability of observing precisely n occurrences of an event. In practice, we are often interested in knowing the probability that an event occur n times or less. To do so we can use the `ppois()`

function to compute the cumulative Poisson distribution. If we are interested in knowing what is the probability of observing strictly more than n occurrences, we can use this function and set the parameter `lower`

to `FALSE`

.

In the situation of exercise 5, what is the probability that the boat caught 5 tons of fish or less? What is the probability that the caught more than 5 tons of fish?

Note that, just as in a binomial experiment, the events in a Poisson process are independant, so you can add or subtract probability of event to compute the probability of a particular set of events.

**Exercise 8**

Draw the Poisson distribution for average rate of 1,3,5 and 10.

**Exercise 9**

The last discrete probability distribution we will use today is the negative binomial distribution which give the probability of observing a certain number of success before observing a fixed number of failures. For example, imagine that a professional football player will retire at the end of the season. This player has scored 495 goals in his career and would really want to meet the 500 goal mark before retiring. If he is set to play 8 games until the end of the season and score one goal every three games in average, we can use the negative binomial distribution to compute the probability that he will meet his goal on his last game, supposing that he won’t score more than one goal per game.

Use the `dnbinom()`

function to compute this probability. In this case, the number of success is 5, the probability of success is 1/3 and the number of failures is 3.

**Exercise 10**

Like for the Poisson distribution, R give us the option to compute the cumulative negative binomial distribution with the function `pnbinom()`

. Again, the `lower.tail`

parameter than give you the option to compute the probability of realizing more than n success if he is set to TRUE.

In the situation of the last exercise, what is the probability that the football player will score at most 5 goals in before the end of his career.

]]>- Hacking statistics or: How I Learned to Stop Worrying About Calculus and Love Stats Exercises (Part-8)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In the first two part of this series, we’ve seen how to identify the distribution of a random variable by plotting the distribution of a sample and by estimating statistic. We also seen that it can be tricky to identify a distribution from a small sample of data. Today, we’ll see how to estimate the confidence interval of a statistic in this situation by using a powerful method called bootstrapping.

Answers to the exercises are available here.

For other parts of this exercise set follow the tag Hacking stats

**Exercise 1**

Load this dataset and draw the histogram, the ECDF of this sample and the ECDF of a density who’s a good fit for the data.

**Exercise 2**

Write a function that takes a dataset and a number of iterations as parameter. For each iteration this function must create a sample with replacement of the same size than the dataset, calculate the mean of the sample and store it in a matrix, which the function must return.

**Exercise 3**

Use the `t.test()`

to compute the 95% confidence interval estimate for the mean of your dataset.

**Exercise 4**

Use the function you just wrote to estimate the mean of your sample 10,000 times. Then draw the histogram of the results and the sampling mean of the data.

The probability distribution of the estimation of a mean is a normal distribution centered around the real value of the mean. In other words, if we take a lot of samples from a population and compute the mean of each sample, the histogram of those mean will look like one of a normal distribution center around the real value of the mean we try to estimate. We have recreated artificially this process by creating a bunch of new sample from the dataset, by resampling it with replacement and now we can do a point estimation of the mean by computing the average of the sample of means or compute the confidence interval by finding the correct percentile of this distribution. This process is basically what is called bootstrapping.

**Exercise 5**

Calculate the value of the 2.5 and 97.5 percentile of your sample of 10,000 estimates of the mean and the mean of this sample. Compare this last value to the value of the sample mean of your data.

**Exercise 6**

Bootstrapping can be used to compute the confidence interval of all the statistics of interest, but you don’t have to write a function for each of them! You can use the `boot()`

function from the library of the same name and pass the statistic as argument to compute the bootstrapped sample. Use this function with 10,000 replicates to compute the median of the dataset.

**Exercise 7**

Look at the structure of your result and plot his histogram. On the same plot, draw the value of the sample median of your dataset and plot the 95% confidence interval of this statistic by adding two vertical green lines at the lower and higher bounds of the interval.

**Exercise 8**

Write functions to compute by bootstrapping the following statistics:

- Variance
- kurtosis
- Max
- Min

**Exercise 9**

Use the functions from last exercise and the boot function with 10,000 replicates to compute the following statistics:

- Variance
- kurtosis
- Max
- Min

Then draw the histogram of the bootstrapped sample and plot the 95% confidence interval of the statistics.

**Exercise 10**

Generate 1000 points from a normal distribution of mean and standard deviation equal to the one of the dataset. Use the bootstrap method to estimate the 95% confidence interval of the mean, the variance, the kurtosis, the min and the max of this density. Then plot the histograms of the bootstrap samples for each of the variable and draw the 95% confidence interval as two red vertical line.

Two bootstrap estimate of the same statistic of two sample who are distributed by the same density should be pretty similar. When we compare those last plots with the confidence interval we drawn before we see that they are. More importantly, the confidence interval computed in exercise 10 overlap the confidence interval of the statistics of the first dataset. As a consequence, we can’t conclude that the two sample come from different density distribution and in practice we could use a normal distribution with a mean of 0.4725156 and a standard deviation of 1.306665 to simulate this random variable.

]]>This is the third part of the series, it will contain the main distributions that you will use most of the time. This part is created in order to make sure that you have (or will have after solving this set of exercises) the knowledge for the next parts to come. The distributions that we will see are:

1)Binomial Distribution: The binomial distribution fits to repeated trials each with a dichotomous outcome such as success-failure, healthy-disease, heads-tails.

2)Normal Distribution: It is the most famous distribution, it is also assumed for many gene expression values.

3)T-Distribution: The T-distribution has many useful applications for testing hypotheses when the sample size is lower than thirty.

4)Chi-squared Distribution: The chi-squared distribution plays an important role in testing hypotheses about frequencies.

5)F-Distribution: The F-distribution is important for testing the equality of two variances.

Before proceeding, it might be helpful to look over the help pages for the ` choose`

, `dbinom`

, `pbinom`

, `rbinom`

, `qbinom`

,`pnorm`

, `qnorm`

, `rnorm`

, `dnorm`

,`pchisq`

, `qchisq`

, `dchisq`

, `df`

, `pf`

, `df`

.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Let X be binomially distributed with n = 100 and p = 0.3.Compute the following:

a) P(X = 34), P(X ≥ 34), and P(X ≤ 34)

b) P(30 ≤ X ≤ 60)

c) The quantiles x_{0.025}, and x_{0.975}

**Exercise 2**

Let X be normally distributed with mean = 3 and standard deviation = 1.Compute the following:

a) P(X 2),P(2 ≤ X ≤ 4)

b) The quantiles x_{0.025}, x_{0.5}and x_{0.975}.

**Exercise 3**

Let T_{8} distribution.Compute the following:

a)P(T_{8} < 1), P(T_{8} > 2), P(-1 < T_{8} < 1).

b)The quantiles t_{0.025}, t_{0.5}, and t_{0.975}. Can you justify the values of the quantiles?

**Exercise 4**

Compute the following for the chi-squared distribution with 5 degrees of freedom:

a) P(X^{2}_{5}<2), P(X^{2}_{5}>4),P(4<X^{2}_{5}<6).

b) The quantiles g_{0.025}, g_{0.5}, and g_{0.975}.

**Exercise 5**

Compute the following for the F_{6,3} distribution:

a)P(F_{6,3} < 2), P(F_{6,3} > 3), P(1 < F_{6,3} < 4).

b)The quantiles f_{0.025}, f_{0.5}, and f_{0.975}.

**Exercise 6**

Generate 100 observations following binomial distribution and plot them(if possible at the same plot):

a) n = 20, p = 0.3

b) n = 20, p = 0.5

c) n = 20, p = 0.7

**Exercise 7**

Generate 100 observations following normal distribution and plot them(if possible at the same plot):

a) standard normal distribution ( N(0,1) )

b) mean = 0, s = 3

c) mean = 0, s = 7

**Exercise 8**

Generate 100 observations following T distribution and plot them(if possible at the same plot):

a) df = 5

b) df = 10

c) df = 25

**Exercise 9**

Generate 100 observations following chi-squared distribution and plot them(if possible at the same plot):

a) df = 5

b) df = 10

c) df = 25

**Exercise 10**

Generate 100 observations following F distribution and plot them(if possible at the same plot):

a) df_{1} = 3, df_{2} = 9

b) df_{1} = 9, df_{2} = 3

c) df_{1} = 15, df_{2} = 15

- Dates and Times – Simple and Easy with lubridate exercises (part 2)
- Dates and Times – Simple and Easy with lubridate exercises (part 1)
- Bioinformatics Tutorial with Exercises in R (part 1)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Creating sample data is a common task performed in many different scenarios.

R has several base functions that make the sampling process quite easy and fast.

Below is an explanation of the main functions used in the current set of exercices:

1. set.seed() – Although R executes a random mechanism of sample creation, set.seed() function allows us to reproduce the exact sample each time we execute a random-related function.

2. sample() – Sampling function. The arguments of the function are:

x – a vector of values,

size – sample size

replace – Either use a chosen value more than once or not

prob – the probabilities of each value in the input vector.

3. seq()/seq.Date() – Create a sequence of values/dates, ranging from a ‘start’ to an ‘end’ value.

4. rep() – Repeat a value/vector n times.

5. rev() – Revert the values within a vector.

You can get additional explanations for those functions by adding a ‘?’ prior to each function’s name.

Answers to the exercises are available here.

If you have different solutions, feel free to post them.

**Exercise 1**

1. Set seed with value 1235

2. Create a Bernoulli sample of 100 ‘fair coin’ flippings.

Populate a variable called `fair_coin`

with the sample results.

**Exercise 2**

1. Set seed with value 2312

2. Create a sample of 10 integers, based on a vector ranging from 8 thru 19.

Allow the sample to have repeated values.

Populate a variable called `hourselect1`

with the sample results

**Exercise 3**

1. Create a vector variable called `probs`

with the following probabilities:

‘0.05,0.08,0.16,0.17,0.18,0.14,0.08,0.06,0.03,0.03,0.01,0.01’

2. Make sure the sum of the vector equals 1.

**Exercise 4**

1. Set seed with value 1976

2. Create a sample of 10 integers, based on a vector ranging from 8 thru 19.

Allow the sample to have repeated values and use the probabilities defined in the previous question.

Populate a variable called `hourselect2`

with the sample results

**Exercise 5**

Let’s prepare the variables for a biased coin:

1. Populate a variable called `coin`

with 5 zeros in a row and 5 ones in a row

2. Populate a variable called `probs`

having 5 times value ‘0.08’ in a row and 5 times value ‘0.12’ in a row.

3. Make sure the sum of probabilities on `probs`

variable equals 1.

**Exercise 6**

1. Set seed with value 345124

2. Create a biased sample of length 100, having as input the `coin`

vector, and as probabilities `probs`

vector of probabilities.

Populate a variable called `biased_coin`

with the sample results.

**Exercise 7**

Compare the sum of values in `fair_coin`

and `biased_coin`

**Exercise 8**

1. Create a ‘Date’ variable called `startDate`

with value 9th of February 2010 and a second ‘Date’ variable called `endDate`

with value 9th of February 2005

2. Create a descending sequence of dates having all 9th’s of the month between those two dates. Populate a variable called `seqDates`

with the sequence of dates.

**Exercise 9**

Revert the sequence of dates created in the previous question, so they are in ascending order and place them in a variable called `RevSeqDates`

**Exercise 10**

1. Set seed with value 10

2. Create a sample of 20 unique values from the RevSeqDates vector.

Today we will focus on generating random numbers from some of the built-in distributions in R as well as using the sample() function to obtain random samples from a given population.

In the next posts, we will use some of these functions to create our own data sets to practice, and we will build upon some of the concepts discussed here.

So… what are we waiting for?

For this set of exercises, we will make use the following functions: runif(), rnorm(), rbinom(), rpois(), sample() and set.seed().

In stark contrast with my ex-GF, R is always willing to tell you exactly what it wants, so don’t be afraid of using ‘?’ (e.g.?sample) to get documentation for each function.

Give it a go to the exercises, and compare your solutions with mine. Please feel free to comment if you have different ideas of how to solve the exercises, I am eager to hear what you come up with!

**Exercise 0**

Quick! Come up with a random number! And another one! And another one!…And one more! And so on until you reach 100 different random numbers! Uff that doesn’t sound so easy, right? And after you have gone through all that trouble of generating a list of 100 numbers, are you sure they are actually random? Difficult to tell! Human brains are incredibly terrible when it comes to dealing with randomness. But don’t worry! R has your back!

**Exercise 1**

Let’s warm up first. Have R generate a vector containing 100 completely random numbers, from 1 to 10

Easy peasy, right?… But let’s kick it up a notch… suppose you don’t want so many floating numbers in your life. Try to generate a vector of random integer numbers.

We will discuss the following concepts in more depth in future sets, particularly when we talk about summary measures; but for now, it suffices to say that variables (i.e. the values obtained from a sample) can be broadly classified into categorical or numeric. Categorical variables can be further divided into three different subtypes, one of which is binary, such as yes/no, live/death, or face/tails when flipping a coin.

One excellent example of a random experiment that gives binary outcomes (variables, e.g. success/failure) is a Bernoulli trial, which can be achieved in R by using the rbinom() function.

**Exercise 2**

Create a vector with the outcomes of 20 Bernoulli trials that simulate the toss of a fair coin (i.e. p= 0.50 for tails and p= 0.50 ). Just for the sake of immersion, we will consider 0s as failures and 1s as successes.

Excellent! I bow to your skills. Now let’s kick it up a notch… can you figure out how to create a similar simulation using the sample() function?

But perhaps a binary existence of 0s and 1s is not for you. You want the excitement of numeric variables, but you don’t wish to deal with those pesky floating numbers. The fabulous world of discrete numeric variables is for you then!

Discrete numeric variables, also known as counts, are basically integers that represent the number of events during a determined interval (e.g. time or space). Typical examples of counts in biostatistics include the number of births, the number of deaths, the number of individuals, etc. Just remember, the events represented by this type of numeric variable are always integers (i.e. you cannot have a quarter or half death… unless you are Schwarzenegger in Terminator!)

**Exercise 3**

Let’s create a random Poisson distribution (pronounced like “pwasö”) to simulate the occurrence of births in the next 20 years in the (fictitious) country of “Los Cocos”, that has an annual mean birth rate of 100.

Pro-tip: lambda refers to the average rate.

Excellent! You have the eternal (fictitious) gratitude of the noble (fictitious) country of “Los Cocos”.

So now you are feeling a little bit more daring, and you have decided to “man-up” (or “woman up”, hey! I am no sexist!) and deal with all those seemingly infinite (well in theory they actually are) decimal numbers. Well, my dear fellow, a continuous, normal distribution may just be right for you!

**Exercise 4**

Let’s create a vector containing 100 random values of weight measurements with a mean of 70kg and a standard deviation of 15 kg.

Sweet! Keep up that good work!

There are several other built-in distributions in R…and so many more you can get from CRAN, but these are the “work-horses” of basic stats, as we are going to see in the upcoming sets. Keep them fresh in your mind, because they are going to make a comeback, just like Alf…in pog form.

**Exercise 5**

Now, let’s simulate a sample of 10 trials of throwing a dice. (for those that have been living in a cave, a dice has 6 possible, mutually exclusive, outcomes that range from 1 to 6)

Brilliant Job! Let’s kick it up a notch…let’s create an “unfair” dice that has a 0.34 probability of resulting in a 6, and a probability of 0.16 for each of the other outcome.

Kind of cool, huh? If only we could bring with us our recently created R dice to the casino…oh well…

Randomised Controlled Trials are a type of medical experiment, where the eligible participants are randomly assigned (allocated) to one of the two (or more) branches of the study. A randomised controlled trial is considered the gold standard of clinical trials. In these studies, randomisation helps to control for confounding factors, and evenly distribute prognostic factors across groups.

**Exercise 6**

Lets (coarsely) randomly select 10 of the participants in x to be our intervention group.

Excellent…now that we have our victims…err…intervention group let’s kick it up a notch and create a data frame that lists the intervention and control groups. Keep in mind that we are dealing with randomisation right now, so it is ok if our (hypothetical) study is not blinded.

Pro-tip: You may want to create an index to store the randomised individuals rows, and then just use an index to subset x.

**Exercise 7**

Let’s apply all this to a real database. For this exercise you can use any database you may have handy, or download the IST trial database, which was the one I employed when writing this exercise. You can either download it manually, or let R do it for you (isn’t R the best wingman ever, or what?)

To let R download the database, just run this:

download.file(“http://datashare.is.ed.ac.uk/bitstream/handle/10283/

128/IST%20dataset%20supp1%20%281%29%20trials%202011.csv?sequence=4&isAllowed=y”, destfile = “stroke.csv”)

And remember to read the csv file with:

dB<- read.csv(“stroke.csv”)

Wow! 19,435 cases! That is certainly a decent sample size. But perhaps we are not interested in all of those cases, and we just want a subset. Let’s create a data set that has all 112 observations, but only 200 randomly selected cases from the original database.

Pro-tip: you may want to try to use a similar strategy than the one utilised in the previous exercise. I find using len_seq() and length() very useful in these cases.

I bow to your mastery of R-fu.

All this talk about randomness and random samples may be having its toll on you, and it may have caused you to develop some philosophical/existential doubts, so let’s end these exercises in a more “stable” note. Let’s create replicable random numbers.

Say what!?!? Well, it happens that sometimes you may write an R script and you want it to choose randomly a sample of cases from a larger database (let’s use the IST database again). You perform your analysis, and everything looks nice and dandy…but then you run your script again and… merciful Zeus! The values are all different because you have just taken another sample at random! You feel like your sanity starts to fade away…but before you pick up that axe, and start running naked à la Christian Bale in American psycho, just take a deep breath and remember, R has you covered.

**Exercise 8**

Create a dataset by randomly selecting cases from the IST database, but that can be replicated every time you run your script.

That is it for now, grasshopper. This was just a warm-up. Tune in next week for some discussion of probability and biostatistics concepts using R. (Yay! that is fun^2!)

]]>