This post aims to introduce you to the basics of Bayesian inference. The ultimate goal of this introductory set of exercises is to get you ready for Bayesian inference using Markov Chain Monte Carlo (MCMC).
Little reminder
The whole Bayesian paradigm is based on the Bayesian Theorem that we all know (right ?), generally formulated as :
$latex \Pr(A|B) = \frac{\Pr(B|A)\cdot \Pr(A)}{P(B)}$.
where $latex \Pr(A|B)$ stands for the posterior probability of an event of interest A given the evidence of some event B.
In our case, let’s slightly rephrase this, since we will be interested by the estimation of a certain parameter $\theta$ of some distribution
$latex \Pr(\theta|X) = \frac{\Pr(X|\theta)\cdot \Pr(\theta)}{P(X)}$
where $latex \theta$ is the parameter of interest (that can be multidimensional) and $latex \Pr(\theta|X)$ is its posterior distribution. $latex \Pr(X|\theta)$ is better known as the likelihood (i.e. the data) and $latex \Pr(\theta)$ is the prior distribution. Since $latex \Pr(X)$ does not directly depends on the parameter of interest, we know that the posterior is directly proportional to prior x likelihood.
Context
Doctors usually recommend to sleep 8 hours per night. We want to analyze whether this is really the case on a sample of statisticians.
For this introductory set of exercises, you will not need to install specific R packages. We will only rely on base
and stats
packages. Solutions to these exercises can be found here.
Exercise 1
In this exercise, we will be interested by estimating the proportion, p, of the number of statisticians that sleep less than 8 hours per night. Our experimental sample of size 1000 gives us 735 statisticians that sleep less than 8 hours.
a. What is the probability distribution of the data? Compute the likelihood and plot it. Note : take everything in percentage.
b. Which continuous probability distribution should be used to describe the prior of this proportion ? Hence, what (family of) function(s) will you use in R to do that ? Look at the parameters and be careful with the pertaining domain.
Exercise 2
A group of doctors claimed that only 30% of statisticians sleep less than 8 hours, and this with a variance of 0.1. We want to take this information as a prior for our study. How can we do that ? Represent this graphically. (Help : look at the equations for the mean and variance of the beta distribution )
Exercise 3
Using the fact that the beta distribution is a conjugate prior for a binomial process, find the posterior distribution of p. Then, plot it together with the prior on the same graph. What do you notice ?
Exercise 4
What would happen if we observe a sample of size 10 instead of 1000 ? And we observe 7 statisticians that sleep less than 8 hours.
Exercise 5
Generate a large random sample (M=10,000) from the posterior of exercise 3 (large sample).
Make it reproducible and compare this sample with the theoretical posterior found in Exercise 3.
- work thru a start to finish analysis process, including data mining,
- know how to compare regression models and the background of a bayesian classifier,
- and much more.
Exercise 6
Based on the generated random sample from the posterior distribution, compute the mean, the median, the standard deviation. What do you observe ?
Exercise 7
Find a 95% quantile-based credible interval for the posterior distribution of p based on this random sample.
How can we interpret that ?
Exercise 8
For the two last exercises, let’s assume we obtain a new sample of size 500 of computer scientists. This shows that out of 500 sleep less than 8 hours per night.
Assuming a non-informative prior (uniform), generate a random sample (M=10,000) of this posterior distribution p2. Compare this with the results for the statisticians sample and generate a random sample for the:
a. posterior odds ratio
b. posterior difference.
Exercise 9
Based on the previous exercise, can you find evidence that there are more computer scientists than statisticians that sleep less than 8 hours per night ?
Leave a Reply