This blog post is the first of a set of exercises about STAN that will introduce the STAN platform and how to link it with R. STAN is a statistical modeling platform that is used as an example for MCMC computations for Bayesian inference. It is more efficient for most analysis since it is written in a low-level language (c++). While we have already seen exercises about the Gibbs sampler or the Metropolis algorithm, STAN typically deals with Hamiltonian Monte Carlo (HMC).
The next exercises will more specifically present the diagnostics and the visualizations of a STAN model.
Solutions for this set of exercises can be found here. The first chunk defines, installs (if needed), and loads the packages needed in this set.
After installing and loading the
rstan package, set the global options that allows you to automatically save a bare version of a compiled Stan program. This will insure that it does not need to be recompiled, along with executing multiple chains in parallel by taking all the available cores of your machine (use the
Load the data “pulp” from the
faraway package using the
Then, display the structure of this data-set and plot the “bright” against the “operator.” Use a small jitter to distinguish the data points.
Could we fit a model on this data?
We want to estimate a model with a random operator.
Write the STAN model online in your R script by following the pertaining blocks of a STAN code and use uninformative priors for the overall mean and the two standard deviations (residual and operator):
– The data block that will gather all the data used in the STAN model: N (the number of samples: it cannot be negative), J (the number of different operators: it cannot be negative), response (bright values) and predictor (a particular value of the predictor corresponding to the response.)
– The parameters block: eta (the parameter associated with the operator), mu (the intercept), sigma_alpha, and sigma_epsilon (the standard deviations, respectively, for the operator variance component and for the residual.)
– The transformed parameters block: creates the parameter “a” (the predicted values of the model for each operator) and yhat ( the predicted values of the model for each rows.) (HINT: use “a” for loop to fill yhat.)
– The model block: defines the distribution of eta (take a standard normal distribution) and creates the response vector of interest by adding the residual random noise.
Do not forget to constraint the upper and lower bounds in the data and parameter block declarations.
Note that it is recommended to write the STAN code in a separate file (“.stan” extension) and then call it directly through the
stan() function, instead of writing it in a character string as done here. That would be easier to debug, read, etc.
Define the named data list that will be used inside the STAN data block written above.
(HINT: you have to define the same set of parameters with the same names as the sample data we will be using.)
Run the STAN model using the
stan() function and using the following input parameters:
– The STAN code defined in Exercise 3.
– The data list defined in Exercise 4.
– 4 different chains.
– 1000 iterations per chain.
– A warm-up phase of 200. This phase defines the number of iterations that are used by the sampler for the adaptation phase before sampling begins (and is thus different of the burn-in phase we have seen in the Metropolis or Gibbs samplers.)
Use all the available cores of your machine and make it reproducible (take the
Read the warnings carefully that appear in the console after having run the STAN model. Try to improve by following the given advice.
(HINT: check the
control input parameter of the
Examine the chains scatter-plot matrix of the parameters mu, sigma_alpha, sigma_epsilon, and “a.”
Make a quick check of the convergence of the chains for the same set of covariates by using traceplots.
(HINT: simply use the available method of the rstan package.)
Retrieve the sampler parameters of the model from Exercise 6.
Then, compute a summary of these chains in the following way:
a. Chains by chains (hence, you will have 4 summary tables.)
b. For the whole aggregated chain.
Restrict the numerical outputs to 2 digits.
Print the first summary statistics of the complete chain for the same set of variables as in Exercises 6 and 7.
What can you say about this?