This exercise set will continue the introduction to the STAN platform and its main features. Whereas the first post introduced the
rstan package, we will now present the
rstanarm package and related features.
The goal here is to fit a series of regressions predicting cognitive test scores of children given characteristics of their mothers, using data from a survey of adult American women and their children.
Information about the Generalized Linear Models in STAN can be found here.
You can find first information on
rstanarm package in this vignette . You can also find all pertaining information about this package here. This exercise set will also make use of the
rstan and the
Solutions of this set of exercises can be found here.
After having installed and loaded the
rstanarm packages, set the global options that allows you to automatically save a bare version of a compiled Stan program so that it does not need to be recompiled and to execute multiple chains in parallel by taking all the available cores of your machine (HINT : use the
Then, load the data kidiq directly from the
rstanarm package and set you preferred theme with
theme_set() that will be applied to the following ggplots.
Check the structure of the dataset. it includes the following columns :
– kid_score : The cognitive kid test scores
– mom_hs : a binary indicator for whether the mother has a high-school degree (covariate)
– mom_iq : the mother’s score on an IQ test (covariate)
– mom_age : the age of the mother (covariate)
Then, plot the data by representing all the different factors of interest in order to bring us insight on the model to choose.
Run the simple linear model that tries to explain the kid_score with the mom_iq. Run the model in a frequentist (simply with the
glm() function) and check the summary of the results.
Then, save the estimated coefficients of the model and put them in an object.
Now, run the same model as in the previous exercise but with a Bayesian model using the
stan_glm() function from the
rstanarm package. this function uses a pre-compiled STAN model that can directly be used to run a glm model.
Keep the default weakly informative priors for
stan_glm(), which are currently set to
normal(0,10) for the intercept and
normal(0,5) for the other regression coefficients. Check the summary of the posterior chains and their convergence.
Save the estimated coefficients of the model and put them in an object.
Plot the results of both the frequentist and the Bayesian model on the same plot. What can you say about the relationship of interest and about the comparison of the two different models used in Exercise 4 and Exercise 5 ?
Now, we could try different models with the data at hand. Fit three contending models where the first model will use mom_hs, the second and the third models will use both covariates, with the third including a term for the interaction of the two predictors.
Plot these models together with the data points in order to have insights on the interpretation of all these models.
(HINT : for the models with more than 1 covariate, you can plot the continuous mom_iq on the x-axis and use color to indicate which points correspond to the different subpopulatations defined by mom_hs. Plot two regression lines, one for each subpopulation)
Now that we have several models each with its own interpretation, we can wonder which model fits “best” the data relying on a quantitative criterion. Compute the leave-one-out (loo) cross-validation criterion in order to select a model.
Which model do we prefer with this criterion ?
(HINT : you can directly use the
loo() function from the loo package)
A good way to detect whether the fitted Bayesian model is a good model for the data is to have a look the Posterior Predictive Distribution (PPD).
Check this PPD for the selected model compare it with the actual data :
a. Plot the histograms of the values and compare it with 5 predictions from the PPD.
b. Plot the densities of the values and compare it with 5 predictions from the PPD.
c. Plot the densities in one plot only and do that for 25 samples simulated from the PPD.
d. Check that the means and the standard devations of several samples simulated from the PPD are scattered close to the true mean and sd of the kid_score.
What can you say about the fit of the selected model relying on these visualizations from the PPD ?
(HINT: look at
pp_check() at its parameters)