This exercise set will continue the introduction to the STAN platform and its main features. Whereas the first post introduced the `rstan`

package, we will now present the `rstanarm `

package and related features.

The goal here is to fit a series of regressions predicting cognitive test scores of children given characteristics of their mothers, using data from a survey of adult American women and their children.

Information about the Generalized Linear Models in STAN can be found here.

You can find first information on `rstanarm `

package in this vignette . You can also find all pertaining information about this package here. This exercise set will also make use of the `rstan`

and the `tidyverse`

packages.

Solutions of this set of exercises can be found here.

**Exercise 1**

After having installed and loaded the `rstan `

and `rstanarm `

packages, set the global options that allows you to automatically save a bare version of a compiled Stan program so that it does not need to be recompiled and to execute multiple chains in parallel by taking all the available cores of your machine (HINT : use the `parallel`

package).

Then, load the data *kidiq* directly from the `rstanarm`

package and set you preferred theme with `theme_set()`

that will be applied to the following ggplots.

**Exercise 2**

Check the structure of the dataset. it includes the following columns :

– *kid_score* : The cognitive kid test scores

– *mom_hs* : a binary indicator for whether the mother has a high-school degree (covariate)

– *mom_iq* : the mother’s score on an IQ test (covariate)

– *mom_age* : the age of the mother (covariate)

Then, plot the data by representing all the different factors of interest in order to bring us insight on the model to choose.

**Exercise 3**

Run the simple linear model that tries to explain the *kid_score* with the *mom_iq*. Run the model in a frequentist (simply with the `glm()`

function) and check the summary of the results.

Then, save the estimated coefficients of the model and put them in an object.

**Exercise 4**

Now, run the same model as in the previous exercise but with a Bayesian model using the `stan_glm()`

function from the `rstanarm`

package. this function uses a pre-compiled STAN model that can directly be used to run a glm model.

Keep the default weakly informative priors for `stan_glm()`

, which are currently set to `normal(0,10)`

for the intercept and `normal(0,5)`

for the other regression coefficients. Check the summary of the posterior chains and their convergence.

Save the estimated coefficients of the model and put them in an object.

**Exercise 5**

Plot the results of both the frequentist and the Bayesian model on the same plot. What can you say about the relationship of interest and about the comparison of the two different models used in Exercise 4 and Exercise 5 ?

**Exercise 6**

Now, we could try different models with the data at hand. Fit three contending models where the first model will use *mom_hs*, the second and the third models will use both covariates, with the third including a term for the interaction of the two predictors.

**Exercise 7**

Plot these models together with the data points in order to have insights on the interpretation of all these models.

(HINT : for the models with more than 1 covariate, you can plot the continuous *mom_iq* on the x-axis and use color to indicate which points correspond to the different subpopulatations defined by *mom_hs.* Plot two regression lines, one for each subpopulation)

**Exercise 8**

Now that we have several models each with its own interpretation, we can wonder which model fits “best” the data relying on a quantitative criterion. Compute the leave-one-out (loo) cross-validation criterion in order to select a model.

Which model do we prefer with this criterion ?

(HINT : you can directly use the `loo()`

function from the loo package)

**Exercise 9**

A good way to detect whether the fitted Bayesian model is a good model for the data is to have a look the *Posterior Predictive Distribution * (PPD).

Check this PPD for the selected model compare it with the actual data :

**a.** Plot the histograms of the values and compare it with 5 predictions from the PPD.

**b.** Plot the densities of the values and compare it with 5 predictions from the PPD.

**c.** Plot the densities in one plot only and do that for 25 samples simulated from the PPD.

**d.** Check that the means and the standard devations of several samples simulated from the PPD are scattered close to the true mean and sd of the *kid_score*.

What can you say about the fit of the selected model relying on these visualizations from the PPD ?

(HINT: look at `pp_check()`

at its parameters)

## Leave a Reply