Logistic regression is a modelling approach for binary independent variable (think yes/no or 1/0 instead of continuous). It is used in machine learning for prediction and a building block for more complicated algorithms such as neural networks. In social sciences and medicine logistic regression is widely used to model causal mechanisms.

We will use a data on containing health-related measurements on women and whether they can be (or will be at a future point?) classified as diabetic. The data was collected by the US National Institute of Diabetes and is contained in the `MASS`

package.

Answers to the exercises are available here.

**Exercise 1**

Load the `MASS`

package and combine `Pima.tr`

and `Pima.tr2`

to a `data.frame`

called `train`

and save `Pima.te`

as `test`

. Change the coding of our variable of interest to (`type`

) to 0 (non-diabetic) and 1 (diabetic). Check for and take note of any missing values.

**Exercise 2**

Take a look at the data. Plot a scatterplot matrix between all the explanatory variables using `pairs()`

, and color code the dots according to diabetic classification. Furthermore, try to plot `type`

as a function of age. Use jitter to make your graph more informative. Bonus: Can you add a logistic fit based on age on top of your plot?

**Exercise 3**

Using the `glm()`

and the `train`

data fit a logistic model of `type`

on age and bmi. Print out the coefficients and their p-value.

**Exercise 4**

What does the model fitted in exercise 3 predict in terms of probability for someone age 35 with bmi of 32, what about bmi of 22?

**Exercise 5**

According to our model what are the odds that a woman in our sample is diabetic given age 55 and a bmi 37? Remember that odds in this context have a very precise definition which is different from probability.

**Exercise 6**

Build the confusion matrix, a table of actual diabetic classification against model prediction. Use a cutoff value of 0.5, meaning that women who the model estimates to have at least 0.5 chance of being diabetic are predicted to be diabetic. What is the prediction accuracy?

**Exercise 7**

Apply the fitted model to the test set. Print the confusion matrix and prediction accuracy.

**Exercise 8**

Draw up the ROC curve and calculate the AUC.

**Exercise 9**

Add number of pregnancies and age squared as an explanatory variables and redraw the ROC curve on the test set and calculate its AUC.

**Exercise 10**

For a woman aged 35 and mother of 2 children, by how much does the probability of diabetes increase, if her bmi was 35 instead of 25 according to the model? What about the marginal effect at bmi = 25?

## Leave a Reply