Logistic regression is a modelling approach for binary independent variable (think yes/no or 1/0 instead of continuous). It is used in machine learning for prediction and a building block for more complicated algorithms such as neural networks. In social sciences and medicine logistic regression is widely used to model causal mechanisms.
We will use a data on containing health-related measurements on women and whether they can be (or will be at a future point?) classified as diabetic. The data was collected by the US National Institute of Diabetes and is contained in the
Answers to the exercises are available here.
MASS package and combine
Pima.tr2 to a
train and save
test. Change the coding of our variable of interest to (
type) to 0 (non-diabetic) and 1 (diabetic). Check for and take note of any missing values.
Take a look at the data. Plot a scatterplot matrix between all the explanatory variables using
pairs(), and color code the dots according to diabetic classification. Furthermore, try to plot
type as a function of age. Use jitter to make your graph more informative. Bonus: Can you add a logistic fit based on age on top of your plot?
glm() and the
train data fit a logistic model of
type on age and bmi. Print out the coefficients and their p-value.
What does the model fitted in exercise 3 predict in terms of probability for someone age 35 with bmi of 32, what about bmi of 22?
According to our model what are the odds that a woman in our sample is diabetic given age 55 and a bmi 37? Remember that odds in this context have a very precise definition which is different from probability.
Build the confusion matrix, a table of actual diabetic classification against model prediction. Use a cutoff value of 0.5, meaning that women who the model estimates to have at least 0.5 chance of being diabetic are predicted to be diabetic. What is the prediction accuracy?
Apply the fitted model to the test set. Print the confusion matrix and prediction accuracy.
Draw up the ROC curve and calculate the AUC.
Add number of pregnancies and age squared as an explanatory variables and redraw the ROC curve on the test set and calculate its AUC.
For a woman aged 35 and mother of 2 children, by how much does the probability of diabetes increase, if her bmi was 35 instead of 25 according to the model? What about the marginal effect at bmi = 25?