Regression techniques are a crucial skill in any data scientist or statisticians toolkit. It is even crucial for people who are unfamiliar with regression modeling. It is a nice way to introduce yourself to the topic through a simple linear model.
A linear model is an explanation of how a continuous response variable behaves, dependent on a set of covariates or explanatory variables. Whilst often insufficient to explain complex problems, linear models do present underlying skills, such as variable selection and diagnostic examinations. Therefore, a worthwhile introduction to statistical regression techniques.
In this tutorial, we’ll be creating a couple of linear models and comparing the performance of them on the Boston Housing dataset. This tutorial will require caret and mlbench to be installed and you may find ggplot2 and dplyr useful too, though these are not essential.
Solutions to these exercises can be found here.
Load the Boston Housing dataset from the mlbench library and inspect the different types of variables present.
Explore and visualize the distribution of our target variable.
Explore and visualize any potential correlations between medv and the variables crim, rm, age, rad, tax and lstat.
Set a seed of 123 and split your data into a train and test set using a 75/25 split. You may find the caret library helpful here.
We have seen that crim, rm, tax, and lstat could be good predictors of medv. To get the ball rolling, let us fit a linear model for these terms.
Obtain an r-squared value for your model and examine the diagnostic plots found by plotting your linear model.
We can see a few problems with our model immediately with variables such as 381 exhibiting a high leverage, a poor QQ plot in the tails a relatively poor r-squared value.
Let us try another model, this time transforming MEDV due to the positive skewness it exhibited.
Examine the diagnostics for the model. What do you conclude? Is this an improvement on the first model?
One assumption of a linear model is that the mean of the residuals is zero. You could try and test this.
Create a data frame of your predicted values and the original values.
Plot this to visualize the performance of your model.