You might fit a statistical model to a set of data and obtain parameter estimates. However, you are not done at this point. You need to make sure the assumptions of the particular model you used were met.
One tool is to examine the model residuals. We previously discussed this in a tutorial. The residuals are the difference between your observed data and your predicted values. In this exercise set, you will examine several aspects of residual plots. These residual plots help determine if you have met your model assumptions.
Answers to the exercises are available here.
Exercise 1
Load the cars data set using the data()
function. This data contains the stopping distances (feet) for different car speeds (miles per hour). The data was recorded in the 1920s.
Exercise 2
Plot car speeds on the y-axis and stopping distances on the x-axis. What kind of pattern is present?
Exercise 3
Using the lm()
function to fit a linear model to the data with the stopping distance as the response variable. Plot the line of best fit.
Exercise 4
Use summary()
to obtain parameter estimates and model details. Is the slope significantly different than zero? How much of the variance can be explained by car speed?
Exercise 5
Use the plot()
command on the linear model to obtain the four plots of residuals.
Exercise 6
Are the data homoscedastic? Homoscedastic data means the distribution of errors should be the same for all values of the explanatory variable.
Exercise 7
Are the residuals normally-distributed?
Exercise 8
Are the residuals correlated with the explanatory variable?
Exercise 9
Bonus test. Now take a look at the fourth plot: the residuals versus leverage. This plot does not indicate whether we have met model assumptions, but it does tell us if certain data points are more influential than others in the regression. Points that have been labeled with a number have a high Cook’s Distance, which means they are particularly influential for the regression. These are usually the points not clustered with the majority of points. Are there any points with a high Cook’s distance?
Exercise 10
Remove the 49th record (the one with a large Cook’s distance) from the data. How does this change the parameter estimates in the model regression?
Leave a Reply