In this exercise, we will try to handle the model that has been over-dispersed using the quasi-Poisson model. Over-dispersion simply means that the variance is greater than the mean. It’s important because it leads to inflation in the models and increases the possibility of Type I errors. We will use a data-set on amphibian road kill (Zuur et al., 2009). It has 17 explanatory variables. We’re going to focus on nine of them using the total number of kills (TOT.N) as the response variable.
Please download the data-set here and name it “Road.” Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
Exercise 1
Doing some plotting, we can see decreasing variability of kills with distance.
Exercise 2
Run the GLM model with distance as the explanatory variables.
Exercise 3
Add more co-variables to the model and see what’s happening by checking the model summary.
Exercise 4
Check the co-linearity using VIF’s. Set options in Base R concerning missing values.
Exercise 5
Check the summary again and set base R options. See why we do this on the previous related post exercise.
Exercise 6
Check for over-dispersion (rule of thumb, value needs to be around 1.) If it is still greater or less than 1, then we need to check diagnostic plots and re-run the GLM with another structure model.
Exercise 7
Restructure the model by throwing out the least significant terms and repeat the model until generating fewer significant terms.
Exercise 8
Check the diagnostic plots. If there are still some problems, then we might need to use other types of regression, like Negative Binomial regression. We’ll discuss it in the next exercise post.
Exercise 7 is bad statistical practice as it ignores recent research. Note comments derived from Chapter 4 section 3 Variable Selection from the book “Regression Modeling Strategies – with applications to Linear Models, Logistic and Ordinal Regression and Survival Analysis” by Frank E. Harrell, Jr.
Stepwise variable selection is commonly employed to form a reduced set of predictive variables from the potential superset (1) to make a more concise model, (2) to avoid collinearity, or (3) because of a false belief that it is not legitimate to include “insignificant” regression coefficients. This stepwise selection is used when the analyst has a set of potential predictors but does not have the necessary subject matter knowledge to enable pre-specification of the “important” variables to include in the model.
Stepwise variable selection has been a very popular technique for many years, but if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing. Here is a summary of the problems with this method.
It yields R^2 values that are biased high.
The ordinary F and chi^2 test statistics do not have the claimed distribution . Variable selection is based on methods (e.g., F test for nested models) that were intended to be used to test only prespecified hypotheses.
The method yields standard errors of regression coefficient estimates that are biased low and confidence intervals for effects and predicted values that are falsely narrow.
It yields P-values that are too small (i.e., there are severe multiple comparison problems) and that do not have the proper meaning, and the proper correction for them is a very difficult problem.
It provides regression coefficients that are biased high in absolute value and need shrinkage. Even if only a single predictor were being analyzed and one only reported the regression coefficient for that predictor if its association with Y were “statistically significant,” the estimate of the regression coefficient is biased (too large in absolute value).
Rather than solving problems caused by collinearity, variable selection is made arbitrary by collinearity.
It allows us to not think about the problems.
Again from Harrell (2015): The problems of P-value-based variable selection are exacerbated when the analyst interprets the final model as if it were prespecified. J. B. Copas and Tianyong Long (see below for BitTeX reference) stated one of the most serious problems with stepwise modeling eloquently when they said, “The choice of the variables to be included depends on estimated regression coefficients rather than their true values, and so X_j is more likely to be included if its regression coefficient is over-estimated than if its regression coefficient is underestimated.” Shelley Derksen and H. J. Keselman (see below for BitTeX reference) studied stepwise variable selection, backward elimination, and forward selection, with these conclusions:
“The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model.
The number of candidate predictor variables affected the number of noise variables that gained entry to the model.
The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.
The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model.”
They found that variables selected for the final model represented noise 0.20 to 0.74 of the time and that the final model usually contained less than half of the actual number of authentic predictors. Hence there are many reasons for using methods such as full-model fits or data reduction, instead of using any stepwise variable selection algorithm.”
@article{10.2307/2348223,
ISSN = {00390526, 14679884}, URL = {http://www.jstor.org/stable/2348223},
author = {J. B. Copas and Tianyong Long},
journal = {Journal of the Royal Statistical Society. Series D (The Statistician)},
number = {1},
pages = {51–59},
publisher = {[Royal Statistical Society, Wiley]},
title = {Estimating the Residual Variance in Orthogonal Regression with Variable Selection},
volume = {40},
year = {1991}
}
@article{doi:10.1111/j.2044-8317.1992.tb00992.x,
author = {Shelley Derksen and H. J. Keselman },
title = {Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables},
journal = {British Journal of Mathematical and Statistical Psychology},
volume = {45},
number = {2},
pages = {265-282},
year={1992},
doi = {10.1111/j.2044-8317.1992.tb00992.x},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.1992.tb00992.x},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.2044-8317.1992.tb00992.x}
}
Hi Mark,
Thank you very much for your feedback. I highly appreciate it!
Yes, I should mention that step on Exercise 7 should be used carefully. There is a chance that the result using stepwise variable selection is overwhelming. Obviously, we need to consider the knowledge behind the statistical process and not just rely on any statistical value or concise model to generate a conclusion. But, again thank you for evaluating this exercise. I will try to solve this problem using full-model fits or data reduction.