Hi Mark,

Thank you very much for your feedback. I highly appreciate it!

Yes, I should mention that step on Exercise 7 should be used carefully. There is a chance that the result using stepwise variable selection is overwhelming. Obviously, we need to consider the knowledge behind the statistical process and not just rely on any statistical value or concise model to generate a conclusion. But, again thank you for evaluating this exercise. I will try to solve this problem using full-model fits or data reduction.

Again from Harrell (2015): The problems of P-value-based variable selection are exacerbated when the analyst interprets the final model as if it were prespecified. J. B. Copas and Tianyong Long (see below for BitTeX reference) stated one of the most serious problems with stepwise modeling eloquently when they said, “The choice of the variables to be included depends on estimated regression coefficients rather than their true values, and so X_j is more likely to be included if its regression coefficient is over-estimated than if its regression coefficient is underestimated.” Shelley Derksen and H. J. Keselman (see below for BitTeX reference) studied stepwise variable selection, backward elimination, and forward selection, with these conclusions:

“The degree of correlation between the predictor variables affected the frequency with which authentic predictor variables found their way into the final model.

The number of candidate predictor variables affected the number of noise variables that gained entry to the model.

The size of the sample was of little practical importance in determining the number of authentic variables contained in the final model.

The population multiple coefficient of determination could be faithfully estimated by adopting a statistic that is adjusted by the total number of candidate predictor variables rather than the number of variables in the final model.”

They found that variables selected for the final model represented noise 0.20 to 0.74 of the time and that the final model usually contained less than half of the actual number of authentic predictors. Hence there are many reasons for using methods such as full-model fits or data reduction, instead of using any stepwise variable selection algorithm.”

@article{10.2307/2348223,

ISSN = {00390526, 14679884}, URL = {http://www.jstor.org/stable/2348223},

author = {J. B. Copas and Tianyong Long},

journal = {Journal of the Royal Statistical Society. Series D (The Statistician)},

number = {1},

pages = {51–59},

publisher = {[Royal Statistical Society, Wiley]},

title = {Estimating the Residual Variance in Orthogonal Regression with Variable Selection},

volume = {40},

year = {1991}

}

@article{doi:10.1111/j.2044-8317.1992.tb00992.x,

author = {Shelley Derksen and H. J. Keselman },

title = {Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables},

journal = {British Journal of Mathematical and Statistical Psychology},

volume = {45},

number = {2},

pages = {265-282},

year={1992},

doi = {10.1111/j.2044-8317.1992.tb00992.x},

url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.1992.tb00992.x},

eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.2044-8317.1992.tb00992.x}

}

Stepwise variable selection is commonly employed to form a reduced set of predictive variables from the potential superset (1) to make a more concise model, (2) to avoid collinearity, or (3) because of a false belief that it is not legitimate to include “insignificant” regression coefficients. This stepwise selection is used when the analyst has a set of potential predictors but does not have the necessary subject matter knowledge to enable pre-specification of the “important” variables to include in the model.

Stepwise variable selection has been a very popular technique for many years, but if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing. Here is a summary of the problems with this method.

It yields R^2 values that are biased high.

The ordinary F and chi^2 test statistics do not have the claimed distribution . Variable selection is based on methods (e.g., F test for nested models) that were intended to be used to test only prespecified hypotheses.

The method yields standard errors of regression coefficient estimates that are biased low and confidence intervals for effects and predicted values that are falsely narrow.

It yields P-values that are too small (i.e., there are severe multiple comparison problems) and that do not have the proper meaning, and the proper correction for them is a very difficult problem.

It provides regression coefficients that are biased high in absolute value and need shrinkage. Even if only a single predictor were being analyzed and one only reported the regression coefficient for that predictor if its association with Y were “statistically significant,” the estimate of the regression coefficient is biased (too large in absolute value).

Rather than solving problems caused by collinearity, variable selection is made arbitrary by collinearity.

It allows us to not think about the problems.