A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. It also allows the magnitude of the variance of each measurement to be a function of its predicted value.
GLMs can be split into three groups:
• Poisson Regression – for count data with no over/under dispersion issues.
• Quasi-Poisson or Negative Binomial Models – where the models are over-dispersed.
• Logistic Regression Models – where the response data is binary (ex. present or absent, male or female, or proportional (ex. percentages.))
In this exercise, we will focus on GLM’s that use Poisson regression. Please download the data-set for this exercise here. The data-set investigates the biographical determinants of species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of an ant species richness against latitude, elevation and habitat type on their paper.
Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.
Load the data and check the data structure using the
scatterplotMatrix function. Assess its co-variation and data patterning.
Run a GLM model and run VIF analysis to check for inflation. Pay attention to the col-linearity.
If there are any issues with the co-variation, try to center the predictor variables.
Re-run VIF with the new variables.
Check for any influential data point outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1, then it is OK to go.
Check for over-dispersion. It needs to be around 1 to go to the next step.
Check the model summary. What can we infer?
Since we have lots of variables, we will do model averaging. The first step is to set options in base R regarding missing values. Then, try to asses which variables have a significant influence on the response variable. Here we include latitude, elevation and habitat variables to produce the best model.
Check validation plots.
Produce a base-plot and the points of predicted values.