- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

This exercise is going to be the last exercise on Basic Generalized Linear Modeling (GLM). Please click here to find the other part of the Basic GLM Exercise that you’ve missed.

In this exercise, we will discuss Logistic Regression models as one of the GLM methods. The model is used where the response data is binary (ex. male or female, present or absence) or proportional (ex. percentage and ratio.)

`M1 <- glm(response ~ Predictor1 + Predictor2, family = binomial)`

Data-sets are used based on Polis et al. (1998), which is recorded island characteristics in the Gulf of California. While the analysis is based on Quinn and Keough (2002), the data model presences/absences are of a spider predator against the perimeter to area ratio of the islands.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set here, call it ‘spider’ and load all the required packages before running the exercise.

**Exercise 1**

Visualize the data.

**Exercise 2**

Run the model.

**Exercise 3**

Check for over-dispersion.

**Exercise 4**

Use component+residual plots (crPlots) for further checking on dispersion.

**Exercise 5**

Check influential values.

**Exercise 6**

Check The Cooks Distance and the model summary.

**Exercise 7**

Check residuals.

**Exercise 8**

Plot and predict. Calculate the predicted values based on the fitted model.

**Exercise 9**

Produce a final plot, including the base plot, plot fitted model and 95% CI bands.

**Exercise 10**

Check the odds ratio to estimate the probability of presence, given the unit increases in perimeter or area ratio.

**Exercise 11**

Estimate the R2 value. What can be inferred?

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In this exercise, we will continue to solve problems from the last exercise about GLM here. Therefore, the exercise number will start at 9. Please make sure you read and follow the previous exercise before you continue practicing.

In the last exercise, we knew that there was over-dispersion over the model. So, we tried to use Quasi-Poisson regression, along with step-wise variable selection algorithms. Please note, here we assumed there is no influence from the background theory or knowledge behind the data. Obviously, there is no such thing in the real world, but we just use this step as an exercise in general.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 9**

Load the package called “MASS” to execute the negative binomial model. Run the package; consider all the explanatory variables.

**Exercise 10**

Check the summary of the model.

**Exercise 11**

Set options in base R, considering missing values.

**Exercise 12**

The previous exercise gave insight that variables 1,3,4,6 or 1,4,6 produce the best model performance. Therefore, refit the model using those variables.

**Exercise 13**

Check the diagnostic plot and generate a conclusion based on if the model gives the best performance.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In this exercise, we will try to handle the model that has been over-dispersed using the quasi-Poisson model. Over-dispersion simply means that the variance is greater than the mean. It’s important because it leads to inflation in the models and increases the possibility of Type I errors. We will use a data-set on amphibian road kill (Zuur et al., 2009). It has 17 explanatory variables. We’re going to focus on nine of them using the total number of kills (TOT.N) as the response variable.

Please download the data-set here and name it “Road.” Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 1**

Doing some plotting, we can see decreasing variability of kills with distance.

**Exercise 2**

Run the GLM model with distance as the explanatory variables.

**Exercise 3**

Add more co-variables to the model and see what’s happening by checking the model summary.

**Exercise 4**

Check the co-linearity using VIF’s. Set options in Base R concerning missing values.

**Exercise 5**

Check the summary again and set base R options. See why we do this on the previous related post exercise.

**Exercise 6**

Check for over-dispersion (rule of thumb, value needs to be around 1.) If it is still greater or less than 1, then we need to check diagnostic plots and re-run the GLM with another structure model.

**Exercise 7**

Restructure the model by throwing out the least significant terms and repeat the model until generating fewer significant terms.

**Exercise 8**

Check the diagnostic plots. If there are still some problems, then we might need to use other types of regression, like Negative Binomial regression. We’ll discuss it in the next exercise post.

- Advanced Techniques With Raster Data – Part 3: Exercises
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. It also allows the magnitude of the variance of each measurement to be a function of its predicted value.

GLMs can be split into three groups:

• **Poisson Regression** **–** for count data with no over/under dispersion issues.

• **Quasi-Poisson** or **Negative Binomial Models – **where the models are over-dispersed.

• **Logistic Regression** **Models –** where the response data is binary (ex. present or absent, male or female, or proportional (ex. percentages.))

In this exercise, we will focus on GLM’s that use Poisson regression. Please download the data-set for this exercise here. The data-set investigates the biographical determinants of species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of an ant species richness against latitude, elevation and habitat type on their paper.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 1**

Load the data and check the data structure using the `scatterplotMatrix`

function. Assess its co-variation and data patterning.

**Exercise 2**

Run a GLM model and run VIF analysis to check for inflation. Pay attention to the col-linearity.

**Exercise 3**

If there are any issues with the co-variation, try to center the predictor variables.

**Exercise 4**

Re-run VIF with the new variables.

**Exercise 5**

Check for any influential data point outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1, then it is OK to go.

**Exercise 6**

Check for over-dispersion. It needs to be around 1 to go to the next step.

**Exercise 7**

Check the model summary. What can we infer?

**Exercise 8**

Since we have lots of variables, we will do model averaging. The first step is to set options in base R regarding missing values. Then, try to asses which variables have a significant influence on the response variable. Here we include latitude, elevation and habitat variables to produce the best model.

**Exercise 9**

Check validation plots.

**Exercise 10**

Produce a base-plot and the points of predicted values.