- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

This exercise is going to be the last exercise on Basic Generalized Linear Modeling (GLM). Please click here to find the other part of the Basic GLM Exercise that you’ve missed.

In this exercise, we will discuss Logistic Regression models as one of the GLM methods. The model is used where the response data is binary (ex. male or female, present or absence) or proportional (ex. percentage and ratio.)

`M1 <- glm(response ~ Predictor1 + Predictor2, family = binomial)`

Data-sets are used based on Polis et al. (1998), which is recorded island characteristics in the Gulf of California. While the analysis is based on Quinn and Keough (2002), the data model presences/absences are of a spider predator against the perimeter to area ratio of the islands.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set here, call it ‘spider’ and load all the required packages before running the exercise.

**Exercise 1**

Visualize the data.

**Exercise 2**

Run the model.

**Exercise 3**

Check for over-dispersion.

**Exercise 4**

Use component+residual plots (crPlots) for further checking on dispersion.

**Exercise 5**

Check influential values.

**Exercise 6**

Check The Cooks Distance and the model summary.

**Exercise 7**

Check residuals.

**Exercise 8**

Plot and predict. Calculate the predicted values based on the fitted model.

**Exercise 9**

Produce a final plot, including the base plot, plot fitted model and 95% CI bands.

**Exercise 10**

Check the odds ratio to estimate the probability of presence, given the unit increases in perimeter or area ratio.

**Exercise 11**

Estimate the R2 value. What can be inferred?

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In this exercise, we will continue to solve problems from the last exercise about GLM here. Therefore, the exercise number will start at 9. Please make sure you read and follow the previous exercise before you continue practicing.

In the last exercise, we knew that there was over-dispersion over the model. So, we tried to use Quasi-Poisson regression, along with step-wise variable selection algorithms. Please note, here we assumed there is no influence from the background theory or knowledge behind the data. Obviously, there is no such thing in the real world, but we just use this step as an exercise in general.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 9**

Load the package called “MASS” to execute the negative binomial model. Run the package; consider all the explanatory variables.

**Exercise 10**

Check the summary of the model.

**Exercise 11**

Set options in base R, considering missing values.

**Exercise 12**

The previous exercise gave insight that variables 1,3,4,6 or 1,4,6 produce the best model performance. Therefore, refit the model using those variables.

**Exercise 13**

Check the diagnostic plot and generate a conclusion based on if the model gives the best performance.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In this exercise, we will try to handle the model that has been over-dispersed using the quasi-Poisson model. Over-dispersion simply means that the variance is greater than the mean. It’s important because it leads to inflation in the models and increases the possibility of Type I errors. We will use a data-set on amphibian road kill (Zuur et al., 2009). It has 17 explanatory variables. We’re going to focus on nine of them using the total number of kills (TOT.N) as the response variable.

Please download the data-set here and name it “Road.” Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 1**

Doing some plotting, we can see decreasing variability of kills with distance.

**Exercise 2**

Run the GLM model with distance as the explanatory variables.

**Exercise 3**

Add more co-variables to the model and see what’s happening by checking the model summary.

**Exercise 4**

Check the co-linearity using VIF’s. Set options in Base R concerning missing values.

**Exercise 5**

Check the summary again and set base R options. See why we do this on the previous related post exercise.

**Exercise 6**

Check for over-dispersion (rule of thumb, value needs to be around 1.) If it is still greater or less than 1, then we need to check diagnostic plots and re-run the GLM with another structure model.

**Exercise 7**

Restructure the model by throwing out the least significant terms and repeat the model until generating fewer significant terms.

**Exercise 8**

Check the diagnostic plots. If there are still some problems, then we might need to use other types of regression, like Negative Binomial regression. We’ll discuss it in the next exercise post.

- Advanced Techniques With Raster Data – Part 3: Exercises
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

A generalized linear model (GLM) is a flexible generalization of an ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function. It also allows the magnitude of the variance of each measurement to be a function of its predicted value.

GLMs can be split into three groups:

• **Poisson Regression** **–** for count data with no over/under dispersion issues.

• **Quasi-Poisson** or **Negative Binomial Models – **where the models are over-dispersed.

• **Logistic Regression** **Models –** where the response data is binary (ex. present or absent, male or female, or proportional (ex. percentages.))

In this exercise, we will focus on GLM’s that use Poisson regression. Please download the data-set for this exercise here. The data-set investigates the biographical determinants of species richness at a regional scale (Gotelli and Everson, 2002). The main purpose of this exercise is to replicate the Poisson regression of an ant species richness against latitude, elevation and habitat type on their paper.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 1**

Load the data and check the data structure using the `scatterplotMatrix`

function. Assess its co-variation and data patterning.

**Exercise 2**

Run a GLM model and run VIF analysis to check for inflation. Pay attention to the col-linearity.

**Exercise 3**

If there are any issues with the co-variation, try to center the predictor variables.

**Exercise 4**

Re-run VIF with the new variables.

**Exercise 5**

Check for any influential data point outliers using influence measures (Cooks distance) and create the plot. If the value is less than 1, then it is OK to go.

**Exercise 6**

Check for over-dispersion. It needs to be around 1 to go to the next step.

**Exercise 7**

Check the model summary. What can we infer?

**Exercise 8**

Since we have lots of variables, we will do model averaging. The first step is to set options in base R regarding missing values. Then, try to asses which variables have a significant influence on the response variable. Here we include latitude, elevation and habitat variables to produce the best model.

**Exercise 9**

Check validation plots.

**Exercise 10**

Produce a base-plot and the points of predicted values.

- Advanced Techniques With Raster Data – Part 3: Exercises
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Generalized Additive Models (GAM) are non-parametric models that add smoother to the data. In this exercise, we will look at GAMs using cubic spline using the `mgcv`

package. Data-sets used can be downloaded here. The data-set is the experiment result of grassland richness over time in the Yellowstone National Park (Skkink et al. 2007).

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Load the data-set and required package before running the exercise.

**Exercise 1**

Observe the data-set and try to classify the response and explanatory variables. We will focus on ROCK as an explanatory variable.

**Exercise 2**

Do some scatter-plots.

**Exercise 3**

Since it is not linear, try to do GAM with ROCK variables.

**Exercise 4**

Check the result. What can be inferred?

**Exercise 5**

Do some validation plots.

**Exercise 6**

Plot the base graph.

**Exercise 7**

Add “predict” across the data and add some lines.

**Exercise 8**

Plot the fitted values.

Why do we only use ROCK variables? It is proven to give the most fitted data without incorporation of all the explanatory variables. Try to play around with other explanatory variables to see the difference.

]]>- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Density-Based Clustering Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

A mechanistic model for the relationship between x and y sometimes needs parameter estimation. When model linearisation does not work,we need to use non-linear modeling.

There are three main differences between non-linear and linear modeling in R:

1. Specify the exact nature of the equation.

2. Replace the `lm()`

with `nls()`

, which means non-linear least squares.

3. Sometimes we also need to specify the model parameters a, b and c.

In this exercise, we will use the same data-set as the previous exercise in polynomial regression here. Download the data-set here.

A quick overview of the data-set:

Response variable = number of invertebrates (INDIV)

Explanatory variable = the area of each clump (AREA)

Additional possible response variables = Species richness of invertebrates (SPECIES)

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Load the data-set; specify the model. Try to use the power function with `nls()`

and a=0.1 and b=1 as the initial parameter number.

**Exercise 2**

Do a quick check by creating a plot residual vs. a fitted model, since a normal plot will not work.

**Exercise 3**

Try to build a self-start function of the powered model.

**Exercise 4**

Generate the asymptotic model.

**Exercise 5**

Compared the asymptotic model to the powered one using AIC. What can we infer?

**Exercise 6**

Plot the model in one graph.

**Exercise 7**

Predict across the data and plot all three lines.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Here, we use ecological data (Peake and Quinn, 1993) to investigate the abundance effects for invertebrates living in mussel beds in intertidal areas. Possible variable configuration:

Response variable = number of invertebrates (INDIV)

Explanatory variable = the area of each clump (AREA)

Additional possible response variables = Species richness of invertebrates (SPECIES)

Download the data-set here.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Load the data-set and try to look at its structure, particularly the normality. What’s the best guess based on the scatter-plot?

**Exercise 2**

Assess its linearity using the `car`

package.

**Exercise 3**

Add in polynomial terms for the distance variable up to the 3rd order.

**Exercise 4**

Validate the model for each order of polynomial models.

**Exercise 5**

Create the predictive model and generate the regression equation. Which one is the best model?

Have a look at this plan view below to get an illustration of how the 2-dimensions model work.

The water levels are store in a 2-D array. They are numbered as follows:

The water flows are stored in 2 different 2-D arrays.

1. QV: defines water flows between buckets down the screen (in plan view)

2. QH: defines water flows between buckets across the screen.

Let’s get into the modeling by cracking the exercises below. Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Set all the required settings for the model:

a. Set the number of time steps. Here we use 1000 (it shows how many time steps we are going to run in the model; you can change it as much as you want.)

b. Set the total number of cells; here we have 25 x 25 water tanks.

c. Set time steps in seconds; here is 1.

d. Set the time at the start of simulations.

e. Set k between each water tank. Here we set a uniform value of k; 0.01.

**Exercise 2**

Create matrix H for the initial water level in the water tank.

**Exercise 3**

Set boundary conditions for the model. Here we have water flow (qh) into the water tanks from three sides (top, left and right) and water flowing out on the bottom (qv; see the plan view.) Water flow to the right and to the bottoms are considered positive. Don’t forget to declare the matrix for qh and qv.

**Exercise 4**

Create an output model for every 100 time steps.

**Exercise 5**

Run the model by creating loop for qh, qv, water storage update and models output (remember the threshold loop on latest previous exercise.)

**Exercise 6**

Plot a model output using a contour plot.

- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Here, the boundary condition is referred to as water level and/or water flow as the two main variables that can be played. There are basically two options to do that; Dirchlet and von Neuman boundary conditions. The Dirchlet method is just simply defined as q1 and q6 to any number. It shows that the flows q1 and q6 are remaining constant over the simulation. Von Neuman allows the model to define the q1 and q6 based on any other factors that might influence. For example, the level of a river here is represented as “tank water level.”

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Dirchlet Boundary Conditions**

**Exercise 1**

Try to modify your latest model with a new step: defining the q1 and q6 as a constant value. The model script will look exactly the same as the previous one, just add a constant value declaration of q1 and q6 as the first and the sixth on the q matrix output.

**Exercise 2**

Plot the model simulation.

**Von Neuman Boundary Conditions**

**Exercise 3**

Instead of defining the q1 and q6, try to define the water level on tank 1 and 5. Pay attention while running the models; you only need to calculate water storage for bucket 2 to 5.

**Exercise 4**

Plot the data.

**Numerical Instability**

As we learned, rounding off and truncation might generate some errors in the final result of the model. Therefore, it is important to maintain a relatively small time-step. Furthermore, shortening the time-step will give more useful information. The exercises below will try to simulate how to run the model for 500 time-steps, but only outputting every 10th.

**Exercise 5**

Set initial information and output matrix q and H for calculated water storage. Set all the initial values, including H and k. Use the Dirchlet boundary conditions value on the model script above.

**Exercise 6**

This is the tricky part. Take a deep breath. The purpose of this step is to get the model output and save the data every time-step. Since we will only output for every 10th, we need to create a matrix output that defines the rule.

**Exercise 7**

Create an output array and size equals number outputs by the number of tanks.

**Exercise 8**

Another consideration is that we need to tell the model that only the 10th calculation will be outputted by creating a loop for output time.

**Exercise 9**

Tell the model which row the data should be stored; it is increased by one each time. Also, define the output timer that will reset back to zero when the number of time-steps equal the number when water levels outputted (10th).

**Exercise 10**

Run the model and add conditional formatting (`if function`

)to the output section. Ask the script whether the increased output timer equals the threshold output value (10). If so, then increase our output count by 1 and output data in the next row. Then, reset our output timer back to zero so that the calculation will repeat again.

**Exercise 11**

Plot the data.

Instead of playing with parameter and loop calculations, we will develop our single water tank into a network. Assume that we add another two similar water tanks that are connected through pipes. It could represent the layers of soil or different areas across a catchment. Below is the illustration of the model. The output of this exercise is to understand how space can be illustrated by a model.

The new data-set required can be downloaded here.

Answers to these exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Before starting the real exercises, try to load the data-set and plot it to see the data characteristic.

**Exercise 1**

Set the initial time as 0, with timestep = 1 and final time step = 100. Define the parameters of each water tank including k (parameter) and s (initial water level) and create an empty output `matrix`

, this time, with four columns.

**Exercise 2**

Similar to before, try to run the loop through the number of time-steps.

**Exercise 3**

Plot the model output for each water tank; pay attention to the range of the result.

You can try to play with longer time-steps and varieties in the parameter to see if the model is realistic enough to predict the future.

What if the water tank arrangement has two-way interactions? In many situations, water flow direction varies over time, such as water flow in a soil profile under different conditions. The figure below is the schematic diagram of the next model.

On the diagram above, each tank has a hole that connects to the adjacent bucket. If the water level in tank 1 is higher than tank 2, water will obviously flow from tank 1 into 2 until the water levels are equal and vice versa. So, water will flow in either direction. The rate of flow (q) depends on the difference between two tanks (H1-H2) and the size of the hole (k) or pipe that connect the two tanks, written mathematically as follows:

q = k x (H1-H2) x t

Assume that there is no hole on the outside of either bucket 1 or 5. So that k, q1, and q6 are zero.

Clear the previous values before starting the following exercises.

**Exercise 4**

Define initial conditions, including t, time-steps, final time-steps, and ncell as the number of water tanks available. Create an empty `matrix`

of the water level (H) and the water flux output (q).

**Exercise 5**

Set the initial parameter for H and k.

**Exercise 6**

Try to use the `sequence`

function to declare time-steps.

**Exercise 7**

Run the loop based on the parameters and the initial condition above.

**Exercise 8**

Plot the water level for each water tank. Pay attention to the output range.

Fancy to use a loop in water level plotting, huh?

Does the model meet your expectations?

Try to play around with the parameters and see how the model responds with those alterations.