- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data – Part 2: Supervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Geospatial data is becoming increasingly used to solve numerous ‘real-life’ problems (check out some examples here.) In turn, R is becoming a powerful open-source solution to handle this type of data, currently providing an exceptional range of functions and tools for GIS and Remote Sensing data analysis.

In particular, **raster data** provides support for representing spatial phenomena by diving the surface into a grid (or matrix) composed of cells of regular size. Each raster data-set has a certain number of columns and rows and each cell contains a value with information for the variable of interest. Stored data can be either: (i) thematic – representing a **discrete** variable, (ex. land cover classification map) or **continuous** (ex. elevation).

The `raster`

package currently provides an extensive set of functions to create, read, export, manipulate and process raster data-sets. It also provides low-level functionalities for creating more advanced processing chains, as well as the ability to manage large data-sets. For more information, see: `vignette("functions", package = "raster")`

. You can also check more about raster data on the tutorial series about this topic here.

In this exercise set, we will explore the following topics in raster data processing and geostatistical analysis (previously discussed in this tutorial series):

- Unsupervised classification/clustering of satellite data
- Regression-kriging (RK)

We will also address how to use the package `RSToolbox`

(link) to calculate the:

- Tasseled Cap Transformation (TCT)
- PCA rotation/transformation

Both data compression techniques examined here will use spectral data from satellite imagery.

Answers to these exercises are available here.

**Exercise 1**

Use the data in this link (Landsat-8 surface reflectance data bands 1-7, for Peneda-Geres National Park – PGNP, NW Portugal) to answer the next exercises (1 to 6). Download the data, uncompress and create a raster brick. How many pixels and layers does the data have?

**Exercise 2**

Make an RGB plot with bands 5, 1, and 3 with linear stretching.

**Exercise 3**

Using k-means algorithm performs an unsupervised classification/clustering of the data with 5 clusters.

**Exercise 4**

Use the CLARA algorithm (package `cluster`

) to perform an unsupervised classification/clustering of the data with 5 clusters and Euclidean distance.

**Exercise 5**

Using package `RStoolbox`

, calculate the Tasseled Cap Transformation of the data (remember it is Landsat-8 data with bands 1-7).

**Exercise 6**

Using package `RStoolbox`

, calculate the standardized PCA transform. What is the cumulative % of explained variance in the three first components?

**Exercise 7**

- Use the data in this link to answer the next exercises (annual average temperature for weather stations in Portugal; col
`AvgTemp`

). Using Lat and Lon columns from the`clim_data_pt.csv`

table, create a`SpatialPointsDataFrame`

object with CRS WGS 1984. - Using Ordinary Kriging from package
`gstat`

, interpolate temperature values employing a*Spherical*empirical variogram. Calculate the RMSE from 5-fold cross-validation (see function`krige.cv`

) and use the`set.seed(12345)`

.

**Exercise 8**

Using the previous question rationale, experiment now with an *Exponential* model. Calculate RMSE also from 5-fold CV. Which one was the best model according to RMSE?

**Exercise 9**

Using the cubist regression algorithm (package `Cubist`

), predict the based `AvgTemp`

on latitude (`Lat`

), elevation (column `Elev`

) and distance to the coastline (column `distCoast`

). Calculate the RMSE for a random test set of 15 observations. Use the `set.seed(12345)`

.

**Exercise 10**

From the previous exercise, extract the train residuals and interpolate them. Following a Regression-kriging approach, add the interpolated residuals and the regression results. Calculate the RMSE for the test set (defined in E9) and check if this improves the modeling performance any further.

]]>- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In this post, the ninth of the geospatial processing series with raster data, I will focus on interpolating and modeling air surface temperature data recorded at weather stations. For this purpose I will explore **regression-kriging** (RK), a spatial prediction technique commonly used in geostatistics that combines a regression of the dependent variable (air temperature in this case) on auxiliary/predictive variables (e.g., elevation, distance from shoreline) with kriging of the regression residuals. RK is mathematically equivalent to the interpolation method variously called universal kriging and kriging with external drift, where auxiliary predictors are used directly to solve the kriging weights.

**Regression-kriging** is an implementation of the best linear unbiased predictor (BLUP) for spatial data, i.e. the best linear interpolator assuming the *universal model of spatial variation*. Hence, RK is capable of modeling the value of a target variable at some location as a sum of a deterministic component (handled by regression) and a stochastic component (kriging). In RK, both deterministic and stochastic components of spatial variation can be modeled separately. Once the deterministic part of variation has been estimated, the obtained residuals can be interpolated with kriging and added back to the estimated trend.

**Regression-kriging** is used in various fields, including meteorology, climatology, soil mapping, geological mapping, species distribution modeling and similar. The only requirement for using RK is that one or more covariates exist which are significantly correlated with the dependent variable.

Although powerful, RK can perform poorly if the point sample is small and non-representative of the target variable, if the relation between the target variable and predictors is non-linear (although some non-linear regression techniques can help on this aspect), or if the points do not represent feature space or represent only the central part of it.

Seven regression algorithms will be used and compared through cross-validation (10-fold CV):

- Interpolation:
- Ordinary Kriging (OK)

- Regression:
- Generalized Linear Model (GLM)
- Generalized Additive Model (GAM)
- Random Forest (RF)

- Regression-kriging:
- GLM + OK of residuals
- GAM + OK of residuals
- RF + OK of residuals

The sample data used for examples is the annual average air temperature for mainland Portugal which includes and summarizes daily records that range from 1950 to 2000. A total of 95 stations are available, unevenly dispersed throughout the country.

Four auxiliary variables were considered as candidates to model the variation of air temperature:

- Elevation (
*Elev*in meters a.s.l.), - Distance to the coastline (
*distCoast*in degrees); - Latitude (
*Lat*in degrees), and, - Longitude (
*Lon*in degrees).

One raster layer *per* predictive variable, with a spatial resolution of 0.009 deg (ca. 1000m) in WGS 1984 Geographic Coordinate System, is available for calculating a continuous surface of temperature values.

We will start by downloading and unzipping the sample data from the GitHub repository:

```
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/CLIM_DATA_PT.zip", "./data-raw/CLIM_DATA_PT.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/CLIM_DATA_PT.zip", exdir = "./data-raw")
```

Now, let’s load the raster layers containing the predictive variables used to build the regression model of air temperature:

```
library(raster)
# GeoTIFF file list
fl <- list.files("./data-raw/climData/rst", pattern = ".tif$", full.names = TRUE)
# Create the raster stack
rst <- stack(fl)
# Change the layer names to coincide with table data
names(rst) <- c("distCoast", "Elev", "Lat", "Lon")
```

`plot(rst)`

Next step, let’s read the point data containing annual average temperature values along with location and predictive variables for each weather station:

```
climDataPT <- read.csv("./data-raw/ClimData/clim_data_pt.csv")
knitr::kable(head(climDataPT, n=10))
```

StationName | StationID | Lat | Lon | Elev | AvgTemp | distCoast |
---|---|---|---|---|---|---|

Sagres | 1 | 36.98 | -8.95 | 40 | 16.3 | 0.0000000 |

Faro | 2 | 37.02 | -7.97 | 8 | 17.0 | 0.0201246 |

Quarteira | 3 | 37.07 | -8.10 | 4 | 16.6 | 0.0090000 |

Vila do Bispo | 4 | 37.08 | -8.88 | 115 | 16.1 | 0.0360000 |

Praia da Rocha | 5 | 37.12 | -8.53 | 19 | 16.7 | 0.0000000 |

Tavira | 6 | 37.12 | -7.65 | 25 | 16.9 | 0.0458912 |

S. Brás de Alportel | 7 | 37.17 | -7.90 | 240 | 15.9 | 0.1853213 |

Vila Real Sto. António | 8 | 37.18 | -7.42 | 7 | 17.1 | 0.0127279 |

Monchique | 9 | 37.32 | -8.55 | 465 | 15.0 | 0.1980000 |

Zambujeira | 10 | 37.50 | -8.75 | 106 | 15.0 | 0.0450000 |

Based on the previous data, create a `SpatialPointsDataFrame`

object to store all points and make some preliminary plots:

```
proj4Str <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
statPoints <- SpatialPointsDataFrame(coords = climDataPT[,c("Lon","Lat")],
data = climDataPT,
proj4string = CRS(proj4Str))
```

```
par(mfrow=c(1,2),mar=c(5,6,3,2))
plot(rst[["Elev"]], main="Elevation (meters a.s.l.) for Portugal\n and weather stations",
xlab = "Longitude", ylab="Latitude")
plot(statPoints, add=TRUE)
hist(climDataPT$AvgTemp, xlab= "Temperature (ºC)", main="Annual avg. temperature")
```

From the figure we can see that: (i) weather stations tend to cover more the areas close to the coastline and with lower altitude, and, (ii) temperature values are ‘left-skewed’ with a median equal to 15 and a median-absolute deviation (MAD) of 15.

Before proceeding, it is a good idea to inspect the correlation matrix to analyze the strength of association between the response and the predictive variables. For this, we will use the package `corrplot`

with some nit graphical options 👍 👍

```
library(corrplot)
corMat <- cor(climDataPT[,3:ncol(climDataPT)])
corrplot.mixed(corMat, number.cex=0.8, tl.cex = 0.9, tl.col = "black",
outline=FALSE, mar=c(0,0,2,2), upper="square", bg=NA)
```

The correlation plot evidence that all predictive variables seem to be correlated with the average temperature, especially ‘Elevation’ and ‘Latitude’ which are well-known regional controls of temperature variation. It also shows that (as expected, given the country geometric shape) both ‘Longitude’ and ‘Distance to the coast’ are highly correlated. As such, given that ‘Longitude’ is less associated with temperature and its climatic effect is less “direct” (compared to ‘distCoast’) we will remove it.

For comparing the different RK algorithms, we will use 10-fold cross-validation and the Root-mean-square error as the evaluation metric.

*Kriging parameters* **nugget**, (partial) **sill** and **range** will be fit through Ordinary Least Squares (OLS) from a set of previously defined values that were adjusted with the help of some visual inspection and trial-and-error. The *Exponential* model was selected since it gave generally best results in preliminary analyses.

The functionalities in package `gstat`

were used for all geostatistical analyses.

Now, let’s define some ancillary functions for creating the k-fold train/test data splits and for obtaining the regression residuals out of a random forest object:

```
# Generate the K-fold train--test splits
# x are the row indices
# Outputs a list with test (or train) indices
kfoldSplit <- function(x, k=10, train=TRUE){
x <- sample(x, size = length(x), replace = FALSE)
out <- suppressWarnings(split(x, factor(1:k)))
if(train) out <- lapply(out, FUN = function(x, len) (1:len)[-x], len=length(unlist(out)))
return(out)
}
# Regression residuals from RF object
resid.RF <- function(x) return(x$y - x$predicted)
```

We also need to define some additional parameters, get the test/train splits with the function `kfoldSplit`

and initialize the matrix that will store all RMSE values (one for each training round and modelling technique; `evalData`

object).

```
set.seed(12345)
k <- 10
kfolds <- kfoldSplit(1:nrow(climDataPT), k = 10, train = TRUE)
evalData <- matrix(NA, nrow=k, ncol=7,
dimnames = list(1:k, c("OK","RF","GLM","GAM","RF_OK","GLM_OK","GAM_OK")))
```

Now we are ready to start modelling! 😋 One code block, inside the ‘for’ loop, will be used for each regression algorithm tested. Notice how (train) residuals are interpolated through kriging and then (test) residuals are added to (test) regression results for evaluation. Use the comments to guide you through the code.

```
library(randomForest)
library(mgcv)
library(gstat)
for(i in 1:k){
cat("K-fold...",i,"of",k,"....\n")
# TRAIN indices as integer
idx <- kfolds[[i]]
# TRAIN indices as a boolean vector
idxBool <- (1:nrow(climDataPT)) %in% idx
# Observed test data for the target variable
obs.test <- climDataPT[!idxBool, "AvgTemp"]
## ----------------------------------------------------------------------------- ##
## Ordinary Kriging ----
## ----------------------------------------------------------------------------- ##
# Make variogram
formMod <- AvgTemp ~ 1
mod <- vgm(model = "Exp", psill = 3, range = 100, nugget = 0.5)
variog <- variogram(formMod, statPoints[idxBool, ])
# Variogram fitting by Ordinary Least Sqaure
variogFitOLS<-fit.variogram(variog, model = mod, fit.method = 6)
#plot(variog, variogFitOLS, main="OLS Model")
# kriging predictions
OK <- krige(formula = formMod ,
locations = statPoints[idxBool, ],
model = variogFitOLS,
newdata = statPoints[!idxBool, ],
debug.level = 0)
ok.pred.test <- OK@data$var1.pred
evalData[i,"OK"] <- sqrt(mean((ok.pred.test - obs.test)^2))
## ----------------------------------------------------------------------------- ##
## RF calibration ----
## ----------------------------------------------------------------------------- ##
RF <- randomForest(y = climDataPT[idx, "AvgTemp"],
x = climDataPT[idx, c("Lat","Elev","distCoast")],
ntree = 500,
mtry = 2)
rf.pred.test <- predict(RF, newdata = climDataPT[-idx,], type="response")
evalData[i,"RF"] <- sqrt(mean((rf.pred.test - obs.test)^2))
# Ordinary Kriging of Random Forest residuals
#
statPointsTMP <- statPoints[idxBool, ]
statPointsTMP@data <- cbind(statPointsTMP@data, residRF = resid.RF(RF))
formMod <- residRF ~ 1
mod <- vgm(model = "Exp", psill = 0.6, range = 10, nugget = 0.01)
variog <- variogram(formMod, statPointsTMP)
# Variogram fitting by Ordinary Least Sqaure
variogFitOLS<-fit.variogram(variog, model = mod, fit.method = 6)
#plot(variog, variogFitOLS, main="OLS Model")
# kriging predictions
RF.OK <- krige(formula = formMod ,
locations = statPointsTMP,
model = variogFitOLS,
newdata = statPoints[!idxBool, ],
debug.level = 0)
rf.ok.pred.test <- rf.pred.test + RF.OK@data$var1.pred
evalData[i,"RF_OK"] <- sqrt(mean((rf.ok.pred.test - obs.test)^2))
## ----------------------------------------------------------------------------- ##
## GLM calibration ----
## ----------------------------------------------------------------------------- ##
GLM <- glm(formula = AvgTemp ~ Elev + Lat + distCoast, data = climDataPT[idx, ])
glm.pred.test <- predict(GLM, newdata = climDataPT[-idx,], type="response")
evalData[i,"GLM"] <- sqrt(mean((glm.pred.test - obs.test)^2))
# Ordinary Kriging of GLM residuals
#
statPointsTMP <- statPoints[idxBool, ]
statPointsTMP@data <- cbind(statPointsTMP@data, residGLM = resid(GLM))
formMod <- residGLM ~ 1
mod <- vgm(model = "Exp", psill = 0.4, range = 10, nugget = 0.01)
variog <- variogram(formMod, statPointsTMP)
# Variogram fitting by Ordinary Least Sqaure
variogFitOLS<-fit.variogram(variog, model = mod, fit.method = 6)
#plot(variog, variogFitOLS, main="OLS Model")
# kriging predictions
GLM.OK <- krige(formula = formMod ,
locations = statPointsTMP,
model = variogFitOLS,
newdata = statPoints[!idxBool, ],
debug.level = 0)
glm.ok.pred.test <- glm.pred.test + GLM.OK@data$var1.pred
evalData[i,"GLM_OK"] <- sqrt(mean((glm.ok.pred.test - obs.test)^2))
## ----------------------------------------------------------------------------- ##
## GAM calibration ----
## ----------------------------------------------------------------------------- ##
GAM <- gam(formula = AvgTemp ~ s(Elev) + s(Lat) + s(distCoast), data = climDataPT[idx, ])
gam.pred.test <- predict(GAM, newdata = climDataPT[-idx,], type="response")
evalData[i,"GAM"] <- sqrt(mean((gam.pred.test - obs.test)^2))
# Ordinary Kriging of GAM residuals
#
statPointsTMP <- statPoints[idxBool, ]
statPointsTMP@data <- cbind(statPointsTMP@data, residGAM = resid(GAM))
formMod <- residGAM ~ 1
mod <- vgm(model = "Exp", psill = 0.3, range = 10, nugget = 0.01)
variog <- variogram(formMod, statPointsTMP)
# Variogram fitting by Ordinary Least Sqaure
variogFitOLS<-fit.variogram(variog, model = mod, fit.method = 6)
#plot(variog, variogFitOLS, main="OLS Model")
# kriging predictions
GAM.OK <- krige(formula = formMod ,
locations = statPointsTMP,
model = variogFitOLS,
newdata = statPoints[!idxBool, ],
debug.level = 0)
gam.ok.pred.test <- gam.pred.test + GAM.OK@data$var1.pred
evalData[i,"GAM_OK"] <- sqrt(mean((gam.ok.pred.test - obs.test)^2))
}
```

```
## K-fold... 1 of 10 ....
## K-fold... 2 of 10 ....
## K-fold... 3 of 10 ....
## K-fold... 4 of 10 ....
## K-fold... 5 of 10 ....
## K-fold... 6 of 10 ....
## K-fold... 7 of 10 ....
## K-fold... 8 of 10 ....
## K-fold... 9 of 10 ....
## K-fold... 10 of 10 ....
```

Let’s check the average and st.-dev. results for the 10-folds CV:

`round(apply(evalData,2,FUN = function(x,...) c(mean(x,...),sd(x,...))),3)`

```
## OK RF GLM GAM RF_OK GLM_OK GAM_OK
## [1,] 1.193 0.678 0.598 0.569 0.613 0.551 0.521
## [2,] 0.382 0.126 0.195 0.186 0.133 0.179 0.163
```

From the results above we can see that RK performed generally better than the regression techniques alone or than Ordinary Kriging. The **GAM-based RK method obtained the best scores** with an RMSE of ca. 0.521. These are pretty good results!! 😋 👍 👍

To finalize, we will predict the temperature values for the entire surface of mainland Portugal based on GAM-based Regression Kriging, which was the best performing technique on the test. For this we will not use any test/train partition but the entire dataset:

```
GAM <- gam(formula = AvgTemp ~ s(Elev) + s(Lat) + s(distCoast), data = climDataPT)
rstPredGAM <- predict(rst, GAM, type="response")
```

Next, we need to obtain a surface with kriging-interpolated residuals. For that, we have to convert the input `RasterStack`

or `RasterLayer`

into a `SpatialPixelsDataFrame`

so that the `krige`

function can use it as a reference:

`rstPixDF <- as(rst[[1]], "SpatialPixelsDataFrame")`

Like before, we will interpolate the regression residuals with kriging and add them back to the regression results.

```
# Create a temporary SpatialPointsDF object to store GAM residuals
statPointsTMP <- statPoints
crs(statPointsTMP) <- crs(rstPixDF)
statPointsTMP@data <- cbind(statPointsTMP@data, residGAM = resid(GAM))
# Define the kriging parameters and fit the variogram using OLS
formMod <- residGAM ~ 1
mod <- vgm(model = "Exp", psill = 0.15, range = 10, nugget = 0.01)
variog <- variogram(formMod, statPointsTMP)
variogFitOLS <- fit.variogram(variog, model = mod, fit.method = 6)
# Plot the results
plot(variog, variogFitOLS, main="Semi-variogram of GAM residuals")
```

The exponential semi-variogram looks reasonable although some lack-of-convergence problems… 😟 😔

Finally, let’s check the average temperature map obtained from GAM RK:

```
residKrigMap <- krige(formula = formMod ,
locations = statPointsTMP,
model = variogFitOLS,
newdata = rstPixDF)
residKrigRstLayer <- as(residKrigMap, "RasterLayer")
gamKrigMap <- rstPredGAM + residKrigRstLayer
plot(gamKrigMap, main="Annual average air temperature\n(GAM regression-kriging)",
xlab="Longitude", ylab="Latitude", cex.main=0.8, cex.axis=0.7, cex=0.8)
```

This concludes our exploration of the raster package and regression kriging for this post. Hope you find it useful! 😄 👍 👍

In **supervised classification**, contrary to the unsupervised version, *a priori* defined reference class is used as additional information. This initial process determines which classes are the result of the classification. Usually, a statistical or machine-learning algorithm is used to obtain or “learn” a classification function from a set of training examples. This is then used to map every instance (pixel or object, depending on the approach used) to its corresponding class. The following workflow is commonly used for deploying a supervised classifier:

- Definition of the thematic classes of land cover/use (ex. coniferous forest, deciduous forest, water, agriculture, urban)
- Classification of suitable training areas (reference areas/pixels for each class)
- Calibration of a classification algorithm (for the training set)
- Classification of the entire image (outside the ‘training space’)
- Classify performance evaluation, verification, and inspection of the results (for the testing set)

Usually, in supervised classification, spectral data from each of the sensor bands are used to obtain a statistical or rule-based *spectral signature* for each class. Besides “direct” spectral data, other kinds of information or features can be used for classifier training. These include band ratios, spectral indices, texture features, temporal variation features (ex. green-up and senescence changes), as well as ancillary data (ex. elevation, slope, built-up masks, roads.)

Combining the training data and the spectral (or other) features in a classifier algorithm allows you to classify the entire image outside the training space. Usually, a form of train/test split set strategy (holdout cross-validation, k-fold CV, etc.) is used. The training set is used for classifier calibration, while the testing set is used for evaluating the classification performance. This process is usually repeated a few times and then an average value of validation indices is calculated.

Because R currently provides a very large set of classification algorithms (a good package to access them is `caret`

), it is particularly well-equipped to handle this kind of problem. For developing the examples, the Random Forest (RF) algorithm will be used. RF is implemented in the (conveniently named 😉) `randomForest`

package. Although packages such as `caret`

provide many useful functions to handle classification (training, tuning and evaluation processes), I will not use it them here. My objective in this post is to explore and show the basic and *“under the wood”* workflow in *pixel-based* classification of raster data.

In a nutshell, RF is an ensemble learning method for classification, regression, and other tasks that operate by constructing multiple decision trees during the training stage (‘bagging’) and outputting the class that is the mode of the classes (classification) or the average prediction (regression) of the individual trees. This way, RF corrects for decision trees’ habit of over-fitting to their training set. See more here and here.

Sample data from the optical **Sentinel-2a** (S2) satellite platform will be used in the examples below (see here for more details.) This data was made available in the 2017 IEEE GRSS Data Fusion Contest and provided by the *GRSS Data and Algorithm Standard Evaluation* (DASE) website (you have to register to access the sample data-sets currently available.) More specifically, we will use one Sentinel-2 scene for Berlin containing 10 spectral bands, originally at 10m and 20m of spatial resolution, but they’re re-sampled to 100m in DASE.

Along with S2 spectral data, DASE also provides training samples for calibrating classifiers. The legend encompasses a total of 12 land cover/use classes that are presented in the table below (NOTE: only 12 out of the 17 classes actually appear in the Berlin area.)

```
legBerlin <- read.csv(url("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/legend_berlin.csv"))
knitr::kable(legBerlin)
```

Type | Code | Land.cover.class.description |
---|---|---|

Artificial | 1 | Compact high-rise |

Artificial | 2 | Compact midrise |

Artificial | 3 | Compact low-rise |

Artificial | 4 | Open high-rise |

Artificial | 5 | Open midrise |

Artificial | 6 | Open low-rise |

Artificial | 7 | Lightweight low-rise |

Artificial | 8 | Large low-rise |

Artificial | 9 | Sparsely built |

Artificial | 10 | Heavy industry |

Vegetated | 11 | Dense trees |

Vegetated | 12 | Scattered trees |

Vegetated | 13 | Bush and scrub |

Vegetated | 14 | Low plants |

Vegetated | 15 | Bare rock or paved |

Vegetated | 16 | Bare soil or sand |

Vegetated | 17 | Water |

For more info on raster data processing, check out the complete tutorial series here.

Now that we have defined some useful concepts, the workflow, and the data, we can start coding! 👍 👍 The first step is to download and uncompress the spectral data for Sentinel-2. These will later be used as training input features for the classification algorithm to identify the spectral signatures for each class.

```
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/berlin.zip", "./data-raw/berlin.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/berlin.zip", exdir = "./data-raw")
```

Load the required packages for the post:

```
library(raster)
library(randomForest)
```

Now that we have the data files available, let’s create a `RasterStack`

object from it. We will also change layer names for more convenient ones.

```
fl <- list.files("./data-raw/berlin/S2", pattern = ".tif$", full.names = TRUE)
rst <- stack(fl)
names(rst) <- c(paste("b",2:8,sep=""),"b8a","b11","b12")
```

Now, let’s use the `plotRGB`

function to visually explore the spectral data from Sentinel-2. RGB composites made from different band combinations allow us to highlight different aspects of land cover and see different layers of the Earth’s surface. Note: band numbers in the Sentinel-2 satellite differ from its position (integer index) in the `RasterStack`

object.

Let’s start by making a Natural Color RGB composite from S2 bands: 4,3, and 2.

`plotRGB(rst, r=3, g=2, b=1, scale=1E5, stretch="lin")`

Next, let’s see a healthy vegetation composite from S2 bands: 8,11, and 2.

`plotRGB(rst, r=7, g=9, b=1, scale=1E5, stretch="lin")`

Finally, a false color urban using S2 bands: 12,11, and 4.

`plotRGB(rst, r=10, g=9, b=3, scale=1E5, stretch="lin")`

Now, let’s load the training samples used in calibration. This data serves as a reference or example from which the classifier algorithm will “learn” the spectral signatures of each class.

```
rstTrain <- raster("./data-raw/berlin/train/berlin_lcz_GT.tif")
# Remove zeros from the train set (background NA)
rstTrain[rstTrain==0] <- NA
# Convert the data to factor/discrete representation
rstTrain <- ratify(rstTrain)
# Change the layer name
names(rstTrain) <- "trainClass"
# Visualize the data
plot(rstTrain, main="Train areas by class for Berlin")
```

Let’s see how many training pixels we have for each of the 12 classes (\(N_{total} = 24537\)):

```
tab <- table(values(rstTrain))
print(tab)
```

```
##
## 2 4 5 6 8 9 11 12 13 14 16 17
## 1534 577 2448 4010 1654 761 4960 1028 1050 4424 359 1732
```

Although, perhaps not the best approach in some cases, we will convert our raster data-set into a `data.frame`

object so we can use the RF classifier. Take into consideration, that in some cases, depending on the size of your `RasterStack`

and the available memory, using this approach will not be possible. One simple way to overcome this would be to convert the training raster into a `SpatialPoints`

object and then run the function `extract`

. This way, only specific pixels from the stack are retrieved. In any case, let’s proceed to get pixel values into our calibration data frame:

```
rstDF <- na.omit(values(stack(rstTrain, rst)))
rstDF[,"trainClass"] <- as.factor(as.character(rstDF[,"trainClass"]))
```

As you probably noticed from the code above, `NA`

’s were removed and the reference class column was converted to a categorical/factor variable. In practice, by “removing `NA`

’s”, it means that we are restricting the data only to the set of training pixels in `rstTrain`

(reducing from 428,238 rows to 24,537 rows 👍.)

Next up, setting some parameters. In the example, we will use holdout cross-validation (HOCV) to evaluate the RF classifier performance. This means that we will use an iterative split set approach with a training and a testing set. So, for this purpose, we need to define the proportion of instances for training (`pTrain`

); the remaining will be set aside for evaluation. Here I took into consideration the fact that RF tends to take some time to calibrate with large numbers of observations, (~ >10000) hence the relatively ‘large’ train proportion. We also need to define the number of repetitions in HOCV (`nEvalRounds`

.)

```
# Number of holdout evaluation rounds
nEvalRounds <- 20
# Proportion of the data used for training the classifier
pTrain <- 0.5
```

Now, let’s initialize some objects that will allow us to store some info on the the classification performance and validation:

```
n <- nrow(rstDF)
nClasses <- length(unique(rstDF[,"trainClass"]))
# Initialize objects
confMats <- array(NA, dim = c(nClasses,nClasses,nEvalRounds))
evalMatrix<-matrix(NA, nrow=nEvalRounds, ncol=3,
dimnames=list(paste("R_",1:nEvalRounds,sep=""),
c("Accuracy","Kappa","PSS")))
pb <- txtProgressBar(1, nEvalRounds, style = 3)
```

Now, with all set, let’s calibrate and evaluate our RF classifier (use comments to guide you through the code):

```
# Evaluation function
source(url("https://raw.githubusercontent.com/joaofgoncalves/Evaluation/master/eval.R"))
# Run the classifier
for(i in 1:nEvalRounds){
# Create the random index for row selection at each round
sampIdx <- sample(1:n, size = round(n*pTrain))
# Calibrate the RF classifier
rf <- randomForest(y = rstDF[sampIdx, "trainClass"],
x = rstDF[sampIdx, -1],
ntree = 200)
# Predict the class to the test set
testSetPred <- predict(rf, newdata = rstDF[-sampIdx,], type = "response")
# Get the observed class vector
testSetObs <- rstDF[-sampIdx,"trainClass"]
# Evaluate
evalData <- Evaluate(testSetObs, testSetPred)
evalMatrix[i,] <- c(evalData$Metrics["Accuracy",1],
evalData$Metrics["Kappa",1],
evalData$Metrics["PSS",1])
# Store the confusion matrices by eval round
confMats[,,i] <- evalData$ConfusionMatrix
# Classify the whole image with raster::predict function
rstPredClassTMP <- predict(rst, model = rf,
factors = levels(rstDF[,"trainClass"]))
if(i==1){
# Initiate the predicted raster
rstPredClass <- rstPredClassTMP
# Get precision and recall for each class
Precision <- evalData$Metrics["Precision",,drop=FALSE]
Recall <- evalData$Metrics["Recall",,drop=FALSE]
}else{
# Stack the predicted rasters
rstPredClass <- stack(rstPredClass, rstPredClassTMP)
# Get precision and recall for each class
Precision <- rbind(Precision,evalData$Metrics["Precision",,drop=FALSE])
Recall <- rbind(Recall,evalData$Metrics["Recall",,drop=FALSE])
}
setTxtProgressBar(pb,i)
}
# save.image(file = "./data-raw/P8-session.RData")
```

Three classification evaluation measures will be used: (i) **overall accuracy**, (ii) **Kappa**, and (iii) **Peirce-skill score** (PSS; aka true-skill statistic.) Let’s print out the results by round:

`knitr::kable(evalMatrix, digits = 3)`

Accuracy | Kappa | PSS | |
---|---|---|---|

R_1 | 0.788 | 0.755 | 0.749 |

R_2 | 0.789 | 0.756 | 0.751 |

R_3 | 0.782 | 0.748 | 0.741 |

R_4 | 0.795 | 0.763 | 0.757 |

R_5 | 0.789 | 0.755 | 0.749 |

R_6 | 0.789 | 0.755 | 0.750 |

R_7 | 0.784 | 0.751 | 0.744 |

R_8 | 0.785 | 0.751 | 0.745 |

R_9 | 0.785 | 0.751 | 0.745 |

R_10 | 0.789 | 0.756 | 0.750 |

R_11 | 0.785 | 0.752 | 0.746 |

R_12 | 0.786 | 0.752 | 0.746 |

R_13 | 0.786 | 0.753 | 0.747 |

R_14 | 0.785 | 0.751 | 0.745 |

R_15 | 0.794 | 0.762 | 0.756 |

R_16 | 0.785 | 0.752 | 0.745 |

R_17 | 0.791 | 0.759 | 0.753 |

R_18 | 0.782 | 0.748 | 0.741 |

R_19 | 0.781 | 0.746 | 0.739 |

R_20 | 0.785 | 0.751 | 0.744 |

Next, calculate some average and sd measures across all rounds:

`round(apply(evalMatrix,2,FUN = function(x,...) c(mean(x,...), sd(x,...))), 3)`

```
## Accuracy Kappa PSS
## [1,] 0.787 0.753 0.747
## [2,] 0.004 0.004 0.005
```

Overall measures seem to indicate that results are acceptable with very low variation between calibration rounds. Let’s check out some average **precision** (aka positive predictive value, PPV), **recall** (aka true positive rate, TPR) and **F1** measures by class:

```
avgPrecision <- apply(Precision,2,mean)
print(round(avgPrecision, 3))
```

```
## 11 12 13 14 16 17 2 4 5 6 8 9
## 0.963 0.705 0.717 0.906 0.789 0.998 0.657 0.345 0.509 0.698 0.664 0.484
```

```
avgRecall <- apply(Recall,2,mean)
print(round(avgRecall, 3))
```

```
## 11 12 13 14 16 17 2 4 5 6 8 9
## 0.979 0.685 0.648 0.960 0.434 1.000 0.611 0.093 0.481 0.869 0.647 0.268
```

```
avgF1 <- (2 * avgPrecision * avgRecall) / (avgPrecision+avgRecall)
print(round(avgF1, 3))
```

```
## 11 12 13 14 16 17 2 4 5 6 8 9
## 0.971 0.695 0.681 0.932 0.560 0.999 0.633 0.147 0.495 0.774 0.655 0.345
```

Well, things are not so great here… 🙍♂️ Some classes, such as 4, 9, 5 (different artificial/urban types) and 16 (bare soil/sand) have relatively low precision/recall/F1 values. This may be a consequence of the loss of information detail due to the 100m re-sampling or some class intermixing, and/or even train data generalization… 🤔 … this subject definitely requires more investigation. 😉

Now, let’s check the confusion matrix for the best round:

```
# Best round for Kappa
evalMatrix[which.max(evalMatrix[,"Kappa"]), , drop=FALSE]
```

```
## Accuracy Kappa PSS
## R_4 0.7946858 0.7626567 0.7569426
```

```
# Show confusion matrix for the best kappa
cm <- as.data.frame(confMats[,,which.max(evalMatrix[,"Kappa"])])
# Change row/col names
colnames(cm) <- rownames(cm) <- paste("c",levels(rstDF[,"trainClass"]),sep="_")
knitr::kable(cm)
```

c_11 | c_12 | c_13 | c_14 | c_16 | c_17 | c_2 | c_4 | c_5 | c_6 | c_8 | c_9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

c_11 | 2442 | 30 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 8 | 0 | 3 |

c_12 | 48 | 330 | 57 | 23 | 0 | 0 | 0 | 1 | 5 | 24 | 1 | 7 |

c_13 | 7 | 43 | 358 | 97 | 1 | 1 | 0 | 0 | 2 | 6 | 4 | 3 |

c_14 | 0 | 20 | 48 | 2096 | 1 | 0 | 0 | 1 | 1 | 11 | 0 | 5 |

c_16 | 0 | 3 | 4 | 29 | 80 | 0 | 0 | 0 | 4 | 4 | 51 | 5 |

c_17 | 0 | 0 | 0 | 0 | 0 | 914 | 0 | 0 | 0 | 0 | 0 | 0 |

c_2 | 0 | 0 | 0 | 4 | 1 | 0 | 507 | 6 | 199 | 23 | 73 | 1 |

c_4 | 2 | 4 | 6 | 5 | 2 | 0 | 2 | 31 | 114 | 105 | 16 | 9 |

c_5 | 5 | 3 | 3 | 10 | 1 | 0 | 171 | 29 | 619 | 262 | 87 | 14 |

c_6 | 14 | 19 | 5 | 4 | 0 | 0 | 1 | 6 | 142 | 1753 | 13 | 38 |

c_8 | 2 | 7 | 6 | 15 | 7 | 3 | 78 | 7 | 136 | 35 | 516 | 13 |

c_9 | 3 | 9 | 10 | 17 | 1 | 0 | 0 | 2 | 10 | 190 | 5 | 104 |

Since we obtained one classified map for each round, we can pull all that information by ensembling it together through a majority vote (ex. calculating the model class.) The `raster`

package `modal`

function makes it really easy to calculate this:

```
rstModalClass <- modal(rstPredClass)
rstModalClassFreq <- modal(rstPredClass, freq=TRUE)
medFreq <- zonal(rstModalClassFreq, rstTrain, fun=median)
```

Using the model frequency of the 20 classification rounds, let’s check out which classes obtained the highest ‘uncertainty’:

```
colnames(medFreq) <- c("ClassCode","MedianModalFrequency")
medFreq[order(medFreq[,2],decreasing = TRUE),]
```

```
## ClassCode MedianModalFrequency
## [1,] 6 20
## [2,] 11 20
## [3,] 12 20
## [4,] 14 20
## [5,] 17 20
## [6,] 2 19
## [7,] 8 19
## [8,] 13 19
## [9,] 5 15
## [10,] 16 14
## [11,] 9 13
## [12,] 4 11
```

These results somewhat confirm those of the class-wise precision/recall. Classes most often shifting (lower frequency) are those with lower values for these performance measures.

Finally, let’s plot our results of the final modal classification map (and modal frequency):

```
par(mfrow=c(1,2), cex.main=0.8, cex.axis=0.8)
plot(rstModalClass, main = "RF modal land cover class")
plot(rstModalClassFreq, main = "Modal frequency")
```

The map on the right provides some insight to identify areas that are more problematic for the classification process. However, as you can see, the results are acceptable but possible to improve in many ways.

See more stuff about geospatial and raster data processing here.

This concludes our exploration of the raster package and supervised classification for this post. Hope you find it useful! 😄 👍 👍

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Spatial Data Analysis: Introduction to Raster Processing (Part 2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

The process of *unsupervised classification* (UC; also commonly known as *clustering*) uses the properties and moments of the statistical distribution of pixels within a feature space (ex. formed by different spectral bands) to differentiate between relatively similar groups. *Unsupervised classification* provides an effective way of partitioning remotely-sensed imagery in a multi-spectral feature space and extracting useful land-cover information. We can perhaps differentiate UC from clustering because the first implies that we investigate the* posteriori* the results and label each class according to its properties. For example, if the objective is to obtain a land cover map, then different groups will perhaps be differentiated and labeled into urban, agriculture, forest and other classes alike.

Clustering is also known as a *data reduction* technique; ex. it compresses highly diverse information at pixel-level into groups or clusters of pixels with similar and more homogeneous values. Contrary to supervised classification, the *unsupervised* version does not require the user to provide training samples or cases. In fact, UC needs minimal inputs from the operator and typically only the definition of the number of groups, along with which bands to use. Then the algorithm attempts to provide the best solution to cluster pixel values, such that *‘within-group’* distances are minimized and *‘between-groups’* separation is maximized.

In this post, we will explore how to:

- Perform unsupervised classification/clustering
- Compare the performance of different clustering algorithms
- And assess the
*“best”*number of clusters/groups to capture the image data

One satellite scene from Landsat 8 will be used for this purpose. The data contains surface reflectance information for seven spectral bands (or layers, following the terminology for `RasterStack`

objects) in the GeoTIFF file format.

The following table summarizes info on Landsat 8 spectral bands used in this tutorial.

Band # | Band name | Wavelength (micrometers) |
---|---|---|

Band 1 | Ultra Blue | 0.435 – 0.451 |

Band 2 | Blue | 0.452 – 0.512 |

Band 3 | Green | 0.533 – 0.590 |

Band 4 | Red | 0.636 – 0.673 |

Band 5 | Near Infrared (NIR) | 0.851 – 0.879 |

Band 6 | Shortwave Infrared (SWIR) 1 | 1.566 – 1.651 |

Band 7 | Shortwave Infrared (SWIR) 2 | 2.107 – 2.294 |

Landsat 8 spatial resolution (or pixel size) is equal to 30 meters. Valid reflecting decimal values are typically within 0.00 – 1.00; but, for decreasing file size, the valid range is multiplied by a 10^{4} scaling factor to be in integer range 0 – 10000. Image acquisition date is from July 15th, 2015.

For more information on raster data processing, see here, as well as the tutorial part-1, tutorial part-2, tutorial part-3, and, tutorial part-4 of this series.

Performing the unsupervised classification/clustering, we will employ and compare two algorithms:

**K-means****And CLARA**

The *k-means* clustering algorithm attempts to define the centroid of each cluster with its mean value. This means that data is partitioned into *k* clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. In a geometric interpretation, k-means partitions the data space into *Voronoi cells* (see plot below). K-means provides an “exclusive” solution, meaning that each observation belongs to one (and only one) cluster. The algorithm is generally efficient for dealing with large data-sets, which commonly happens for raster data (ex. satellite or aerial images.)

The CLARA (Clustering LARge Application) algorithm is based on the Partition Around Medoids (PAM) algorithm, which in turn is an implementation of K-medoids… … so, let’s try to dig in by parts. Also, since these posts are intended to be *‘short-and-sweet’*, I will not use mathematical notation or pseudo-code in the descriptions. There are plenty of awesome books and online resources on these subjects that can you can consult for more information.

The *k-medoids* algorithm is a clustering algorithm similar to k-means. Both are *partitional* algorithms in the sense that they break-up the data into groups. They both also attempt to minimize the distance between points labeled in a cluster and a point designated as the center of that cluster.

However, in contrast to the k-means algorithm, k-medoids chooses specific data-points as centers (named as *medoids*) and works with a generalization of the Manhattan Norm to define the distance between data points. A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal. For example, it is the most centrally located point in the cluster.

Generally, k-medoids are more robust to noise and outliers as compared to k-means because it minimizes a sum of pair-wise dissimilarities instead of a sum of squared Euclidean distances.

*PAM* is the most common realization of the k-medoids clustering algorithm. PAM uses a greedy search, which may not find the optimum solution, however, it is faster than an exhaustive search. PAM has a high computational cost and uses a large amount of memory to compute the dissimilarity object. This leads us to the *CLARA* algorithm!

*CLARA* randomly chooses a small portion of the actual data as a representative of the data; then, medoids are chosen from this sample using *PAM*. If the sample is robustly selected, in a fairly random manner and with enough data points, it should closely represent the original data-set.

*CLARA* draws multiple samples of the data-set, applies PAM to each sample, finds the medoids, and then returns its best clustering as the output. At first, a sample data-set is drawn from the original data-set; then the PAM algorithm is applied to find the *k* medoids. Using these *k* medoids and the whole data-set, the ‘current’ dissimilarity is calculated. If it is smaller than the one you got in the previous iteration, then these *k* medoids are kept as the best ones. This process of selection is repeated frequently a specified number of times.

In R, the `clara`

function from the `cluster`

package implements this algorithm. It accepts dissimilarities calculated based on `"euclidean"`

or `"manhattan"`

distance. Euclidean distances are root sum-of-squares of differences, while Manhattan distances are the sum of absolute differences.

Our approach to clustering the Landsat 8 spectral raster data will employ two stages:

- Cluster image data with
*K-means*and*CLARA*for a*number of clusters*between 2 and 12 - Assessing each clustering solutions performance through the average
*Silhouette Index*

Let’s start by downloading and uncompressing the Landsat-8 surface reflectance sample data for the Peneda-Geres National Park:

```
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/LT8_PNPG_MultiBand.zip", "./data-raw/LT8_PNPG_MultiBand.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/LT8_PNPG_MultiBand.zip", exdir = "./data-raw")
```

Now, let’s read the multi-band GeoTIFF file into R as a `RasterBrick`

object:

```
library(raster)
# Load the multi-band GeoTIFF file with brick function
rst <- brick("./data-raw/LC82040312015193LGN00_sr_b_1_7.tif")
# Change band names
names(rst) <- paste("b",1:7,sep="")
```

Plot the data in the RGB display (bands 4,3,2) to see if everything is fine:

`plotRGB(rst, r=4, g=3, b=2, scale=10000, stretch="lin", main="RGB composite (b4,b3,b2) of Landsat-8")`

Data is OK, which means we can proceed to *step #1.* For example, clustering raster data with both algorithms and for different numbers of clusters (from 2 to 12).

For simplifying the processing workflow, the ‘internal’ values of the raster object will be entirely loaded up into memory into a `data.frame`

object (in this case with 2.286610^{6} rows – one for each cell; and, 7 columns – one for each layer) You can also see `?values`

and `?getValues`

for more info on this. Although this makes things easier and faster, in some cases, depending on the size of the image and its number of layers/bands, it will be unfeasible to push all the data into RAM.

However, if your raster object is too large to fit into a data frame object in memory, you can still use R to perform K-means clustering. Packages such as **RSToolbox** provide an implementation of k-means that may be more suited for your case. You can check here and the function `unsuperClass`

here.

Also, we will need to be careful regarding `NA`

’s because clustering algorithms will not run with these values (typically throwing an error). One simple way is to use a logical index to sub-set the data.

Let’s see how this works out in actual R code (use comments as guidance):

```
library(cluster)
# Extract all values from the raster into a data frame
rstDF <- values(rst)
# Check NA's in the data
idx <- complete.cases(rstDF)
# Initiate the raster datasets that will hold all clustering solutions
# from 2 groups/clusters up to 12
rstKM <- raster(rst[[1]])
rstCLARA <- raster(rst[[1]])
for(nClust in 2:12){
cat("-> Clustering data for nClust =",nClust,"......")
# Perform K-means clustering
km <- kmeans(rstDF[idx,], centers = nClust, iter.max = 50)
# Perform CLARA's clustering (using manhattan distance)
cla <- clara(rstDF[idx, ], k = nClust, metric = "manhattan")
# Create a temporary integer vector for holding cluster numbers
kmClust <- vector(mode = "integer", length = ncell(rst))
claClust <- vector(mode = "integer", length = ncell(rst))
# Generate the temporary clustering vector for K-means (keeps track of NA's)
kmClust[!idx] <- NA
kmClust[idx] <- km$cluster
# Generate the temporary clustering vector for CLARA (keeps track of NA's too ;-)
claClust[!idx] <- NA
claClust[idx] <- cla$clustering
# Create a temporary raster for holding the new clustering solution
# K-means
tmpRstKM <- raster(rst[[1]])
# CLARA
tmpRstCLARA <- raster(rst[[1]])
# Set raster values with the cluster vector
# K-means
values(tmpRstKM) <- kmClust
# CLARA
values(tmpRstCLARA) <- claClust
# Stack the temporary rasters onto the final ones
if(nClust==2){
rstKM <- tmpRstKM
rstCLARA <- tmpRstCLARA
}else{
rstKM <- stack(rstKM, tmpRstKM)
rstCLARA <- stack(rstCLARA, tmpRstCLARA)
}
cat(" done!\n\n")
}
# Write the clustering solutions for each algorithm
writeRaster(rstKM,"./data-raw/LT8_PGNP_KMeans_nc2_12-1.tif", overwrite=TRUE)
writeRaster(rstCLARA,"./data-raw/LT8_PGNP_CLARA_nc2_12-1.tif", overwrite=TRUE)
```

For evaluating the performance of each clustering solution and selecting the *“best”* number of clusters for partitioning the sample data, we will use the **silhouette Index**.

More specifically, the *silhouette* refers to a method of interpreting and validating the consistency within clusters of data (hence its called an *internal criteria* in `clusterCrit`

package.) In a nutshell, this method provides a graphical representation that depicts how well each clustered object lies within its cluster. Also, the silhouette value is a measure of how similar an object is to its own cluster (assessing intra-cluster *cohesion*) compared to other clusters (denoting between-clusters’ *separation*).

The silhouette index ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is considered appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.

In R, the `clusterCrit`

package provides an implementation of this internal clustering criteria in the `intCriteria`

(among many others, such as the Dunn, Ball-Hall, Davies-Bouldin, GDI, Tau indices.) Check out `library(help="clusterCrit")`

and `vignette("clusterCrit")`

for more info on this package.

Now that we have defined the conceptual underpinnings of the silhouette index, we can implement it in R code. One important detail before we proceed: since calculating the silhouette index is a rather slow process for large numbers’ of observations (>5000), we will use a stratified random sampling approach.

This means that we will take a sub-set of cells from each cluster and calculate the index based on those. We are assuming that the sample is somewhat robust and representative of the whole cells. Ideally, this process should be repeated several times and then an average value could be calculated (using a *bootstrap* approach would also be nice here.) However, for the sake of simplicity (and also because estimation generally yields relatively low errors… you have to trust me here… ) we will use a single sample of cells in this example.

Let’s see how this works out (use comments to guide you through the code):

```
library(clusterCrit)
# Start a data frame that will store all silhouette values
# for k-means and CLARA
clustPerfSI <- data.frame(nClust = 2:12, SI_KM = NA, SI_CLARA = NA)
for(i in 1:nlayers(rstKM)){ # Iterate through each layer
cat("-> Evaluating clustering performance for nClust =",(2:12)[i],"......")
# Extract random cell samples stratified by cluster
cellIdx_RstKM <- sampleStratified(rstKM[[i]], size = 2000)
cellIdx_rstCLARA <- sampleStratified(rstCLARA[[i]], size = 2000)
# Get cell values from the Stratified Random Sample from the raster
# data frame object (rstDF)
rstDFStRS_KM <- rstDF[cellIdx_RstKM[,1], ]
rstDFStRS_CLARA <- rstDF[cellIdx_rstCLARA[,1], ]
# Make sure all columns are numeric (intCriteria function is picky on this)
rstDFStRS_KM[] <- sapply(rstDFStRS_KM, as.numeric)
rstDFStRS_CLARA[] <- sapply(rstDFStRS_CLARA, as.numeric)
# Compute the sample-based Silhouette index for:
#
# K-means
clCritKM <- intCriteria(traj = rstDFStRS_KM,
part = as.integer(cellIdx_RstKM[,2]),
crit = "Silhouette")
# and CLARA
clCritCLARA <- intCriteria(traj = rstDFStRS_CLARA,
part = as.integer(cellIdx_rstCLARA[,2]),
crit = "Silhouette")
# Write the silhouette index value to clustPerfSI data frame holding
# all results
clustPerfSI[i, "SI_KM"] <- clCritKM[[1]][1]
clustPerfSI[i, "SI_CLARA"] <- clCritCLARA[[1]][1]
cat(" done!\n\n")
}
write.csv(clustPerfSI, file = "./data-raw/clustPerfSI.csv", row.names = FALSE)
```

Let’s print out a nice table with the silhouette index results for comparing each clustering solution:

```
knitr::kable(clustPerfSI, digits = 3, align = "c",
col.names = c("#clusters","Avg. Silhouette (k-means)","Avg. Silhouette (CLARA)"))
```

#clusters | Avg. Silhouette (k-means) | Avg. Silhouette (CLARA) |
---|---|---|

2 | 0.378 | 0.351 |

3 | 0.381 | 0.258 |

4 | 0.306 | 0.308 |

5 | 0.442 | 0.280 |

6 | 0.427 | 0.393 |

7 | 0.388 | 0.260 |

8 | 0.384 | 0.272 |

9 | 0.367 | 0.325 |

10 | 0.326 | 0.311 |

11 | 0.356 | 0.285 |

12 | 0.320 | 0.255 |

We can also make a plot for comparing the two algorithms:

```
plot(clustPerfSI[,1], clustPerfSI[,2],
xlim = c(1,13), ylim = range(clustPerfSI[,2:3]), type = "n",
ylab="Avg. Silhouette Index", xlab="# of clusters",
main="Silhouette index by # of clusters")
# Plot Avg Silhouette values across # of clusters for K-means
lines(clustPerfSI[,1], clustPerfSI[,2], col="red")
# Plot Avg Silhouette values across # of clusters for CLARA
lines(clustPerfSI[,1], clustPerfSI[,3], col="blue")
# Grid lines
abline(v = 1:13, lty=2, col="light grey")
abline(h = seq(0.30,0.44,0.02), lty=2, col="light grey")
legend("topright", legend=c("K-means","CLARA"), col=c("red","blue"), lty=1, lwd=1)
```

From both the table and the plot, we can see widely different results in terms of clustering performance with the **k-means algorithm clearly performing better**. This may be due to the fact that CLARA works on a sub-set of the data, and hence, is less capable of finding the best cluster centers. In addition, for the k-means algorithm, we can see that partitioning the data into 5 groups/clusters seems to be the best option (although 6 also seems a perfectly reasonable solution.)

Finally, let’s make a plot of the best solutions according to the silhouette index:

`plot(rstKM[[4]])`

The final step (typical in the Remote Sensing domain) would be to interpret the clustering results, analyze their spectral and land cover properties and provide a label to each cluster (ex. urban, agriculture, forest.) Albeit is very important, that is outside the scope of this tutorial

This concludes our exploration of the raster package and unsupervised classification for this post. Hope you find it useful!

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing (Part 2)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In the fourth part of this tutorial series on Spatial Data Analysis using the `raster`

package, we will explore more functionalities, this time related to time-series analysis of raster data. For more information on raster data processing, see here, as well as the tutorial part-1, tutorial part-2, and, tutorial part-3, of this series.

We will use an Enhanced Vegetation Index (EVI), 5-year time-series (from the year 2012 to 2016) from Terra/MODIS satellite/sensor platform for the Peneda-Geres National Park (PGNP, in NW Portugal) to develop some examples.

This data corresponds to MODIS’s **MOD13Q1 data product** version-006, (+info here) which has 250m of spatial resolution and 16-days of temporal resolution (more precisely, the product is generated by maximum daily value composites for each 16-day period). This means that each year has a total of 23 observations. This data was downloaded from the EarthData platform and later assembled and re-projected to WGS 1984 – UTM 29N Coordinate Reference System (CRS) using the MODIS Reprojection Tool – MRT (sorry, but these pre-processing steps are outside the scope of this tutorial 😁).

In this post, we will introduce `RasterBrick`

‘s, a multi-layer raster object typically created from a multi-layer (or multi-band) file, although they can also exist entirely in memory. These objects are similar to `RasterStack`

s, but processing time should be shorter when using a `RasterBrick`

(irrespective, if values are on disk or in memory). However, these objects are less flexible, as they can only point to a single file, while `RasterStacks`

can point to multiple files.

Besides the `raster`

package, we will also work with `rts`

, which provides classes and methods for manipulating and processing raster time-series data (e.g. a time-series of satellite images). A raster time-series object is created by combining a `RasterStack`

or `RasterBrick`

object (from `raster`

package) and a set of dates of class `POSIXct`

, `POSIXt`

, `Date`

, `timeDate`

. The time information in `rts`

is then handled by a `xts`

object.

The function `rts`

is used to build a raster time-series (either a `RasterBrickTS`

or `RasterStackTS`

) which is simply an S4 object composed of two slots:

- Slot [
*raster*]: a`RasterStack`

or`RasterBrick`

object. - Slot [
*time*]: a`xts`

object with dates for each layer in the raster object.

One key advantage of using the `rts`

package is to facilitate the subset, extraction, or application of functions by specific periods using date notation (instead of integer or name indices like in `raster`

).

First up: download and uncompress the sample data! 👍 The .zip archive contains a multi-layer GeoTIFF file with a 16-day (composite) EVI time-series from 2012 to 2016. This means that we have a total of 23 images per year and a total of 115 layers in the file (for the whole five-year series).

```
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/MODIS_EVI_TS_PGNP_MultiBand.zip", "./data-raw/MODIS_EVI_TS_PGNP_MultiBand.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/MODIS_EVI_TS_PGNP_MultiBand.zip", exdir = "./data-raw")
```

Creating the `RasterBrick`

from the downloaded data is really easy and similar to using the `stack`

. The main advantage is that the input is just one multi-layer file instead of a vector with multiple files as usual when using `stack`

. We will also change the names of each layer for making them more readable. Let’s see how this goes:

```
library(raster)
# Load the raster data into a RasterBrick object
rst <- brick("./data-raw/MOD13Q1.2012_2016.PGNP_250m_EVI_16days.tif")
names(rst) <- paste("EVI",1:nlayers(rst),sep="_")
```

MODIS data products, such as *MOD13Q1* (and others alike), use 7-digit long dates composed by the year, (4 digits) followed by the Julian day (3 digits) to identify the reference date of an image composite. For example, the date code 2012001 corresponds to 2012-01-01 (in YYYY-mm-dd format), and, 2012161 to 2012-06-09. Usually, these dates are inscribed in image files or meta-data but, since we don’t have them, we will generate them first and then process them so we can obtain a `Date`

object for each layer. We can then use these properly formatted dates in the `rts`

function to create a raster time-series (see `?rts`

for more details).

```
padIt <- function(x)
if(x<10) paste("00",x,sep="") else if(x<100 & x>=10) paste("0",x, sep="") else return(as.character(x))
padWithZeros <- function(x)
sapply(x, FUN = padIt)
# Generate the MODIS-like dates for each layer
julDay <- padWithZeros(rep(seq(from = 1,to = 365,by = 16),5))
yrs <- as.character(rep(2012:2016, each = 23))
MODISYrJday <- paste(yrs, julDay, sep="")
# Print out the MODIS dates for the year 2012
print(MODISYrJday[1:23])
```

```
## [1] "2012001" "2012017" "2012033" "2012049" "2012065" "2012081" "2012097"
## [8] "2012113" "2012129" "2012145" "2012161" "2012177" "2012193" "2012209"
## [15] "2012225" "2012241" "2012257" "2012273" "2012289" "2012305" "2012321"
## [22] "2012337" "2012353"
```

Now that we have our MODIS-like dates (in year and Julian day format) we need to convert them into a more ‘human-readable’ format, also accepted by `rts`

function:

```
# Extract the year
MOD.getYear<-function(x)
as.integer(sapply(x,FUN=function(x) substr(x,1,4)))
# Extract de julian day
MOD.getDOY<-function(x)
as.integer(sapply(x,FUN=function(x) substr(x,5,7)))
# Process the MODIS-like date into YYYY-mm-dd format as Date object
MOD.getDate<-function(x)
as.Date(sapply(x,FUN=function(x) as.character(as.Date(MOD.getDOY(x)-1,origin=paste(MOD.getYear(x),"01-01",sep="-")))))
MODdates <- MOD.getDate(MODISYrJday)
class(MODdates)
```

`## [1] "Date"`

```
# Check the result for year 2012
print(MODdates[1:23])
```

```
## [1] "2012-01-01" "2012-01-17" "2012-02-02" "2012-02-18" "2012-03-05"
## [6] "2012-03-21" "2012-04-06" "2012-04-22" "2012-05-08" "2012-05-24"
## [11] "2012-06-09" "2012-06-25" "2012-07-11" "2012-07-27" "2012-08-12"
## [16] "2012-08-28" "2012-09-13" "2012-09-29" "2012-10-15" "2012-10-31"
## [21] "2012-11-16" "2012-12-02" "2012-12-18"
```

With the date vector (class `Date`

) for each layer, we can now generate a raster time-series with `rts`

constructor:

```
# Install the rts package
if(!("rts" %in% installed.packages()[,1]))
install.packages(c("rts"), dependencies = TRUE)
library(rts)
rstTS <- rts(rst, MODdates)
```

With the `RasterBrickTS`

object created, we can extract sub-sets of the data for particular periods or dates (check `?rts::subset`

for more details):

```
# Subset a specific period
rstTSsubset1 <- subset(rstTS,"2013-05-15/2014-08-25")
# Subset the whole year of 2012
rstTSsubset2 <- subset(rstTS,"2012")
# Subset years from 2013 to 2014
rstTSsubset3 <- subset(rstTS,"2013/2014")
# Subset all years from (and including) 2014 to the series end
rstTSsubset4 <- subset(rstTS,"2014/")
# Subset all to the end of 2014
rstTSsubset5 <- subset(rstTS,"/2014")
# Subset all to the end of July 2014
rstTSsubset6 <- subset(rstTS,"/2014-06")
# Subset a specific month
rstTSsubset7 <- subset(rstTS,"2016-05")
# Plot the May 2016 data
plot(rstTSsubset7)
```

As you can see, the `subset`

function is pretty handy for extracting parts of a time-series. The date parameter format must be left-specified with respect to the standard ISO:8601 time format “CCYY-MM-DD HH:MM:SS.” It is also possible to specify a range of times via the index-based sub-setting, using ISO-recommended “/” as the range operator. Generally, it works with “from/to” dates, where using both is optional. If one side is missing, it is interpreted as a request to retrieve layers from the beginning or through the end of the raster time series.

One of the best applications of the `rts`

package for processing raster time-series is the ability to apply a specified function to distinct periods. Let’s see how we can do this:

```
# Apply function to each quarter
# Mean
rstTS_quarterlyMN <- apply.quarterly(rstTS, FUN = mean, na.rm=TRUE)
# Standard-deviation
rstTS_quarterlySD <- apply.quarterly(rstTS, FUN = sd, na.rm=TRUE)
# Apply function to each year
# Mean
rstTS_yearlyMN <- apply.yearly(rstTS, FUN = mean, na.rm=TRUE)
# Standard-deviation
rstTS_yearlySD <- apply.yearly(rstTS, FUN = sd, na.rm=TRUE)
# Plot the time-series for annual EVI
plot(rstTS_yearlyMN)
```

As we can see, these functions can be very useful for applying specific functions over certain calendar periods without the hassle of having to specify indices – you simply work with dates, which are nicer! 👍 😉

However, in certain cases, you may want to work with the “simpler” / more general functions that the `raster`

package offers to apply functions over a raster time-series. In that case, `calc`

and `stackApply`

can be used.

The difference between these functions is that `calc`

applies the defined function over the whole series pixel-by-pixel, while `stack`

applies a function on sub-sets of a `RasterStack`

or `RasterBrick`

. For this function, the layers to be combined are indicated by an integer vector with indices. The function used should return a single value, and the number of layers in the output `Raster*`

equals the number of unique values in indices. In the opposite hand, `calc`

allows having a function that outputs multiple values and, in that case, a multi-layer `Raster*`

object is returned with one layer per output value.

Also, keep in mind that for large objects, `calc`

will compute values chunk by chunk. This means that for the result of fun to be correct, it should not depend on having access to *all* values at once. Let’s grab some examples.

Using the `calc`

function to provide a global average and standard-deviation of the entire raster time-series:

```
rstMean <- calc(rst, fun = mean)
rstStd <- calc(rst, fun = sd)
```

Now, let’s use `calc`

for a multi-value output with a specific function:

```
# Calculate quantiles
# NOTE: you have to use na.rm=TRUE to make this work
rstQuantiles <- calc(rst, fun = function(x,...) as.numeric(quantile(x,probs=c(0.05, 0.5, 0.95),...)), na.rm=TRUE)
print(rstQuantiles)
```

```
## class : RasterBrick
## dimensions : 186, 179, 33294, 3 (nrow, ncol, ncell, nlayers)
## resolution : 250, 250 (x, y)
## extent : 549486, 594236, 4613206, 4659706 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=29 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer.1, layer.2, layer.3
## min values : -290.7, 13.0, 271.5
## max values : 4229.3, 5572.0, 7458.0
```

*Et voilà!* ❕ We have three layers; one for each calculated quantile.

Now, switching to `stackApply`

, we will emulate the behavior of rts `apply.yearly`

function:

`rstYrMean <- stackApply(rst, fun=mean, indices = rep(1:5,each=23))`

Well… To be honest, I was not aware of this but, strangely the `rts`

function took much more time to calculate the annual averages… Check out the comparison below:

```
# stackApply with RasterBrick
system.time({rstYrMean <- stackApply(rst, fun=mean, indices = rep(1:5,each=23))})
```

```
## user system elapsed
## 0.52 0.04 0.58
```

```
# apply.yearly with RasterBrickTS
system.time({rstTS_yearlyMN <- apply.yearly(rstTS, FUN = mean, na.rm=TRUE)})
```

```
## user system elapsed
## 23.47 1.25 24.75
```

Let’s check if the data is equal to be sure:

```
for(i in 1:nlayers(rstYrMean))
print(compareRaster(rstYrMean[[i]], rstTS_yearlyMN@raster[[i]], values=TRUE))
```

```
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
```

Yup, all the layers are equal…

Not sure why this is happening though… 🤔. Do you have some ideas/comments on this?

This concludes our exploration of the raster and the rts packages for this post. We have covered a couple of useful things for processing raster time-series. Hope you find it useful! 😄 👍 👍

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Spatial Data Analysis: Introduction to Raster Processing (Part 2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Geospatial data is becoming increasingly used to solve numerous ‘real-life’ problems (check out some examples here.) In turn, R is becoming a powerful, open-source solution to handle this type of data, currently providing an exceptional range of functions and tools for GIS and Remote Sensing Data Analysis.

In particular, **raster data** provides support for representing spatial phenomena by diving the surface into a grid (or matrix) composed of cells of regular size. Each raster data-set has a certain number of columns and rows and each cell contains a value with information for the variable of interest. Stored data can be either: (i) Thematic – representing a **discrete** variable (e.g., land cover classification map) or (ii) C**ontinuous** (e.g., elevation).

The `raster`

package currently provides an extensive set of functions to create, read, export, manipulate, and process raster data-sets. It also provides low-level functionalities for creating more advanced processing chains, as well as the ability to manage large data-sets. For more information, see: `vignette("functions", package = "raster")`

.

In the third part of this tutorial series focused on spatial data analysis using the `raster`

package, we will explore more functionalities, namely:

- Masking
- Aggregation
- Zonal analysis
- Cross-tabulation

For more information on raster data processing, see here, as well as the tutorial part-1, and, tutorial part-2 of this series.

Masking a raster is often a required operation when we want to represent and/or analyze only a subset of pixels included in a specific area or region. In turn, the remaining pixels are transformed into `NA`

’s (or other user-defined value).

For this purpose, we can use the `mask`

function. This requires inputting a ‘mask’ layer that can be either a `Raster*`

object (with the same extent and resolution), or a `Spatial*`

object (e.g., `SpatialPolygons`

) in which case, all cells that are not covered by this object are set to `updatevalue`

(`NA`

by default).

We will start by downloading, uncompressing, and loading the sample data. A `SpatialPolygons*`

layer will be used as the mask layer. The objective of this example is to mask elevation values that are inside the Peneda-Geres National Park (NW Portugal). First up, let’s prepare the elevation data:

```
library(raster)
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/srtm_pnpg.zip", "./data-raw/srtm_pnpg.zip", method = "auto")
unzip("./data-raw/srtm_pnpg.zip", exdir = "./data-raw")
## Create the RasterLayer object
rst <- raster("./data-raw/srtm_pnpg.tif")
```

Now, let’s download and read the mask layer (using `rgdal`

):

```
library(sp)
library(rgdal)
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/BOUNDS_PNPG.zip", "./data-raw/BOUNDS_PNPG.zip", method = "auto")
unzip("./data-raw/BOUNDS_PNPG.zip", exdir = "./data-raw")
# Read the mask layer and convert it to a 'simpler' SpatialPolygons dataset
maskLayer <- as(readOGR(dsn = "./data-raw", layer = "pnpg_bounds"), "SpatialPolygons")
```

```
## OGR data source with driver: ESRI Shapefile
## Source: "./data-raw", layer: "pnpg_bounds"
## with 1 features
## It has 5 fields
```

Plot the data to see if everything is Okay:

```
plot(rst, main="Elevation (meters) for Peneda-Geres\n National Park", xlab="X-coordinates",
ylab = "Y-coordinates")
plot(maskLayer, add=TRUE)
```

Finally, let’s mask the values for the PG National Park boundaries:

```
rstMasked <- mask(rst, maskLayer)
plot(rstMasked, main="Elevation (meters) for Peneda-Geres\n National Park", xlab="X-coordinates",
ylab = "Y-coordinates")
plot(maskLayer, add=TRUE)
```

From the image, we can see that only pixels occurring inside the Park are represented or ‘masked’. Exactly what we wanted!

Raster aggregation is the process of creating a new `RasterLayer`

by grouping cell values in a rectangular area to create larger/coarser cells. This grouping can employ any user-defined function to summarize multiple values (in the rectangular area) and provide a single value (e.g., mean, sd, min, max, sum). This ‘upsampling’ allows you to represent and analyze the spatial distribution of cell values inside each rectangular area.

The `aggregate`

function will be used for this purpose. The ‘coarseness’ of the aggregation is controlled by the `fact`

parameter (aggregation factor) which expresses the number of cells in each direction (horizontal and vertical). Alternatively, two integers can be used to separately express the horizontal and vertical aggregation factors.

Let’s see how this works out with different aggregation factors:

```
* Aggregation factor 2 - pixel size 160(m)
* Aggregation factor 7 - pixel size 560(m)
```

```
# Aggregation factor = 2
rstAggFact2Mean <- aggregate(rst, fact = 2, fun = mean)
rstAggFact2SD <- aggregate(rst, fact = 2, fun = sd)
# Aggregation factor = 7
rstAggFact7Mean <- aggregate(rst, fact = 7, fun = mean)
rstAggFact7SD <- aggregate(rst, fact = 7, fun = sd)
# Plot the newly aggregated rasters
par(mfrow = c(2,2))
plot(rstAggFact2Mean, main = "Aggregation factor = 2 | Mean")
plot(rstAggFact2SD, main = "Aggregation factor = 2 | SD")
plot(rstAggFact7Mean, main = "Aggregation factor = 7 | Mean")
plot(rstAggFact7SD, main = "Aggregation factor = 7 | SD")
```

Notice the change in coarseness as we move from an aggregation factor of 2 to 7, with much less detail in the latter. In this particular example, (a DEM analysis), and for larger factors, it allows us to understand general land-forms (mean) and also topographic heterogeneity (standard-deviation).

For this part of the tutorial, we will address Zonal Analysis. This method allows summarizing the values in a `Raster*`

object for each “zone” included in a `RasterLayer`

(typically defined by an integer code). The `zonal`

function is used for this objective. Notice that both `Raster*`

input objects must have the same extent, resolution, and CRS.

Applications of this technique include summarizing cell values for administrative regions (like in the example explored below) or calculating summary statistics for raster segments (useful in an object-based image analysis approach).

In the example, we will calculate zonal statistics for elevation data for each civil parish (N=17) within the Peneda-Geres National Park (PGNP – NW Portugal).

As customary, we will start by downloading, uncompressing, and loading civil parish data:

```
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/CIVPARISH_PNPG.zip", "./data-raw/CIVPARISH_PNPG.zip", method = "auto")
unzip("./data-raw/CIVPARISH_PNPG.zip", exdir = "./data-raw")
rstCivPar <- raster("./data-raw/PNPG_CivilParishes.tif")
```

Plot the data to see if it’s OK:

```
par(mfrow = c(1,2))
plot(rst, main="Elevation (meters)")
plot(maskLayer, add=TRUE)
plot(rstCivPar, main="Civil parishes PGNP")
plot(maskLayer, add=TRUE)
```

We can now calculate elevation statistics for each civil parish (identified by an integer number). `zonal`

accepts summarizing functions (in the argument `fun`

) with single or multiple outputs. In the example below, we will calculate the mean.

```
# Single-output
zonal(rst, rstCivPar, fun=mean)
```

```
## zone value
## [1,] 1 1041.2407
## [2,] 2 308.6019
## [3,] 3 622.7727
## [4,] 4 790.4464
## [5,] 5 970.8552
## [6,] 6 787.0810
## [7,] 7 549.3588
## [8,] 8 963.2459
## [9,] 9 307.2516
## [10,] 10 1106.3882
## [11,] 11 1156.9009
## [12,] 12 713.7014
## [13,] 13 1044.0795
## [14,] 14 641.9319
## [15,] 15 1023.6904
## [16,] 16 867.7098
## [17,] 17 930.2353
```

In the next example, column #1 equals the median and column #2 the median-absolute deviation, which is obtained by a specific multi-output function:

```
# Multi-output
zonal(rst, rstCivPar, fun=function(x,...) c(MED=median(x,...), MAD=mad(x,...)))
```

```
## zone value_1 value_2
## [1,] 1 1059.0 171.9816
## [2,] 2 271.0 131.9514
## [3,] 3 637.0 191.2554
## [4,] 4 813.5 405.4911
## [5,] 5 956.0 293.5548
## [6,] 6 765.0 392.8890
## [7,] 7 555.5 306.1569
## [8,] 8 954.0 219.4248
## [9,] 9 289.0 182.3598
## [10,] 10 1125.0 123.0558
## [11,] 11 1162.0 96.3690
## [12,] 12 698.0 332.1024
## [13,] 13 1024.0 240.1812
## [14,] 14 617.0 265.3854
## [15,] 15 1017.0 195.7032
## [16,] 16 875.0 413.6454
## [17,] 17 977.0 189.7728
```

Performing a cross-tabulation of two raster data-sets is very useful when, for example, you want to assess land cover changes between two different dates. It is also a preliminary step for generating a confusion matrix from which several classification performance metrics can be calculated.

In this example, we will use Corine Land Cover (CLC), a dataset from the European Environmental Agency (EEA) for years 2006 and 2012, to analyze changes in land cover composition. In this case, we have two categorical rasters with integer values corresponding to different land cover classes (see details in the table below).

```
clcLeg <- read.csv(url("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/legend_clc.csv"),
stringsAsFactors = FALSE)
clcLeg <- data.frame(clcLeg[,1:2],
CLC_abr=toupper(abbreviate(gsub("-"," ",clcLeg[,3]), 6)),
Label=clcLeg[,3], row.names = 1:nrow(clcLeg))
knitr::kable(clcLeg)
```

Raster_value | CLC_code | CLC_abr | Label |
---|---|---|---|

1 | 111 | CNTNUF | Continuous urban fabric |

2 | 112 | DSCNUF | Discontinuous urban fabric |

7 | 131 | MNRLES | Mineral extraction sites |

12 | 211 | NNIRAL | Non-irrigated arable land |

13 | 212 | PRMNIL | Permanently irrigated land |

15 | 221 | VNYRDS | Vineyards |

18 | 231 | PASTRS | Pastures |

19 | 241 | ANCWPC | Annual crops with permanent crops |

20 | 242 | CMPLCP | Complex cultivation patterns |

21 | 243 | AGRWNV | Agriculture with natural vegetation |

23 | 311 | BRDLVF | Broad-leaved forest |

24 | 312 | CNFRSF | Coniferous forest |

25 | 313 | MXDFRS | Mixed forest |

26 | 321 | NTRLGR | Natural grasslands |

27 | 322 | MRSANH | Moors and heathland |

29 | 324 | TRNSWS | Transitional woodland-shrub |

31 | 332 | BRRCKS | Bare rocks |

32 | 333 | SPRSVA | Sparsely vegetated areas |

33 | 334 | BRNTAR | Burnt areas |

40 | 511 | WTRCRS | Water courses |

41 | 512 | WTRBDS | Water bodies |

Now, let’s download, uncompress, and load the raster data into R and then perform the cross-tabulation:

```
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/CLC_06_12.zip", "./data-raw/CLC_06_12.zip", method = "auto")
unzip("./data-raw/CLC_06_12.zip", exdir = "./data-raw")
# Load the Corine Land cover dataset for 2006 and 2012
clc06 <- raster("./data-raw/clc2006_100m.tif")
clc12 <- raster("./data-raw/clc2012_100m.tif")
# 'Ratify' the rasters, i.e., inform that these are
# categorical/discrete datasets
clc06 <- ratify(clc06)
clc12 <- ratify(clc12)
# Perform the crosstab
ct <- crosstab(clc06, clc12, long = TRUE)
# Plot the contingency table
knitr::kable(ct)
```

clc2006_100m | clc2012_100m | Freq | |
---|---|---|---|

1 | 1 | 1 | 28 |

25 | 2 | 2 | 503 |

49 | 7 | 7 | 50 |

73 | 12 | 12 | 1966 |

97 | 13 | 13 | 206 |

121 | 15 | 15 | 196 |

145 | 18 | 18 | 3242 |

169 | 19 | 19 | 6229 |

193 | 20 | 20 | 5638 |

197 | 25 | 20 | 1 |

217 | 21 | 21 | 14263 |

219 | 24 | 21 | 6 |

222 | 27 | 21 | 9 |

223 | 29 | 21 | 7 |

228 | 41 | 21 | 1 |

240 | 21 | 23 | 4 |

241 | 23 | 23 | 14763 |

243 | 25 | 23 | 5 |

246 | 29 | 23 | 167 |

249 | 33 | 23 | 28 |

265 | 24 | 24 | 6761 |

266 | 25 | 24 | 26 |

269 | 29 | 24 | 563 |

286 | 21 | 25 | 1 |

288 | 24 | 25 | 44 |

289 | 25 | 25 | 9183 |

292 | 29 | 25 | 637 |

313 | 26 | 26 | 12791 |

315 | 29 | 26 | 4 |

318 | 33 | 26 | 112 |

333 | 23 | 27 | 24 |

334 | 24 | 27 | 103 |

335 | 25 | 27 | 101 |

337 | 27 | 27 | 58718 |

338 | 29 | 27 | 54 |

341 | 33 | 27 | 172 |

353 | 19 | 29 | 1 |

356 | 23 | 29 | 371 |

357 | 24 | 29 | 1398 |

358 | 25 | 29 | 1466 |

359 | 26 | 29 | 2 |

360 | 27 | 29 | 96 |

361 | 29 | 29 | 23358 |

363 | 32 | 29 | 1 |

364 | 33 | 29 | 390 |

366 | 41 | 29 | 1 |

385 | 31 | 31 | 1706 |

406 | 27 | 32 | 15 |

407 | 29 | 32 | 1 |

409 | 32 | 32 | 35083 |

410 | 33 | 32 | 24 |

425 | 23 | 33 | 25 |

426 | 24 | 33 | 294 |

427 | 25 | 33 | 50 |

429 | 27 | 33 | 457 |

430 | 29 | 33 | 392 |

457 | 40 | 40 | 98 |

471 | 23 | 41 | 9 |

481 | 41 | 41 | 3800 |

505 | 128 | 128 | 97250 |

The first two columns of the table show, respectively, the land cover class in year 2006 and in 2012. The third column shows the number of pixels (frequency). In cases where values for both column 1 and 2 coincide, no land cover transition occurred. On the opposite hand, different values are evidence of changes.

We can also convert the contingency table into a confusion matrix with the following (not-so-pretty) code. Confusion matrices are sometimes easier to analyze than contingency tables…

```
# Get the class integer codes and size
lv <- unique(c(levels(ct[,1]), levels(ct[,2])))
n <- length(lv)
# Create the square confusion matrix filled with 0's
cm <- matrix(0, nrow = n, ncol = n, dimnames = list(lv,lv))
# Fill the matrix following each line of the contingency table
for(i in 1:nrow(ct)){
cm[ct[i,1], ct[i,2]] <- ct[i,3]
}
knitr::kable(cm)
```

1 | 2 | 7 | 12 | 13 | 15 | 18 | 19 | 20 | 21 | 23 | 24 | 25 | 26 | 27 | 29 | 31 | 32 | 33 | 40 | 41 | 128 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

2 | 0 | 503 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

7 | 0 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

12 | 0 | 0 | 0 | 1966 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

13 | 0 | 0 | 0 | 0 | 206 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

15 | 0 | 0 | 0 | 0 | 0 | 196 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

18 | 0 | 0 | 0 | 0 | 0 | 0 | 3242 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6229 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |

20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5638 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14263 | 4 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 14763 | 0 | 0 | 0 | 24 | 371 | 0 | 0 | 25 | 0 | 9 | 0 |

24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 6761 | 44 | 0 | 103 | 1398 | 0 | 0 | 294 | 0 | 0 | 0 |

25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 5 | 26 | 9183 | 0 | 101 | 1466 | 0 | 0 | 50 | 0 | 0 | 0 |

26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 12791 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |

27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 | 0 | 58718 | 96 | 0 | 15 | 457 | 0 | 0 | 0 |

29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 167 | 563 | 637 | 4 | 54 | 23358 | 0 | 1 | 392 | 0 | 0 | 0 |

31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1706 | 0 | 0 | 0 | 0 | 0 |

32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 35083 | 0 | 0 | 0 | 0 |

33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 0 | 0 | 112 | 172 | 390 | 0 | 24 | 0 | 0 | 0 | 0 |

40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 98 | 0 | 0 |

41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3800 | 0 |

128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 97250 |

From either the contingency table or the matrix, we can assess how many pixels remained in the same class, or that for some reason, changed to another land cover category. If we look closely, we see that several areas of forest (Class IDs 23, 24 and 25) changed to class ID 29 – *‘Transitional woodland-shrubland’*, and, 33 – *‘Burnt areas’*, thus proving forest loss.

Using a simple rule-set, we can identify pixels that correspond to forest loss areas, identified by the following class transition: *2006 Class IDs {23, 24, 25} —> 2012 Class IDs {29, 33}*

```
# Calculate
forestLossAreas <- (clc06 %in% 23:25) & (clc12 %in% c(29,33))
# Plot the results
plot(forestLossAreas, main="Forest loss in PGNP (NW PT)", xlab="x-coord", ylab="y-coord")
plot(spTransform(maskLayer, CRS=crs(clc06)), add=TRUE)
```

The areas highlighted in green correspond to forest and/or habitat quality loss (probably due to wildfires…).

This concludes our exploration of the raster package for this post. Hope you find it useful!

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing (Part 2)
- Big Data Manipulation in R Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Geospatial data is becoming increasingly used to solve numerous ‘real-life’ problems (check

out some examples here).

In turn, R is becoming a powerful, open-source solution to handle this type of data, currently providing

an exceptional range of functions and tools for GIS and Remote Sensing data analysis.

In particular, **raster data** provides support for representing spatial phenomena

by diving the surface into a grid (or matrix) composed of cells of regular size. Each raster

dataset has a certain number of columns and rows and each cell contains a value with information

for the variable of interest. Stored data can be either: (i) Thematic – representing a

**discrete** variable (e.g., land cover classification map) or (ii) **continuous** (e.g., elevation).

The `raster` package currently provides an extensive set of functions to create, read, export,

manipulate and process raster data sets. It also provides low-level functionalities for creating

more advanced processing chains, as well as the ability to manage large data sets. For more

information, see: `vignette("functions", package = "raster")`

.

Answers to the exercises are available here.

You can also check more about raster data on the tutorial series about this topic here.

Start by downloading, uncompressing, and loading the sample data for these exercises from this

link (digital elevation model data from SRTM-v4.1 for the Peneda-Geres National Park, Portugal).

The data is in GeoTIFF format with file name: *srtm_pnpg.tif*.

**Exercise 1**

Check out the size of the data in terms of number of rows, columns, cells and layers.

**Exercise 2**

Check the spatial resolution of the raster and its coordinate reference system (CRS).

**Exercise 3**

Get the raster extent object and calculate the ‘height’ (in the y-axis) and the length (in x-axis) of the raster.

**Exercise 4**

Calculate the mean and standard-deviation for all pixels.

**Exercise 5**

Calculate the 1%, 25%, 50%, 75% and 99% quantiles for all pixels.

**Exercise 6**

Using a QQ-plot, investigate deviations from normality in the distribution of elevation values.

**Exercise 7**

Extract raster values for 100 randomly generated points within the image (use `set.seed(12345)`

) for obtaining the same values as in the solutions).

**Exercise 8**

Convert the elevation units of the DEM from meters to feet. Make a RasterStack object with both the rasters with meters (original) and feet (new).

**Exercise 9**

Crop the raster to the following extent: Upper-left {ymax = 4654705, xmin = 554615}, and, Lower-right {ymin = 4618355, xmax = 589015}.

**Exercise 10**

Re-project the sample raster to Datum ETRS 1989 (European Terrestrial Reference System 1989), projection Lambert Azimuthal Equal Area (LAEA) and change the resolution to 100m with the bi-linear method.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Exercises With Raster Data (Part 1 and 2)
- Data Manipulation with data.table (part -2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In the second part of this tutorial series on spatial data analysis using the `raster`

package, we will explore new functionalities, namely:

- Raster algebra
- Cropping
- Reprojection and resampling

We will also introduce a new type of object named `RasterStack`

, which, in its essence, is a collection of `RasterLayer`

objects with the same spatial extent, resolution and coordinate reference system (CRS).

For more information on raster data processing, see here and here.

We will start this tutorial by downloading the sample raster data and creating a `RasterStack`

composed of multiple image files. One satellite scene from Landsat 8 will be used for this purpose. The data contains surface reflectance information for seven spectral bands (or layers, following the terminology for `RasterStack`

objects) in GeoTIFF file format.

The following table summarizes info on Landsat 8 spectral bands used in this tutorial.

Band # | Band name | Wavelength (micrometers) |
---|---|---|

Band 1 | Ultra Blue | 0.435 – 0.451 |

Band 2 | Blue | 0.452 – 0.512 |

Band 3 | Green | 0.533 – 0.590 |

Band 4 | Red | 0.636 – 0.673 |

Band 5 | Near Infrared (NIR) | 0.851 – 0.879 |

Band 6 | Shortwave Infrared (SWIR) 1 | 1.566 – 1.651 |

Band 7 | Shortwave Infrared (SWIR) 2 | 2.107 – 2.294 |

Landsat 8 spatial resolution (or pixel size) is equal to 30 meters. Valid reflectance decimal values are typically within 0.00 – 1.00; but, for decreasing file size, the valid range is multiplied by a 10^{4} scaling factor to be in integer range 0 – 10000. Image acquisition date is the 15^{th} of July 2015.

```
library(raster)
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/LT8_PNPG.zip", "./data-raw/LT8_PNPG.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/LT8_PNPG.zip", exdir = "./data-raw")
```

With the data downloaded and uncompressed, we can now generate an `RasterStack`

object. The `stack`

function accepts a character vector as input, containing the paths to each raster layer. To generate this we will use the `list.files`

function.

```
# Get file paths and check/print the list
fp <- list.files(path = "./data-raw", pattern = ".tif$", full.names = TRUE)
print(fp)
```

```
## [1] "./data-raw/LC82040312015193LGN00_sr_band1.tif"
## [2] "./data-raw/LC82040312015193LGN00_sr_band2.tif"
## [3] "./data-raw/LC82040312015193LGN00_sr_band3.tif"
## [4] "./data-raw/LC82040312015193LGN00_sr_band4.tif"
## [5] "./data-raw/LC82040312015193LGN00_sr_band5.tif"
## [6] "./data-raw/LC82040312015193LGN00_sr_band6.tif"
## [7] "./data-raw/LC82040312015193LGN00_sr_band7.tif"
```

```
# Create the raster stack and print basic info
rst <- stack(fp)
print(rst)
```

```
## class : RasterStack
## dimensions : 1545, 1480, 2286600, 7 (nrow, ncol, ncell, nlayers)
## resolution : 30, 30 (x, y)
## extent : 549615, 594015, 4613355, 4659705 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=29 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : LC8204031//0_sr_band1, LC8204031//0_sr_band2, LC8204031//0_sr_band3, LC8204031//0_sr_band4, LC8204031//0_sr_band5, LC8204031//0_sr_band6, LC8204031//0_sr_band7
## min values : -27, -1, 29, -86, -216, -212, -102
## max values : 3170, 3556, 4296, 4931, 6904, 7413, 6696
```

Changing raster layer names (usually difficult to read, as we saw above) is really straightforward. Also, if necessary, using simple names makes it easier to access layers *‘by name’* in the `RasterStack`

.

`names(rst) <- paste("b",1:7,sep="")`

Let’s check if the data is being stored in memory:

`inMemory(rst)`

`## [1] FALSE`

Similarly to `RasterLayer`

objects, by default (and unless necessary), an `RasterStack`

object only holds metadata and connections to the actual data to spare memory.

Now, let’s plot the data for a fast visualization.

`plot(rst)`

Notice how each layer has a separated tile in the plot.

**Raster algebra**

Now we can proceed to do some raster algebra calculations. We will accomplish this by using three different methods: (i) direct, (ii) `calc`

function, and, (iii) `overlay`

function. In this example, we will calculate the Normalized Difference Vegetation Index (NDVI) using the red (b4) and the near-infrared (NIR; b5) bands as:

NDVI = (NIR – Red) / (NIR + red)

** Method #1** (Direct)

This method allows you to directly use the raster layers in the stack called by their indices (or names). Typical operands (e.g., `+`

, `-`

, `/`

, `*`

) can be used, as well, as functions (e.g., `sqrt`

, `log`

, `cos`

). However, since processing occurs all at once in memory, you must be sure that your data fits into RAM.

```
# Calling raster layers by index
ndvi <- (rst[[5]] - rst[[4]]) / (rst[[5]] + rst[[4]])
# Or calling by name
ndvi <- (rst[["b5"]] - rst[["b4"]]) / (rst[["b5"]] + rst[["b4"]])
```

Notice how the data type of the input rasters and the final raster (a ratio) are different (from integer to float; see `?dataType`

for details):

`dataType(rst)`

`## [1] "INT2S" "INT2S" "INT2U" "INT2S" "INT2S" "INT2S" "INT2S"`

`dataType(ndvi)`

`## [1] "FLT4S"`

**Method #2** (Calc Function)

For large objects `calc`

will compute values by raster chunks, thus saving memory. This means that for the result of the defined function to be correct, it should not depend on having access to all values at once.

```
calcNDVI_1 <- function(x) return((x[[5]] - x[[4]]) / (x[[5]] + x[[4]]))
ndvi1 <- calc(rst, fun = calcNDVI_1)
```

**Method #3** (Overlay Function)

The overlay function allows you to combine two (or more) `Raster*`

objects. It should be more efficient when using large raster datasets that cannot be loaded into memory (similarly to `calc`

).

```
calcNDVI_2 <- function(x, y) return((x - y) / (x + y))
ndvi2 <- overlay(x = rst[[5]], y = rst[[4]], fun = calcNDVI_2)
```

Overall, using the first method is not advisable in cases were raster data is “big”. In those cases, it is recommended to use more “memory-friendly” methods such as `calc`

or `overlay`

. Also, as a general rule, if a calculation needs to use multiple individual layers separately, (sometimes in different objects) it will be easier to set up in `overlay`

rather than in `calc`

.

Plotting the NDVI data requires some fine tuning because some ‘strange’ values appeared. Note that NDVI range is between -1.00 and 1.00. In the summary below, notice how ‘resistant’ measures (quartiles) are fine, but not the extremes. For NDVI, values closer to 1 represent higher vegetation cover.

```
# NDVI summary
summary(ndvi)
# Set values outside the 'normal' range as NA's
# Indexing for RasterLayers works similarly to matrix or data frame objects
ndvi[ndvi < -1] <- NA
ndvi[ndvi > 1] <- NA
# Plot NDVI
plot(ndvi, main="NDVI Peneda-Geres National Park", xlab = "X-coordinates", ylab = "Y-coordinates")
```

It is also fairly easy to perform logical operations. For example, creating an NDVI mask with values above 0.4:

```
ndviMask <- ndvi > 0.4
plot(ndviMask, main="NDVI mask", xlab = "X-coordinates", ylab = "Y-coordinates")
```

This creates a new boolean raster with 0’s for pixels that are equal or below 0.4, and, 1’s for values above 0.4. This is very useful for separating vegetated from non-vegetated surfaces! 🙂

Often, we want to crop (or clip) a raster dataset for a specific area of study. For doing that, the `raster`

package uses the `crop`

function, which accepts as input a `Raster*`

object and an `Extent`

object used to define the new bounding coordinates (see `?extent`

for more details).

```
# Bounding coordinates
xmin <- 554615
xmax <- 589015
ymin <- 4618355
ymax <- 4654705
# Create the extent object by defining the bounding coordinates
newExtent <- extent(xmin, xmax, ymin, ymax)
# Crop
cropRst <- crop(rst, newExtent)
```

Often, after downloading some raster data (e.g., satellite imagery) for a given area, it is needed to change its coordinate reference system (CRS). `projectRaster`

function allows projecting the values of an `Raster*`

object to a new one with another CRS. It is possible to do this by providing the new projection info as a single argument (an `CRS`

object); in this case, the function sets the extent and resolution of the new object. To assure that the newly created object lines up with other datasets, you can provide a target `Raster*`

object with the properties that the input data should be projected to. `projectRaster`

also allows changing the spatial resolution (or pixel size) of the input raster.

In the first example, we will keep the same Datum as in the original data, but change from a projected CRS (in Universal Transverse Mercator – UTM 29N) to a geographic lat/lon CRS. Notice how the pixel size is not constant across the x- and y-axes.

```
# Create an object of class CRS with the target reference system
targetCRS <- CRS("+init=epsg:4326")
# Reproject
ndvi_ReprojWGS84 <- projectRaster(ndvi, method = "ngb", crs = targetCRS)
print(ndvi_ReprojWGS84)
```

```
## class : RasterLayer
## dimensions : 1575, 1504, 2368800 (nrow, ncol, ncell)
## resolution : 0.000362, 0.00027 (x, y)
## extent : -8.40579, -7.861342, 41.6645, 42.08975 (xmin, xmax, ymin, ymax)
## coord. ref. : +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

In this second example, we will change the data to the Portuguese official CRS: Datum ETRS 1989, Projection Transverse Mercator, Ellipsoid GRS 1980 (see here more details).

```
# Create an object of class CRS with the target reference system
targetCRS <- CRS("+proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs ")
# Reproject
ndvi_ReprojETRS89 <- projectRaster(ndvi, method = "ngb", crs = targetCRS)
print(ndvi_ReprojETRS89)
```

```
## class : RasterLayer
## dimensions : 1570, 1506, 2364420 (nrow, ncol, ncell)
## resolution : 30, 30 (x, y)
## extent : -22707.33, 22472.67, 221784.7, 268884.7 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

Now, let’s change the resolution from the initial 30m of Landsat 8 to 25m (‘downsampling’). For this purpose we use the `res`

parameter:

```
ndvi_ReprojETRS89_20m <- projectRaster(ndvi, res = 25, method = "ngb", crs = targetCRS)
print(ndvi_ReprojETRS89_20m)
```

```
## class : RasterLayer
## dimensions : 1882, 1805, 3397010 (nrow, ncol, ncell)
## resolution : 25, 25 (x, y)
## extent : -22682.33, 22442.67, 221809.7, 268859.7 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

This concludes our exploration of the raster package for this post. Hope you find it useful!

- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Spatial Data Analysis: Introduction to Raster Processing (Part 2)
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Geospatial data is becoming increasingly used to solve numerous ‘real-life’ problems (check out some examples here). In turn, R is becoming more equipped than ever to handle this type of data. Thus, providing an exceptional open-source solution to solve many problems in the Geographic Information Sciences and Remote Sensing domains.

In general, two types of geospatial data models are used to represent, visualize, and model spatial phenomena. These are:

**Vector Data**: This represents the world in three simple geometries:**points**,**lines**, and**polygons**. As such, it allows you to represent spatial phenomena or variables that are typically discrete and with well-defined boundaries (e.g., touristic points-of-interest, gas stations, rivers, roads, drainage basins, country GDP).**Raster Data**: This provides support for representing spatial phenomena by diving the surface into a grid (or matrix) composed of cells of regular size. Each raster dataset has a certain number of columns and rows and each cell contains a value with information for the variable of interest. Stored data can be either: (i) thematic – representing a*discrete*variable (e.g., land cover classification map) or*continuous*(e.g., elevation).

Choosing the appropriate data model to use depends on the domain of application and the specific problem at hand. Typically, people from social sciences tend to use the vector data model more. R packages such as **sp** or **sf** (a relatively new package, starting in 2016), provide support for this type of data. In contrast, in the environmental sciences, the raster data model is more often used because of satellite data, or the need to represent spatially continuous phenomena, such as pollution levels, temperature or precipitation values, the abundance or habitat suitability for a species, etc. The **raster** package, introduced in March 2010 by Robert Hijmans & Jacob van Etten, currently provides many useful functions for using this type of data. Despite these differences, GIS specialists and researchers often use both data models to tackle their problems.

Throughout these posts, we will cover the basics, intermediate, and some advanced stuff in **raster data** handling, manipulation, and modeling in R. Examples will be given along with the tutorials. Some exercises, with different difficulty levels, will be provided so you can practice.

The `raster`

package currently provides an extensive set of functions to create, read, export, manipulate, and process raster datasets. It also provides low-level functionalities for creating more advanced processing chains, as well as the ability to manage large datasets. For more information see: `vignette("functions", package = "raster")`

.

This first post on raster data is divided into two sub-sections:

(i) Accessing raster attributes, and, (go-to)

(ii) Viewing raster values and calculating simple statistics (go-to).

First, we need to install the `raster`

package (as well as `sp`

and `rgdal`

):

```
if(!("rgdal" %in% installed.packages()[,1]))
install.packages(c("rgdal"), dependencies = TRUE)
if(!("sp" %in% installed.packages()[,1]))
install.packages(c("sp"), dependencies = TRUE)
if(!("raster" %in% installed.packages()[,1]))
install.packages(c("raster"), dependencies = TRUE)
library(rgdal)
library(sp)
library(raster)
```

Next, download and unzip the sample data. We will use SRTM – version 4.1 elevation data (in meters a.s.l.) for the Peneda-Geres National Park – Portugal (+info) in the examples.

```
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/srtm_pnpg.zip", "./data-raw/srtm_pnpg.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/srtm_pnpg.zip", exdir = "./data-raw")
```

In the first part (of two) of this tutorial, we will focus on reading raster data and accessing its core attributes.

After finishing the download, load the data into R using the `raster`

function (see `?raster`

for more details). Then use `print`

to inspect the “essential” attributes of the dataset.

```
# In this example the function uses a string with the data location
rst <- raster("./data-raw/srtm_pnpg.tif")
# Print raster attributes
print(rst)
```

```
## class : RasterLayer
## dimensions : 579, 555, 321345 (nrow, ncol, ncell)
## resolution : 80, 80 (x, y)
## extent : 549619.7, 594019.7, 4613377, 4659697 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=29 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : D:\MyDocs\R-dev\R_exercises_raster_tutorial\data-raw\srtm_pnpg.tif
## names : srtm_pnpg
## values : 9, 1520 (min, max)
```

From the above, we can see some important information about our raster dataset. Given that we used the `raster`

function for data loading, we have now created a `RasterLayer`

, ex., a raster object with a single layer. We can also see its dimension: 579 rows, 555 columns and the pixel size in x and y dimensions, a.k.a. the **spatial resolution**, equal to 80m (we are using a projected coordinate system with units in meters; more on this below).

We can use the function `inMemory`

to check if the raster dataset is currently stored on RAM:

`inMemory(rst)`

`## [1] FALSE`

As we can see, the raster data is currently stored on the disk. So, at this point, our `RasterLayer`

object is actually “made of” metadata and a link to the actual raster data on disk. This allows preserving RAM space.

The package also provides several functions to access each raster attribute individually.

```
## Raster layer name(s) / more useful for multi-layer rasters
## By default coincides with the file name without extension
names(rst)
```

`## [1] "srtm_pnpg"`

```
## Number of rows, columns and layers
dim(rst)
```

`## [1] 579 555 1`

```
## Nr of rows
nrow(rst)
```

`## [1] 579`

```
# Nr of columns
ncol(rst)
```

`## [1] 555`

```
## Total number of grid cells
ncell(rst)
```

`## [1] 321345`

```
## Spatial resolution in x and y dimensions
res(rst)
```

`## [1] 80 80`

```
## Data type - see ?dataType for more details
dataType(rst)
```

`## [1] "INT2S"`

```
## Extent (returns a S4 object of class "Extent")
extent(rst)
```

```
## class : Extent
## xmin : 549619.7
## xmax : 594019.7
## ymin : 4613377
## ymax : 4659697
```

Info on extent coordinates can be retrieved individually:

`xmin(rst)`

`## [1] 549619.7`

`xmax(rst)`

`## [1] 594019.7`

`ymin(rst)`

`## [1] 4613377`

`ymax(rst)`

`## [1] 4659697`

Finally, we can also see info about the Coordinate Reference System (CRS) used to represent the data. Many different CRS is used to describe geographic data, depending on the location, extent, time, domain (among other features) of the collected data.

`crs(rst)`

```
## CRS arguments:
## +proj=utm +zone=29 +datum=WGS84 +units=m +no_defs +ellps=WGS84
## +towgs84=0,0,0
```

For the raster package, a **proj4string** is used to set and define the CRS of the data. This string contains some important details of the CRS, such as the *Projection*, the *Datum*, the *Ellipsoid* and the *units* (e.g., meters, degree). You can see more info on *proj4* parameters here. Use the site spatialreference.org to find the appropriate *proj4string* (or other information) for the CRS of your choice.

**Summary Statistics**

For the second, and last part, of this tutorial, we are going to explore raster functions for visualizing, summarizing and accessing/querying values at specific locations.

To visualize the data we can simply use the function `plot`

.

```
plot(rst, main="Elevation (meters) for Peneda-Geres\n National Park, Portugal",
xlab="X-coordinates", ylab="Y-coordinates")
```

We can also use a histogram to visualize the distribution of elevation values in the sample data.

```
# Generate histogram from a sample of pixels (by default 100K are randomly used)
hist(rst, col="light grey", main = "Histogram of elevation", prob = TRUE,
xlab = "Elevation (meters a.s.l.)")
# Generate the density plot object and then overlap it
ds <- density(rst, plot = FALSE)
lines(ds, col = "red", lwd = 2)
```

Calculating summary statistics is fairly easy using the `raster`

package. The generic method `summary`

is available for this type of object (note: this function will use a sample of pixels to calculate statistics).

`summary(rst)`

`## Warning in .local(object, ...): summary is an estimate based on a sample of 1e+05 cells (31.12% of all cells)`

```
## srtm_pnpg
## Min. 11
## 1st Qu. 529
## Median 776
## 3rd Qu. 984
## Max. 1520
## NA's 0
```

Minimum and maximum values can be calculated with the following functions (no sample employed):

```
## Min
minValue(rst)
```

`## [1] 9`

```
## Max
maxValue(rst)
```

`## [1] 1520`

The package also provides a more general interface to calculate cell statistics using the `cellStats`

function (no sample employed).

```
## Mean
cellStats(rst, mean)
```

`## [1] 747.2759`

```
## Standard-deviation
cellStats(rst, sd)
```

`## [1] 311.8615`

```
## Median
cellStats(rst, median)
```

`## [1] 774`

```
## Median-absolute deviation (MAD)
cellStats(rst, mad)
```

`## [1] 336.5502`

```
## Quantiles
## 5%, 25%, 50%, 75% and 95%
cellStats(rst, function(x,...) quantile(x, probs=c(0.05, 0.25, 0.5, 0.75, 0.95),...))
```

```
## 5% 25% 50% 75% 95%
## 186 527 774 983 1224
```

`cellStats`

does not use a random sample of the pixels to calculate. Hence, it will fail (gracefully) for very large `Raster*`

objects, except for certain predefined functions: `sum`

, `mean`

, `min`

, `max`

, `sd`

, `'skew'`

, and `'rms'`

.

**Extracting Values**

The `raster`

package allows several possibilities to extract data at specific points, lines or polygons. The `extract`

function used for this purpose allows a two-column `matrix`

or `data.frame`

(with x, y coordinates) or spatial objects from the `sp`

package such as: `SpatialPoints*`

, `SpatialPolygons*`

, `SpatialLines`

or `Extent`

as input.

For the first example, we will start by extracting raster values using points as input:

```
## One specific point location (with coordinates in the same CRS as the input raster)
xy <- data.frame(x = 570738, y = 4627306)
xy <- SpatialPoints(xy, proj4string = crs(rst))
extract(rst, xy)
```

```
##
## 611
```

```
## Extract raster values for 20 randomly located points
xy <- data.frame(x = runif(20, xmin(rst), xmax(rst)), y = runif(20, ymin(rst), ymax(rst)))
xy <- SpatialPoints(xy, proj4string = crs(rst))
extract(rst, xy)
```

```
## [1] 137 761 1034 828 834 837 691 342 597 272 1263 935 270 1240
## [15] 339 1136 1245 991 1073 1153
```

Typically, we are also interested in extracting raster values for specific regions-of-interest (ROI). In this example, we will use a polygon (a broad-leaf forest area) to assess the distribution of elevation values within it.

```
## Download the vector data with the woodland patch ROI
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/WOODS_PNPG.zip", "./data-raw/WOODS_PNPG.zip", method = "auto")
## Uncompress the data
unzip("./data-raw/WOODS_PNPG.zip", exdir = "./data-raw")
## Convert the data into SpatialPolygons (discards the attached attribute but keeps geometry)
woods <- as(readOGR(dsn = "./data-raw", layer = "woods_pnpg"), "SpatialPolygons")
```

```
## OGR data source with driver: ESRI Shapefile
## Source: "./data-raw", layer: "woods_pnpg"
## with 1 features
## It has 6 fields
```

Let’s check out the polygon data with a simple plot:

```
## Plot elevation raster
plot(rst, main="Elevation (meters) for Peneda-Geres\n National Park, Portugal",
xlab="X-coordinates", ylab="Y-coordinates")
## Add the ROI
plot(woods, add = TRUE)
```

Now, let’s extract the raster values from the polygon ROI and calculate some statistics:

```
elev <- extract(rst, woods)[[1]] ## Subset the first (and only) geometry element
# Tukey's five number summary: minimum, lower-hinge, median, upper-hinge, and, maximum
fivenum(elev)
```

`## [1] 556 677 745 822 1086`

When using `extract`

with a `SpatialPolygons*`

object, by default, we get a `list`

containing a set of raster values for each individual polygon.

Now, using the extracted values, we can investigate the distribution of elevation values for the target patch.

```
hist(elev, main = "Histogram of ROI elevation", xlab = "Elevation (meters a.s.l.)")
abline(v = mean(elev), lwd = 2) ## Mean line
```

This concludes our first exploration of the `raster`

package – an awesome resource for handling geospatial data in R! Hope you find this post useful.

Cheers!