- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Cryptocurrencies are digital assets that facilitates exchange of value between two parties. They are intended to be secure and safe. Encryption systems are used to regulate the generation of coins and verify the exchange of value between the parties involved. Cryptocurrencies have recently taken the world by storm, grabbing attention from investors, academia, financial institutions, governments and individuals.

The crypto package uses data provided by ‘Cryptocurrency Market Capitalizations’ (here) to retrieve data on cryptocurrencies. CoinMarketCap is a website that provides present and historical data on numerous cryptocurrencies through web scraping tables.

Answers to these exercises are available here.

Please install and load the **crypto** package, along with all its dependencies. You will also need to be connected to the internet to retrieve data from the website. This exercise also assumes basic knowledge of **ggplot2** for graph plotting and will require you to load the package.

**Exercise 1**

Get the list of all coins available in CoinMarketCap.

**Exercise 2**

Retrieve the current prices and market capitalization data for the top 20 coins (by Market Capitalization.)

**Exercise 3**

Plot a bar graph to see the top 20 coins by market capitalization.

**Exercise 4**

Which coin (name) among the top 20 has been most profitable in the last 7 days? Store the name in a variable called “*most_profitable*.”

**Exercise 5**

Retrieve data on historical prices from May 1st, 2017 to May 1st, 2018 for the most_profitable coin. Store data in a variable called “*historical_data*.”

**Exercise 6**

Plot the closing price in a line graph to see how the cryptocurrency price has fluctuated in recent times from “*historical_data*.” Save the plot as a variable named “*p*.”

**Exercise 7**

Cryptocurrency prices fluctuate a lot within a day. Let’s add the absolute price spread (maximum.price – minimum.price) from “*historical_data”* in the graph plotted above. In case the plot doesn’t come out well due to bad data (like in the figure in solutions page), try the same graph for “Bitcoin.” Replace “most_profitable” with “Bitcoin” in the solutions for Ex. 5,6 and 7.

**Exercise 8**

Now, let’s retrieve historical data for a few more coins. Use the function **getCoins()** to retrieve data from May 1st, 2017 to May 1st, 2018 for the top 20 coins (hint: use the “*coins_data”* variable.)

We will use the above data for exploring Cryptocurrency markets further in coming exercises.

]]>- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

`shinystan`

package.
As we already know, the STAN platform typically uses particular Markov Chain Monte Carlo (MCMC) algorithms: the Hamiltonian Monte Carlo (HMC) or the No-U-Turn sampler (NUTS). It is also typically more efficient than other popular samplers, such as the Gibbs sampler or the Metropolis-Hasting algorithm. This is due to updates of the posterior converging more quickly to the stationary posterior distribution.

We have already introduced STAN in the three previous posts:

introduction with `rstan`

, with `rstanarm `

and with `bayesplot`

. Now, the goal is to use all learned knowledge from before to build a shiny application through the `shinystan`

framework.

Several vignettes are available for the `shinystan`

package: For examples, see here or here for the deployment to shinapps server. The solutions to this set of exercises can be found here.

**Exercise 1**

In order to have a first visualization of the structure of a shiny application from STAN, run the demo app of the `shinystan `

package. This demo app comes from the “Meta analysis” chapter of the STAN manual. You can also see it in this rstan vignette in the example section.

**Exercise 2**

We want to simulate data in order to build our own shiny application with the `shinystan `

package.

We will take the same kind of examples in the previous set of exercises (see exercises with the “Bayesian inference” and “MCMC” tags) with the two-parameters Gumbel distribution coming from the Extreme Value Theory. This kind of posterior distribution is often hard to evaluate, hence why we need MCMC samplers.

Simulate a sample size of 100 from a Gumbel distribution (relying on the `evd`

package) with location and scale parameters set to 10 and 3, respectively.

Do not forget to set the global options that allows you to automatically save a bare version of a compiled Stan program. Saving a bare version will ensure it does not need to be recompiled, along with executing multiple chains in parallel by taking all the available cores of your machine (use the `parallel`

package.)

**Exercise 3**

Write the STAN model in your R script that estimates the generated sample by following the pertaining blocks of a STAN code:

**–** The **data** block that will gather the data used in the STAN model: n (the number of data points; they cannot be negative) and y (the observations of interest.)

**–** The **parameters** block: mu and sigma (cannot be negative) that represent the two parameters of a Gumbel distribution that we will have to estimate.

**–** The **model** block: defines the prior distribution for the two parameters: weakly informative centered on their sample estimate, for example, the empirical mean and the empirical SD of the sample with a variance of 25. Then, create the vector of interest from these two parameters.

**–** The **generated quantities** block: defines the posterior predictive values (y_pred) from the model.

Do not forget to constrain the upper and lower bounds in the data and parameter block declarations.

Note that it is recommended to write the STAN code in a separate file (“.stan” extension) and then call it directly through the `stan()`

function instead of writing it in a character string, as done here. That would be easier to debug, read, etc.

**Exercise 4**

Define the named data list that will be used inside the STAN data block written above.

**Exercise 5**

Run the STAN model using the `stan()`

function and use the following input parameters:

– The STAN code defined in Exercise 3

– The data list defined in Exercise 4

– 4 different chains

– 1000 iterations per chain

– A warm-up phase of 200. This phase defines the number of iterations that are used by the sampler for the adaptation phase before sampling begins; it is different than the burn-in phase we have seen in the Metropolis or Gibbs samplers.

Use all the available cores of your machine and make it reproducible (take the `seed`

1234.)

**Exercise 6**

Now, launch the shiny application that will allow you to have full access to the MCMC visualizations and diagnostics. Specify that you want to visualize the application in your own Rstudio viewer pane.

**Exercise 7**

Now, do the same as Exercise 6, but by viewing the application on your localhost directly in your default browser.

**Exercise 8**

Now, explore the shiny application and do the following:

a. Do a Trace-plot for the location parameter with all 4 of the chains

b. Do a Trace-plot for the location parameter, but visualize only the 3rd chain

c. Do a Trace-plot for the scale parameter with the 4th chain only. Moreover, take the log10 transformation of this parameter.

d. Find the standard deviation summary of the sampler parameters. Include the warm-up phase and display 3 decimals.

e. Check the effective sample sizes, the monte-carlo standard error and posterior SD, and the Gelman-Rubin Rhat. Set the warning levels to 50%, 15% and 1.01, respectively.

f. Compute the **Partial **Auto-correlation for every chain and for every parameter. (HINT: check the “Show/Hide” Options.) Then save the plot as a pdf.

After that, close the app.

**Exercise 9**

Now that you have diagnosed the convergence of the generated Markov chains, we can go through the estimation of the model. Go to the “estimate” tab in the main panel:

a. Compute the 95% and 70% posterior intervals for the parameter sigma. Include an indicator of the Rhat (Gelman Rubin diagnostic) in the plot.

b. Display the summary table of the parameter estimates. Highlight the scale parameter and order the table by effective sample size. Then, download the table.

c. Finally, go to the “explore” tab and check the bi-variate plot of the posterior draws of the two parameters.

**Exercise 10 (BONUS) **

Since this feature is still experimental, some bugs may still appear.

Check the fit of the model with the the posterior predictive distribution. Rely on the generated quantities of your STAN model (Exercise 3.)

You need to use the “*PPcheck” *tab and relaunch the app on your browser. It will check if you have both the predictive posterior samples and the sample data in a good format.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Density-Based Clustering Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

`bayesplot`

package. This package is very useful to construct diagnostics that can be used to have insights on the convergence of the MCMC sampling since the convergence of the generated chains is the main issue in most STAN models.You can find the first bit of information on

`bayesplot `

package in this introductory vignette or in this vignette.
We will make use of a survey of residents from a small area in Bangladesh that was affected by arsenic in drinking water. Respondents with elevated arsenic levels in their wells were asked if they were interested in getting water from a neighbor’s well. A series of logistic regressions could be fit to predict this binary response given various information about the households. Here we fit a model for the well-switching response, given two predictors: the arsenic level of the water in the resident’s home and the distance of the house from the nearest safe well.

More information on this data can be found in this interesting paper.

Solutions for this set of exercises can be found here. The first chunk defines, installs (if needed) and loads the packages needed in this set.

**Exercise 1**

After having installed and loaded the `rstan `

and `bayesplot `

packages, set the global options that allows you to automatically save a bare version of a compiled Stan program. This way, it does not need to be recompiled to execute multiple chains, in parallel, by taking all the available cores of your machine (use the `parallel`

package).

Set the color “bright blue” so that it applies to all subsequent plots.

Finally, import the data of interest; the wells.dat data-set, you can import from the Gelman repository by simply making use of the `read.table()`

function from the R utils package. Create the variable “*dist100″* which is equal to `dist/100 `

.

**Exercise 2**

Create a plot to visualize the relationship between the two variables of interest with the dependent binary variable “*switch*“.

What can you say about this relationship?

Then, fit the model with the frequentist `glm()`

function and inspect the summary of the estimation results.

**Exercise 3**

Write the STAN model of interest in your R script by following the pertaining blocks of a STAN code by using weak priors for the parameters (ex. take multivariate normal prior with a standard deviation of 10).

**–** The **data** block that will gather all the data is used in the STAN model: N (the number of data points; it cannot be negative), P (the number of different predictors; it cannot be negative), X (the matrix of covariates, including the intercept) and y (the binary outcome).

**–** The **parameters** block: beta that is represented as a parameter vector for the parameters of the logistic regression model.

**–** The **model** block: it defines the prior distribution for the parameters beta and creates the response vector of interest by multiplying the matrix of covariates with the parameters.

Do not forget to constraint the upper and lower bounds in the data and parameter block declarations.

Note that it is recommended to write the STAN code in a separate file (“.stan” extension) and then call it directly through the `stan()`

function instead of writing it in a character, like what is done here. That would be easier to debug, read, etc.

**Exercise 4**

Define the named data list that will be used inside the STAN data block written above.

**Exercise 5**

Run the STAN model using the `stan()`

function and using the following input parameters:

– The STAN code defined in Exercise 3

– The data list defined in Exercise 4

– Four different chains

– 1000 iterations per chain

– A warm-up phase of 200. This phase defines the number of iterations that are used by the sampler for the adaptation phase before sampling begins (and is thus different of the burn-in phase we have seen in the Metropolis or Gibbs samplers).

Use all the available cores of your machine and make it reproducible (take the `seed`

1234). Then, print the estimation outputs of the parameters.

**Exercise 6**

Plot the central posterior uncertainty intervals of each parameter in the form of horizontal lines with the median as a central point.

**Exercise 7**

Do the same as Exercise 6, but with the posterior estimates represented by density areas. Take the mean as a central point and 80%/99% as inner/outer credible intervals.

**Exercise 8**

– Plot the histograms of the three parameters. Take the binwidth of 1/50 for each parameter. Then, do the same, but displaying each of the 4 chains and take a binwidth of 1/100.

– Do the same but with density plots. Overlay chains in the density plots instead of plotting them separately, like the histograms.

– Draw violin-plots of the 4 chains separately for each parameter except the intercept. Highlight the quantiles 5%, 20%, 50%, 70%, and 95%.

**Exercise 9**

Compute the traceplots for each parameter separately and overlay the 4 pertaining chains in each plot.

**Exercise 10**

Since we are interested only in the two covariates (*dist100* and *arsenic*), represent a bivariate scatter-plot of the posterior draws. Then, try to avoid over-plotting with hexagonal binning.

Finally, represent a scatter-plot table of all the parameters with the density plots lying in the diagonal and the posterior draws in the off-diagonal.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Advanced Techniques With Raster Data: Part 1 – Unsupervised Classification
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

**R FOR HYDROLOGISTS **

LOADING AND PLOTTING THE DATA (Part 3)

Creating a box plot of the data can be a good approach to inspect the historical behavior of the river level and can show us how the data spreads in different time indexing (Month/ Year). If you are not familiar with this, a boxplot is a method for graphically depicting groups of numerical data through their quartiles. The lower and upper bounds of the box are first and third quartiles and the line inside the box is the median. The wishers are one standard deviation above and below the mean of the data. The outliers are plotted as individual points.

If you don’t have the data, please first see the first part of the tutorial here.

Answers to the exercises are available here.

**Exercise 1**

Please create a box plot of the ` LEVEL `

with the ` geom_boxplot`

.

**Exercise 2**

Now please create a box plot for every ` MONTH `

. Hint: Use a ` group `

in the ` aes `

parameter.

**Exercise 3**

Good, now please create a box plot for every ` YEAR `

. Please plot each box with different color, according to the year. Hint: Use the `col`

in the ` aes `

parameter.

**Exercise 4**

Another good way to see how data is distributed is through a histogram. Please create a plot of a histogram of the ` LEVEL `

with the function ` geom_histogram`

.

**Exercise 5**

As you see, the function tells us that it is using 30 bins for the histogram, but that we can pick a better value with `binwidth `

. Please select a bandwidth according to the Freedman–Diaconis formula ` binwidth =2 * IQR(river_data$LEVEL) / (length(river_data$LEVEL)^(1/3))`

.

**Exercise 6**

Please use the ` geom_density `

to plot a kernel density estimate of the ` LEVEL `

, which is a smoothed version of the histogram.

**Exercise 7**

Now, please create a kernel density estimate for every month and overlap it.

**Exercise 8**

The plot is very confusing because all curves have the same color. Please assign a discrete set of colors for each month ` span `

. Hint: You can get the month string using ` month.abb[MONTH] `

inside the ` aes `

.

- Bayesian Inference – MCMC Diagnostics using coda : Exercises
- MCMC Using STAN – Introduction With The Rstanarm Package: Exercises
- MCMC Using STAN – Diagnostics With The Bayesplot Package: Exercises
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

The next exercises will more specifically present the diagnostics and the visualizations of a STAN model.

This webpage references all the relevant information you would need for STAN, like the manual, wiki, tutorials, etc.

Solutions for this set of exercises can be found here. The first chunk defines, installs (if needed), and loads the packages needed in this set.

**Exercise 1**

After installing and loading the `rstan`

package, set the global options that allows you to automatically save a bare version of a compiled Stan program. This will insure that it does not need to be recompiled, along with executing multiple chains in parallel by taking all the available cores of your machine (use the `parallel`

package.)

**Exercise 2**

Load the data “*pulp”* from the `faraway`

package using the `data()`

function.

Then, display the structure of this data-set and plot the “*bright” *against the “*operator.” *Use a small jitter to distinguish the data points.

Could we fit a model on this data?

**Exercise 3**

We want to estimate a model with a random operator.

Write the STAN model online in your R script by following the pertaining blocks of a STAN code and use uninformative priors for the overall mean and the two standard deviations (residual and operator):

**–** The **data** block that will gather all the data used in the STAN model: N (the number of samples: it cannot be negative), J (the number of different operators: it cannot be negative), response (*bright *values) and predictor (a particular value of the predictor corresponding to the response.)

**–** The **parameters** block: eta (the parameter associated with the operator), mu (the intercept), sigma_alpha, and sigma_epsilon (the standard deviations, respectively, for the operator variance component and for the residual.)

**–** The **transformed parameters** block: creates the parameter “a” (the predicted values of the model for each operator) and yhat ( the predicted values of the model for each rows.) (HINT: use “a” for loop to fill yhat.)

**–** The **model** block: defines the distribution of eta (take a standard normal distribution) and creates the response vector of interest by adding the residual random noise.

Do not forget to constraint the upper and lower bounds in the data and parameter block declarations.

Note that it is recommended to write the STAN code in a separate file (“.stan” extension) and then call it directly through the `stan()`

function, instead of writing it in a character string as done here. That would be easier to debug, read, etc.

**Exercise 4**

Define the named data list that will be used inside the STAN data block written above.

(HINT: you have to define the same set of parameters with the same names as the sample data we will be using.)

**Exercise 5**

Run the STAN model using the `stan()`

function and using the following input parameters:

– The STAN code defined in Exercise 3.

– The data list defined in Exercise 4.

– 4 different chains.

– 1000 iterations per chain.

– A warm-up phase of 200. This phase defines the number of iterations that are used by the sampler for the adaptation phase before sampling begins (and is thus different of the burn-in phase we have seen in the Metropolis or Gibbs samplers.)

Use all the available cores of your machine and make it reproducible (take the `seed`

1234.)

**Exercise 6**

Read the warnings carefully that appear in the console after having run the STAN model. Try to improve by following the given advice.

(HINT: check the `control `

input parameter of the `stan()`

function.)

Examine the chains scatter-plot matrix of the parameters mu, sigma_alpha, sigma_epsilon, and “a.”

**Exercise 7**

Make a quick check of the convergence of the chains for the same set of covariates by using traceplots.

(HINT: simply use the available method of the rstan package.)

**Exercise 8**

Retrieve the sampler parameters of the model from Exercise 6.

Then, compute a summary of these chains in the following way:

a. Chains by chains (hence, you will have 4 summary tables.)

b. For the whole aggregated chain.

Restrict the numerical outputs to 2 digits.

**Exercise 9**

Print the first summary statistics of the complete chain for the same set of variables as in Exercises 6 and 7.

What can you say about this?

- R FOR HYDROLOGISTS – Seasonality and Trend Decomposition
- Dates and Times – Simple and Easy with lubridate exercises (part 2)
- R FOR HYDROLOGISTS – Part 1: Correlation and Information Theory Measurements
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

**R FOR HYDROLOGISTS **

LOADING AND PLOTTING THE DATA (Part 2)

In hydrology, it is very common to analyze the annual behavior of the levels in order to see if there is any recurrent patterns over the year (seasonality.) In order to observe the historical behavior of the river, we have to construct a plot with the level of all the years overlapped from the first of January to the 31st of December. This task can be solved in many possible ways, but this time we will use the capabilities that `ggplot2`

has to offer and organize information from the data frames.

If you don’t have the data, please first see the first part of the tutorial here.

Answers to these exercises are available here.

**Exercise 1**

First, we have to process a little bit of the data frame in order to provide it to the ` ggplot `

function in the right format. Please add a column ` YEAR `

and ` DOY `

(day of year) to the ` river_data `

. Hint: the ` lubridate `

package has the function ` yday `

(year day) you can install with the line.

` if(!require(lubridate)){install.packages(lubridate, dep=T)} `

**Exercise 2**

Create a plot with the level of all the years overlapped from the first of January to the 31st of December with the ` ggplot `

function.

**Exercise 3**

Now it is plotted, but it doesn’t seem very clear because all the lines have the same color. Please plot each line with a different color according to the year.

**Exercise 4**

That looks better. Now we want to see the average annual behavior so we will calculate the mean value of `LEVEL`

for each `DOY`

(day of the year) using the function `aggregate`

. Then assign it to the data frame `mean_data`

.

**Exercise 5**

Please add a ` DOY `

column to the `mean_data`

; also, a ` YEAR `

column with the value “2000.” This last column has to be inserted in order to overlap the plots with the function ` ggplot`

.

**Exercise 6**

Please overlap the plot generated in Exercise 3 with the `mean_data`

.

**Exercise 7**

The mean looks a little bit spiky. In order to visualize better, we will smooth the mean values with the function ` qplot`

.

**Exercise 8**

The default smoothing parameter flattens up too many details. Please adjust the parameter ` span `

to get more details.

- R FOR HYDROLOGISTS – Part 1: Correlation and Information Theory Measurements
- R FOR HYDROLOGISTS – Seasonality and Trend Decomposition
- Spatial Data Analysis: Introduction to Raster Processing: Part 4
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

LOADING AND PLOTTING THE DATA (Part 1)

Working with hydro-meteorological data can be very time consuming and exhausting. Luckily, R can provide a framework to easily import and process data in order to implement statistical analysis and models. In these tutorials, we are going to explore and analyze data from a tropical basin in order to create a simple forecast model.

Let’s have a look at these exercises and examples.

Answers to these exercises are available here.

**Exercise 1**

First, let’s import the daily levels of a river and the rainfall data from the basin, stored in a CSV file. Please download the data here (PAICOL.csv) and import it with the function `read.csv`

.

Then, assign it to `river_data`

.

Remember that `river_data`

is a data frame, so we can access the attributes of it with `$`

; for example, you can get the date values with `river_data$DATE`

.

**Exercise 2**

To guarantee that the `DATE`

column has the proper format, it is crucial to convert the string values into dates with the function `as.Date`

. Please replace the value of ` DATE `

with formatted dates.

**Exercise 3**

Create a summary of the `river_data`

.

**Exercise 4**

Normally we can use the build in R functions; but, this time, we will use the `ggplot`

package. In my opinion, it is able to create better plots. Before we start, install it and load it to be able to use it.

install.packages("ggplot2")

library(ggplot2)

Create a line plot of the `LEVEL`

with the `ggplot`

function.

**Exercise 5**

Create a scatter plot of the `RAIN`

against `LEVEL`

.

**Exercise 6**

Create a plot of the `RAIN`

and `LEVEL`

.

**Exercise 7**

Find and plot circles on the `LEVEL`

plot at the maximum and minimum value.

**Exercise 8**

Plot the `LEVEL`

for the year “2001.”

`ggplot2`

is a great tool for complex data visualization. Let’s practice it a bit!
Answers to these exercises are available here.

For each exercise, please replicate the given graph. Some exercises require additional data wrangling. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Fancy the `iris`

dot-plot.

**Exercise 2**

Faceted smoothing (`iris`

, once again).

**Exercise 3**

Tufte style `mtcars`

.

**Exercise 4**

`mtcars`

bubble-plot.

**Exercise 5**

Polar barplot of the mean `diamond`

price per cut and color.

**Exercise 6**

Economist style `economics`

time series. (Hint: you will need the `ggthemes`

package.)

- Generating Data Exercises
- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Spatial Data Analysis: Introduction to Raster Processing: Part-3
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

Visualization is a key component to understanding and communicating your understanding to an audience. The more second nature turning your data into plots becomes, the more you can focus on the overall goals instead of being stuck on technical details.

As a freelance data analyst, I know that often times between when a project arrives at your table until it needs to be delivered is shorter than you would like, leaving limited time to consult documentation and search Stackoverflow.

This exercise set is a drilling exercise for the advanced user, but can be completed by a novice with patience and willingness to learn.

Solutions are available here.

**Exercise 1**

Load the `ggplot2`

, `MASS`

and `viridis`

packages. Combine the three Pima data-sets from (`MASS`

) (used in the previous exercise set) and make a 2D density (density heat map) plot of `bp`

versus `bmi`

using `scale_fill_viridis()`

.

**Exercise 2**

Using the same data, overlay a histogram of `bmi`

with a normal density curve using the sample mean and standard deviation.

**Exercise 3**

Using the `accdeaths`

data-set from `MASS`

, make a line plot with time on the x-axis. Mark the maximum and minimum value of accidental deaths in a month with a read and blue dot, respectively. Note that the data does not come in ggplot-friendly format.

**Exercise 4**

The internet surely loves cats, but most users have little idea how much a cat’s organs weigh. Using the `cats`

data from the `MASS`

package, make two 2D density plot of total weight versus hearth weight, side by side; one for each gender. In addition, add a dot for each observation.

**Exercise 5**

Back to the `pima`

data. Make a boxplot for the `glu`

(glucose concentration), splitting the observations into five age groups with approximately the same number of observations.

**Exercise 6**

Using `ggplot2`

‘s inbuilt `economics`

data-set, make a stacked bar plot with proportions of unemployed to employed (employed or not seeking work) with the date in the x-axis.

**Exercise 7**

Using `ggplot2`

‘s inbuilt `msleep`

data-set, make a scatter plot (body weight versus total sleep) of all animals of the order artiodactyla. Mark the domesticated animals with a different color (from black) and annotate their names onto the graph.

**Exercise 8**

Using `msleep`

, make one density plot for the total sleep, colored by `vore`

. Play with the transparency and parameters of the density estimation.

**Exercise 9**

Using the Gapminder data, (available from the `gapminder`

package) and data from the `rworldmap`

package, color countries by life expectancy in 2007. Use the `geom_map`

.

**Exercise 10**

Still using the Gapminder data, make a scatter plot with the GDP per capital on a log scale on the x-axis and life expectancy on the y-axis. Map population to size and color to continent. Write a loop that makes a graph for each year and saves it with `ggsave`

to your hard drive, so later you can turn it into an animated graph.

- Spatial Data Analysis: Introduction to Raster Processing (Part 1)
- Exercises With Raster Data (Part 1 and 2)
- Data Manipulation with data.table (part -2)
- Become a Top R Programmer Fast with our Individual Coaching Program
- Explore all our (>4000) R exercises
- Find an R course using our R Course Finder directory

In the second part of this tutorial series on spatial data analysis using the `raster`

package, we will explore new functionalities, namely:

- Raster algebra
- Cropping
- Reprojection and resampling

We will also introduce a new type of object named `RasterStack`

, which, in its essence, is a collection of `RasterLayer`

objects with the same spatial extent, resolution and coordinate reference system (CRS).

For more information on raster data processing, see here and here.

We will start this tutorial by downloading the sample raster data and creating a `RasterStack`

composed of multiple image files. One satellite scene from Landsat 8 will be used for this purpose. The data contains surface reflectance information for seven spectral bands (or layers, following the terminology for `RasterStack`

objects) in GeoTIFF file format.

The following table summarizes info on Landsat 8 spectral bands used in this tutorial.

Band # | Band name | Wavelength (micrometers) |
---|---|---|

Band 1 | Ultra Blue | 0.435 – 0.451 |

Band 2 | Blue | 0.452 – 0.512 |

Band 3 | Green | 0.533 – 0.590 |

Band 4 | Red | 0.636 – 0.673 |

Band 5 | Near Infrared (NIR) | 0.851 – 0.879 |

Band 6 | Shortwave Infrared (SWIR) 1 | 1.566 – 1.651 |

Band 7 | Shortwave Infrared (SWIR) 2 | 2.107 – 2.294 |

Landsat 8 spatial resolution (or pixel size) is equal to 30 meters. Valid reflectance decimal values are typically within 0.00 – 1.00; but, for decreasing file size, the valid range is multiplied by a 10^{4} scaling factor to be in integer range 0 – 10000. Image acquisition date is the 15^{th} of July 2015.

```
library(raster)
## Create a folder named data-raw inside the working directory to place downloaded data
if(!dir.exists("./data-raw")) dir.create("./data-raw")
## If you run into download problems try changing: method = "wget"
download.file("https://raw.githubusercontent.com/joaofgoncalves/R_exercises_raster_tutorial/master/data/LT8_PNPG.zip", "./data-raw/LT8_PNPG.zip", method = "auto")
## Uncompress the zip file
unzip("./data-raw/LT8_PNPG.zip", exdir = "./data-raw")
```

With the data downloaded and uncompressed, we can now generate an `RasterStack`

object. The `stack`

function accepts a character vector as input, containing the paths to each raster layer. To generate this we will use the `list.files`

function.

```
# Get file paths and check/print the list
fp <- list.files(path = "./data-raw", pattern = ".tif$", full.names = TRUE)
print(fp)
```

```
## [1] "./data-raw/LC82040312015193LGN00_sr_band1.tif"
## [2] "./data-raw/LC82040312015193LGN00_sr_band2.tif"
## [3] "./data-raw/LC82040312015193LGN00_sr_band3.tif"
## [4] "./data-raw/LC82040312015193LGN00_sr_band4.tif"
## [5] "./data-raw/LC82040312015193LGN00_sr_band5.tif"
## [6] "./data-raw/LC82040312015193LGN00_sr_band6.tif"
## [7] "./data-raw/LC82040312015193LGN00_sr_band7.tif"
```

```
# Create the raster stack and print basic info
rst <- stack(fp)
print(rst)
```

```
## class : RasterStack
## dimensions : 1545, 1480, 2286600, 7 (nrow, ncol, ncell, nlayers)
## resolution : 30, 30 (x, y)
## extent : 549615, 594015, 4613355, 4659705 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=29 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
## names : LC8204031//0_sr_band1, LC8204031//0_sr_band2, LC8204031//0_sr_band3, LC8204031//0_sr_band4, LC8204031//0_sr_band5, LC8204031//0_sr_band6, LC8204031//0_sr_band7
## min values : -27, -1, 29, -86, -216, -212, -102
## max values : 3170, 3556, 4296, 4931, 6904, 7413, 6696
```

Changing raster layer names (usually difficult to read, as we saw above) is really straightforward. Also, if necessary, using simple names makes it easier to access layers *‘by name’* in the `RasterStack`

.

`names(rst) <- paste("b",1:7,sep="")`

Let’s check if the data is being stored in memory:

`inMemory(rst)`

`## [1] FALSE`

Similarly to `RasterLayer`

objects, by default (and unless necessary), an `RasterStack`

object only holds metadata and connections to the actual data to spare memory.

Now, let’s plot the data for a fast visualization.

`plot(rst)`

Notice how each layer has a separated tile in the plot.

**Raster algebra**

Now we can proceed to do some raster algebra calculations. We will accomplish this by using three different methods: (i) direct, (ii) `calc`

function, and, (iii) `overlay`

function. In this example, we will calculate the Normalized Difference Vegetation Index (NDVI) using the red (b4) and the near-infrared (NIR; b5) bands as:

NDVI = (NIR – Red) / (NIR + red)

** Method #1** (Direct)

This method allows you to directly use the raster layers in the stack called by their indices (or names). Typical operands (e.g., `+`

, `-`

, `/`

, `*`

) can be used, as well, as functions (e.g., `sqrt`

, `log`

, `cos`

). However, since processing occurs all at once in memory, you must be sure that your data fits into RAM.

```
# Calling raster layers by index
ndvi <- (rst[[5]] - rst[[4]]) / (rst[[5]] + rst[[4]])
# Or calling by name
ndvi <- (rst[["b5"]] - rst[["b4"]]) / (rst[["b5"]] + rst[["b4"]])
```

Notice how the data type of the input rasters and the final raster (a ratio) are different (from integer to float; see `?dataType`

for details):

`dataType(rst)`

`## [1] "INT2S" "INT2S" "INT2U" "INT2S" "INT2S" "INT2S" "INT2S"`

`dataType(ndvi)`

`## [1] "FLT4S"`

**Method #2** (Calc Function)

For large objects `calc`

will compute values by raster chunks, thus saving memory. This means that for the result of the defined function to be correct, it should not depend on having access to all values at once.

```
calcNDVI_1 <- function(x) return((x[[5]] - x[[4]]) / (x[[5]] + x[[4]]))
ndvi1 <- calc(rst, fun = calcNDVI_1)
```

**Method #3** (Overlay Function)

The overlay function allows you to combine two (or more) `Raster*`

objects. It should be more efficient when using large raster datasets that cannot be loaded into memory (similarly to `calc`

).

```
calcNDVI_2 <- function(x, y) return((x - y) / (x + y))
ndvi2 <- overlay(x = rst[[5]], y = rst[[4]], fun = calcNDVI_2)
```

Overall, using the first method is not advisable in cases were raster data is “big”. In those cases, it is recommended to use more “memory-friendly” methods such as `calc`

or `overlay`

. Also, as a general rule, if a calculation needs to use multiple individual layers separately, (sometimes in different objects) it will be easier to set up in `overlay`

rather than in `calc`

.

Plotting the NDVI data requires some fine tuning because some ‘strange’ values appeared. Note that NDVI range is between -1.00 and 1.00. In the summary below, notice how ‘resistant’ measures (quartiles) are fine, but not the extremes. For NDVI, values closer to 1 represent higher vegetation cover.

```
# NDVI summary
summary(ndvi)
# Set values outside the 'normal' range as NA's
# Indexing for RasterLayers works similarly to matrix or data frame objects
ndvi[ndvi < -1] <- NA
ndvi[ndvi > 1] <- NA
# Plot NDVI
plot(ndvi, main="NDVI Peneda-Geres National Park", xlab = "X-coordinates", ylab = "Y-coordinates")
```

It is also fairly easy to perform logical operations. For example, creating an NDVI mask with values above 0.4:

```
ndviMask <- ndvi > 0.4
plot(ndviMask, main="NDVI mask", xlab = "X-coordinates", ylab = "Y-coordinates")
```

This creates a new boolean raster with 0’s for pixels that are equal or below 0.4, and, 1’s for values above 0.4. This is very useful for separating vegetated from non-vegetated surfaces! 🙂

Often, we want to crop (or clip) a raster dataset for a specific area of study. For doing that, the `raster`

package uses the `crop`

function, which accepts as input a `Raster*`

object and an `Extent`

object used to define the new bounding coordinates (see `?extent`

for more details).

```
# Bounding coordinates
xmin <- 554615
xmax <- 589015
ymin <- 4618355
ymax <- 4654705
# Create the extent object by defining the bounding coordinates
newExtent <- extent(xmin, xmax, ymin, ymax)
# Crop
cropRst <- crop(rst, newExtent)
```

Often, after downloading some raster data (e.g., satellite imagery) for a given area, it is needed to change its coordinate reference system (CRS). `projectRaster`

function allows projecting the values of an `Raster*`

object to a new one with another CRS. It is possible to do this by providing the new projection info as a single argument (an `CRS`

object); in this case, the function sets the extent and resolution of the new object. To assure that the newly created object lines up with other datasets, you can provide a target `Raster*`

object with the properties that the input data should be projected to. `projectRaster`

also allows changing the spatial resolution (or pixel size) of the input raster.

In the first example, we will keep the same Datum as in the original data, but change from a projected CRS (in Universal Transverse Mercator – UTM 29N) to a geographic lat/lon CRS. Notice how the pixel size is not constant across the x- and y-axes.

```
# Create an object of class CRS with the target reference system
targetCRS <- CRS("+init=epsg:4326")
# Reproject
ndvi_ReprojWGS84 <- projectRaster(ndvi, method = "ngb", crs = targetCRS)
print(ndvi_ReprojWGS84)
```

```
## class : RasterLayer
## dimensions : 1575, 1504, 2368800 (nrow, ncol, ncell)
## resolution : 0.000362, 0.00027 (x, y)
## extent : -8.40579, -7.861342, 41.6645, 42.08975 (xmin, xmax, ymin, ymax)
## coord. ref. : +init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

In this second example, we will change the data to the Portuguese official CRS: Datum ETRS 1989, Projection Transverse Mercator, Ellipsoid GRS 1980 (see here more details).

```
# Create an object of class CRS with the target reference system
targetCRS <- CRS("+proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs ")
# Reproject
ndvi_ReprojETRS89 <- projectRaster(ndvi, method = "ngb", crs = targetCRS)
print(ndvi_ReprojETRS89)
```

```
## class : RasterLayer
## dimensions : 1570, 1506, 2364420 (nrow, ncol, ncell)
## resolution : 30, 30 (x, y)
## extent : -22707.33, 22472.67, 221784.7, 268884.7 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

Now, let’s change the resolution from the initial 30m of Landsat 8 to 25m (‘downsampling’). For this purpose we use the `res`

parameter:

```
ndvi_ReprojETRS89_20m <- projectRaster(ndvi, res = 25, method = "ngb", crs = targetCRS)
print(ndvi_ReprojETRS89_20m)
```

```
## class : RasterLayer
## dimensions : 1882, 1805, 3397010 (nrow, ncol, ncell)
## resolution : 25, 25 (x, y)
## extent : -22682.33, 22442.67, 221809.7, 268859.7 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=tmerc +lat_0=39.66825833333333 +lon_0=-8.133108333333334 +k=1 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs
## data source : in memory
## names : layer
## values : -18, 40 (min, max)
```

This concludes our exploration of the raster package for this post. Hope you find it useful!