**R FOR HYDROLOGISTS **

CORRELATION AND INFORMATION THEORY MEASUREMENTS (Part 1)

In this tutorial, we will show you how to apply tools, such as the correlation, auto-correlation, entropy, and mutual information as an introductory exercise in the analysis of time series dynamics. The first measurement that we will calculate will be the linear correlation. This statistic quantifies the linear correlation between two variables and represents how much these two data sets resemble a straight line. If there is a positive relationship between the variables, the value of ρ approaches 1. If there is a negative relation, ρ approaches -1. Finally, if there is no relation, ρ approaches 0.

If you don’t have the data, please first see the first part of the tutorial here. Install and load the `ggplot2`

, `GGally `

and ` forecast `

packages.

if(!require(ggplot2)){install.packages(ggplot2, dep=T)}

if(!require(GGally)){install.packages(GGally, dep=T)}

if(!require(forecast)){install.packages(forecast, dep=T)}

Answers to these exercises are available here.

**Exercise 1**

Please calculate the correlation coefficient between the `LEVEL`

and the ` RAIN `

with the function ` cor`

.

**Exercise 2**

One nice way to resume information is through `ggpairs`

. This plot contains a scatter plot of x[,i] plotted against x[,j] a matrix of data. In the diagonal, a plotted kernel mass estimation and the upper triangle are the correlation values. Please use `ggpairs`

for columns 2 and 3 of `river_data`

.

**Exercise 3**

The scatterplot can also be customized. Please change the size of the text on the upper triangle to 8. Hint: Pass a `list(continuous = wrap("cor", par1=value1)) `

to the parameter `upper`

.

**Exercise 4**

Now, please add a tendency line on the scatterplots and change the color of the dots to blue. Hint: `.smooth`

.

**Exercise 5**

Good, the plot looks nice; but, as you can see, the correlation between the precipitation and the level of the river is very low. That is reasonable due to the time that precipitation has occurred to the moment when the level of the river increased due to the runoff over the basin. That is why we also have to estimate the correlation between the variables and its lags. A good example of this is the auto-correlation function, which indicates how much linear correlation exists between the values of the series at a time, “t”, and the values in t-i.

Please use the `ggAcf`

function and plot the auto-correlation function of the `LEVEL`

and the `RAIN`

.

**Exercise 6**

Another common operation on time series is to take a difference of the series `x[t]- x[t-k]`

(in this case k=1) and estimate the auto-correlation function. Please use the ` diff `

function and plot the auto-correlation function for the difference of the `LEVEL`

and the difference of the `RAIN`

.

**Exercise 7**

Please use the ` ggCcf `

function and plot the cross correlation function of the `LEVEL`

and the `RAIN`

.

**Exercise 8**

Another interesting way to explore system properties is creating our own lags decompositions of the time series. To do it, please use this function:

# Generate a laged variable

createLags = function(x, numberOfLags,VarName) {

if (!is.vector(x))

stop('x must be a vector')

if (is.null(VarName))

VarName="x"

```
```

` lags=as.data.frame(x)`

names=paste0(VarName,'(t)')

for(i in 1:numberOfLags){

# generate the lag

lag=c(rep(NA, i), x)[1 : length(x)]

# Stack the lag

lags=cbind(lags,lag)

# Stack the name of the lag

names=cbind(names,paste0(VarName,'(t-',toString(i),")"))

}

# Asign names to the columns

colnames(lags) =names

# trim the first values with nan

return(lags[(numberOfLags+1):length(x),])

}

Please generate a lag decomposition of the `RAIN`

and the `LEVEL`

for the first 5 lags.

**Exercise 9**

Please create one data frame with all the lags `lags_all=cbind(lags_level,lags_rain)`

and generate a pairs plot with `ggpairs`

.

## Leave a Reply