`dist`

, `hlcust`

, `cutree`

, `rect.hclust`

We will be using a custom-made dataset. Before starting the exercise please run the following code to obtain the capital locations for Europe (note that you will need to have `ggmap`

library installed):

library(ggmap)

capitals <- c("Albania, Tirana", "Andorra, Andorra la Vella", "Armenia, Yerevan", "Austria, Vienna", "Azerbaijan, Baku", "Belarus, Minsk", "Belgium, Brussels", "Bosnia and Herzegovina, Sarajevo", "Bulgaria, Sofia", "Croatia, Zagreb", "Cyprus, Nicosia", "Czech Republic, Prague", "Denmark, Copenhagen", "Estonia, Tallinn", "Finland, Helsinki", "France, Paris", "Germany, Berlin", "Greece, Athens", "Georgia, Tbilisi", "Hungary, Budapest", "Iceland, Reykjavik", "Italy, Rome", "Latvia, Riga", "Kazakhstan, Astana", "Liechtenstein, Vaduz", "Lithuania, Vilnius", "Luxembourg, Luxembourg", "Macedonia, Skopje", "Malta, Valletta", "Moldova, Chişinău", "Monaco, Monaco-Ville", "Montenegro, Podgorica", "Netherlands, Amsterdam", "Norway, Oslo", "Poland, Warsaw", "Portugal, Lisbon", "Republic of Ireland, Dublin", "Romania, Bucharest", "Russia, Moscow", "San Marino, San Marino", "Serbia, Belgrade", "Slovakia, Bratislava", "Slovenia, Ljubljana", "Spain, Madrid", "Sweden, Stockholm", "Switzerland, Bern", "Turkey, Ankara", "Ukraine, Kiev", "United Kingdom, London", "Vatican City, Vatican City")

theData <- geocode(capitals)

rownames(theData) <- capitals

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Calculate the Euclidean latitude/longitude distances between all pairs of capital cities.

**Exercise 2**

Use the obtained distances to produce the hierarchical clustering dendrogram object. Use all the default parameters.

NOTE: By default the clusters will be merged together using the maximum possible distance between all pairs of their elements (this fact will be useful later).

**Exercise 3**

Visualize the obtained hierarchical clustering dendrogram.

**Exercise 4**

In the previous step the leaves of our dendrogram were placed at different heights. Let’s redo the plot so that all capital names are written at the same level.

**Exercise 5**

Hierarchical clustering procedure builds a hierarchy of clusters. One advantage of this method is that we can use the same dendrogram to obtain different numbers of groups.

Cluster the European capitals into 3 groups.

**Exercise 6**

Instead of specifying the wanted number of groups we can select the dendrogram height where the tree will be partitioned into groups. Since we used the maximum linkage function (default in exercise 2) this height has a useful interpretation – it ensures that all elements within one cluster are not more than the selected distance apart.

a) Cluster the European capitals by cutting the tree at height=20.

b) Plot the dendrogram and visualize the height at which the tree was cut into groups using a line.

**Exercise 7**

Now visualize the clustering solution obtained in the 5th exercise on the dendrogram plot. This should be done by drawing a rectangle around all capitals that fall in the same group. Use different colors for different groups.

**Exercise 8**

Visualize the dendrogram again but this time present both cluster versions obtained in exercise 5 and exercise 6 on the same plot. Use red color to represent exercise 5 clusters and blue to represent clusters from exercise 6.

**Exercise 9**

The `hclust`

function has 8 implemented different linkage methods – methods used to merge two clusters when building the dendrogram. We want to experiment with all of them.

Produce a dendrogram, obtain 5 groups and vizualize them using different color rectangles. Repeat this for all available linkage methods.

**Exercise 10**

Design your own clustering solution based on what you learned in this exercise and visualize it as a map.

Plot capital coordinates with longitude on the x-axis and latitude on the y-axis and color them based on the groups obtained using your hierarchical clustering version.

`boxplot`

function has a lot of useful parameters allowing us to change the behaviour and appearance of the boxplot graphs. In this exercise we will try to use those parameters in order to replicate the visual style of Matlab’s boxplot. Before trying out this exercise please make sure that you are familiar with the following functions: `bxp`

, `boxplot`

, `axis`

, `mtext`

Here is the plot we will be replicating:

We will be using the same **iris** dataset which is available in R by default in the variable of the same name – `iris`

. The exercises will require you to make incremental changes to the default boxplot style.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

Make a default boxplot of Sepal.Width stratified by Species.

**Exercise 2**

Change the range of the y-axis so it starts at 2 and ends at 4.5.

**Exercise 3**

Modify the boxplot function so it doesn’t draw ticks nor labels of the x and y axes.

**Exercise 4**

Add notches (triangular dents around the median representing confidence intervals) to the boxes in the plot.

**Exercise 5**

Increase the distance between boxes in the plot.

**Exercise 6**

Change the color of the box borders to blue.

**Exercise 7**

a. Change the color of the median lines to red.

b. Change the line width of the median line to 1.

**Exercise 8**

a. Change the color of the outlier points to red.

b. Change the symbol of the outlier points to “+”.

c. Change the size of the outlier points to 0.8.

**Exercise 9**

a. Add the title to the boxplot (try to replicate the style of matlab’s boxplot).

b. Add the y-axis label to the boxplot (try to replicate the style of matlab’s boxplot).

**Exercise 10**

a. Add x-axis (try to make it resemble the x-axis in the matlab’s boxplot)

b. Add y-axis (try to make it resemble the y-axis in the matlab’s boxplot)

c. Add the y-axis ticks on the other side.

NOTE: You can use `format(as.character(c(2, 4.5)), drop0trailing=TRUE, justify="right")`

to obtain the text for y-axis labels.

`expand.grid`

, `combn`

, `outer`

and `choose`

.
Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

**Exercise 1**

You first throw a coin with 2 possible outcomes: it either lands heads or tails. Then you throw a dice with 6 possible outcomes: 1,2,3,4,5 and 6.

Generate all possible results of your action.

**Exercise 2**

Generate a multiplication table for numbers ranging from 1 to 10.

**Exercise 3**

You have a set of card values:

` values <- c("A", 2, 3, 4, 5, 6, 7, 8, 9, 10, "J", "Q", "K") `

and a set of suits representing diamonds, clubs, spades and hearths:

` suits <- c("d", "c", "s", "h") `

Generate a deck of playing cards. (e.g. King of spades should be represented as Ks).

Note: function `paste(..., sep="")`

can be used to combine characters.

**Exercise 4**

Think about a game of poker using a standard deck of cards like the one generated earlier. Starting hand in poker consists of 5 cards and their order does not matter. How many different starting hands are there in total?

**Exercise 5**

You have a set of colors to choose from:

` colors <- c("red", "blue", "green", "white", "black", "yellow") `

You have to pick 3 colors and you cant’ pick the same color more than once. List all possible combinations.

**Exercise 6**

Using the same set of colors – pick 3 without picking the same more than once, just like in the previous exercise.

List all possible combinations but this time sort each combination alphabetically.

**Exercise 7**

You have the same choices of colors and have to pick 3 but this time you can pick the same color more than once.

List all possible combinations.

**Exercise 8**

You have the same set of colors but this time instead of having to pick 3 you can choose to pick either 1, 2 or 3.

How many different choices can you make?

**Exercise 9**

You have the same set of colors and you can choose to pick either 1, 2 or 3.

Make a list of all possible choices.

**Exercise 10**

There are 3 color palletes: the first one has 4 colors, the second has 6 colors and the third has 8 colors. You have to pick a pallete and then choose up to 5 (1, 2, 3, 4 or 5) colors from the chosen color pallete. How many different possibilities are there?

`Map`

, `Reduce`

, `Filter`

, `Find`

, `Position`

, `Negate`

. They enable us to complete complex operations by using simple single-purpose functions as their building blocks. In R this is especially helpful in cases where we cannot depend on vectorization and have to utilize control statements like for loops. In such scenarios higher order functions help us by: a) simplifying and shortening the syntax, b) getting rid of counter indices and c) getting rid of temporary storage values.
Exercises in this section will have to be solved by using one or more of the higher order functions mentioned above. It might be useful reading their help page before continuing.

Answers to the exercises are available here.

**Exercise 1**

You are working on 3 datasets all at once:

`multidata <- list(mtcars, USArrests, rock)`

`summary(multidata[[1]])`

will return the summary information for a single dataset.

Obtain summary information for every dataset in the list.

**Exercise 2**

`cumsum(1:100)`

returns the cumulative sums of a vector of numbers from 1 to 100.

Do the same using `sum`

and an appropriate higher order function.

**Exercise 3**

You have a vector of numbers from 1 to 10. You want to multiply all those numbers first by 2 and then by 4. Why the following line does not work and how to fix it?

`Map(`*`, 1:10, c(2,4))`

**Exercise 4**

Expression `sample(LETTERS, 5, replace=TRUE)`

obtains 5 random letters.

Generate a list with 10 elements, where first element contains 1 random letter, second element 2 random letters and so on.

Note: use a fixed random seed: `set.seed(14)`

**Exercise 5**

Library `spatstat`

has a function `is.prime()`

that checks if a given number is a prime.

Find all prime numbers between 100 and 200.

**Exercise 6**

We have a vector containing all the words of the English language –

`words <- scan("http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt", what="character")`

a. Using a function that checks if a given words contains any vowels:

`containsVowel <- function(x) grepl("a|o|e|i|u", x)`

find all words that do not contain any vowels.

b. Using a function `is.colour()`

from the `spatstat`

library find the index of the first word inside the words vector corresponding to a valid R color.

**Exercise 7**

a. Find the smallest number between 10000 and 20000 that is divisible by 1234.

b. Find the largest number between 10000 and 20000 that is divisible by 1234.

**Exercise 8**

Consider the `babynames`

dataset from the `babynames`

library.

Start with a list containing the used names for each year:

`library(babynames); namesData <- split(babynames$name, babynames$year)`

a. Obtain a set of names that were present in every year.

b. Obtain a set of names that are only present in year 2014

**Exercise 9**

Using the same `babynames`

dataset and a function that checks if word has more than 3 letters: `moreThan3 <- function(x) nchar(x) > 3`

Inside each year list leave only the names that have 3 letters or less.

**Exercise 10**

Using the same `babynames`

dataset:

a. Split each name to a list of letters.

b. Join each list of letters by inserting an underscore “_” after each letter.

Note: if you have a word `x <- "exercise"`

you can split it with `x2 <- strsplit(x, "")`

and join using underscores with `paste(x2[[1]], collapse="_")`