R has a lot of tools to speed up computations making use of multiple CPU cores either on one computer, or on multiple machines. This series of exercises aims to introduce the basic techniques for implementing parallel computations using multiple CPU cores on one machine.

The initial step in preparation for parallelizing computations is to decide whether the task can and should be run in parallel. Some tasks involve sequential computation, where operations in one round depend on the results of the previous round. Such computations cannot be parallelized. The next question is whether it is worth to use parallel computations. On the one hand, running tasks in parallel may reduce computer time spent on calculations. On the other hand, it requires additional time to write the code that can be run in parallel, and check whether it yields correct results.

The code that implements parallel computations basically makes three things:

- splits the task into pieces,
- runs them in parallel, and
- combines the results.

This set of exercises allows to train in using the `snowfall`

package to perform parallel computations. The set is based on the example of parallelizing the k-means algorithm, which splits data into clusters (i.e. splits data points into groups based on their similarity). The standard k-means algorithm is sensitive to the choice of initial points. So it is advisable to run the algorithm multiple times, with different initial points to get the best result. It is assumed that your computer has two or more CPU cores.

The data for the exercises can be downloaded here.

For other parts of the series follow the tag parallel computing.

Answers to the exercises are available here.

**Exercise 1**

Use the `detectCores`

function from the `parallel`

package to find the number of physical CPU cores on your computer. Then change the arguments of the function to find the number of logical CPU cores.

**Exercise 2**

Load the data set, and assign it to the `df`

variable.

**Exercise 3**

Use the `system.time`

function to measure the time spent on execution of the command `fit_30 <- kmeans(df, centers = 3, nstart = 30)`

, which finds three clusters in the data.

Note that this command runs the kmeans function 30 times sequentially with different (randomly chosen) initial points, and then selects the ‘best’ way of clustering (the one that minimizes the squared sum of distances between each data point and its cluster center).

**Exercise 4**

Now we’ll try to paralellize the runs of kmeans. The first step is to write the code that performs a single run of the `kmeans`

function. The code has to do the following:

- Randomly choose three rows in the data set (this can be done using the
`sample`

function). - Subset the data set keeping only the chosen rows (they will be used as initial points in the k-means algorithm).
- Transform the obtained subset into a matrix.
- Run the
`kmeans`

function using the original data set, the obtained matrix (as the`centers`

argument), and without the`nstart`

argument.

**Exercise 5**

The second step is to wrap the code written in the previous exercise into a function. It should take one argument, which is not used (see explanation on the solutions page), and should return the output of the `kmeans`

function.

Such functions are often labelled as `wrapper`

, but they may have any possible name.

**Exercise 6**

Let’s prepare for parallel execution of the function:

- Initialize a cluster for parallel computations using the
`sfInit`

function from the`snowfall`

package. Set the`parallel`

argument equal to`TRUE`

. If your machine has two logical CPU’s assign two to the`cpus`

argument; if the number of CPU’s exceeds two set this argument equal to the number of logical CPU’s on your machine minus one. - Make the data set available for parallel processes with the
`sfExport`

function. - Prepare the random number generation for parallel execution using the
`sfClusterSetupRNG`

. Set the`seed`

argument equal to 1234.

(Note that `kmeans`

is a function from the base R packages. If your want to run in parallel a function from a downloaded package, you have also to make it available for parallel execution with the `sfLibrary`

function).

**Exercise 7**

Use the `sfLapply`

function from the `snowfall`

package to run the wrapper function (written in Exercise 5) 30 times in parallel, and store the output of `sfLapply`

in the `result`

variable. Apply also the `system.time`

function to measure the time spent on execution of `sfLapply`

.

Note that `sfLapply`

is a parallel version of `lapply`

function. It takes two main arguments: (1) a vector or a list (in this case it should be a numeric vector of length 30), and (2) the function to be applied to each element of the vector or list.

**Exercise 8**

Stop the cluster for parallel execution with the `sfStop`

function from the `snowfall`

package.

**Exercise 9**

Explore the output of `sfLapply`

(the `result`

object):

- Find out to what class it belongs.
- Print its length.
- Print the structure of its first element.
- Find the value of the
`tot.withinss`

sub-element in the first element (it represents the total squared sum of distances between data points and their cluster centers in a given solution to the clustering problem). Print that value.

**Exercise 10**

Find an element of the `result`

object with the lowest `tot.withinss`

value (there may be multiple such elements), and assign it to the `best_result`

variable.

Compare the `tot.withinss`

value of that variable with the corresponding value of the `fit_30`

variable, which was obtained in Exercise 3.

## Leave a Reply