R has a lot of tools to speed up computations making use of multiple CPU cores either on one computer, or on multiple machines. This series of exercises aims to introduce the basic techniques for implementing parallel computations using multiple CPU cores on one machine.
The initial step in preparation for parallelizing computations is to decide whether the task can and should be run in parallel. Some tasks involve sequential computation, where operations in one round depend on the results of the previous round. Such computations cannot be parallelized. The next question is whether it is worth to use parallel computations. On the one hand, running tasks in parallel may reduce computer time spent on calculations. On the other hand, it requires additional time to write the code that can be run in parallel, and check whether it yields correct results.
The code that implements parallel computations basically makes three things:
- splits the task into pieces,
- runs them in parallel, and
- combines the results.
This set of exercises allows to train in using the
snowfall package to perform parallel computations. The set is based on the example of parallelizing the k-means algorithm, which splits data into clusters (i.e. splits data points into groups based on their similarity). The standard k-means algorithm is sensitive to the choice of initial points. So it is advisable to run the algorithm multiple times, with different initial points to get the best result. It is assumed that your computer has two or more CPU cores.
The data for the exercises can be downloaded here.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
detectCores function from the
parallel package to find the number of physical CPU cores on your computer. Then change the arguments of the function to find the number of logical CPU cores.
Load the data set, and assign it to the
system.time function to measure the time spent on execution of the command
fit_30 <- kmeans(df, centers = 3, nstart = 30), which finds three clusters in the data.
Note that this command runs the kmeans function 30 times sequentially with different (randomly chosen) initial points, and then selects the ‘best’ way of clustering (the one that minimizes the squared sum of distances between each data point and its cluster center).
Now we’ll try to paralellize the runs of kmeans. The first step is to write the code that performs a single run of the
kmeans function. The code has to do the following:
- Randomly choose three rows in the data set (this can be done using the
- Subset the data set keeping only the chosen rows (they will be used as initial points in the k-means algorithm).
- Transform the obtained subset into a matrix.
- Run the
kmeansfunction using the original data set, the obtained matrix (as the
centersargument), and without the
The second step is to wrap the code written in the previous exercise into a function. It should take one argument, which is not used (see explanation on the solutions page), and should return the output of the
Such functions are often labelled as
wrapper, but they may have any possible name.
Let’s prepare for parallel execution of the function:
- Initialize a cluster for parallel computations using the
sfInitfunction from the
snowfallpackage. Set the
parallelargument equal to
TRUE. If your machine has two logical CPU’s assign two to the
cpusargument; if the number of CPU’s exceeds two set this argument equal to the number of logical CPU’s on your machine minus one.
- Make the data set available for parallel processes with the
- Prepare the random number generation for parallel execution using the
sfClusterSetupRNG. Set the
seedargument equal to 1234.
kmeans is a function from the base R packages. If your want to run in parallel a function from a downloaded package, you have also to make it available for parallel execution with the
sfLapply function from the
snowfall package to run the wrapper function (written in Exercise 5) 30 times in parallel, and store the output of
sfLapply in the
result variable. Apply also the
system.time function to measure the time spent on execution of
sfLapply is a parallel version of
lapply function. It takes two main arguments: (1) a vector or a list (in this case it should be a numeric vector of length 30), and (2) the function to be applied to each element of the vector or list.
Stop the cluster for parallel execution with the
sfStop function from the
Explore the output of
- Find out to what class it belongs.
- Print its length.
- Print the structure of its first element.
- Find the value of the
tot.withinsssub-element in the first element (it represents the total squared sum of distances between data points and their cluster centers in a given solution to the clustering problem). Print that value.
Find an element of the
result object with the lowest
tot.withinss value (there may be multiple such elements), and assign it to the
tot.withinss value of that variable with the corresponding value of the
fit_30 variable, which was obtained in Exercise 3.