foreach is a statement for iterating over items in a collection without using any explicit counter. In R, it is also a way to run code in parallel, which may be more convenient and readable that the
sfLapply function (considered in the previous set of exercises of this series) or other
Apart from being able to run code in parallel, the R’s
foreach has some other differences from the standard
for loop. Specifically, the
- allows to iterate over several variables simultaneously,
- returns a value (a list, a vector, a matrix, or another object),
- is able to skip some iterations based on a condition (the last two properties make it similar to the list comprehension, which is present in Python and some other languages),
- has a special syntax that includes operators
%do%(see an example in Exercise 1),
The first six exercises in this set allow to train in performing basic operations with the
foreach statement, and the last four ones show how to run it in parallel using multiple CPU cores on one machine. The task will be to parallelize identical operations on a set of files (the zipped data files can be downloaded here). It is assumed that your computer has two or more CPU cores.
The exercises require the packages
parallel. The first two packages have to be installed, and the last one comes with the standard R distribution. The packages
parallel are necessary to run
foreach in parallel.
For other parts of the series follow the tag parallel computing.
Answers to the exercises are available here.
foreach function (from the package of the same name) is typically used as a part of a special statement. In its simple form, the statement looks like this:
result <- foreach(i = 1:3) %do% sqrt(i)
The statement above consists of three parts:
foreach(i = 1:3)– a call to the
foreachfunction, with an argument that includes an iteration variable (
i) and a sequence to be iterated over (
%do%– a special operator,
sqrt(i): an R expression, which represents an operation to be performed over the iteration variable (this part of the statement is equivalent to the body of the loop).
The code iterates over the sequence, applies an operation defined in the expression to each element of the sequence, and stores the output in the
Note that if the expression extends over several lines it has to be enclosed in curly braces. The use of the iteration variable is not mandatory: if you just want to repeat the expression
n times not passing anything to that expression you can use only a sequence of the length
n as input to
In this exercise:
- Run the code above, print the
resultobject, and find to which class it belongs.
- Use the
foreachfunction to reverse the result. I.e. write a line of code that receives the
resultobject as an input, and outputs the original sequence. Print the sequence.
foreach function allows for the use of several iteration variables simultaneously. They are passed to the function as arguments, and are separated by commas.
foreach function with two iteration variables to get a sequence of their sums. The variables have to iterate over a vector of integers from 1 to 3, and a vector of 5 integers of value 10. Print the result.
(Tip: if you want to use an arithmetic operator to calculate the sum then the expression must be placed in parentheses or curly braces).
What is the length of the resulting object? How does the function deal with the vectors of different length?
iterators provides several functions that can be used to create sequences for the
foreach function. For example, the
irnorm function creates an object that iterates over vectors of normally distributed random numbers. It is useful when you need to use random variables drawn from one distribution in an expression that is run in parallel.
In this exercise, use the
irnorm functions to iterate over 3 vectors, each containing 5 random variables. Find the largest value in each vector, and print those largest values.
Before running the
foreach function set the seed to 1234.
By default the
foreach function returns a list. But it can also return sequences of other types. This requires changing the value of the
.combine parameter of the function. This exercise will train how to use this parameter.
As in the previous exercise, use the
irnorm functions to iterate over 3 vectors, each containing 5 random variables. But now use an expression that returns all variables generated by
irnorm. Pass the
.combine parameter to the
foreach function with value
'c'. Print the result, and find its class and length.
Then run the code again with the
'cbind' value assigned to the
.combine parameter. Print the result, find its class and size.
'cbind' are R functions from the
base package. Other functions (including user-written ones) can be used as well to combine the outputs of the expression.
The results of the expression placed after the
%do% operator can be combined in different ways. Look at the documentation for the
foreach function to find what value has to be assigned to the
.combine parameter to sum the values produced by the expression in each iteration.
Run the code used in previous exercise with that value assigned to the
.combine parameter, and print the result.
Before running the code set the seed to 1234.
The sequence passed to the
foreach function can be filtered so that the expression after
%do% is applied only to a part of the sequence. This is done using a syntax like this:
result ‹- foreach(i = some_sequence) %:% when(i › 0) %do% sqrt(i)
You can notice that the
%:% operator and the
when function, which contains a Boolean expression involving the iteration variable, are added to a standard
Modify the example above to get a vector of logs of all even integers in the range from 1 to 10. Print the result.
Now let’s parallelize the execution of the
foreach function. We’ll use it to read similarly named files, and perform identical calculations on data from each file.
As a first step, write a function to be run in parallel. The function takes an integer as input, and performs the following actions:
- Create a string (character vector) with a file name by concatenating constant parts of the name (
.csv) with the integer (example of possible result when 1 is used as integer:
- Read the file with the obtained name from the current working directory into a data frame.
- Calculate mean values for each column in the data frame.
- Return a vector of those values.
The second step is to create a backend for parallel execution:
- Make a cluster for parallel execution using the
makeClusterfunction from the
parallelpackage; pass the size of the cluster (i.e. the number of CPU cores that you want to be used in computations) as an argument to this function .
- Register the cluster with the
registerDoParallelfunction from the
Note that by default the
makeCluster function creates a
PSOCK cluster, which is an enhanced version of the
SOCK cluster implemented in the
snow package. Accordingly, the
PSOCK cluster is a pool of worker processes that exchange data with the master process via sockets. The
makeCluster function can also create other types of clusters.
The last step is to run the
foreach function to read and analyze 10 test files (contained in this archive) using the function created in Exercise 7. Combine the outputs of that function using
Perform this task twice:
%do%operator, which evaluates the expression sequentially, and
%dopar%operator, which evaluates the expression in parallel.
In both cases, measure the execution time using the the
system.time function. Print the result of the last run.
IMPORTANT: after completing parallel computations stop the cluster (created in Exercise 8) using the
stopCluster function from the
Modify the code written in the Exercise 7 and Exercise 9 to calculate the mean and the variance of values contained in the first column in each file. The resulting object must be a two-column matrix with the first column representing means, and the second column describing variances (the number of rows must be equal to the number of files).
Repeat the actions listed in Exercise 8 to prepare a cluster for parallel execution, then run the modified code in parallel.
Print the result.
Stop the cluster.