Add Some Spark to Your Analysis – Sparklyr: Exercise 1
In this exercise set, we will see how to work with Big data in spark using sparklyr. Please read the documentation and download the data-set required for this exercise from here.
Answers to this set are available here.
Install the package Sparklyr. Install spark via sparklyr with the spark version 2.2.0.
Create a spark connection with Spark Connect using the locally installed spark.
Upload the data set into the spark cluster using spark_read_csv.
Check the data set available in the cluster.
Sparklyr can also copy any data frame available in an R session to spark cluster. Copy the iris data-set to the spark cluster.
Sparklyr creates a name space for the data frame when you upload a csv or transfer a data frame from R to a spark cluster. You can use tbl command to take the meta-data from spark and use it in an R session. Do this and check the resultant tibble.
Select countries, the motive of the attack, city, and number of kills of the attack from the tibble.
Now, filter the terrorist activities which have happened after the year 2000. The country of the activity is the US.
Then, arrange the activities by descending year.
Create a new column (large_death_toll) which is 1 when the number of kills is more than 10 and 0 otherwise.