In this exercise set, we will see how to work with Big data in spark using sparklyr. Please read the documentation and download the data-set required for this exercise from here.
Answers to this set are available here.
Exercise 1
Install the package Sparklyr. Install spark via sparklyr with the spark version 2.2.0.
Exercise 2
Create a spark connection with Spark Connect using the locally installed spark.
Exercise 3
Upload the data set into the spark cluster using spark_read_csv.
Exercise 4
Check the data set available in the cluster.
Exercise 5
Sparklyr can also copy any data frame available in an R session to spark cluster. Copy the iris data-set to the spark cluster.
Exercise 6
Sparklyr creates a name space for the data frame when you upload a csv or transfer a data frame from R to a spark cluster. You can use tbl command to take the meta-data from spark and use it in an R session. Do this and check the resultant tibble.
Exercise 7
Select countries, the motive of the attack, city, and number of kills of the attack from the tibble.
Exercise 8
Now, filter the terrorist activities which have happened after the year 2000. The country of the activity is the US.
Exercise 9
Then, arrange the activities by descending year.
Exercise 10
Create a new column (large_death_toll) which is 1 when the number of kills is more than 10 and 0 otherwise.
Leave a Reply