We will continue learning about Sparklyr. For this exercise, download the H1B data-set from kaggle here.
Answer to this exercise are available here.
For other parts of this series please follow the tag spark
Load the data in spark. Select the distinct work site from the data. Use the collect method to show it in the R session.
Now, find the top 10 work sites in terms of count of H1B Visa.
Since spark uses lazy evaluation, it is useful to store some temporary computation in the spark session so we can use them for other analysis’.
Now, like the previous exercise, find the top 10 work sites and store it as the top_work site in spark. Check the session for the newly created spark data-frame.
Find the mean wage of each site.
N.b – See what happens when you try to use median? We will discuss later how to tackle these problems.
Sparklyr can leverage sparksql using the dbGetQuery. Use the sql to find records in Jersey city, New Jersey of people’s salary over 200,000.
You need the DBI library to use dbGetQuery.
Sql queries can be simple, like above, or more complex, since you can write in sparksql. Find the median wage from each work site. Remember there is no median function in sparksql.
Rather user percentile_approx(WORKSITE,0.5) as median.
For many Statistical operations, its important to get sample data. Use Sparklyr to get 100 samples from the data-set.
Sparklyr can also leverage windowing operations, like dplyr. Now, use this concept to find employers in each work site where they have waged over 100,000.
Those who are familiar with rank functions in dplyr or sql might find it easy; but, for those who are not aware, consider going over to dplyr documentation for this.
Rank each of the employers in each work state with decreasing mean wage.
You can see the sql generated by the sql_render command. Check the sql and see if it matches your intuition.