In the last set of exercises, you have seen the basic functionalities of H2O. In this exercise set, we will explore H2O further and see how to wrangle data in H2O.
Answers to the exercises are available here. Please check the documentation before starting this exercise set.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Create an H2O aq frame from the air quality dataset (from R), and arrange it by the column “Temp”.
Like Ex. 1, create an H2O frame or from the iris dataset. Now arrange it via Sepal.Length. When you have done that, change the Species columns to character from factor and try to arrange it again.
See what happens. Change it back once you have found out the error.
Check the loan data set from the last exercise set. Import it into your local H2O cluster. Find the Numeric and Categorical columns.
H2O has a function which is similar to plyr ddply. Compute the average Sepal.length per species via ddply.
Your result should be similar to:
iris %>% group_by(Species)%>%summarise(sum(Sepal.Length)/n())
Find the top 10 percent of loan_amnt from loan H2O frame.
Find the bottom 10 percent of loan_amnt from loan H2O frame.
Find the number of NA values in each column in aq H2O frame and loan H2O frame.
Fill the NA values by the previous value of the column in the aq H2O frame.
Fill the NA values of each column by the mean value of each column. Since, in the last exercise, you have already filled NA values, you need to create aq again from the air quality dataset.
Sometimes you may need to know the incremental value of a column, like a lag value. Create a new column in aq, which is a lag 1 value of Temp.