In this Exercise set we will see how to work with h2o’s main machine learning algorithms and their parameters
download the Energy efficiency dataset from UCLA data repository and lets get started .
Answers to the exercises are available here. Please check the documentation before starting this exercise set.
Load the data in h2o using h2o import . Check the data set for the type of the columns and take a quick glance into the columns to see if the data types portrayed correctly . Convert necessary columns as factor .Remember H2O does not allow columns with floating point value to convert to factor.Split the data into train and test set with 80% data as train and 20% data as test .
Create a Linear model with Y2 as response variable and X1-X8 as explanatory variable .
Find the variable importance in the linear model . Notice how the factor variables are handled .
Create a grid search and try out different family of distribution to find the best models . you may need to create multiple grids .Always try to find the performance for the best model and how that compares to the base line model . How to find the best model is shown in the last answer .
Create a h2o grid for tweedie family ,tweedie is a combination of different families and can be used as a grid search as above.
Create a default random forest model with default parameter settings
Check the model details of the random forest model and check how the variable importance is played out here
Check the performance of the model on the test set
Create a grid search on the parameters of random forest and check the best model from it .
Check the best model’s performance from random forest model grid on test data .