This is the last of the exercise set on H2O’s machine learning algorithms. Please do them in sequence. This requires some additional data. I have provided the links, so please download them when it’s needed.
Answers to the exercises are available here. Please check the documentation before starting this exercise set.
For other parts of this series please follow the tag: h2o
Load the bank data from the previous exercise. You can see that the response variable “class O” is much higher than the class 1. H2O provides a parameter balance_classes in its classification algorithm. Create a naive Bayes algorithm with balanc_class =TRUE and check its performance by comparing it with the base model of the naive Bayes classifier. Does it improve performance?
You can see that this does not improve the score, but ideally, it should. What went wrong? Although we have instructed the gbm to use balance_class to be true, we have not specified the ratio for “No” class or “Yes” class. H2O did not sample it properly. We can specify the class sample factor in the parameter. Try to undersample the “No” class and also try to oversample the “No” Class. Do you see the change in error rate?
Created a grid for gbm with hyperparameters nbins_cats, learn_rate. The distribution should be bernoulli.
Find the best model and check its performance in the test set.
Next, we will create a random forest classifier. Create a base random forest classifier, check the auc score on the validation, and test data set.
Create a grid search for a random forest classifier. Check the documentation or the first series of H2O ml algorithms for parameters that can be used in the grid search.
As always, find the best model and check the performance.
Create an ensemble model for the bank data where the base models are a random forest model and a gbm model. Create the ensembles and check the performance.
For the next exercise set, we will use prostate data. Download it from here.
Create a K means model and find the clusters.
In the K means model, try setting the init parameter to “Furthest” and “PlusPlus”, respectively. Find the centers STD from the model for the explanatory variables.
Next, you will see why the principal component is used in ml. Download the data from here. Load the data set, and check the main two principal components. The summary should give you an intuitive idea that the first two pc explains 98% of the variance.