eXtreme Gradient Boosting is a machine learning model which became really popular few years ago after winning several Kaggle competitions. It is very powerful algorithm that use an ensemble of weak learners to obtain a strong learner. Its R implementation is available in
xgboost package and it is really worth including into anyone’s machine learning portfolio.
This is the second part of eXtremely Boost your machine learning series. For other parts follow the tag xgboost.
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
To prepare for the exercises please run the following chunk of code (This was covered in the first part of the series):
url <- "http://freakonometrics.free.fr/german_credit.csv"
credit <- read.csv(url, header = TRUE, sep = ",")
factor_columns <- c(2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20)
for(i in factor_columns) credit[,i] <- as.factor(credit[,i])
X <- model.matrix(~ . - Creditability, data=credit)
inTraining <- sample(1:nrow(credit),size=700)
dtrain <- xgb.DMatrix(X[inTraining,], label=credit$Creditability[inTraining])
dtest <- xgb.DMatrix(X[-inTraining,], label=credit$Creditability[-inTraining])
model <- xgboost(data = dtrain,
max_depth = 4,
nrounds = 3,
objective = "binary:logistic")
Plot model trees.
Visualise the ensemble of trees as a single collective unit.
Save the model to a file.
Train model from file for another 10 rounds.
Plot model deepness.
Use 10-fold cross validation with AUC as evaluation metric to determine number of rounds for model to be trained.
Try five different values of
max_depth and check how it influences performance and optimal number of rounds.
Repeat exercise 7 with
eta equal to
Check if you can get better results with linear booster.
F1 score as evaluation metric. HINT: You have to define a custom function.