Extreme Gradient Boosting Classification Learner
Source:R/LearnerClassifXgboost.R
mlr_learners_classif.xgboost.RdeXtreme Gradient Boosting classification.
Calls xgboost::xgb.train() from package xgboost.
Note that using the evals parameter directly will lead to problems
when wrapping this mlr3::Learner in a mlr3pipelines GraphLearner
as the preprocessing steps will not be applied to the data in evals.
See the section Early Stopping and Validation on how to do this.
Note
To compute on GPUs, you first need to compile xgboost yourself and link against CUDA. See https://xgboost.readthedocs.io/en/stable/build.html#building-with-gpu-support.
The outputmargin, predcontrib, predinteraction, and predleaf parameters are not supported.
You can still call e.g. predict(learner$model, newdata = newdata, outputmargin = TRUE) to get these predictions.
Initial parameter values
nrounds:Actual default: no default.
Adjusted default: 1000.
Reason for change: Without a default construction of the learner would error. The lightgbm learner has a default of 1000, so we use the same here.
nthread:Actual value: Undefined, triggering auto-detection of the number of CPUs.
Adjusted value: 1.
Reason for change: Conflicting with parallelization via future.
verbose:Actual default: 1.
Adjusted default: 0.
Reason for change: Reduce verbosity.
verbosity:Actual default: 1.
Adjusted default: 0.
Reason for change: Reduce verbosity.
Early Stopping and Validation
In order to monitor the validation performance during the training, you can set the $validate field of the Learner.
For information on how to configure the validation set, see the Validation section of mlr3::Learner.
This validation data can also be used for early stopping, which can be enabled by setting the early_stopping_rounds parameter.
The final (or in the case of early stopping best) validation scores can be accessed via $internal_valid_scores, and the optimal nrounds via $internal_tuned_values.
The internal validation measure can be set via the custom_metric parameter that can be a mlr3::Measure, a function, or a character string for the internal xgboost measures.
Using an mlr3::Measure is slower than the internal xgboost measures, but allows to use the same measure for tuning and validation.
Dictionary
This mlr3::Learner can be instantiated via the dictionary mlr3::mlr_learners or with the associated sugar function mlr3::lrn():
Meta Information
Task type: “classif”
Predict Types: “response”, “prob”
Feature Types: “logical”, “integer”, “numeric”
Required Packages: mlr3, mlr3learners, xgboost
Parameters
| Id | Type | Default | Levels | Range |
| alpha | numeric | 0 | \([0, \infty)\) | |
| approxcontrib | logical | FALSE | TRUE, FALSE | - |
| base_score | numeric | - | \((-\infty, \infty)\) | |
| booster | character | gbtree | gbtree, gblinear, dart | - |
| callbacks | untyped | list() | - | |
| colsample_bylevel | numeric | 1 | \([0, 1]\) | |
| colsample_bynode | numeric | 1 | \([0, 1]\) | |
| colsample_bytree | numeric | 1 | \([0, 1]\) | |
| device | untyped | "cpu" | - | |
| disable_default_eval_metric | logical | FALSE | TRUE, FALSE | - |
| early_stopping_rounds | integer | NULL | \([1, \infty)\) | |
| eta | numeric | 0.3 | \([0, 1]\) | |
| evals | untyped | NULL | - | |
| eval_metric | untyped | - | - | |
| custom_metric | untyped | - | - | |
| extmem_single_page | logical | FALSE | TRUE, FALSE | - |
| feature_selector | character | cyclic | cyclic, shuffle, random, greedy, thrifty | - |
| gamma | numeric | 0 | \([0, \infty)\) | |
| grow_policy | character | depthwise | depthwise, lossguide | - |
| interaction_constraints | untyped | - | - | |
| iterationrange | untyped | - | - | |
| lambda | numeric | 1 | \([0, \infty)\) | |
| max_bin | integer | 256 | \([2, \infty)\) | |
| max_cached_hist_node | integer | 65536 | \((-\infty, \infty)\) | |
| max_cat_to_onehot | integer | - | \((-\infty, \infty)\) | |
| max_cat_threshold | numeric | - | \((-\infty, \infty)\) | |
| max_delta_step | numeric | 0 | \([0, \infty)\) | |
| max_depth | integer | 6 | \([0, \infty)\) | |
| max_leaves | integer | 0 | \([0, \infty)\) | |
| maximize | logical | NULL | TRUE, FALSE | - |
| min_child_weight | numeric | 1 | \([0, \infty)\) | |
| missing | numeric | NA | \((-\infty, \infty)\) | |
| monotone_constraints | untyped | 0 | - | |
| nrounds | integer | - | \([1, \infty)\) | |
| normalize_type | character | tree | tree, forest | - |
| nthread | integer | - | \([1, \infty)\) | |
| num_parallel_tree | integer | 1 | \([1, \infty)\) | |
| objective | untyped | "binary:logistic" | - | |
| one_drop | logical | FALSE | TRUE, FALSE | - |
| print_every_n | integer | 1 | \([1, \infty)\) | |
| rate_drop | numeric | 0 | \([0, 1]\) | |
| refresh_leaf | logical | TRUE | TRUE, FALSE | - |
| seed | integer | - | \((-\infty, \infty)\) | |
| seed_per_iteration | logical | FALSE | TRUE, FALSE | - |
| sampling_method | character | uniform | uniform, gradient_based | - |
| sample_type | character | uniform | uniform, weighted | - |
| save_name | untyped | NULL | - | |
| save_period | integer | NULL | \([0, \infty)\) | |
| scale_pos_weight | numeric | 1 | \((-\infty, \infty)\) | |
| skip_drop | numeric | 0 | \([0, 1]\) | |
| subsample | numeric | 1 | \([0, 1]\) | |
| top_k | integer | 0 | \([0, \infty)\) | |
| training | logical | FALSE | TRUE, FALSE | - |
| tree_method | character | auto | auto, exact, approx, hist, gpu_hist | - |
| tweedie_variance_power | numeric | 1.5 | \([1, 2]\) | |
| updater | untyped | - | - | |
| use_rmm | logical | - | TRUE, FALSE | - |
| validate_features | logical | TRUE | TRUE, FALSE | - |
| verbose | integer | - | \([0, 2]\) | |
| verbosity | integer | - | \([0, 2]\) | |
| xgb_model | untyped | NULL | - | |
| use_pred_offset | logical | - | TRUE, FALSE | - |
Offset
If a Task has a column with the role offset, it will automatically be used during training.
The offset is incorporated through the xgboost::xgb.DMatrix interface, using the base_margin field.
During prediction, the offset column from the test set is used only if use_pred_offset = TRUE (default) and the Task has a column with the role offset.
The test set offsets are passed via the base_margin argument in xgboost::predict.xgb.Booster().
Otherwise, if the user sets use_pred_offset = FALSE (or the Task doesn't have a column with the offset role), the (possibly estimated) global intercept from the train set is applied.
See https://xgboost.readthedocs.io/en/stable/tutorials/intercept.html.
References
Chen, Tianqi, Guestrin, Carlos (2016). “Xgboost: A scalable tree boosting system.” In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 785–794. ACM. doi:10.1145/2939672.2939785 .
See also
Chapter in the mlr3book: https://mlr3book.mlr-org.com/chapters/chapter2/data_and_basic_modeling.html#sec-learners
Package mlr3extralearners for more learners.
as.data.table(mlr_learners)for a table of available Learners in the running session (depending on the loaded packages).mlr3pipelines to combine learners with pre- and postprocessing steps.
Extension packages for additional task types:
mlr3proba for probabilistic supervised regression and survival analysis.
mlr3cluster for unsupervised clustering.
mlr3tuning for tuning of hyperparameters, mlr3tuningspaces for established default tuning spaces.
Other Learner:
mlr_learners_classif.cv_glmnet,
mlr_learners_classif.glmnet,
mlr_learners_classif.kknn,
mlr_learners_classif.lda,
mlr_learners_classif.log_reg,
mlr_learners_classif.multinom,
mlr_learners_classif.naive_bayes,
mlr_learners_classif.nnet,
mlr_learners_classif.qda,
mlr_learners_classif.ranger,
mlr_learners_classif.svm,
mlr_learners_regr.cv_glmnet,
mlr_learners_regr.glmnet,
mlr_learners_regr.kknn,
mlr_learners_regr.km,
mlr_learners_regr.lm,
mlr_learners_regr.nnet,
mlr_learners_regr.ranger,
mlr_learners_regr.svm,
mlr_learners_regr.xgboost
Super classes
mlr3::Learner -> mlr3::LearnerClassif -> LearnerClassifXgboost
Active bindings
internal_valid_scores(named
list()orNULL) The validation scores extracted frommodel$evaluation_log. If early stopping is activated, this contains the validation scores of the model for the optimalnrounds, otherwise the scores are taken from the final boosting roundnrounds.internal_tuned_values(named
list()orNULL) If early stopping is activated, this returns a list withnrounds, which is extracted from$best_iterationof the model and otherwiseNULL.validate(
numeric(1)orcharacter(1)orNULL) How to construct the internal validation data. This parameter can be eitherNULL, a ratio,"test", or"predefined".model(any)
The fitted model. Only available after$train()has been called.
Methods
Inherited methods
mlr3::Learner$base_learner()mlr3::Learner$configure()mlr3::Learner$encapsulate()mlr3::Learner$format()mlr3::Learner$help()mlr3::Learner$predict()mlr3::Learner$predict_newdata()mlr3::Learner$print()mlr3::Learner$reset()mlr3::Learner$selected_features()mlr3::Learner$train()mlr3::LearnerClassif$predict_newdata_fast()
Method importance()
The importance scores are calculated with xgboost::xgb.importance().
Returns
Named numeric().
Examples
# Define the Learner and set parameter values
learner = lrn("classif.xgboost")
print(learner)
#>
#> ── <LearnerClassifXgboost> (classif.xgboost): Extreme Gradient Boosting ────────
#> • Model: -
#> • Parameters: nrounds=1000, nthread=1, verbose=0, verbosity=0,
#> use_pred_offset=TRUE
#> • Validate: NULL
#> • Packages: mlr3, mlr3learners, and xgboost
#> • Predict Types: [response] and prob
#> • Feature Types: logical, integer, and numeric
#> • Encapsulation: none (fallback: -)
#> • Properties: hotstart_forward, importance, internal_tuning, missings,
#> multiclass, offset, twoclass, validation, and weights
#> • Other settings: use_weights = 'use'
# Define a Task
task = tsk("sonar")
# Create train and test set
ids = partition(task)
# Train the learner on the training ids
learner$train(task, row_ids = ids$train)
# Print the model
print(learner$model)
#> ##### xgb.Booster
#> call:
#> xgboost::xgb.train(params = pv[names(pv) %in% formalArgs(xgboost::xgb.params)],
#> data = xgb_data, nrounds = pv$nrounds, evals = pv$evals,
#> custom_metric = pv$custom_metric, verbose = pv$verbose, print_every_n = pv$print_every_n,
#> early_stopping_rounds = pv$early_stopping_rounds, maximize = pv$maximize,
#> save_period = pv$save_period, save_name = pv$save_name, callbacks = pv$callbacks %??%
#> list())
#> # of features: 60
#> # of rounds: 1000
# Importance method
if ("importance" %in% learner$properties) print(learner$importance())
#> V12 V52 V45 V20 V48 V36
#> 0.1913429724 0.1156073538 0.0754178506 0.0727729692 0.0705704052 0.0649181658
#> V11 V23 V49 V44 V47 V37
#> 0.0511647632 0.0342193616 0.0322207985 0.0284559922 0.0269096965 0.0199268940
#> V31 V28 V25 V51 V43 V21
#> 0.0189730382 0.0185971956 0.0169041089 0.0162353892 0.0125918171 0.0120586522
#> V9 V15 V5 V24 V34 V4
#> 0.0104133960 0.0092412054 0.0090557387 0.0083255828 0.0079496959 0.0071776699
#> V40 V46 V10 V55 V38 V17
#> 0.0070807163 0.0070548352 0.0070059491 0.0065617249 0.0061949628 0.0060334511
#> V32 V53 V54 V60 V59 V50
#> 0.0058413994 0.0041317725 0.0033295236 0.0032392012 0.0026154487 0.0021014714
#> V8 V39 V33 V29 V27 V18
#> 0.0015051329 0.0013138548 0.0011376653 0.0011284622 0.0006982279 0.0006765504
#> V57 V58 V1
#> 0.0005266029 0.0004945725 0.0002777623
# Make predictions for the test rows
predictions = learner$predict(task, row_ids = ids$test)
# Score the predictions
predictions$score()
#> classif.ce
#> 0.2028986
# Early stopping
learner = lrn("classif.xgboost", nrounds = 100, early_stopping_rounds = 10, validate = 0.3)
# Train learner with early stopping
learner$train(task)
# Inspect optimal nrounds and validation performance
learner$internal_tuned_values
#> $nrounds
#> [1] 32
#>
learner$internal_valid_scores
#> $logloss
#> [1] 0.3309164
#>