Cross-Validating a `super_learner` — cv_super

Produce cv-rmse for a `super_learner` specified by a closure that accepts data and returns a `super_learner` prediction function.

Usage

cv_super_learner(
  data,
  learners,
  formulas,
  y_variable,
  n_folds = 5,
  determine_super_learner_weights = determine_super_learner_weights_nnls,
  continuous_or_discrete = "continuous",
  cv_schema = cv_random_schema,
  outcome_type = "continuous",
  extra_learner_args = NULL,
  verbose_output = FALSE,
  loss_metric
)

Arguments

data: Data to use in training a `super_learner`.
learners: A list of predictor/closure-returning-functions. See Details.
formulas: Either a single regression formula or a vector of regression formulas.
y_variable: Typically `y_variable` can be inferred automatically from the `formulas`, but if needed, the y_variable can be specified explicitly.
n_folds: The number of cross-validation folds to use in constructing the `super_learner`.
determine_super_learner_weights: A function/method to determine the weights for each of the candidate `learners`. The default is to use `determine_super_learner_weights_nnls`.
continuous_or_discrete: Defaults to `'continuous'`, but can be set to `'discrete'`.
cv_schema: A function that takes `data`, `n_folds` and returns a list containing `training_data` and `validation_data`, each of which are lists of `n_folds` data frames.
outcome_type: One of 'continuous', 'binary', or 'density'. outcome_type is used to infer the correct determine_super_learner_weights function if it is not explicitly passed.
extra_learner_args: A list of equal length to the `learners` with additional arguments to pass to each of the specified learners.
verbose_output: If `verbose_output = TRUE` then return a list containing the fit learners with their predictions on held-out data as well as the prediction function closure from the trained `super_learner`.
loss_metric: A loss metric function, like the mean-squared-error or negative-log-loss to be used in evaluating the learners on held-out data and minimized through convex optimization. A loss metric should take two (vector) arguments: predictions, and true outcomes, and produce a single statistic summarizing the performance of each learner. Defaults to the mean-squared-error nadir:::mse().

Value

A list containing $trained_learners and $cv_loss which respectively include 1) the trained super learner models on each fold of the data, their holdout predictions and, 2) the cross-validated estimate of the risk (expected loss) on held-out data.

Details

The idea is that `cv_super_learner` splits the data into training/validation splits, trains `super_learner` on each training split, and then evaluates their predictions on the held-out validation data, calculating a root-mean-squared-error on those held-out data.

This function does print a message if the loss_function argument is not set explicitly, letting the user know that the mean-squared-error will be used by default. Pass in loss_function = nadir:::mse to super_learner() if you'd like to suppress this message, or use a similar approach for the appropriate loss function depending on context.

Examples

if (FALSE) { # \dontrun{
  cv_super_learner(
    data = mtcars,
    formula = mpg ~ cyl + hp,
    learners = list(lnr_mean, lnr_lm, lnr_rf))

  cv_super_learner(
    data = mtcars,
    formula = am ~ cyl + hp,
    learners = list(lnr_mean, lnr_lm, lnr_logistic, lnr_rf_binary),
    outcome_type = 'binary')
} # }