Produce cv-rmse for a `super_learner` specified by a closure that accepts data and returns a `super_learner` prediction function.
Usage
cv_super_learner(
data,
learners,
formulas,
y_variable,
n_folds = 5,
determine_super_learner_weights = determine_super_learner_weights_nnls,
continuous_or_discrete = "continuous",
cv_schema = cv_random_schema,
outcome_type = "continuous",
extra_learner_args = NULL,
verbose_output = FALSE,
loss_metric
)
Arguments
- data
Data to use in training a `super_learner`.
- learners
A list of predictor/closure-returning-functions. See Details.
- formulas
Either a single regression formula or a vector of regression formulas.
- y_variable
Typically `y_variable` can be inferred automatically from the `formulas`, but if needed, the y_variable can be specified explicitly.
- n_folds
The number of cross-validation folds to use in constructing the `super_learner`.
- determine_super_learner_weights
A function/method to determine the weights for each of the candidate `learners`. The default is to use `determine_super_learner_weights_nnls`.
- continuous_or_discrete
Defaults to `'continuous'`, but can be set to `'discrete'`.
- cv_schema
A function that takes `data`, `n_folds` and returns a list containing `training_data` and `validation_data`, each of which are lists of `n_folds` data frames.
- outcome_type
One of 'continuous', 'binary', or 'density'.
outcome_type
is used to infer the correctdetermine_super_learner_weights
function if it is not explicitly passed.- extra_learner_args
A list of equal length to the `learners` with additional arguments to pass to each of the specified learners.
- verbose_output
If `verbose_output = TRUE` then return a list containing the fit learners with their predictions on held-out data as well as the prediction function closure from the trained `super_learner`.
- loss_metric
A loss metric function, like the mean-squared-error or negative-log-loss to be used in evaluating the learners on held-out data and minimized through convex optimization. A loss metric should take two (vector) arguments: predictions, and true outcomes, and produce a single statistic summarizing the performance of each learner. Defaults to the mean-squared-error
nadir:::mse()
.
Value
A list containing $trained_learners
and $cv_loss
which
respectively include 1) the trained super learner models on each fold of the data, their holdout predictions and,
2) the cross-validated estimate of the risk (expected loss) on held-out data.
Details
The idea is that `cv_super_learner` splits the data into training/validation splits, trains `super_learner` on each training split, and then evaluates their predictions on the held-out validation data, calculating a root-mean-squared-error on those held-out data.
This function does print a message if the loss_function
argument is
not set explicitly, letting the user know that the mean-squared-error will be
used by default. Pass in loss_function = nadir:::mse
to
super_learner()
if you'd like to suppress this message, or use a
similar approach for the appropriate loss function depending on context.