Super Learner: Cross-Validation Based Ensemble Learning

Super learning with functional programming!

Usage

super_learner(
  data,
  learners,
  formulas,
  y_variable,
  n_folds = 5,
  determine_super_learner_weights,
  continuous_or_discrete = "continuous",
  cv_schema = cv_random_schema,
  outcome_type = "continuous",
  extra_learner_args = NULL,
  verbose_output = FALSE
)

Arguments

data: Data to use in training a `super_learner`.
learners: A list of predictor/closure-returning-functions. See Details.
formulas: Either a single regression formula or a vector of regression formulas.
y_variable: Typically `y_variable` can be inferred automatically from the `formulas`, but if needed, the y_variable can be specified explicitly.
n_folds: The number of cross-validation folds to use in constructing the `super_learner`.
determine_super_learner_weights: A function/method to determine the weights for each of the candidate `learners`. The default is to use `determine_super_learner_weights_nnls`.
continuous_or_discrete: Defaults to `'continuous'`, but can be set to `'discrete'`.
cv_schema: A function that takes `data`, `n_folds` and returns a list containing `training_data` and `validation_data`, each of which are lists of `n_folds` data frames.
outcome_type: One of 'continuous', 'binary', or 'density'. outcome_type is used to infer the correct determine_super_learner_weights function if it is not explicitly passed.
extra_learner_args: A list of equal length to the `learners` with additional arguments to pass to each of the specified learners.
verbose_output: If `verbose_output = TRUE` then return a list containing the fit learners with their predictions on held-out data as well as the prediction function closure from the trained `super_learner`.

Details

The goal of any super learner is to use cross-validation and a set of candidate learners to 1) evaluate how the learners perform on held out data and 2) to use that evaluation to produce a weighted average (for continuous super learner) or to pick a best learner (for discrete super learner) of the specified candidate learners.

Super learner and its statistically desirable properties have been written about at length, including at least the following references:

* <https://biostats.bepress.com/ucbbiostat/paper222/> * <https://www.stat.berkeley.edu/users/laan/Class/Class_subpages/BASS_sec1_3.1.pdf>

`nadir::super_learner` adopts several user-interface design-perspectives that will be useful to know in understanding what it does and how it works:

* The specification of learners should be _very flexible_, really only constrained by the fact that candidate learners should be designed for the same prediction problem but their details can wildly vary from learner to learner. * It should be easy to specify a customized or new learner.

`nadir::super_learner` at its core accepts `data`, a `formula` (a single one passed to `formulas` is fine), and a list of `learners`.

`learners` are taken to be lists of functions of the following specification:

* a learner must accept a `data` and `formula` argument, * a learner may accept more arguments, and * a learner must return a prediction function that accepts `newdata` and produces a vector of prediction values given `newdata`.

In essence, a learner is specified to be a function taking (`data`, `formula`, ...) and returning a _closure_ (see <http://adv-r.had.co.nz/Functional-programming.html#closures> for an introduction to closures) which is a function accepting `newdata` returning predictions.

Since many candidate learners will have hyperparameters that should be tuned, like depth of trees in random forests, or the `lambda` parameter for `glmnet`, extra arguments can be passed to each learner via the `extra_learner_args` argument. `extra_learner_args` should be a list of lists, one list of extra arguments for each learner. If no additional arguments are needed for some learners, but some learners you're using do require additional arguments, you can just put a `NULL` value into the `extra_learner_args`. See the examples.

In order to seamlessly support using features implemented by extensions to the formula syntax (like random effects formatted like random intercepts or slopes that use the `(age | strata)` syntax in `lme4` or splines like `s(age | strata)` in `mgcv`), we allow for the `formulas` argument to either be one fixed formula that `super_learner` will use for all the models, or a vector of formulas, one for each learner specified.

Note that in the examples a mean-squared-error (mse) is calculated on the same training/test set, and this is only useful as a crude diagnostic to see that super_learner is working. A more rigorous performance metric to evaluate `super_learner` on is the cv-rmse produced by cv_super_learner.

Examples

if (FALSE) { # \dontrun{

learners <- list(
     glm = lnr_glm,
     rf = lnr_rf,
     glmnet = lnr_glmnet,
     lmer = lnr_lmer
  )

# mtcars example ---
formulas <- c(
  .default = mpg ~ cyl + hp, # first three models use same formula
  lmer = mpg ~ (1 | cyl) + hp # lme4 uses different language features
  )

# fit a super_learner
sl_model <- super_learner(
  data = mtcars,
  formula = formulas,
  learners = learners,
  verbose = TRUE)

# We recommend taking a look at this object, and comparing it to the sole function
# returned when verbose = FALSE.  tip: It's the $sl_predictor function in the
# verbose output.
sl_model

compare_learners(sl_model)

# iris example ---
sl_model <- super_learner(
  data = iris,
  formula = list(
    .default = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
    lmer = Sepal.Length ~ (Sepal.Width | Species) + Petal.Length),
  learners = learners,
  verbose = TRUE)

# produce super_learner predictions and compare against the individual learners
compare_learners(sl_model)
} # }

Super Learner: Cross-Validation Based Ensemble Learning

Usage

Arguments

Details

See also

Examples