Super learning with functional programming!
Usage
super_learner(
data,
learners,
formulas,
y_variable,
n_folds = 5,
determine_super_learner_weights = determine_super_learner_weights_nnls,
continuous_or_discrete = "continuous",
cv_schema = cv_random_schema,
extra_learner_args = NULL,
verbose_output = FALSE
)
Arguments
- data
Data to use in training a `super_learner`.
- learners
A list of predictor/closure-returning-functions. See Details.
- formulas
Either a single regression formula or a vector of regression formulas.
- y_variable
Typically `y_variable` can be inferred automatically from the `formulas`, but if needed, the y_variable can be specified explicitly.
- n_folds
The number of cross-validation folds to use in constructing the `super_learner`.
- determine_super_learner_weights
A function/method to determine the weights for each of the candidate `learners`. The default is to use `determine_super_learner_weights_nnls`.
- continuous_or_discrete
Defaults to `'continuous'`, but can be set to `'discrete'`.
- cv_schema
A function that takes `data`, `n_folds` and returns a list containing `training_data` and `validation_data`, each of which are lists of `n_folds` data frames.
- extra_learner_args
A list of equal length to the `learners` with additional arguments to pass to each of the specified learners.
- verbose_output
If `verbose_output = TRUE` then return a list containing the fit learners with their predictions on held-out data as well as the prediction function closure from the trained `super_learner`.
Details
The goal of any super learner is to use cross-validation and a set of candidate learners to 1) evaluate how the learners perform on held out data and 2) to use that evaluation to produce a weighted average (for continuous super learner) or to pick a best learner (for discrete super learner) of the specified candidate learners.
Super learner and its statistically desirable properties have been written about at length, including at least the following references:
* <https://biostats.bepress.com/ucbbiostat/paper222/> * <https://www.stat.berkeley.edu/users/laan/Class/Class_subpages/BASS_sec1_3.1.pdf>
`nadir::super_learner` adopts several user-interface design-perspectives that will be useful to know in understanding what it does and how it works:
* The specification of learners should be _very flexible_, really only constrained by the fact that candidate learners should be designed for the same prediction problem but their details can wildly vary from learner to learner. * It should be easy to specify a customized or new learner.
`nadir::super_learner` at its core accepts `data`, a `formula` (a single one passed to `formulas` is fine), and a list of `learners`.
`learners` are taken to be lists of functions of the following specification:
* a learner must accept a `data` and `formula` argument, * a learner may accept more arguments, and * a learner must return a prediction function that accepts `newdata` and produces a vector of prediction values given `newdata`.
In essence, a learner is specified to be a function taking (`data`, `formula`, ...) and returning a _closure_ (see <http://adv-r.had.co.nz/Functional-programming.html#closures> for an introduction to closures) which is a function accepting `newdata` returning predictions.
Since many candidate learners will have hyperparameters that should be tuned, like depth of trees in random forests, or the `lambda` parameter for `glmnet`, extra arguments can be passed to each learner via the `extra_learner_args` argument. `extra_learner_args` should be a list of lists, one list of extra arguments for each learner. If no additional arguments are needed for some learners, but some learners you're using do require additional arguments, you can just put a `NULL` value into the `extra_learner_args`. See the examples.
In order to seamlessly support using features implemented by extensions to the formula syntax (like random effects formatted like random intercepts or slopes that use the `(age | strata)` syntax in `lme4` or splines like `s(age | strata)` in `mgcv`), we allow for the `formulas` argument to either be one fixed formula that `super_learner` will use for all the models, or a vector of formulas, one for each learner specified.
Note that in the examples a mean-squared-error (mse) is calculated on the same training/test set, and this is only useful as a crude diagnostic to see that super_learner is working. A more rigorous performance metric to evaluate `super_learner` on is the cv-rmse produced by cv_super_learner.