Let’s start with an extremely simple example: a prediction problem on a continuous outcome, where we want to use cross-validation to minimize the expected risk/loss on held out data across a few different models.
We’ll use the iris
dataset to do this.
nadir::super_learner()
strives to keep the syntax
simple, so the simplest call to super_learner()
might look
something like this:
super_learner(
data = iris,
formula = Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width,
learners = list(lnr_lm, lnr_rf, lnr_earth, lnr_mean))
#> function(newdata) {
#> # for each model, predict on the newdata and apply the model weights
#> parallel_lapply(1:length(fit_learners), function(i) {
#> fit_learners[[i]](newdata) * learner_weights[[i]]
#> }) |>
#> Reduce(`+`, x = _) # aggregate across the weighted model predictions
#> }
#> <bytecode: 0x1157b5c40>
#> <environment: 0x1157dcf28>
#> attr(,"class")
#> [1] "function" "nadir_sl_predictor"
Notice what it returns: A function of newdata
that
predicts across the learners, sums up according to the learned weights,
and returns the ensemble predictions.
We can store that learned predictor function and use it:
# We recommend storing more complicated arguments used repeatedly to simplify
# the call to super_learner()
petal_formula <- Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width
learners <- list(lnr_lm, lnr_rf, lnr_earth, lnr_mean)
learned_sl_predictor <- super_learner(
data = iris,
formula = petal_formula,
learners = learners)
In particular, we can use it to predict on the same dataset,
learned_sl_predictor(iris) |> head()
#> 1 2 3 4 5 6
#> 0.2298699 0.1703070 0.1897956 0.2529110 0.2507371 0.3838603
On a random sample of it,
learned_sl_predictor(iris[sample.int(size = 10, n = nrow(iris)), ]) |>
head()
#> 130 138 141 150 65 118
#> 1.947879 1.994989 2.060553 1.866126 1.186865 2.365248
Or on completely new data.
fake_iris_data <- data.frame()
fake_iris_data <- cbind.data.frame(
Sepal.Length =
rnorm(
n = 6,
mean = mean(iris$Sepal.Length),
sd = sd(iris$Sepal.Length)
),
Sepal.Width =
rnorm(
n = 6,
mean = mean(iris$Sepal.Width),
sd = sd(iris$Sepal.Width)
),
Petal.Length =
rnorm(
n = 6,
mean = mean(iris$Petal.Length),
sd = sd(iris$Petal.Length)
)
)
learned_sl_predictor(fake_iris_data) |>
head()
#> 1 2 3 4 5 6
#> 1.0684633 2.1233415 0.3412173 1.8459358 1.9028170 1.7436178
Getting More Information Out
Suppose we want to know a lot more about the
super_learner()
process, how it weighted the candidate
learners, what the candidate learners predicted on the held-out data,
etc., then we use the verbose_output = TRUE
option.
sl_model_iris <- super_learner(
data = iris,
formula = petal_formula,
learners = learners,
verbose = TRUE)
str(sl_model_iris, max.level = 2)
#> List of 5
#> $ sl_predictor :function (newdata)
#> ..- attr(*, "srcref")= 'srcref' int [1:8] 448 39 454 3 39 3 2442 2448
#> .. ..- attr(*, "srcfile")=Classes 'srcfilealias', 'srcfile' <environment: 0x115606d70>
#> $ y_variable : chr "Petal.Width"
#> $ outcome_type : chr "continuous"
#> $ learner_weights : Named num [1:4] 0.588 0.412 0 0
#> ..- attr(*, "names")= chr [1:4] "lm" "rf" "earth" "mean"
#> $ holdout_predictions: tibble [150 × 6] (S3: tbl_df/tbl/data.frame)
#> - attr(*, "class")= chr [1:2] "list" "nadir_sl_verbose_output"
To put some description to what’s contained in the
verbose_output = TRUE
output from
super_learner()
:
- A prediction function,
$sl_predictor()
that takesnewdata
- Some character fields like
$y_variable
and$outcome_type
to provide some context to the learning task that was performed. -
$learner_weights
that indicate what weight the different candidate learners were given -
$holdout_predictions
: A data.frame of predictions from each of the candidate learners, along with the actual outcome from the held-out data.
We can call compare_learners()
on the verbose output
from super_learner()
if we want to assess how the different
learners performed. We can also call cv_super_learner()
with the same arguments as super_learner()
to wrap the
super_learner()
call in another layer of cross-validation
to assess how super_learner()
performs on held-out
data.
compare_learners(sl_model_iris)
#> Inferring the loss metric for learner comparison based on the outcome type:
#> outcome_type=continuous -> using mean squared error
#> # A tibble: 1 × 4
#> lm rf earth mean
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0390 0.0454 1.70 0.612
cv_super_learner(
data = iris,
formula = petal_formula,
learners = learners)$cv_loss
#> The loss_metric is being inferred based on the outcome_type=continuous -> using CV-MSE
#> [1] 0.03496255
We can, of course, do anything with a super learned model that we would do with a conventional prediction model, like calculating performance statistics like .