Skip to contents

Let’s start with an extremely simple example: a prediction problem on a continuous outcome, where we want to use cross-validation to minimize the expected risk/loss on held out data across a few different models.

We’ll use the iris dataset to do this.

nadir::super_learner() strives to keep the syntax simple, so the simplest call to super_learner() might look something like this:

super_learner(
  data = iris,
  formula = Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width,
  learners = list(lnr_lm, lnr_rf, lnr_earth, lnr_mean))
#> $predict
#> function(newdata) {
#>     # for each model, predict on the newdata and apply the model weights
#>     future_lapply(1:length(fit_learners), function(i) {
#>       fit_learners[[i]](newdata) * learner_weights[[i]]
#>     }, future.seed = TRUE) |>
#>       Reduce(`+`, x = _) # aggregate across the weighted model predictions
#>   }
#> <bytecode: 0x13ce840c0>
#> <environment: 0x13ce91aa0>
#> 
#> $y_variable
#> [1] "Petal.Width"
#> 
#> $outcome_type
#> [1] "continuous"
#> 
#> $learner_weights
#>        lm        rf     earth      mean 
#> 0.5819815 0.4180185 0.0000000 0.0000000 
#> 
#> $holdout_predictions
#> # A tibble: 150 × 6
#>    .sl_fold    lm    rf earth  mean Petal.Width
#>       <int> <dbl> <dbl> <dbl> <dbl>       <dbl>
#>  1        1 0.282 0.232  1.93  1.23         0.2
#>  2        1 0.247 0.232  2.14  1.23         0.2
#>  3        1 0.230 0.261  1.81  1.23         0.2
#>  4        1 0.211 0.243  2.14  1.23         0.1
#>  5        1 0.167 0.283  2.38  1.23         0.4
#>  6        1 0.300 0.233  2.19  1.23         0.4
#>  7        1 0.318 0.348  2.25  1.23         0.5
#>  8        1 0.162 0.290  2.45  1.23         0.4
#>  9        1 0.265 0.272  2.40  1.23         0.2
#> 10        1 0.243 0.230  1.77  1.23         0.2
#> # ℹ 140 more rows
#> 
#> attr(,"class")
#> [1] "nadir_sl_model"

Notice what it returns: A function of newdata that predicts across the learners, sums up according to the learned weights, and returns the ensemble predictions.

We can store that learned predictor function and use it:

# We recommend storing more complicated arguments used repeatedly to simplify 
# the call to super_learner()
petal_formula <- Petal.Width ~ Petal.Length + Sepal.Length + Sepal.Width
learners <- list(lnr_lm, lnr_rf, lnr_earth, lnr_mean)

sl_model <- super_learner(
  data = iris,
  formula = petal_formula,
  learners = learners)

In particular, we can use it to predict on the same dataset,

predict(sl_model, iris) |> head()
#>         1         2         3         4         5         6 
#> 0.2308244 0.1711635 0.1904240 0.2491505 0.2517029 0.3793405

On a random sample of it,

predict(sl_model, iris[sample.int(size = 10, n = nrow(iris)), ]) |> 
  head()
#>       142        28        84        70       127         7 
#> 1.8443894 0.2395949 1.7579094 1.2004224 1.6369481 0.2752224

Or on completely new data.

fake_iris_data <- data.frame()
fake_iris_data <- cbind.data.frame(
  Sepal.Length = 
  rnorm(
    n = 6,
    mean = mean(iris$Sepal.Length),
    sd = sd(iris$Sepal.Length)
  ),

Sepal.Width = 
  rnorm(
    n = 6,
    mean = mean(iris$Sepal.Width),
    sd = sd(iris$Sepal.Width)
  ),

Petal.Length = 
  rnorm(
    n = 6,
    mean = mean(iris$Petal.Length),
    sd = sd(iris$Petal.Length)
  )
)

predict(sl_model, fake_iris_data) |> 
  head()
#>           1           2           3           4           5           6 
#>  2.15292372 -0.02860532  1.53747064  0.52215488  1.11711273  0.78535573

Getting More Information Out

If we want to know a lot more about the super_learner() process, how it weighted the candidate learners, what the candidate learners predicted on the held-out data, etc., then we will want to look at the other metadata contained in the nadir_sl_model object produced: option.

sl_model_iris <- super_learner(
  data = iris,
  formula = petal_formula,
  learners = learners)

str(sl_model_iris, max.level = 2)
#> List of 5
#>  $ predict            :function (newdata)  
#>   ..- attr(*, "srcref")= 'srcref' int [1:8] 515 39 521 3 39 3 2973 2979
#>   .. ..- attr(*, "srcfile")=Classes 'srcfilealias', 'srcfile' <environment: 0x13cd6cdf0> 
#>  $ y_variable         : chr "Petal.Width"
#>  $ outcome_type       : chr "continuous"
#>  $ learner_weights    : Named num [1:4] 0.665 0.335 0 0
#>   ..- attr(*, "names")= chr [1:4] "lm" "rf" "earth" "mean"
#>  $ holdout_predictions: tibble [150 × 6] (S3: tbl_df/tbl/data.frame)
#>  - attr(*, "class")= chr "nadir_sl_model"

To put some description to what’s contained in the output from super_learner():

  • A prediction function, $predict() that takes newdata
  • Some character fields like $y_variable and $outcome_type to provide some context to the learning task that was performed.
  • $learner_weights that indicate what weight the different candidate learners were given
  • $holdout_predictions: A data.frame of predictions from each of the candidate learners, along with the actual outcome from the held-out data.

We can call compare_learners() on the verbose output from super_learner() if we want to assess how the different learners performed. We can also call cv_super_learner() with the same arguments as super_learner() to wrap the super_learner() call in another layer of cross-validation to assess how super_learner() performs on held-out data.

compare_learners(sl_model_iris)
#> Inferring the loss metric for learner comparison based on the outcome type:
#> outcome_type=continuous -> using mean squared error
#> # A tibble: 1 × 4
#>       lm     rf earth  mean
#>    <dbl>  <dbl> <dbl> <dbl>
#> 1 0.0381 0.0493  2.03 0.579

cv_super_learner(
  data = iris, 
  formula = petal_formula,
  learners = learners)$cv_loss
#> The loss_metric is being inferred based on the outcome_type=continuous -> using CV-MSE
#> [1] 0.03294804

We can, of course, do anything with a super learned model that we would do with a conventional prediction model, like calculating performance statistics like R2R^2.

var_residuals <- var(iris$Sepal.Length - predict(sl_model_iris, iris))
total_variance <- var(iris$Sepal.Length)
variance_explained <- total_variance - var_residuals 

rsquared <- variance_explained / total_variance
print(rsquared)
#> [1] 0.7202032