Skip to contents

We try to cover some anticipated frequently asked questions here.

I want to use super_learner() for count or nonnegative outcomes.

In principle, you can use super_learner() for whatever type of outcomes you want as long as a few things hold:

What are the limitations of nadir::super_learner()?

There are a few key limitations of the design.

  • Because learners (see ?learners) are understood to be functions that take data, a formula and return a prediction function, there is little to no ability (outside of manually following along with the internals of a learner) to check on the internals of learner fits.
    • That is to say, if you want to peek into the beta coefficients or other fit statistics of a learner, this is not supported in nadir::super_learner() by design. The reasoning is that an explicit goal of nadir is to keep learner objects lightweight so that building a super_learner() can be fast.
  • So far, no thought has been put into complex left-hand-sides of regression equations. There is no support for left-hand-sides that are not just the name of a column in the data passed. The advice for now if you want to model some transformation of the Y variable is to apply the transformation and store that in the data with a column name and to use that new column name in your regression formula(s).
    • As an explicit subpoint to call attention to, this means so far, no work has been put into supporting survival type outcomes.
  • So far, everything in nadir assumes completeness (no missingness) of the data.

What if the learner that I want to write really isn’t formula based?

A solution for such a case is to more-or-less ditch the formula piece of a learner entirely, just treating it as an unused argument, and for your custom needs you can always build learners that encode details of the structure of your data.

data <- matrix(data = rnorm(n = 200), nrow = 20)
colnames(data) <- paste0("X", 1:10)
data <- cbind(data, data %*% rnorm(10))
colnames(data)[ncol(data)] <- 'Y'

lnr_nonformula1 <- function(data, formula, ...) {
  
  # notice by way of knowing things about our data structure, we never reference
  # the formula;  so if you truly don't want to use it, you don't have to.
  
  # as an example, here we do OLS assuming inputs are numeric matrices — 
  # this might even be computationally more performant given how much extra
  # stuff is inside an lm or glm fit.
  X <- as.matrix(data[,grepl(pattern = "^X", colnames(data))])
  Y <- as.matrix(data[,'Y'])
  model_betas <- solve(t(X) %*% X) %*% t(X) %*% Y
  
  learned_predictor <- function(newdata) {
    if ('Y' %in% colnames(newdata)) {
      index_of_y <- which(colnames(newdata) == 'Y')[[1]]
    } else {
      index_of_y <- NULL
    }
    if (is.data.frame(newdata)) {
      newdata <- as.matrix(newdata)
    }
    as.vector(t(model_betas) %*% t(newdata[,-index_of_y, drop=FALSE]))
  }
  return(learned_predictor)
}
attr(lnr_nonformula1, 'sl_lnr_type') <- 'continuous'

# this is essentially a re-implementation of lnr_mean with no reference to the formula 
lnr_nonformula2 <- function(data, formula, ...) {
  
  Y <- data[,'Y']
  Y_mean <- mean(Y)
  
  learned_predictor <- function(newdata) {
    rep(Y_mean, nrow(newdata))
  }
  return(learned_predictor)
}
attr(lnr_nonformula2, 'sl_lnr_type') <- 'continuous'

learned_super_learner <- super_learner(
  data = data,
  learners = list(
    nonformula1 = lnr_nonformula1,
    nonformula2 = lnr_nonformula2),
  formulas = . ~ ., # it doesn't matter what we put here, because neither 
  # learner uses their formula inputs. 
  y_variable = 'Y',
  verbose = TRUE
  )

# observe that the OLS model gets all the weight because it's the correct model:
round(learned_super_learner$learner_weights, 10) 
#> nonformula1 nonformula2 
#>           1           0


rm(data) # cleanup

So you can see from the immediately prior code snippet, if you have some niche application where you would like to avoid using the formulas argument to nadir::super_learner() at all, you can do this by taking advantage of what you know about how you’re going to structure the data argument.

Can I use {origami} with {nadir}?

Yes, you can.

There’s a wrapper provided for working with the folds_* functions from origami.

The first example below is a bit boring, but it does internally use origami::folds_vfold. The second example demonstrates how to pass another fold_* function from the origami package, and any extra arguments passed to cv_origami_schema get passed on to the origami::folds_* function.

sl_model <- super_learner(
  data = mtcars,
  formula = mpg ~ cyl + hp,
  learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
  cv_schema = cv_origami_schema,
  verbose = TRUE
)

# if you want to use a different origami::folds_* function, pass it into cv_origami_schema
sl_model <- super_learner(
  data = mtcars,
  formula = mpg ~ cyl + hp,
  learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
  cv_schema = \(data, n_folds) {
    cv_origami_schema(data, n_folds, fold_fun = origami::folds_loo)
  },
  verbose = TRUE
)

Are there potentially ‘sharp edges’ to {nadir} worth knowing about?

Yes! Though nadir tries to make the process user-friendly, there may be unexpected behaviors if you use it outside its design-scope and tested functionality.

For example, nadir and its learners are so far only built to handle regression formulas where the left-hand-side appears as a column name of the data passed in. That means transformations of the outcome variable implied through the formula are not supported. Our advice is that if you want to transform the outcome variable, you should store that in a column in your data and run super_learner() using that column name in your formula(s).

Another sharp-edge is around the meta-learning step. For example, predicting continuous outcomes, one should specify to super_learner() that outcome_type = 'continuous' (the default) so that non-negative least squares is used to minimize a linear combination of the candidate learners based on held-out mean squared error for the loss function. Additionally, by setting outcome_type = 'binary' or outcome_type = 'density' the negative log likelihood / negative log predicted density are used respectively as loss functions. In each of these cases, these defaults translate to setting the determine_weights_for_super_learner function argument appropriately to one of determine_super_learner_weights_nnls(), determine_weights_for_binary_outcomes, determine_weights_using_neg_log_loss. These loss functions are selected based on work in the loss based estimation literature, especially1 2 3.