We try to cover some anticipated frequently asked questions here.
I want to use super_learner()
for count or nonnegative
outcomes.
In principle, you can use super_learner()
for whatever
type of outcomes you want as long as a few things hold:
- The learners that you pass to
super_learner()
predict that type of outcome. - The loss function used inside the
determine_super_learner_weights()
function argument is consistent with the loss function that should be used with your type of data. What loss function “should” be used depends on the context, but using the mean-squared-error loss for continuous outcomes and negative log loss for binary outcomes and conditional density models is what has been written so far. Refer to the sourceR/determine_weights.R
. - If using
nadir::super_learner()
for applications in the context of Targeted Learning, it may be useful to understand better the arguments of
Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples by Mark van der Laan and Sandrine Dudoit, 2003 first to understand how an appropriate loss function should be chosen depending on the outcome type. My understanding is that some people have used Poisson distribution motivated loss functions (https://discuss.pytorch.org/t/poisson-loss-function/44301/6, https://pytorch.org/docs/stable/generated/torch.nn.PoissonNLLLoss.html) but I would need to think more about if this is the right thing to do in general for count outcome data.
What are the limitations of
nadir::super_learner()
?
There are a few key limitations of the design.
- Because
learners
(see?learners
) are understood to be functions that takedata
, aformula
and return a prediction function, there is little to no ability (outside of manually following along with the internals of a learner) to check on the internals of learner fits.- That is to say, if you want to peek into the beta coefficients or
other fit statistics of a learner, this is not supported in
nadir::super_learner()
by design. The reasoning is that an explicit goal of nadir is to keep learner objects lightweight so that building asuper_learner()
can be fast.
- That is to say, if you want to peek into the beta coefficients or
other fit statistics of a learner, this is not supported in
- So far, no thought has been put into complex left-hand-sides of
regression equations. There is no support for left-hand-sides that are
not just the name of a column in the
data
passed. The advice for now if you want to model some transformation of theY
variable is to apply the transformation and store that in thedata
with a column name and to use that new column name in your regression formula(s).- As an explicit subpoint to call attention to, this means so far, no work has been put into supporting survival type outcomes.
- So far, everything in nadir assumes completeness (no missingness) of the data.
What if the learner that I want to write really isn’t formula based?
A solution for such a case is to more-or-less ditch the formula piece of a learner entirely, just treating it as an unused argument, and for your custom needs you can always build learners that encode details of the structure of your data.
data <- matrix(data = rnorm(n = 200), nrow = 20)
colnames(data) <- paste0("X", 1:10)
data <- cbind(data, data %*% rnorm(10))
colnames(data)[ncol(data)] <- 'Y'
lnr_nonformula1 <- function(data, formula, ...) {
# notice by way of knowing things about our data structure, we never reference
# the formula; so if you truly don't want to use it, you don't have to.
# as an example, here we do OLS assuming inputs are numeric matrices —
# this might even be computationally more performant given how much extra
# stuff is inside an lm or glm fit.
X <- as.matrix(data[,grepl(pattern = "^X", colnames(data))])
Y <- as.matrix(data[,'Y'])
model_betas <- solve(t(X) %*% X) %*% t(X) %*% Y
learned_predictor <- function(newdata) {
if ('Y' %in% colnames(newdata)) {
index_of_y <- which(colnames(newdata) == 'Y')[[1]]
} else {
index_of_y <- NULL
}
if (is.data.frame(newdata)) {
newdata <- as.matrix(newdata)
}
as.vector(t(model_betas) %*% t(newdata[,-index_of_y, drop=FALSE]))
}
return(learned_predictor)
}
attr(lnr_nonformula1, 'sl_lnr_type') <- 'continuous'
# this is essentially a re-implementation of lnr_mean with no reference to the formula
lnr_nonformula2 <- function(data, formula, ...) {
Y <- data[,'Y']
Y_mean <- mean(Y)
learned_predictor <- function(newdata) {
rep(Y_mean, nrow(newdata))
}
return(learned_predictor)
}
attr(lnr_nonformula2, 'sl_lnr_type') <- 'continuous'
learned_super_learner <- super_learner(
data = data,
learners = list(
nonformula1 = lnr_nonformula1,
nonformula2 = lnr_nonformula2),
formulas = . ~ ., # it doesn't matter what we put here, because neither
# learner uses their formula inputs.
y_variable = 'Y',
verbose = TRUE
)
# observe that the OLS model gets all the weight because it's the correct model:
round(learned_super_learner$learner_weights, 10)
#> nonformula1 nonformula2
#> 1 0
rm(data) # cleanup
So you can see from the immediately prior code snippet, if you have
some niche application where you would like to avoid using the
formulas
argument to nadir::super_learner()
at
all, you can do this by taking advantage of what you know about how
you’re going to structure the data
argument.
Can I use {origami}
with {nadir}
?
Yes, you can.
There’s a wrapper provided for working with the folds_*
functions from origami.
The first example below is a bit boring, but it does internally use
origami::folds_vfold
. The second example demonstrates how
to pass another fold_*
function from the
origami package, and any extra arguments passed to
cv_origami_schema
get passed on to the
origami::folds_*
function.
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = cv_origami_schema,
verbose = TRUE
)
# if you want to use a different origami::folds_* function, pass it into cv_origami_schema
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = \(data, n_folds) {
cv_origami_schema(data, n_folds, fold_fun = origami::folds_loo)
},
verbose = TRUE
)
Are there potentially ‘sharp edges’ to {nadir}
worth
knowing about?
Yes! Though nadir tries to make the process user-friendly, there may be unexpected behaviors if you use it outside its design-scope and tested functionality.
For example, nadir and its learners are so far only
built to handle regression formulas where the left-hand-side appears as
a column name of the data passed in. That means transformations of the
outcome variable implied through the formula are not supported. Our
advice is that if you want to transform the outcome variable, you should
store that in a column in your data and run super_learner()
using that column name in your formula(s).
Another sharp-edge is around the meta-learning step. For example,
predicting continuous outcomes, one should specify to
super_learner()
that
outcome_type = 'continuous'
(the default) so that
non-negative least squares is used to minimize a linear combination of
the candidate learners based on held-out mean squared error for the loss
function. Additionally, by setting outcome_type = 'binary'
or outcome_type = 'density'
the negative log likelihood /
negative log predicted density are used respectively as loss functions.
In each of these cases, these defaults translate to setting the
determine_weights_for_super_learner
function argument
appropriately to one of
determine_super_learner_weights_nnls()
,
determine_weights_for_binary_outcomes
,
determine_weights_using_neg_log_loss
. These loss functions
are selected based on work in the loss based estimation literature,
especially1 2 3.