We try to cover some anticipated frequently asked questions here.
#> Registered S3 method overwritten by 'future':
#> method from
#> all.equal.connection parallelly
How should we think about “misspecified” learners?
It’s important to be clear about the fact that super learner is a purely predicitive algorithm that has some statistically useful properties. A fit super learner is not an explanatory model. Therefore, from a pure prediction standpoint, it does not matter whether aspects of the models fit are “incorrect” from a structural standpoint.
To make this more clear, here is an example:
- If a
glmis used in super learner, if the predictions from that model are ones that minimize the loss function, thatglmmodel will be given higher weight in super learner even if it, say, uses the wrong link function orfamilyspecification from the stapdoint of traditional parametric generalized linear modeling.- For example, a “linear probability model” (i.e., a linear model fit to binary outcome data) may perform quite well in some instances and outperform a logistic regression model. The main downside of the linear probability model is that it can make predictions outside the [0,1] interval, so a user may want to modify their linear probability model learner to produce predictions that are truncated to the [0,1] interval.
However, certain kinds of misspecification are genuinely problematic.
For example, as in any supervised learning situation, if predictors are included that are based on the outcomes and would not be available in a situation where predictions are needed but the outcomes have not yet been observed, this would be incorrect usage of the super learner algorithm.
I want to use super_learner() for count or nonnegative
outcomes.
In principle, you can use super_learner() for whatever
type of outcomes you want as long as a few things hold:
- The learners that you pass to
super_learner()predict that type of outcome. - The loss function used inside the
determine_super_learner_weights()function argument is consistent with the loss function that should be used with your type of data. What loss function “should” be used depends on the context, but using the mean-squared-error loss for continuous outcomes and negative log loss for binary outcomes and conditional density models is what has been written so far. Refer to the sourceR/determine_weights.R. - If using
nadir::super_learner()for applications in the context of Targeted Learning, it may be useful to understand better the arguments of
Unified Cross-Validation Methodology For Selection Among Estimators and a General Cross-Validated Adaptive Epsilon-Net Estimator: Finite Sample Oracle Inequalities and Examples by Mark van der Laan and Sandrine Dudoit, 2003 first to understand how an appropriate loss function should be chosen depending on the outcome type. My understanding is that some people have used Poisson distribution motivated loss functions (https://discuss.pytorch.org/t/poisson-loss-function/44301/6, https://pytorch.org/docs/stable/generated/torch.nn.PoissonNLLLoss.html) but I would need to think more about if this is the right thing to do in general for count outcome data.
What are the limitations of
nadir::super_learner()?
There are a few key limitations of the design.
- Because
learners(see?learners) are understood to be functions that takedata, aformulaand return a prediction function, there is little to no ability (outside of manually following along with the internals of a learner) to check on the internals of learner fits.- That is to say, if you want to peek into the beta coefficients or
other fit statistics of a learner, this is not supported in
nadir::super_learner()by design. The reasoning is that an explicit goal of nadir is to keep learner objects lightweight so that building asuper_learner()can be fast.
- That is to say, if you want to peek into the beta coefficients or
other fit statistics of a learner, this is not supported in
- So far, no thought has been put into complex left-hand-sides of
regression equations. There is no support for left-hand-sides that are
not just the name of a column in the
datapassed. The advice for now if you want to model some transformation of theYvariable is to apply the transformation and store that in thedatawith a column name and to use that new column name in your regression formula(s).- As an explicit subpoint to call attention to, this means so far, no work has been put into supporting survival type outcomes.
- So far, everything in nadir assumes completeness (no missingness) of the data.
-
nadir::super_learner()is a pure prediction algorithm, and does not provide confidence intervals. To obtain confidence intervals,nadir::super_learner()needs to be embedded in an inferential paradigm such as influence function based estimation and inference as in Targeted Learning, or similar.
What if the learner that I want to write really isn’t formula based?
A solution for such a case is to more-or-less ditch the formula piece of a learner entirely, just treating it as an unused argument, and for your custom needs you can always build learners that encode details of the structure of your data.
data <- matrix(data = rnorm(n = 200), nrow = 20)
colnames(data) <- paste0("X", 1:10)
data <- cbind(data, data %*% rnorm(10))
colnames(data)[ncol(data)] <- 'Y'
lnr_nonformula1 <- function(data, formula, ...) {
# notice by way of knowing things about our data structure, we never reference
# the formula; so if you truly don't want to use it, you don't have to.
# as an example, here we do OLS assuming inputs are numeric matrices —
# this might even be computationally more performant given how much extra
# stuff is inside an lm or glm fit.
X <- as.matrix(data[,grepl(pattern = "^X", colnames(data))])
Y <- as.matrix(data[,'Y'])
model_betas <- solve(t(X) %*% X) %*% t(X) %*% Y
learned_predictor <- function(newdata) {
if ('Y' %in% colnames(newdata)) {
index_of_y <- which(colnames(newdata) == 'Y')[[1]]
} else {
index_of_y <- NULL
}
if (is.data.frame(newdata)) {
newdata <- as.matrix(newdata)
}
as.vector(t(model_betas) %*% t(newdata[,-index_of_y, drop=FALSE]))
}
return(learned_predictor)
}
attr(lnr_nonformula1, 'sl_lnr_type') <- 'continuous'
# this is essentially a re-implementation of lnr_mean with no reference to the formula
lnr_nonformula2 <- function(data, formula, ...) {
Y <- data[,'Y']
Y_mean <- mean(Y)
learned_predictor <- function(newdata) {
rep(Y_mean, nrow(newdata))
}
return(learned_predictor)
}
attr(lnr_nonformula2, 'sl_lnr_type') <- 'continuous'
learned_super_learner <- super_learner(
data = data,
learners = list(
nonformula1 = lnr_nonformula1,
nonformula2 = lnr_nonformula2),
formulas = . ~ ., # it doesn't matter what we put here, because neither
# learner uses their formula inputs.
y_variable = 'Y',
verbose = TRUE
)
# observe that the OLS model gets all the weight because it's the correct model:
round(learned_super_learner$learner_weights, 10)
#> nonformula1 nonformula2
#> 1 0
rm(data) # cleanupSo you can see from the immediately prior code snippet, if you have
some niche application where you would like to avoid using the
formulas argument to nadir::super_learner() at
all, you can do this by taking advantage of what you know about how
you’re going to structure the data argument.
Can I use {origami} with {nadir}?
Yes, you can.
There’s a wrapper provided for working with the folds_*
functions from origami.
The first example below is a bit boring, but it does internally use
origami::folds_vfold. The second example demonstrates how
to pass another fold_* function from the
origami package, and any extra arguments passed to
cv_origami_schema get passed on to the
origami::folds_* function.
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = cv_origami_schema,
verbose = TRUE
)
# if you want to use a different origami::folds_* function, pass it into cv_origami_schema
sl_model <- super_learner(
data = mtcars,
formula = mpg ~ cyl + hp,
learners = list(rf = lnr_rf, lm = lnr_lm, mean = lnr_mean),
cv_schema = \(data, n_folds) {
cv_origami_schema(data, n_folds, fold_fun = origami::folds_loo)
},
verbose = TRUE
)Are there potentially ‘sharp edges’ to {nadir} worth
knowing about?
Yes! Though nadir tries to make the process user-friendly, there may be unexpected behaviors if you use it outside its design-scope and tested functionality.
Outcome Transformations
For example, nadir and its learners are so far only
built to handle regression formulas where the left-hand-side appears as
a column name of the data passed in. That means transformations of the
outcome variable implied through the formula are not supported. Our
advice is that if you want to transform the outcome variable, you should
store that in a column in your data and run super_learner()
using that column name in your formula(s).
The recommended way to handle outcome transformations is to store the transformed outcome as a new column in your data frame and refer to it on the left-hand-side of your formula(e).
What does outcome_type = ... do and what doesn’t it
do?
Another sharp-edge is around the meta-learning step. For example,
predicting continuous outcomes, one should specify to
super_learner() that
outcome_type = 'continuous' (the default) so that
non-negative least squares is used to minimize a linear combination of
the candidate learners based on held-out mean squared error for the loss
function. Additionally, by setting outcome_type = 'binary'
or outcome_type = 'density' the negative log likelihood /
negative log predicted density are used respectively as loss functions.
In each of these cases, these defaults translate to setting the
determine_weights_for_super_learner function argument
appropriately to one of
determine_super_learner_weights_nnls(),
determine_weights_for_binary_outcomes,
determine_weights_using_neg_log_loss. These loss functions
are selected based on work in the loss based estimation literature,
especially1 2 3.
The outcome_type argument doesn’t modify the
behavior of the candidate learners, so if you want to use say a
lnr_glm with
family = binomial(link = 'logit'), you need to pass this in
the extra_learner_args argument to
super_learner().
