Creating Learners • nadir

library(nadir)
#> Registered S3 method overwritten by 'future':
#>   method               from      
#>   all.equal.connection parallelly

This article contains some advice for writing and constructing new learners.

Simple Learner Examples

lnr_lm <- function(data, formula, ...) {
  model <- stats::lm(formula = formula, data = data, ...)

  predict_from_trained_lm <- function(newdata) {
    predict(model, newdata = newdata, type = 'response')
  }
  return(predict_from_trained_lm)
}

Weights

We recommend explicitly handling weights as an argument to learners so that it is a protected argument. Some of the internals of different algorithms may vary, using other names for weights instead, so we recommend doing this to standardize the weights argument across different learner algorithms. As a concrete example, the ranger::ranger() function takes case.weights as its argument rather than weights.

A typical learner that supports weights might look like:

lnr_supportsWeights <- function(data, formula, weights = NULL, ...) {

  # train the model 
  model <- model_fit(data = data, formula = formula, weights = weights, ...)
  
  return(function(newdata) {
    predict(model, newdata = newdata)
  })
}

However, some model fitting procedures do not like passing weights = NULL and so it may be necessary to be careful not to pass the default weights = NULL to the model fitting procedure.

As an example of this, we refer the curious reader to the source of lnr_glm in https://github.com/ctesta01/nadir/blob/main/R/learners.R.

  model_training_arguments <- list(data = data, formula = formula)
  
  # add weights they aren't missing
  if (! is.null(weights) & length(weights) == nrow(data)) { 
    model_training_arguments$weights <- weights
  }

Attributes

It’s recommended that if you create learners, that you also give them a couple attributes for a couple of reasons:

If a learner has a sl_lnr_name attribute, then this can be automatically used in the outputs if a name for the learner is left unspecified.
If a learner has a sl_lnr_type attribute, it will be checked against the output_type argument to super_learner().

To set these attributes, when making a new learner, one should run something along the lines of

lnr_myNewLearner <- function(data, formula, ...) {
  model <- # fit your learner given data, formula, ...
    
  predictor_fn <- function(newdata) {
    predict(model, newdata = newdata)
  }
  return(predictor)
}
attr(lnr_myNewLearner, 'sl_lnr_name') <- 'newLearnerName'
attr(lnr_myNewLearner, 'sl_lnr_type') <- 'continuous' # or c('continuous', 'binary') and similar
# see ?nadir_supported_types

Modifying Predictions

Occasionally there are settings in which we would like to enforce some constraint on model behavior. One such example is in clipping the predicted values to some pre-specified range. This concept is often relevant for the logistic regression (i.e., binary outcomes) setting, but may be more broadly applicable.

Here we show how to truncate the predictions of a learner to a specified range as an illustrative example.

truncate_lnr <- function(lnr, min, max) {
  truncate <- function(x, min, max) {
    pmax(pmin(x, max), min)
  }
  
  # we want to return a modified learner: so in other words, we want to return a
  # function that takes in a learner's inputs and returns a prediction function,
  # just with the added truncation.
  return(
    function(...) {
      predictor_fn <- lnr(...) 
      truncated_predictor_fn <- function(...) {
        truncate(predictor_fn(...), min, max)
      }
      return(truncated_predictor_fn)
    }
  )
}

# create the new, truncated learner
lnr_truncated_hal <- truncate_lnr(lnr = lnr_hal, min = 0, max = 1)

# fit the learner
learned_hal_model <- lnr_truncated_hal(data = mtcars, am ~ .)

# produce predictions, for example
learned_hal_model(mtcars)
#>  [1] 0.699742339 0.664010374 0.695709519 0.000000000 0.000000000 0.060095349
#>  [7] 0.000000000 0.511040606 0.283893910 0.390855426 0.361627216 0.194663708
#> [13] 0.057058232 0.009506522 0.000000000 0.015485391 0.047048910 0.807818383
#> [19] 0.837018020 0.857143078 0.590793239 0.225453528 0.141459463 0.233407327
#> [25] 0.000000000 0.820616609 0.760721447 0.750925569 0.192624472 0.777093601
#> [31] 0.354637109 0.630647486


# in contrast, if we had not truncated this learner, we could have gotten
# predictions outside of [0, 1]:
learned_hal_not_truncated <- lnr_hal(data = mtcars, am ~ .)
range(learned_hal_not_truncated(mtcars))
#> [1] -0.4658581  1.0268739

With many learners, it is possible to specify family = binomial(link = 'logit') or a similar argument. However, nonetheless there are settings where fitting a truncated learner has benefits, including that sometimes a truncated linear probability model has lower risk than a logistic regression model, and sometimes truncated models are useful for manually expanding the variety of candidate learners used in the super_learner() algorithm, thereby making it more likely that the true data generating mechanism is closely estimated by the fit super_learner().

If you are wondering whether a truncated linear probability model for binary outcomes is misspecified, see the following section of the FAQ: FAQ: How should we think about “misspecified” learners?