Cross Validation Training/Validation Splits with Characters/Factor Columns

Designed to handle cross-validation on models like randomForest, ranger, glmnet, etc., where the model matrix of newdata must match eactly the model matrix of the training dataset, this function intends to answer the need "The training datasets need to have every level of every discrete-type column that appears in the data."

Usage

cv_character_and_factors_schema(
  data,
  n_folds = 5,
  cv_sl_mode = TRUE,
  check_validation_datasets_too = TRUE
)

Arguments

data: Data to use in training a `super_learner`.
n_folds: The number of cross-validation folds to use in constructing the `super_learner`.
cv_sl_mode: A binary (default: TRUE) indicator for if the output training/validation data lists will be used inside another `super_learner` call. If so, then the training data needs to have every level appear at least twice so that the data can be put into further training/validation splits.
check_validation_datasets_too: Enforce that the validation datasets produced also have every level of every character / factor type column present. This is particularly useful for learners like `glmnet` which require that the `newx` have the exact same shape/structure as the training data, down to binary indicators for every level that appears.

Value

A list of two lists ($training_data and $validation_data) which are each lists of length n_folds. In each of those entries is a data.frame that contains the nth training or validation fold of the data.

a named list of two lists, each being a list of `n_folds` data frames.

Details

The fundamental idea is to check if the unique levels of character and/or factor columns are represented in every training dataset.

Above and beyond this, this function is designed to support cv_super_learner, which inherently involves two layers of cross-validation. As a result, more stringent conditions are specified when the `cv_sl_mode` is enabled. For convenience this mode is enabled by default

Examples


if (FALSE) { # \dontrun{
require(palmerpenguins)
training_validation_splits <- cv_character_and_factors_schema(
  palmerpenguins::penguins)

# we can see the population breakdown across all the training
# splits:
sapply(training_validation_splits$training_data, function(df) {
  table(df$species)
  })
# notably, none of them are empty! this is crucial for certain
# types of learning algorithms that must see all levels appear in the
# training data, like random forests.

# certain models like glmnet require that the prediction dataset
# newx have the _exact_ same shape as the training data, so it
# can be important that every level appears in the validation data
# as well.  check that by looking into these types of tables:
sapply(training_validation_splits$validation_data, function(df) {
  table(df$species)
  })

# if you don't need this level of stringency, but you just want
# to make cv_splits where every level appears in the training_data,
# you can do so using the check_validation_datasets_too = FALSE
# argument.
penguins_small <- palmerpenguins::penguins[c(1:3, 154:156, 277:279), ]
penguins_small <- penguins_small[complete.cases(penguins_small),]

training_validation_splits <- cv_character_and_factors_schema(
  penguins_small,
  cv_sl_mode = FALSE,
  n_folds = 5,
  check_validation_datasets_too = FALSE)

sapply(training_validation_splits$training_data, function(df) {
  table(df$species)
  })

# now you can see plenty of non-appearing levels in the validation data:
sapply(training_validation_splits$validation_data, function(df) {
  table(df$species)
  })
} # }