
Cross Validation Training/Validation Splits with Characters/Factor Columns
Source:R/cv_schemas.R
cv_character_and_factors_schema.Rd
Designed to handle cross-validation on models like randomForest, ranger, glmnet, etc., where the model matrix of newdata must match eactly the model matrix of the training dataset, this function intends to answer the need "The training datasets need to have every level of every discrete-type column that appears in the data."
Usage
cv_character_and_factors_schema(
data,
n_folds = 5,
cv_sl_mode = TRUE,
check_validation_datasets_too = TRUE
)
Arguments
- data
Data to use in training a `super_learner`.
- n_folds
The number of cross-validation folds to use in constructing the `super_learner`.
- cv_sl_mode
A binary (default: TRUE) indicator for if the output training/validation data lists will be used inside another `super_learner` call. If so, then the training data needs to have every level appear at least twice so that the data can be put into further training/validation splits.
- check_validation_datasets_too
Enforce that the validation datasets produced also have every level of every character / factor type column present. This is particularly useful for learners like `glmnet` which require that the `newx` have the exact same shape/structure as the training data, down to binary indicators for every level that appears.
Details
The fundamental idea is to check if the unique levels of character and/or factor columns are represented in every training dataset.
Above and beyond this, this function is designed to support cv_super_learner, which inherently involves two layers of cross-validation. As a result, more stringent conditions are specified when the `cv_sl_mode` is enabled. For convenience this mode is enabled by default
Examples
if (FALSE) { # \dontrun{
require(palmerpenguins)
training_validation_splits <- cv_character_and_factors_schema(
palmerpenguins::penguins)
# we can see the population breakdown across all the training
# splits:
sapply(training_validation_splits$training_data, function(df) {
table(df$species)
})
# notably, none of them are empty! this is crucial for certain
# types of learning algorithms that must see all levels appear in the
# training data, like random forests.
# certain models like glmnet require that the prediction dataset
# newx have the _exact_ same shape as the training data, so it
# can be important that every level appears in the validation data
# as well. check that by looking into these types of tables:
sapply(training_validation_splits$validation_data, function(df) {
table(df$species)
})
# if you don't need this level of stringency, but you just want
# to make cv_splits where every level appears in the training_data,
# you can do so using the check_validation_datasets_too = FALSE
# argument.
penguins_small <- palmerpenguins::penguins[c(1:3, 154:156, 277:279), ]
penguins_small <- penguins_small[complete.cases(penguins_small),]
training_validation_splits <- cv_character_and_factors_schema(
penguins_small,
cv_sl_mode = FALSE,
n_folds = 5,
check_validation_datasets_too = FALSE)
sapply(training_validation_splits$training_data, function(df) {
table(df$species)
})
# now you can see plenty of non-appearing levels in the validation data:
sapply(training_validation_splits$validation_data, function(df) {
table(df$species)
})
} # }