Skip to contents

There are several reasons one may want to screen out predictor variables from a regression problem being passed to nadir::super_learner():

  • A more parsimonious model that excludes candidate predictor variables that have no explanatory value may have better predictive performance,
  • nadir::super_learner() can be a computationally expensive algorithm, so removing extraneous predictor variables with little-to-no predictive value can reduce runtime
  • Including learners that reflect a variety of different thresholds for screening out variables can enrich the set of candidate learners, making it more likely that the true data generating process is accurately reflected by one of the trained learners.

We view screening out variables as potentially occurring in either of two settings: 1) before the regression problem is specified, or 2) on a per-learner basis.

When using the nadir package in R/RStudio, the ?screeners documentation is available for reference.

Screening Variables out of the Regression Problem

The screeners currently available in nadir are:

  • screener_cor — thresholds based on the correlation coefficient.
  • screener_cor_top_n — only keeps the top n predictors most correlated with the outcome.
  • screener_t_test — uses the t statistic or p.value from a linear model of the outcome, intercept, and one predictor at a time.

In order to use these to set up our regression problem, we can follow the following template:

library(nadir)

# step 1:
# specify the original regression problem, before any screening has been done
# 
# data:    will be the mtcars dataset 
# formula: will be mpg ~ . 
#   as in, mpg regressed on every other column.
# 

# step 2: 
# use a screener to modify the problem, dropping some predictors 
screened_regression_problem <- screener_cor(
  data = mtcars, 
  formula = mpg ~ .,
  threshold = 0.5)
# we will require predictors have correlation coefficient of at least 0.5 to keep them
# 
# if you want to look, you can see what was kept:
screened_regression_problem$formula
## mpg ~ cyl + disp + hp + drat + wt + vs + am + carb
## <environment: 0x133eed6d8>
screened_regression_problem$failed_to_correlate_names
## [1] "qsec" "gear"
# step 3: 
# now we can use nadir::super_learner() with the modified problem
super_learner(
  data = screened_regression_problem$data,
  formula = screened_regression_problem$formula,
  learners = list(lnr_lm, lnr_earth, lnr_rf),
  verbose = TRUE)
## $sl_predictor
## function(newdata) {
##     # for each model, predict on the newdata and apply the model weights
##     future_lapply(1:length(fit_learners), function(i) {
##       fit_learners[[i]](newdata) * learner_weights[[i]]
##     }, future.seed = TRUE) |>
##       Reduce(`+`, x = _) # aggregate across the weighted model predictions
##   }
## <bytecode: 0x1341def28>
## <environment: 0x1341ef698>
## 
## $y_variable
## [1] "mpg"
## 
## $outcome_type
## [1] "continuous"
## 
## $learner_weights
##         lm      earth         rf 
## 0.08265103 0.00000000 0.91734897 
## 
## $holdout_predictions
## # A tibble: 32 × 5
##    .sl_fold    lm earth    rf   mpg
##       <int> <dbl> <dbl> <dbl> <dbl>
##  1        1 22.6   17.0  19.6  21.4
##  2        1 18.0   15.0  16.3  18.7
##  3        1 21.5   17.6  19.0  18.1
##  4        1 12.7   12.9  13.4  10.4
##  5        1 17.4   21.6  19.9  19.7
##  6        1  9.47  16.2  15.4  15  
##  7        2 21.9   18.6  20.5  21  
##  8        2 21.1   20.4  23.0  24.4
##  9        2 12.3   15.1  16.9  16.4
## 10        2 13.8   16.2  17.0  15.2
## # ℹ 22 more rows
## 
## attr(,"class")
## [1] "list"                    "nadir_sl_verbose_output"
# use super_learner() as you would, just with the updated formula and data

It should be noted that this functionality should be used with fairly extreme discretion. The point of offering it is that there are settings in which a huge number of variables may be computationally challenging, but it is likely that screening out variables may induce some issues around post-selection inference.

Adding a Screening Layer into a Learner

In contrast to the above approach, another approach to screening variables is to screen them out within a given learner. This involves constructing learners that have screening “baked in” so to speak.

To construct these new learners, one should use the add_learner(learner, screener, screener_extra_args) function, which returns a new learner.

# construct some new learners with varying levels and types of screening
 
# here we just show a small sample of examples manually constructed:
lnr_glm_screened_pearson_cor_50 <- add_screener(lnr_glm, screener_cor, list(threshold = 0.5))
lnr_glm_screened_spearman_cor_50 <- add_screener(lnr_glm, screener_cor, list(threshold = 0.5, cor... = list(method = 'spearman')))
lnr_rf_screened_cor_top_5 <- add_screener(lnr_rf, screener_cor_top_n, list(keep_n_terms = 5))
lnr_earth_screened_t_test_p_lt_05 <- add_screener(lnr_earth, screener_t_test, list(p_value_threshold = 0.05))

# use the learners with built-in screeners in a super_learner():
super_learner(
  data = MASS::Boston, 
  formula = medv ~ .,
  learners = list(
    lnr_glm_screened_pearson_cor_50, lnr_glm_screened_spearman_cor_50,
    lnr_rf_screened_cor_top_5, lnr_earth_screened_t_test_p_lt_05),
  verbose = TRUE)
## $sl_predictor
## function(newdata) {
##     # for each model, predict on the newdata and apply the model weights
##     future_lapply(1:length(fit_learners), function(i) {
##       fit_learners[[i]](newdata) * learner_weights[[i]]
##     }, future.seed = TRUE) |>
##       Reduce(`+`, x = _) # aggregate across the weighted model predictions
##   }
## <bytecode: 0x1341def28>
## <environment: 0x13573ae48>
## 
## $y_variable
## [1] "medv"
## 
## $outcome_type
## [1] "continuous"
## 
## $learner_weights
## cor_threshold_screened_glm_1 cor_threshold_screened_glm_2 
##                    0.0000000                    0.0000000 
##        cor_top_n_screened_rf        t_test_screened_earth 
##                    0.4219593                    0.5780407 
## 
## $holdout_predictions
## # A tibble: 506 × 6
##    .sl_fold cor_threshold_screene…¹ cor_threshold_screen…² cor_top_n_screened_rf
##       <int>                   <dbl>                  <dbl>                 <dbl>
##  1        1                    31.7                   32.1                  33.3
##  2        1                    30.7                   30.4                  34.5
##  3        1                    21.8                   23.1                  22.6
##  4        1                    24.3                   25.2                  21.1
##  5        1                    22.5                   22.0                  20.1
##  6        1                    22.0                   20.9                  19.4
##  7        1                    18.1                   18.8                  17.0
##  8        1                    16.5                   17.5                  16.0
##  9        1                    14.5                   15.4                  15.5
## 10        1                    28.7                   28.0                  30.5
## # ℹ 496 more rows
## # ℹ abbreviated names: ¹​cor_threshold_screened_glm_1,
## #   ²​cor_threshold_screened_glm_2
## # ℹ 2 more variables: t_test_screened_earth <dbl>, medv <dbl>
## 
## attr(,"class")
## [1] "list"                    "nadir_sl_verbose_output"

While the above approach shows how to construct screeners manually, there may be settings in which constructing screened learners programmatically is beneficial. Here we show how to produce an array of learners with built-in screeners programmatically:

# construct new learners with builtin screeners
# =============================================

# learners
base_learners <- list(lnr_glm, lnr_hal, lnr_earth, lnr_rf, lnr_glmnet)

# screeners
screeners <- list(screener_cor, screener_cor, screener_cor, screener_cor_top_n)
screener_extra_args <- list(list(threshold = 0.3), 
                            list(threshold = 0.4), 
                            list(threshold = 0.5), 
                            list(keep_n_terms = 10))
# ensure that the screeners and screener_extra_args are the same length

# set up a grid of combinations of learners and screeners
# we'll refer to them by indices to avoid duplicating objects unnecessarily 
learner_screener_grid <- expand.grid(learner = 1:length(base_learners), screener = 1:length(screeners))

new_learners <- lapply(1:nrow(learner_screener_grid), \(i) {
  learner_i <- learner_screener_grid[['learner']][i]
  screener_i <- learner_screener_grid[['screener']][i]
  
  new_learner <- add_screener(learner = base_learners[[learner_i]],
                              screener = screeners[[screener_i]],
                              screener_extra_args = screener_extra_args[[screener_i]])
  new_learner
})


# run super_learner() with the new screeners
# ==========================================

# memory expanded to deal with a large number of learners
options(future.globals.maxSize = 8000 * 1024^2) 

nadir::super_learner(
  data = MASS::Boston, 
  formula = medv ~ .,
  learners = new_learners,
  verbose = TRUE)
## $sl_predictor
## function(newdata) {
##     # for each model, predict on the newdata and apply the model weights
##     future_lapply(1:length(fit_learners), function(i) {
##       fit_learners[[i]](newdata) * learner_weights[[i]]
##     }, future.seed = TRUE) |>
##       Reduce(`+`, x = _) # aggregate across the weighted model predictions
##   }
## <bytecode: 0x1341def28>
## <environment: 0x1343950e8>
## 
## $y_variable
## [1] "medv"
## 
## $outcome_type
## [1] "continuous"
## 
## $learner_weights
##    cor_threshold_screened_glm_1    cor_threshold_screened_hal_1 
##                      0.00000000                      0.14714937 
##  cor_threshold_screened_earth_1     cor_threshold_screened_rf_1 
##                      0.04341371                      0.00000000 
## cor_threshold_screened_glmnet_1    cor_threshold_screened_glm_2 
##                      0.00000000                      0.00000000 
##    cor_threshold_screened_hal_2  cor_threshold_screened_earth_2 
##                      0.00000000                      0.00000000 
##     cor_threshold_screened_rf_2 cor_threshold_screened_glmnet_2 
##                      0.44293555                      0.00000000 
##    cor_threshold_screened_glm_3    cor_threshold_screened_hal_3 
##                      0.00000000                      0.02000205 
##  cor_threshold_screened_earth_3     cor_threshold_screened_rf_3 
##                      0.00000000                      0.00000000 
## cor_threshold_screened_glmnet_3          cor_top_n_screened_glm 
##                      0.00000000                      0.00000000 
##          cor_top_n_screened_hal        cor_top_n_screened_earth 
##                      0.10746165                      0.00000000 
##           cor_top_n_screened_rf       cor_top_n_screened_glmnet 
##                      0.23903768                      0.00000000 
## 
## $holdout_predictions
## # A tibble: 506 × 22
##    .sl_fold cor_threshold_screen…¹ cor_threshold_screen…² cor_threshold_screen…³
##       <int>                  <dbl>                  <dbl>                  <dbl>
##  1        1                   31.3                   29.1                   36.5
##  2        1                   30.8                   30.2                   32.3
##  3        1                   25.9                   21.6                   22.1
##  4        1                   23.6                   20.4                   20.3
##  5        1                   25.9                   20.5                   21.5
##  6        1                   17.5                   17.4                   17.7
##  7        1                   14.2                   19.5                   19.3
##  8        1                   16.3                   17.3                   15.1
##  9        1                   21.2                   20.2                   19.1
## 10        1                   22.9                   21.2                   20.7
## # ℹ 496 more rows
## # ℹ abbreviated names: ¹​cor_threshold_screened_glm_1,
## #   ²​cor_threshold_screened_hal_1, ³​cor_threshold_screened_earth_1
## # ℹ 18 more variables: cor_threshold_screened_rf_1 <dbl>,
## #   cor_threshold_screened_glmnet_1 <dbl>, cor_threshold_screened_glm_2 <dbl>,
## #   cor_threshold_screened_hal_2 <dbl>, cor_threshold_screened_earth_2 <dbl>,
## #   cor_threshold_screened_rf_2 <dbl>, cor_threshold_screened_glmnet_2 <dbl>, …
## 
## attr(,"class")
## [1] "list"                    "nadir_sl_verbose_output"