Skip to contents

Correlation Threshold Based Screening

Usage

screener_cor_top_n(data, formula, keep_n_terms, cor... = NULL)

Arguments

data

A dataframe intended to be used with super_learner()

formula

The formula specifying the regression to be done

keep_n_terms

Set to an integer value >=1, this indicates that the top n terms in the model frame with greatest absolute correlation with the outcome will be kept.

cor...

An optional list of extra arguments to pass to cor. Use method = 'spearman' for the Spearman rank based correlation coefficient.

Value

A list of $data with columns screened out, $formula with variables screened out, and $failed_to_correlate_names the names of variables that failed to correlate with the outcome at least at the threshold level.

Details

If a variable used has little correlation with the outcome being predicted, we might want to screen that variable out from the predictors.

In large datasets, this is quite important, as having a huge number of columns could be computationally intractable or frustratingly time-consuming to run super_learner() with.

Examples

if (FALSE) { # \dontrun{
screener_cor_top_n(
  data = mtcars,
  formula = mpg ~ .,
  keep_n_terms = 5)

# We're also showing how to specify that you want the Spearman rank-based
# correlation coefficient, to get away from the assumption of linearity.

screener_cor_top_n(
  data = mtcars,
  formula = mpg ~ .,
  keep_n_terms = 5,
  cor... = list(method = 'spearman')
  )
} # }