-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
149 lines (110 loc) · 4.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
output: github_document
# Via https://bookdown.org/yihui/bookdown/citations.html
biblio-style: "apalike"
bibliography: references.bib
link-citations: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# Feature Rank: ensemble feature ranking for variable selection
Ensemble feature ranking for variable selection in SuperLearner ensembles [@polley2021package], based on @effrosynidis2021evaluation. Multiple algorithms estimate a ranking of the strength of the relationship between predictors and the outcome in the training set, and these rankings are combined into a single ranking via an aggregation method (reciprocal ranking currently). The final ranking can then be cut at a certain number of variables (e.g. top 10 predictors, top 70%, etc.) to create one or more feature selection wrappers for SuperLearner. The result should generally be more robust and stable than feature selection using a single algorithm. See also [@neumann2017efs] for a similar method.
## Install
```{r install, eval = FALSE}
# install.packages("remotes")
remotes::install_github("ck37/featurerank")
```
## Algorithms
Currently implemented algorithms are:
* Feature ranking: correlation, glm, glmnet, random forest, bart, xgboost + shap, variance
* Rank aggregation: reciprocal ranking
## Example
A minimal example to demonstrate how the package can be used.
```{r setup, include=FALSE}
# We include library() here so that the output is suppressed, and again
# later in the demo just so people can see it.
library(SuperLearner)
library(glmnet)
# https://github.com/cjcarlson/embarcadero
library(embarcadero)
library(dbarts)
library(weights)
library(randomForest)
library(ck37r)
# Ignore warnings, e.g. from glm().
options("warn" = -1)
```
### Prepare dataset
```{r prep_dataset}
# TODO: switch to a less problematic demo dataset.
data(Boston, package = "MASS")
# Use "chas" as our outcome variable, which is binary.
y = Boston$chas
x = subset(Boston, select = -chas)
```
### Create feature ranking library
Specify the feature ranking wrappers for the ensemble library.
```{r create_library}
library(featurerank)
# Modify RF feature ranker to use 100 trees (faster than default of 500).
featrank_randomForest100 =
function(...) featrank_randomForest(ntree = 100L, ...)
# Specify the set of feature ranking algorithms.
ensemble_rank_custom =
function(top_vars, ...)
ensemble_rank(fn_rank = c(featrank_cor, featrank_randomForest100,
featrank_glm, featrank_glmnet),
#featrank_shap, # too verbose currently
#featrank_dbarts), # skip for speed
top_vars = top_vars,
...)
# There are 13 total vars so try dropping 1 of them.
top12 = function(...) ensemble_rank_custom(top_vars = 12, ...)
# Try dropping worst 2 predictors.
top11 = function(...) ensemble_rank_custom(top_vars = 11, ...)
# Drop worst 3 predictors.
top10 = function(...) ensemble_rank_custom(top_vars = 10, ...)
```
### Use in SuperLearner
```{r use_sl}
library(SuperLearner)
set.seed(1)
# Takes 93 seconds with 1 core.
sl = SuperLearner(y, x, family = binomial(),
# 10-fold cross-validation stratified on the outcome.
cvControl = list(V = 10L, stratifyCV = TRUE),
SL.library =
list("SL.glm", # Baseline estimator uses all predictors.
# Try three ensemble screening options, giving the
# screened variable list to logistic regression (SL.glm).
c("SL.glm", "top12", "top11", "top10")))
# Review timing.
sl$times$everything
# We do achieve a modest AUC benefit.
ck37r::auc_table(sl, y = y)[, -6]
# Which features were dropped (will show FALSE below)?
t(sl$whichScreen)
```
### Assess ranking stability
```{r stability}
# Check if we see stability across multiple runs,
# especially for comparison to individual feature ranking algorithms.
# (See stability scores in Table 3 of paper.)
set.seed(2)
# Takes about 90 seconds using 1 core.
system.time({
results =
do.call(rbind.data.frame,
lapply(1:10,
function(i) top12(y, x, family = binomial(),
# Default replications is 3 - more replications increases stability.
replications = 10,
detailed_results = TRUE)$ranking))
})
names(results) = names(x)
# Stability looks excellent.
results
# What if we treated each iteration as its own ranking and then aggregated?
agg_reciprocal_rank(t(results))
```
## References
<div id="refs"></div>