Skip to content

Commit

Permalink
some edits distinguishing ML/statistics and genomics
Browse files Browse the repository at this point in the history
  • Loading branch information
jjc2718 committed Sep 2, 2024
1 parent 26498bc commit 068de88
Showing 1 changed file with 5 additions and 3 deletions.
8 changes: 5 additions & 3 deletions content/02.main-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Clinically, there are many reasons why a smaller gene signature may be preferabl

Behind much of this work, there is an underlying assumption that smaller gene signatures tend to be more robust: that for a new patient or in a new biological context, a smaller gene set or more parsimonious model will be more likely to maintain its predictive performance than a larger one.
Similar ideas are described in the statistics literature, suggesting that simpler models with performance that is comparable to the best model are more likely to perform robustly across datasets or resist overfitting [@pmc:PMC2929880; @pmc:PMC3994246].
This assumption has rarely been explicitly tested in genomics applications, but it has often been included in guidelines or rules of thumb for applied statistical modeling or machine learning in biology, e.g. [@doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961].
Although these assumptions have rarely been formally stated or systematically tested in genomics applications, they are often included in guidelines or rules of thumb for applied statistical modeling or machine learning in biology, e.g. [@doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961].

In this study, we sought to test the robustness assumption directly by evaluating model generalization across biological contexts, inspired by previous work on domain adaptation and transfer learning in cancer transcriptomics [@doi:10.1038/s43018-020-00169-2; @doi:10.1038/s42256-021-00408-w; @doi:10.1073/pnas.2106682118].
We used two large, heterogeneous public cancer datasets: The Cancer Genome Atlas (TCGA) for human tumor sample data [@doi:10.1038/ng.2764], and the Cancer Cell Line Encyclopedia (CCLE) for human cell line data [@doi:10.1038/s41586-019-1186-3].
Expand Down Expand Up @@ -169,7 +169,8 @@ To accomplish this, we rely on the "`lambda.1se`" heuristic used in the `glmnet`
We first identify models with performance within one standard error of the top-performing model on the holdout dataset.
Then, from this subset of relatively well-performing models, we choose the smallest (i.e., strongest LASSO penalty) to apply to the test data.
In both cases, we exclusively use the holdout data to select a model and only apply the model to out-of-dataset samples to evaluate generalization performance _after_ model selection.
Applying these criteria to both the TCGA to CCLE and CCLE to TCGA prediction problems, we saw that model sizes (number of nonzero gene expression features) tended to differ by approximately an order of magnitude between model selection approaches, with medians on the order of 100 nonzero features for the "best" models and on the order of 10 nonzero features for the "smallest good" models, although there was considerable variation between target genes, and some best-performing models included substantially more features (Supplementary Figure {@fig:best_smallest_features}).
Applying these criteria to both the TCGA to CCLE and CCLE to TCGA prediction problems, we saw that model sizes (number of nonzero gene expression features) tended to differ by approximately an order of magnitude between model selection approaches, with medians on the order of 100 nonzero features for the "best" models and on the order of 10 nonzero features for the "smallest good" models (Supplementary Figure {@fig:best_smallest_features}).
Still, there was considerable variation between target genes, and some best-performing models included substantially more features than the median, including classifiers we have previously observed to perform well such as _TP53_, _PTEN_, and _SETD2_.

For TCGA to CCLE generalization, 37/71 genes (52.1%) had better performance for the "best" model, and 24/71 genes (33.8%) had better generalization performance with the "smallest good" model.
The other 10 genes had the same "best" and "smallest good" model: in other words, the "smallest good" model was also the best-performing overall, so the performance difference between the two was exactly 0 (Figure {@fig:tcga_ccle_smallest_best}B).
Expand Down Expand Up @@ -287,7 +288,8 @@ Overall, however, we believe the size and tissue representation of TCGA and CCLE
## Conclusion

Without directly evaluating model generalization, it is tempting to assume that simpler models will generalize better than more complex models.
Previous studies and sets of guidelines suggest this rule of thumb [@doi:10.1214/088342306000000060; @doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961], and existing model selection approaches sometimes incorporate information-theoretic or other explicit criteria to encourage simpler models that do not fit the data as closely.
Studies in the statistics and machine learning literature suggest this rule of thumb [@doi:10.1214/088342306000000060; @doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961], and model selection approaches sometimes incorporate criteria to encourage simpler models that do not fit the data as closely.
These ideas have taken root in genomics, although they are less commonly stated formally or studied systematically [@doi:10.1007/s00405-021-06717-5; @doi:10.1089/dna.2020.6193; @doi:10.1186/s12859-021-04503-y].
However, we do not observe strong evidence that simpler models inherently generalize more effectively than more complex ones.
There may be other reasons to train small models or to look for the best model of a certain size/sparsity, such as biomarker interpretability or assay cost.
Our results underscore the importance of defining clear goals for each analysis.
Expand Down

0 comments on commit 068de88

Please sign in to comment.