Merge pull request #24 from jjc2718/r2_changes

Changes from second round of revisions
greenelab · Sep 9, 2024 · a8303ac · a8303ac
2 parents 3b665ed + 4f81825
commit a8303ac
Show file tree

Hide file tree

Showing 23 changed files with 28,995 additions and 18,583 deletions.
diff --git a/content/02.main-text.md b/content/02.main-text.md
@@ -12,7 +12,7 @@ Clinically, there are many reasons why a smaller gene signature may be preferabl
 
 Behind much of this work, there is an underlying assumption that smaller gene signatures tend to be more robust: that for a new patient or in a new biological context, a smaller gene set or more parsimonious model will be more likely to maintain its predictive performance than a larger one.
 Similar ideas are described in the statistics literature, suggesting that simpler models with performance that is comparable to the best model are more likely to perform robustly across datasets or resist overfitting [@pmc:PMC2929880; @pmc:PMC3994246].
-This assumption has rarely been explicitly tested in genomics applications, but it has often been included in guidelines or rules of thumb for applied statistical modeling or machine learning in biology, e.g. [@doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961].
+Although these assumptions have rarely been formally stated or systematically tested in genomics applications, they are often included in guidelines or rules of thumb for applied statistical modeling or machine learning in biology, e.g. [@doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961].
 
 In this study, we sought to test the robustness assumption directly by evaluating model generalization across biological contexts, inspired by previous work on domain adaptation and transfer learning in cancer transcriptomics [@doi:10.1038/s43018-020-00169-2; @doi:10.1038/s42256-021-00408-w; @doi:10.1073/pnas.2106682118].
 We used two large, heterogeneous public cancer datasets: The Cancer Genome Atlas (TCGA) for human tumor sample data [@doi:10.1038/ng.2764], and the Cancer Cell Line Encyclopedia (CCLE) for human cell line data [@doi:10.1038/s41586-019-1186-3].
@@ -128,7 +128,7 @@ All neural network analyses were performed on a Ubuntu 18.04 machine with a NVID
 ### Evaluating model generalization using public cancer data
 
 We collected data from the TCGA Pan-Cancer Atlas and the Cancer Cell Line Encyclopedia to predict the presence or absence of mutations in cancer genes, as a benchmark of cancer-related information content across cancer types and contexts.
-We trained mutation status classifiers across approximately 70 genes involved in cancer development and progression from Vogelstein et al. 2013 [@doi:10.1126/science.1235122], using LASSO logistic regression with gene expression (RNA-seq) values as predictive features.
+We trained mutation status classifiers across approximately 70 genes involved in cancer development and progression from Vogelstein et al. 2013 [@doi:10.1126/science.1235122], using LASSO logistic regression with gene expression (RNA-seq) values as predictive features, and integrating point mutation and copy number data to label each sample as mutated or not mutated in the target gene (Supplementary Note [S1](#supplementary-note-s1)).
 We fit each classifier across a variety of regularization parameters, resulting in models with a variety of different sparsity levels between the extremes of 0 nonzero features and all features included (Supplementary Figure {@fig:average_sparsity}).
 Inspired by the generalization experiments across tissues and model systems in [@doi:10.1038/s43018-020-00169-2], we designed experiments to evaluate the generalization of mutation status classifiers across datasets (TCGA to CCLE and CCLE to TCGA) and across biological contexts (cancer types) within TCGA, relative to a within-dataset baseline (Figure {@fig:overview}).
 
@@ -164,10 +164,13 @@ For generalization from CCLE to TCGA, we observed a more pronounced upward shift
 To address the question of whether sparser or more parsimonious models tend to generalize better or not, we implemented two model selection schemes and compared them for the TCGA to CCLE and CCLE to TCGA mutation prediction problems (Figure {@fig:tcga_ccle_smallest_best}A).
 The "best" model selection scheme chooses the top-performing model (LASSO parameter) on the holdout dataset from the same source as the training data and applies it to the test data from the other data source.
 The intention of the "smallest good" model selection scheme is to balance parsimony with reasonable performance on the holdout data, since simply selecting the smallest possible model (generally, the dummy regressor/mean predictor) is not likely to generalize well.
-To accomplish this, we rely on the "`lambda.1se" heuristic used in the `glmnet` R package for generalized linear models, as one of the default methods for parameter choice and model selection [@doi:10.18637/jss.v033.i01].
+
+To accomplish this, we rely on the "`lambda.1se`" heuristic used in the `glmnet` R package for generalized linear models, which is one of the default methods for parameter choice and model selection [@doi:10.18637/jss.v033.i01].
 We first identify models with performance within one standard error of the top-performing model on the holdout dataset.
 Then, from this subset of relatively well-performing models, we choose the smallest (i.e., strongest LASSO penalty) to apply to the test data.
 In both cases, we exclusively use the holdout data to select a model and only apply the model to out-of-dataset samples to evaluate generalization performance _after_ model selection.
+Applying these criteria to both the TCGA to CCLE and CCLE to TCGA prediction problems, we saw that model sizes (number of nonzero gene expression features) tended to differ by approximately an order of magnitude between model selection approaches, with medians on the order of 100 nonzero features for the "best" models and on the order of 10 nonzero features for the "smallest good" models (Supplementary Figure {@fig:best_smallest_features}).
+Still, there was considerable variation between target genes, and some best-performing models included substantially more features than the median, including classifiers we have previously observed to perform well such as _TP53_, _PTEN_, and _SETD2_.
 
 For TCGA to CCLE generalization, 37/71 genes (52.1%) had better performance for the "best" model, and 24/71 genes (33.8%) had better generalization performance with the "smallest good" model.
 The other 10 genes had the same "best" and "smallest good" model: in other words, the "smallest good" model was also the best-performing overall, so the performance difference between the two was exactly 0 (Figure {@fig:tcga_ccle_smallest_best}B).
@@ -228,7 +231,7 @@ For _EGFR_ mutation status prediction, we saw that performance for small hidden
 On average, over all 71 genes from Vogelstein et al., performance on both held-out TCGA data and CCLE data tends to increase until a hidden layer size of 10-50, then flatten (Figure {@fig:tcga_ccle_nn}B).
 To explore additional approaches to neural network regularization, we also tried varying dropout and weight decay for _EGFR_ and _KRAS_ mutation status classification while holding the hidden layer size constant.
 Results followed a similar trend, with generalization performance generally tracking performance on holdout data (Supplementary Figure {@fig:nn_dropout_wd}).
-We also preprocessed the input gene expression features using PCA, and varied the number of PCA features retained as input to the neural network; for _EGFR_ the best generalization performance and holdout performance both occurred at 1000 PCs, but for _KRAS_ the model generalized better to cell line data for fewer PCs than its peak holdout performance (Supplementary Figure {@fig:nn_dropout_wd}).
+We also preprocessed the input gene expression features using PCA, and varied the number of PCA features retained as input to the neural network; for _EGFR_ the best generalization performance and holdout performance both occurred at 1000 PCs, but for _KRAS_ the model generalized better to cell line data for fewer PCs than its peak holdout performance (Supplementary Figure {@fig:nn_pca}).
 
 It can be challenging to measure which hidden layer sizes tended to perform relatively well or poorly across classifiers, since different genes may have different baseline performance AUPR values and overall classifier effect sizes.
 In order to summarize across genes, for each gene, we ranked the range of hidden layer sizes by the corresponding models' generalization performance on CCLE (Figure {@fig:tcga_ccle_nn}C).
@@ -285,7 +288,8 @@ Overall, however, we believe the size and tissue representation of TCGA and CCLE
 ## Conclusion
 
 Without directly evaluating model generalization, it is tempting to assume that simpler models will generalize better than more complex models.
-Previous studies and sets of guidelines suggest this rule of thumb [@doi:10.1214/088342306000000060; @doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961], and existing model selection approaches sometimes incorporate information-theoretic or other explicit criteria to encourage simpler models that do not fit the data as closely.
+Studies in the statistics and machine learning literature suggest this rule of thumb [@doi:10.1214/088342306000000060; @doi:10/bhfhgd; @doi:10.4137/CIN.S408; @doi:10.1371/journal.pcbi.1004961], and model selection approaches sometimes incorporate criteria to encourage simpler models that do not fit the data as closely.
+These ideas have taken root in genomics, although they are less commonly stated formally or studied systematically [@doi:10.1007/s00405-021-06717-5; @doi:10.1089/dna.2020.6193; @doi:10.1186/s12859-021-04503-y].
 However, we do not observe strong evidence that simpler models inherently generalize more effectively than more complex ones.
 There may be other reasons to train small models or to look for the best model of a certain size/sparsity, such as biomarker interpretability or assay cost.
 Our results underscore the importance of defining clear goals for each analysis.