From 8abb0c7e97584cd30a02889d8a3d173a4d69cb56 Mon Sep 17 00:00:00 2001
From: LTLA <infinite.monkeys.with.keyboards@gmail.com>
Date: Sun, 24 Nov 2024 15:56:37 -0800
Subject: [PATCH] Cleaned up docs about the expectations on the reference data.

---
 DESCRIPTION              |  4 ++--
 R/SingleR.R              |  2 +-
 R/matchReferences.R      |  2 +-
 R/trainSingleR.R         | 40 ++++++++++++++++++++++++++--------------
 man/SingleR.Rd           |  2 +-
 man/getClassicMarkers.Rd |  2 +-
 man/matchReferences.Rd   |  2 +-
 man/trainSingleR.Rd      | 40 ++++++++++++++++++++++++++--------------
 vignettes/SingleR.Rmd    |  4 ++++
 9 files changed, 63 insertions(+), 35 deletions(-)

diff --git a/DESCRIPTION b/DESCRIPTION
index 62c7d65c..549ccd30 100644
--- a/DESCRIPTION
+++ b/DESCRIPTION
@@ -1,7 +1,7 @@
 Package: SingleR
 Title: Reference-Based Single-Cell RNA-Seq Annotation
-Version: 2.9.1
-Date: 2024-11-17
+Version: 2.9.2
+Date: 2024-11-24
 Authors@R: c(person("Dvir", "Aran", email="dvir.aran@ucsf.edu", role=c("aut", "cph")),
     person("Aaron", "Lun", email="infinite.monkeys.with.keyboards@gmail.com", role=c("ctb", "cre")),
     person("Daniel", "Bunis", role="ctb"),
diff --git a/R/SingleR.R b/R/SingleR.R
index a8bf9311..a458fdba 100644
--- a/R/SingleR.R
+++ b/R/SingleR.R
@@ -6,7 +6,7 @@
 #' @param test A numeric matrix of single-cell expression values where rows are genes and columns are cells.
 #' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 #' @inheritParams trainSingleR
-#' @param ref A numeric matrix of (usually log-transformed) expression values from a reference dataset,
+#' @param ref A numeric matrix of (usually normalized and log-transformed) expression values from a reference dataset,
 #' or a \linkS4class{SummarizedExperiment} object containing such a matrix;
 #' see \code{\link{trainSingleR}} for details.
 #'
diff --git a/R/matchReferences.R b/R/matchReferences.R
index 9d7fedb7..32178749 100644
--- a/R/matchReferences.R
+++ b/R/matchReferences.R
@@ -3,7 +3,7 @@
 #' Match labels from a pair of references, corresponding to the same underlying cell type or state
 #' but with differences in nomenclature.
 #' 
-#' @param ref1,ref2 Numeric matrices of single-cell (usually log-transformed) expression values where rows are genes and columns are cells.
+#' @param ref1,ref2 Numeric matrices of single-cell (usually normalized and log-transformed) expression values where rows are genes and columns are cells.
 #' Alternatively, \linkS4class{SummarizedExperiment} objects containing such matrices.
 #' @param labels1,labels2 A character vector or factor of known labels for all cells in \code{ref1} and \code{ref2}, respectively.
 #' @param ... Further arguments to pass to \code{\link{SingleR}}.
diff --git a/R/trainSingleR.R b/R/trainSingleR.R
index 787f36a8..51978619 100644
--- a/R/trainSingleR.R
+++ b/R/trainSingleR.R
@@ -4,7 +4,7 @@
 #' 
 #' @param ref A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples).
 #' Each row should be named with the gene name.
-#' In general, the expression values are expected to be log-transformed, see Details.
+#' In general, the expression values are expected to be normalized and log-transformed, see Details.
 #' 
 #' Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 #'
@@ -28,8 +28,7 @@
 #' @param sd.thresh Deprecated and ignored.
 #' @param de.method String specifying how DE genes should be detected between pairs of labels.
 #' Defaults to \code{"classic"}, which sorts genes by the log-fold changes and takes the top \code{de.n}.
-#' Setting to \code{"wilcox"} or \code{"t"} will use Wilcoxon ranked sum test or Welch t-test between labels, respectively,
-#' and take the top \code{de.n} upregulated genes per comparison.
+#' Other options are \code{"wilcox"} and \code{"t"}, see Details.
 #' Ignored if \code{genes} is a list of markers/DE genes.
 #' @param de.n An integer scalar specifying the number of DE genes to use when \code{genes="de"}.
 #' If \code{de.method="classic"}, defaults to \code{500 * (2/3) ^ log2(N)} where \code{N} is the number of unique labels.
@@ -73,10 +72,25 @@
 #' The resulting objects can be re-used across multiple classification steps with different test data sets via \code{\link{classifySingleR}}.
 #' This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.
 #' 
-#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between labels.
-#' This is done by identifying the median expression within each label, and computing differences between medians for each pair of labels.
-#' For each label, the top \code{de.n} genes with the largest differences compared to another label are chosen as markers to distinguish the two labels.
+#' The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between pairs of labels in the reference dataset.
 #' The expression values are expected to be log-transformed and normalized.
+#' For each pair of labels, the top \code{de.n} genes with strongest upregulation in one label are chosen as markers to distinguish it from the other label.
+#' The exact ranking depends on the \code{de.method=} argument:
+#' \itemize{
+#' \item The default \code{de.method="classic"} will use \code{\link{getClassicMarkers}} to compute the median expression for each label and each gene.
+#' Then, for each pair of labels, the top \code{de.n} genes with the largest positive differences are chosen as markers to distinguish the first label from the second.
+#' This is intended for reference datasets derived from bulk transcriptomic data (e.g., microarrays) with a high density of non-zero values. 
+#' It is less effective for single-cell data, where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
+#' \item \code{de.method="wilcox"} will rank genes based on the area under the curve (AUC) in each pairwise comparison between labels.
+#' The top \code{de.n} genes with the largest AUCs above 0.5 are chosen as markers for the first label compared to the second.
+#' This is analogous to ranking on significance in the Wilcoxon ranked sum test and is intended for use with single-cell data.
+#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+#' \item \code{de.method="t"} will rank genes on the Cohen's d in each pairiwse comparison.
+#' The top \code{de.n} genes with the largest positive Cohen's d are chosen as markers for the first label compared to the second.
+#' This is roughly analogous to ranking on significance in the t-test and is faster than the AUC.
+#' The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+#' }
+#' Alternatively, users can detect markers externally and pass a list of markers to \code{genes} (see \dQuote{Custom gene specification}).
 #'
 #' Classification with \code{classifySingleR} assumes that the test dataset contains all marker genes that were detected from the reference.
 #' If the test and reference datasets do not have the same genes in the same order, we can set \code{test.genes} to the row names of the test dataset.
@@ -88,7 +102,7 @@
 #' It has the same effect as filtering out undesirable rows from \code{ref} prior to calling \code{trainSingleR}.
 #' Unlike \code{test.genes}, setting \code{restrict} does not introduce further checks on \code{rownames(test)} in \code{classifySingleR}.
 #'
-#' @section Custom feature specification:
+#' @section Custom gene specification:
 #' Rather than relying on the in-built feature selection, users can pass in their own features of interest to \code{genes}.
 #' The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels.
 #' For example:
@@ -115,10 +129,12 @@
 #' This allows the function to handle pre-defined marker lists for specific cell populations.
 #' However, it obviously captures less information than marker sets for the pairwise comparisons.
 #'
-#' If \code{genes} is manually passed, \code{ref} can be the raw counts or any monotonic transformation thereof.
+#' If \code{genes} is manually passed, \code{ref} can contain the raw counts or any monotonic transformation thereof.
 #' There is no need to supply (log-)normalized expression values for the benefit of the automatic marker detection.
 #' Similarly, for manual \code{genes}, the values of \code{de.method}, \code{de.n} and \code{sd.thresh} have no effect.
 #'
+#' Check out the Examples to see how manual \code{genes} can be passed to \code{trainSingleR}.
+#'
 #' @section Dealing with multiple references:
 #' The default \pkg{SingleR} policy for dealing with multiple references is to perform the classification for each reference separately and combine the results 
 #' (see \code{?\link{combineRecomputedResults}} for an explanation).
@@ -131,12 +147,8 @@
 #' where each inner list corresponds to a reference in \code{ref} and can be of any format described in \dQuote{Custom feature specification}.
 #' Thus, it is possible for \code{genes} to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.
 #'
-#' @section Note on single-cell references:
-#' The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind.
-#' It may not be effective for single-cell reference data where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
-#' Users are recommended to either set \code{de.method} to another DE ranking method, or detect markers externally and pass a list of markers to \code{genes} (see Examples).
-#'
-#' In addition, it is generally unnecessary to have single-cell resolution on the reference profiles.
+#' @section Aggregating single-cell references:
+#' It is generally unnecessary to have single-cell resolution on the reference profiles.
 #' We can instead set \code{aggr.ref=TRUE} to aggregate per-cell references into a set of pseudo-bulk profiles using \code{\link{aggregateReference}}.
 #' This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information.
 #' Note that any aggregation is done \emph{after} marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation.
diff --git a/man/SingleR.Rd b/man/SingleR.Rd
index d706e0f6..ac072107 100644
--- a/man/SingleR.Rd
+++ b/man/SingleR.Rd
@@ -35,7 +35,7 @@ SingleR(
 \item{test}{A numeric matrix of single-cell expression values where rows are genes and columns are cells.
 Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.}
 
-\item{ref}{A numeric matrix of (usually log-transformed) expression values from a reference dataset,
+\item{ref}{A numeric matrix of (usually normalized and log-transformed) expression values from a reference dataset,
 or a \linkS4class{SummarizedExperiment} object containing such a matrix;
 see \code{\link{trainSingleR}} for details.
 
diff --git a/man/getClassicMarkers.Rd b/man/getClassicMarkers.Rd
index 54a79279..e780341b 100644
--- a/man/getClassicMarkers.Rd
+++ b/man/getClassicMarkers.Rd
@@ -17,7 +17,7 @@ getClassicMarkers(
 \arguments{
 \item{ref}{A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples).
 Each row should be named with the gene name.
-In general, the expression values are expected to be log-transformed, see Details.
+In general, the expression values are expected to be normalized and log-transformed, see Details.
 
 Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 
diff --git a/man/matchReferences.Rd b/man/matchReferences.Rd
index 937975fc..d92c0aff 100644
--- a/man/matchReferences.Rd
+++ b/man/matchReferences.Rd
@@ -7,7 +7,7 @@
 matchReferences(ref1, ref2, labels1, labels2, ...)
 }
 \arguments{
-\item{ref1, ref2}{Numeric matrices of single-cell (usually log-transformed) expression values where rows are genes and columns are cells.
+\item{ref1, ref2}{Numeric matrices of single-cell (usually normalized and log-transformed) expression values where rows are genes and columns are cells.
 Alternatively, \linkS4class{SummarizedExperiment} objects containing such matrices.}
 
 \item{labels1, labels2}{A character vector or factor of known labels for all cells in \code{ref1} and \code{ref2}, respectively.}
diff --git a/man/trainSingleR.Rd b/man/trainSingleR.Rd
index 1ae9cea2..2b6b9e8d 100644
--- a/man/trainSingleR.Rd
+++ b/man/trainSingleR.Rd
@@ -28,7 +28,7 @@ trainSingleR(
 \arguments{
 \item{ref}{A numeric matrix of expression values where rows are genes and columns are reference samples (individual cells or bulk samples).
 Each row should be named with the gene name.
-In general, the expression values are expected to be log-transformed, see Details.
+In general, the expression values are expected to be normalized and log-transformed, see Details.
 
 Alternatively, a \linkS4class{SummarizedExperiment} object containing such a matrix.
 
@@ -59,8 +59,7 @@ containing markers for labels in the corresponding entry of \code{ref}.}
 
 \item{de.method}{String specifying how DE genes should be detected between pairs of labels.
 Defaults to \code{"classic"}, which sorts genes by the log-fold changes and takes the top \code{de.n}.
-Setting to \code{"wilcox"} or \code{"t"} will use Wilcoxon ranked sum test or Welch t-test between labels, respectively,
-and take the top \code{de.n} upregulated genes per comparison.
+Other options are \code{"wilcox"} and \code{"t"}, see Details.
 Ignored if \code{genes} is a list of markers/DE genes.}
 
 \item{de.n}{An integer scalar specifying the number of DE genes to use when \code{genes="de"}.
@@ -117,10 +116,25 @@ This function uses a training data set to select interesting features and constr
 The resulting objects can be re-used across multiple classification steps with different test data sets via \code{\link{classifySingleR}}.
 This improves efficiency by avoiding unnecessary repetition of steps during the downstream analysis.
 
-The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between labels.
-This is done by identifying the median expression within each label, and computing differences between medians for each pair of labels.
-For each label, the top \code{de.n} genes with the largest differences compared to another label are chosen as markers to distinguish the two labels.
+The automatic marker detection (\code{genes="de"}) identifies genes that are differentially expressed between pairs of labels in the reference dataset.
 The expression values are expected to be log-transformed and normalized.
+For each pair of labels, the top \code{de.n} genes with strongest upregulation in one label are chosen as markers to distinguish it from the other label.
+The exact ranking depends on the \code{de.method=} argument:
+\itemize{
+\item The default \code{de.method="classic"} will use \code{\link{getClassicMarkers}} to compute the median expression for each label and each gene.
+Then, for each pair of labels, the top \code{de.n} genes with the largest positive differences are chosen as markers to distinguish the first label from the second.
+This is intended for reference datasets derived from bulk transcriptomic data (e.g., microarrays) with a high density of non-zero values. 
+It is less effective for single-cell data, where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
+\item \code{de.method="wilcox"} will rank genes based on the area under the curve (AUC) in each pairwise comparison between labels.
+The top \code{de.n} genes with the largest AUCs above 0.5 are chosen as markers for the first label compared to the second.
+This is analogous to ranking on significance in the Wilcoxon ranked sum test and is intended for use with single-cell data.
+The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+\item \code{de.method="t"} will rank genes on the Cohen's d in each pairiwse comparison.
+The top \code{de.n} genes with the largest positive Cohen's d are chosen as markers for the first label compared to the second.
+This is roughly analogous to ranking on significance in the t-test and is faster than the AUC.
+The exact calculaton is performed using the \code{\link[scrapper]{scoreMarkers}} function.
+}
+Alternatively, users can detect markers externally and pass a list of markers to \code{genes} (see \dQuote{Custom gene specification}).
 
 Classification with \code{classifySingleR} assumes that the test dataset contains all marker genes that were detected from the reference.
 If the test and reference datasets do not have the same genes in the same order, we can set \code{test.genes} to the row names of the test dataset.
@@ -132,7 +146,7 @@ This can be convenient for ignoring inappropriate genes like pseudogenes or pred
 It has the same effect as filtering out undesirable rows from \code{ref} prior to calling \code{trainSingleR}.
 Unlike \code{test.genes}, setting \code{restrict} does not introduce further checks on \code{rownames(test)} in \code{classifySingleR}.
 }
-\section{Custom feature specification}{
+\section{Custom gene specification}{
 
 Rather than relying on the in-built feature selection, users can pass in their own features of interest to \code{genes}.
 The function expects a named list of named lists of character vectors, with each vector containing the DE genes between a pair of labels.
@@ -160,9 +174,11 @@ The entry \code{genes$A} represent the genes that are upregulated in \code{A} co
 This allows the function to handle pre-defined marker lists for specific cell populations.
 However, it obviously captures less information than marker sets for the pairwise comparisons.
 
-If \code{genes} is manually passed, \code{ref} can be the raw counts or any monotonic transformation thereof.
+If \code{genes} is manually passed, \code{ref} can contain the raw counts or any monotonic transformation thereof.
 There is no need to supply (log-)normalized expression values for the benefit of the automatic marker detection.
 Similarly, for manual \code{genes}, the values of \code{de.method}, \code{de.n} and \code{sd.thresh} have no effect.
+
+Check out the Examples to see how manual \code{genes} can be passed to \code{trainSingleR}.
 }
 
 \section{Dealing with multiple references}{
@@ -179,13 +195,9 @@ where each inner list corresponds to a reference in \code{ref} and can be of any
 Thus, it is possible for \code{genes} to be - wait for it - a list (per reference) of lists (per label) of lists (per label) of character vectors.
 }
 
-\section{Note on single-cell references}{
-
-The default marker selection is based on log-fold changes between the per-label medians and is very much designed with bulk references in mind.
-It may not be effective for single-cell reference data where it is not uncommon to have more than 50\% zero counts for a given gene such that the median is also zero for each group.
-Users are recommended to either set \code{de.method} to another DE ranking method, or detect markers externally and pass a list of markers to \code{genes} (see Examples).
+\section{Aggregating single-cell references}{
 
-In addition, it is generally unnecessary to have single-cell resolution on the reference profiles.
+It is generally unnecessary to have single-cell resolution on the reference profiles.
 We can instead set \code{aggr.ref=TRUE} to aggregate per-cell references into a set of pseudo-bulk profiles using \code{\link{aggregateReference}}.
 This improves classification speed while using vector quantization to preserve within-label heterogeneity and mitigate the loss of information.
 Note that any aggregation is done \emph{after} marker gene detection; this ensures that the relevant tests can appropriately penalize within-label variation.
diff --git a/vignettes/SingleR.Rmd b/vignettes/SingleR.Rmd
index 3f97a207..9a6be3bc 100644
--- a/vignettes/SingleR.Rmd
+++ b/vignettes/SingleR.Rmd
@@ -55,6 +55,7 @@ hpca.se
 
 Our test dataset consists of some human embryonic stem cells [@lamanno2016molecular] from the `r Biocpkg("scRNAseq")` package. 
 For the sake of speed, we will only label the first 100 cells from this dataset.
+The test dataset does not need to be log-transformed, as only the ranks will be used by `SingleR()`.
 
 ```{r}
 library(scRNAseq)
@@ -110,6 +111,9 @@ To speed up this demonstration, we will subset to the first 100 cells.
 ```{r}
 sceG <- GrunPancreasData()
 sceG <- sceG[,colSums(counts(sceG)) > 0] # Remove libraries with no counts.
+
+# Log-transformation is not actually required for SingleR itself,
+# but we'll do it anyway as we need it for plotting heatmaps later.
 sceG <- logNormCounts(sceG) 
 ```