Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiome keeper cell metrics #1303

Merged
merged 14 commits into from
Jul 1, 2024
6 changes: 6 additions & 0 deletions pipelines/skylab/multiome/Multiome.changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# 5.1.0
2024-06-28 (Date of Last Commit)

* Updated the STARsolo parameters for estimating cells to Emptydrops_CR
* Added an optional input for expected cells which is used for metric calculation

# 5.0.0
2024-05-20 (Date of Last Commit)

Expand Down
2 changes: 1 addition & 1 deletion pipelines/skylab/multiome/Multiome.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import "https://raw.githubusercontent.com/broadinstitute/CellBender/v0.3.0/wdl/c

workflow Multiome {

String pipeline_version = "5.0.0"
String pipeline_version = "5.1.0"

input {
String input_id
Expand Down
6 changes: 6 additions & 0 deletions pipelines/skylab/optimus/Optimus.changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# 7.2.0
2024-06-28 (Date of Last Commit)

* Updated the STARsolo parameters for estimating cells to Emptydrops_CR
* Added an optional input for expected cells which is used for metric calculation

# 7.1.0
2024-05-20 (Date of Last Commit)

Expand Down
6 changes: 4 additions & 2 deletions pipelines/skylab/optimus/Optimus.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ workflow Optimus {
File annotations_gtf
File? mt_genes
String? soloMultiMappers = "Uniform"
Int? expected_cells

# Chemistry options include: 2 or 3
Int tenx_chemistry_version
Expand Down Expand Up @@ -65,7 +66,7 @@ workflow Optimus {
# version of this pipeline


String pipeline_version = "7.1.0"
String pipeline_version = "7.2.0"


# this is used to scatter matched [r1_fastq, r2_fastq, i1_fastq] arrays
Expand Down Expand Up @@ -168,7 +169,8 @@ workflow Optimus {
align_features = STARsoloFastq.align_features,
umipercell = STARsoloFastq.umipercell,
input_id = input_id,
counting_mode = counting_mode
counting_mode = counting_mode,
expected_cells = expected_cells
}
if (counting_mode == "sc_rna"){
call RunEmptyDrops.RunEmptyDrops {
Expand Down
6 changes: 6 additions & 0 deletions pipelines/skylab/paired_tag/PairedTag.changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
# 1.1.0
2024-06-28 (Date of Last Commit)

* Updated the STARsolo parameters for estimating cells to Emptydrops_CR
* Added an optional input for expected cells which is used for metric calculation

# 1.0.0
2024-06-26

Expand Down
2 changes: 1 addition & 1 deletion pipelines/skylab/paired_tag/PairedTag.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import "../../../pipelines/skylab/optimus/Optimus.wdl" as optimus
import "../../../tasks/skylab/H5adUtils.wdl" as H5adUtils
import "../../../tasks/skylab/PairedTagUtils.wdl" as Demultiplexing
workflow PairedTag {
String pipeline_version = "1.0.0"
String pipeline_version = "1.1.0"

input {
String input_id
Expand Down
5 changes: 5 additions & 0 deletions pipelines/skylab/slideseq/SlideSeq.changelog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 3.1.7
2024-06-28 (Date of Last Commit)

* Updated the STARsolo parameters for estimating cells to Emptydrops_CR; this does not affect the slideseq pipeline

# 3.1.6
2024-05-20 (Date of Last Commit)

Expand Down
2 changes: 1 addition & 1 deletion pipelines/skylab/slideseq/SlideSeq.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ import "../../../tasks/skylab/MergeSortBam.wdl" as Merge

workflow SlideSeq {

String pipeline_version = "3.1.6"
String pipeline_version = "3.1.7"

input {
Array[File] r1_fastq
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# 1.3.5
2024-06-28 (Date of Last Commit)

* Updated the STARsolo parameters for estimating cells to Emptydrops_CR; this does not impact the snSS2 pipeline

# 1.3.4
2024-04-12 (Date of Last Commit)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ workflow MultiSampleSmartSeq2SingleNucleus {
String? input_id_metadata_field
}
# Version of this pipeline
String pipeline_version = "1.3.4"
String pipeline_version = "1.3.5"

if (false) {
String? none = "None"
Expand Down
11 changes: 7 additions & 4 deletions tasks/skylab/StarAlign.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -327,7 +327,8 @@ task STARsoloFastq {
--soloBarcodeReadLength 0 \
--soloCellReadStats Standard \
~{"--soloMultiMappers " + soloMultiMappers} \
--soloUMIfiltering MultiGeneUMI_CR
--soloUMIfiltering MultiGeneUMI_CR \
--soloCellFilter EmptyDrops_CR

echo "UMI LEN " $UMILen

Expand Down Expand Up @@ -442,11 +443,12 @@ task MergeStarOutput {
String? counting_mode

String input_id
Int expected_cells = 3000
File barcodes_single = barcodes[0]
File features_single = features[0]

#runtime values
String docker = "us.gcr.io/broad-gotc-prod/star-merge-npz:1.1"
String docker = "us.gcr.io/broad-gotc-prod/star-merge-npz:1.2"
Int machine_mem_gb = 20
Int cpu = 1
Int disk = ceil(size(matrix, "Gi") * 2) + 10
Expand Down Expand Up @@ -491,7 +493,7 @@ task MergeStarOutput {

# Running star for combined cell matrix
# outputs will be called outputbarcodes.tsv. outputmatrix.mtx, and outputfeatures.tsv
STAR --runMode soloCellFiltering ./matrix ./output --soloCellFilter CellRanger2.2
STAR --runMode soloCellFiltering ./matrix ./output --soloCellFilter EmptyDrops_CR

#list files
echo "listing files"
Expand Down Expand Up @@ -567,7 +569,8 @@ task MergeStarOutput {
~{counting_mode} \
~{input_id} \
outputbarcodes.tsv \
outputmatrix.mtx
outputmatrix.mtx \
~{expected_cells}
tar -zcvf ~{input_id}.star_metrics.tar *.txt
else
echo "No text files found in the folder."
Expand Down
43 changes: 43 additions & 0 deletions website/docs/Pipelines/Optimus_Pipeline/Library-metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
sidebar_position: 5
---

# Optimus Library-level metrics

The following table describes the library level metrics of the produced by the Optimus workflow. These are calcuated using custom python scripts available in the warp-tools repository. The Optimus workflow aligns files in shards to parallelize computationally intensive steps. This results in multiple matrix market files and shard-levl library metrics.

To produce the library-level metrics here, the [combined_mtx.py script](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/star-merge-npz/scripts/combined_mtx.py) combines all the shard-level matrix market files into one raw mtx file. Then, STARsolo is run to filter this matrix to only those barcodes that meet STARsolo's criteria of cells (using the Emptydrops_CR parameter). Lastly, the [combine_shard_metrics.py script](https://github.com/broadinstitute/warp-tools/blob/develop/3rd-party-tools/star-merge-npz/scripts/combine_shard_metrics.py) uses the filtered matrix and the all of the shard-level metrics files produced by STARsolo to calculate the metrics below. Each of the scripts are called from [MergeStarOutput task](https://github.com/broadinstitute/warp/blob/develop/tasks/skylab/StarAlign.wdl) of the Optimus workflow.


| Metric | Description |
| ---| --- |
| number_of_reads | Total number of reads.|
| sequencing_saturation | Proportion of unique molecular identifiers (UMIs) observed relative to the total number of possible UMIs. |
| fraction_of_unique_reads_mapped_to_genome | Fraction of unique reads that map to the genome. |
| fraction_of_unique_and_multiple_reads_mapped_to_genome| Fraction of both unique and multiple reads that map to the genome. |
| fraction_of_reads_with_Q30_bases_in_rna | Fraction of reads with base quality score ≥ Q30 in RNA sequences. |
| fraction_of_reads_with_Q30_bases_in_cb_and_umi | Fraction of reads with base quality score ≥ Q30 in cell barcode (CB) and unique molecular identifier (UMI). |
| fraction_of_reads_with_valid_barcodes | Fraction of reads with valid cell barcodes. |
| reads_mapped_antisense_to_gene | Number of reads mapped antisense to gene regions. |
| reads_mapped_confidently_exonic | Number of reads mapped confidently to exonic regions. |
| reads_mapped_confidently_to_genome | Number of reads mapped confidently to the genome. |
| reads_mapped_confidently_to_intronic_regions | Number of reads mapped confidently to intronic regions. |
| reads_mapped_confidently_to_transcriptome | Number of reads mapped confidently to the transcriptome. |
| estimated_cells | Estimated number of cells from STARsolo using the Emptydops_CR parameter. |
| umis_in_cells | Total number of unique molecular identifiers (UMIs) in cells. |
| mean_umi_per_cell | Average number of UMIs per cell. |
| median_umi_per_cell | Median number of UMIs per cell. |
| unique_reads_in_cells_mapped_to_gene | Number of unique reads in cells mapped to genes. |
| fraction_of_unique_reads_in_cells | Fraction of unique reads in cells. |
| mean_reads_per_cell | Average number of reads per cell. |
| median_reads_per_cell | Median number of reads per cell. |
| mean_gene_per_cell | Average number of genes per cell. |
| median_gene_per_cell | Median number of genes per cell. |
| total_genes_unique_detected | Total number of unique genes detected. |
| percent_target | Percentage of target cells. Calculated as: estimated_number_of_cells / barcoded_cell_sample_number_of_expected_cells |
| percent_intronic_reads | Percentage of intronic reads. Calculated as: reads_mapped_confidently_to_intronic_regions / number_of_reads |
| keeper_mean_reads_per_cell | Mean reads per cell for cells with >1500 genes or nuclei with >1000 genes. |
| keeper_median_genes | Median genes per cell for cells with >1500 genes or nuclei with >1000 genes. |
| keeper_cells | Number of cells with >1500 genes or nuclei with >1000 genes.|
| percent_keeper | Percentage of keeper cells. Calculated as: keeper_cells / estimated_cells |
| percent_usable | Percentage of usable cells. Calculated as: keeper_cells / expected_cells |
3 changes: 2 additions & 1 deletion website/docs/Pipelines/Optimus_Pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,7 @@ The example configuration files also contain metadata for the reference files, d
| ignore_r1_read_length | Boolean that overrides a check on the 10x chemistry. Default is set to false. If true, the workflow will not ensure that the 10x_chemistry_version input matches the chemistry in the read 1 FASTQ. | "true" or "false" (default) |
| emptydrops_lower | UMI threshold for emptyDrops detection; default is 100. | N/A |
| count_exons | Boolean indicating if the workflow should calculate exon counts **when in single-nucleus (sn_rna) mode**. If true, this option will output an additional layer for the h5ad file. By default, it is set to "false". If the parameter is true and used with sc_rnamode, the workflow will return an error. | "true" or "false" (default) |
| expected_cells | Optional integer input for the expected number of cells, which is used calculate library-level metrics. The default is set to 3,000 |

#### Pseudogene handling
The example Optimus reference files are downloaded directly from GENCODE (see Quickstart table) and are not modified to remove pseudogenes. This is in contrast to the [references created for Cell Ranger](https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/release-notes/references#header) which remove pseudogenes and small RNAs.
Expand Down Expand Up @@ -255,7 +256,7 @@ The following table lists the output files produced from the pipeline. For sampl
| cell_metrics | `<input_id>.cell-metrics.csv.gz` | Matrix of metrics by cells. | Compressed CSV |
| gene_metrics | `<input_id>.gene-metrics.csv.gz` | Matrix of metrics by genes. | Compressed CSV |
| aligner_metrics | `<input_id>.star_metrics.tar` | Tarred metrics files produced by the STARsolo aligner; contains align features, cell reads, summary, and UMI per cell metrics files. | TXT |
| library_metrics | `<input_id>_library_metrics.csv` | Optional CSV file containing all library-level metrics calculated with STARsolo for gene expression data. | CSV |
| library_metrics | `<input_id>_library_metrics.csv` | Optional CSV file containing all library-level metrics calculated with STARsolo for gene expression data. See the [Library-level metrics](./Library-metrics.md) for how metrics are calculated. | CSV |
| multimappers_EM_matrix | `UniqueAndMult-EM.mtx` | Optional output produced when `soloMultiMappers` is "EM"; see STARsolo [documentation](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#multi-gene-reads) for more information. | MTX |
| multimappers_Uniform_matrix | `UniqueAndMult-Uniform.mtx` | Optional output produced when `soloMultiMappers` is "Uniform"; see STARsolo [documentation](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#multi-gene-reads) for more information. | MTX |
| multimappers_Rescue_matrix | `UniqueAndMult-Rescue.mtx` | Optional output produced when `soloMultiMappers` is "Rescue"; see STARsolo [documentation](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md#multi-gene-reads) for more information. | MTX |
Expand Down
Loading