Merge branch 'ar/docs-0.3.2' into 'master'

Docs 0.3.2 See merge request machine-learning/modkit!206
nanoporetech · Aug 15, 2024 · 828697b · 828697b
2 parents 9d65d2a + d86f367
commit 828697b
Show file tree

Hide file tree

Showing 39 changed files with 863 additions and 85 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,15 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [v0.3.2]
+### Fixes
+- [thresholds] OOB panic fix, #244
+- [dmr, pair] Allow strand information in regions to be provided. #240
+- [dmr, multi] Fix problem when many pairs are provided (#229)
+### Adds
+- [sample-probs] Change output format of probabilities table to make it easier to parse, also change schema. Output HTML documents with nicer tables.
+- [ci] Build in Ubuntu-16 due to Centos7 EOL.
+
 ## [v0.3.1]
 ### Fixes
 - [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical.

diff --git a/book/src/SUMMARY.md b/book/src/SUMMARY.md
@@ -3,6 +3,7 @@
 - [Quick Start guides](./quick_start.md)
     - [Constructing bedMethyl tables](./intro_bedmethyl.md)
     - [Updating and adjusting MM tags](./intro_adjust.md)
+    - [Inspecting base modification probabilities](./intro_sample_probs.md)
     - [Summarizing a modBAM](./intro_summary.md)
     - [Making a motif BED file](./intro_motif_bed.md)
     - [Extracting read information to a table](./intro_extract.md)
@@ -17,6 +18,7 @@
     - [Narrow output to specific positions](./intro_include_bed.md)
 - [Extended subcommand help](./advanced_usage.md)
 - [Troubleshooting](./troubleshooting.md)
+- [Frequently asked questions](./faq.md)
 - [Current limitations](./limitations.md)
 - [Performance considerations](./perf_considerations.md)
 - [Algorithm details](./algo_details.md)

diff --git a/book/src/advanced_usage.md b/book/src/advanced_usage.md
@@ -281,8 +281,8 @@ Usage: modkit adjust-mods [OPTIONS] <IN_BAM> <OUT_BAM>
 
 Arguments:
   <IN_BAM>
-          BAM file to collapse mod call from. Can be a path to a file or one of `-` or `stdin` to
-          specify a stream from standard input.
+          Input BAM file, can be a path to a file or one of `-` or `stdin` to specify a stream from
+          standard input.
 
   <OUT_BAM>
           File path to new BAM file to be created. Can be a path to a file or one of `-` or `stdin`
@@ -325,6 +325,58 @@ Options:
       --output-sam
           Output SAM format instead of BAM.
 
+      --filter-probs
+          Filter out the lowest confidence base modification probabilities.
+
+  -n, --num-reads <NUM_READS>
+          Sample approximately this many reads when estimating the filtering threshold. If
+          alignments are present reads will be sampled evenly across aligned genome. If a region is
+          specified, either with the --region option or the --sample-region option, then reads will
+          be sampled evenly across the region given. This option is useful for large BAM files. In
+          practice, 10-50 thousand reads is sufficient to estimate the model output distribution and
+          determine the filtering threshold.
+          
+          [default: 10042]
+
+      --sample-region <SAMPLE_REGION>
+          Specify a region for sampling reads from when estimating the threshold probability. If
+          this option is not provided, but --region is provided, the genomic interval passed to
+          --region will be used. Format should be <chrom_name>:<start>-<end> or <chrom_name>.
+
+      --sampling-interval-size <SAMPLING_INTERVAL_SIZE>
+          Interval chunk size to process concurrently when estimating the threshold probability, can
+          be larger than the pileup processing interval.
+          
+          [default: 1000000]
+
+  -p, --filter-percentile <FILTER_PERCENTILE>
+          Filter out modified base calls where the probability of the predicted variant is below
+          this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
+          modification calls.
+          
+          [default: 0.1]
+
+      --filter-threshold <FILTER_THRESHOLD>
+          Specify the filter threshold globally or per primary base. A global filter threshold can
+          be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by
+          colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for
+          cytosine modification calls. Additional per-base thresholds can be specified by repeating
+          the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a
+          single base option and a default for all other bases with: --filter-threshold A:0.70
+          --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all
+          other base modification calls.
+
+      --mod-threshold <MOD_THRESHOLDS>
+          Specify a passing threshold to use for a base modification, independent of the threshold
+          for the primary sequence base or the default. For example, to set the pass threshold for
+          5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as
+          usual and used for canonical cytosine and other modifications unless the
+          `--filter-threshold` option is also passed. See the online documentation for more details.
+
+      --only-mapped
+          Only use base modification probabilities from bases that are aligned when estimating the
+          filter threshold (i.e. ignore soft-clipped, and inserted bases).
+
       --suppress-progress
           Hide the progress bar
 
@@ -398,13 +450,9 @@ Options:
 
   -o, --out-dir <OUT_DIR>
           Directory to deposit result tables into. Required for model probability histogram output.
-          Creates two files probabilities.tsv and probabilities.txt The .txt contains
-          ASCII-histograms and the .tsv contains tab-separated variable data represented by the
-          histograms.
 
       --prefix <PREFIX>
-          Label to prefix output files with. E.g. 'foo' will output foo_thresholds.tsv,
-          foo_probabilities.tsv, and foo_probabilities.txt.
+          Label to prefix output files with.
 
       --force
           Overwrite results if present.
@@ -431,11 +479,6 @@ Options:
       --hist
           Output histogram of base modification prediction probabilities.
 
-      --buckets <BUCKETS>
-          Number of buckets for the histogram, if used.
-          
-          [default: 128]
-
   -n, --num-reads <NUM_READS>
           Approximate maximum number of reads to use, especially recommended when using a large BAM
           without an index. If an indexed BAM is provided, the reads will be sampled evenly over the
@@ -1397,7 +1440,7 @@ sample is input as a bgzip pileup bedMethyl (produced by pileup, for example) th
 tabix index. Output is a BED file with the score column indicating the magnitude of the difference
 in methylation between the two samples. See the online documentation for additional details.
 
-Usage: modkit dmr pair [OPTIONS] -a <CONTROL_BED_METHYL> -b <EXP_BED_METHYL> --ref <REFERENCE_FASTA>
+Usage: modkit dmr pair [OPTIONS] --ref <REFERENCE_FASTA>
 
 Options:
   -a <CONTROL_BED_METHYL>

diff --git a/book/src/dmr_scoring_details.md b/book/src/dmr_scoring_details.md
@@ -118,7 +118,8 @@ The model is a simple 2-state hidden Markov model, shown below, where the two hi
 ![hmm](./images/hmm2.png "2-state segmenting HMM")
 
 </div>
-The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
+
+The model is run over the intersection of the modified positions in a [pileup](./intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.
 
 ## Transition parameters
 There are two transition probability parameters, \\(p\\) and \\(d\\).

diff --git a/book/src/faq.md b/book/src/faq.md
@@ -0,0 +1,40 @@
+# Frequently asked questions
+
+## How are base modification probabilities calculated?
+
+Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position.
+The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical:
+
+\\[ \
+P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m} \
+\\]
+
+where \\(\textbf{M}\\) is the set of all of the potential modifications for the base.
+
+For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is  \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\).
+Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine:
+
+\\[
+P_{\text{5mC}} = 0.7, \\\\
+P_{\text{5hmC}} = 0.2, \\\\
+P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\\\
+\\]
+
+A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign.
+
+## What value for `--filter-threshold` should I use?
+
+The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities.
+The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step.
+This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed.
+The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\)  column instead of the \\( \text{N}\_{\text{valid}} \\) column.
+For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data.
+The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value.
+This percentile can be changed with the `--filter-percentile` option.
+Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively.
+It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs` [command](./intro_sample_probs.md) is designed for this task. 
+Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification.
+
+
+
+<!-- ## How can I perform differential methylation analysis? -->
diff --git a/book/src/intro_sample_probs.md b/book/src/intro_sample_probs.md
@@ -0,0 +1,28 @@
+# Inspecting base modification probabilities
+
+> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated)
+
+For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results.
+However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds.
+The `modkit sample-probs` command is designed for this task.
+There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base. 
+This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value.
+To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities.
+Using the command this way produces 3 files in the `$output_dir`:
+1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads.
+1. Another HTML document containing the proportions of each probability emitted.
+1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value.
+
+The schema of the table is as follows:
+
+| column | name            | description                                                                                  | type   |
+|--------|-----------------|----------------------------------------------------------------------------------------------|--------|
+| 1      | code            | modification code or '-' for canonical                                                       | string |
+| 2      | primary base    | the primary DNA base for which the code applies                                              | string |
+| 3      | range_start     | the inclusive start probability of the bin                                                   | float  |
+| 4      | range_end       | the exclusive end probability of the bin                                                     | float  |
+| 5      | count           | the total count of probabilities falling in this bin                                         | int    |
+| 6      | frac            | the fraction of the total calls for this code/primary base in this bin                       | float  |
+| 7      | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float  |
+
+From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold` [accordingly](./filtering.md).