Skip to content

Commit

Permalink
Merge branch 'ar/docs-0.3.2' into 'master'
Browse files Browse the repository at this point in the history
Docs 0.3.2

See merge request machine-learning/modkit!206
  • Loading branch information
ArtRand committed Aug 15, 2024
2 parents 9d65d2a + d86f367 commit 828697b
Show file tree
Hide file tree
Showing 39 changed files with 863 additions and 85 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,15 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [v0.3.2]
### Fixes
- [thresholds] OOB panic fix, #244
- [dmr, pair] Allow strand information in regions to be provided. #240
- [dmr, multi] Fix problem when many pairs are provided (#229)
### Adds
- [sample-probs] Change output format of probabilities table to make it easier to parse, also change schema. Output HTML documents with nicer tables.
- [ci] Build in Ubuntu-16 due to Centos7 EOL.

## [v0.3.1]
### Fixes
- [call-mods] Always change model to "explicit", dropped base modification probabilities should not be interpreted as canonical.
Expand Down
2 changes: 2 additions & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
- [Quick Start guides](./quick_start.md)
- [Constructing bedMethyl tables](./intro_bedmethyl.md)
- [Updating and adjusting MM tags](./intro_adjust.md)
- [Inspecting base modification probabilities](./intro_sample_probs.md)
- [Summarizing a modBAM](./intro_summary.md)
- [Making a motif BED file](./intro_motif_bed.md)
- [Extracting read information to a table](./intro_extract.md)
Expand All @@ -17,6 +18,7 @@
- [Narrow output to specific positions](./intro_include_bed.md)
- [Extended subcommand help](./advanced_usage.md)
- [Troubleshooting](./troubleshooting.md)
- [Frequently asked questions](./faq.md)
- [Current limitations](./limitations.md)
- [Performance considerations](./perf_considerations.md)
- [Algorithm details](./algo_details.md)
Expand Down
69 changes: 56 additions & 13 deletions book/src/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -281,8 +281,8 @@ Usage: modkit adjust-mods [OPTIONS] <IN_BAM> <OUT_BAM>
Arguments:
<IN_BAM>
BAM file to collapse mod call from. Can be a path to a file or one of `-` or `stdin` to
specify a stream from standard input.
Input BAM file, can be a path to a file or one of `-` or `stdin` to specify a stream from
standard input.
<OUT_BAM>
File path to new BAM file to be created. Can be a path to a file or one of `-` or `stdin`
Expand Down Expand Up @@ -325,6 +325,58 @@ Options:
--output-sam
Output SAM format instead of BAM.
--filter-probs
Filter out the lowest confidence base modification probabilities.
-n, --num-reads <NUM_READS>
Sample approximately this many reads when estimating the filtering threshold. If
alignments are present reads will be sampled evenly across aligned genome. If a region is
specified, either with the --region option or the --sample-region option, then reads will
be sampled evenly across the region given. This option is useful for large BAM files. In
practice, 10-50 thousand reads is sufficient to estimate the model output distribution and
determine the filtering threshold.
[default: 10042]
--sample-region <SAMPLE_REGION>
Specify a region for sampling reads from when estimating the threshold probability. If
this option is not provided, but --region is provided, the genomic interval passed to
--region will be used. Format should be <chrom_name>:<start>-<end> or <chrom_name>.
--sampling-interval-size <SAMPLING_INTERVAL_SIZE>
Interval chunk size to process concurrently when estimating the threshold probability, can
be larger than the pileup processing interval.
[default: 1000000]
-p, --filter-percentile <FILTER_PERCENTILE>
Filter out modified base calls where the probability of the predicted variant is below
this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
modification calls.
[default: 0.1]
--filter-threshold <FILTER_THRESHOLD>
Specify the filter threshold globally or per primary base. A global filter threshold can
be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by
colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for
cytosine modification calls. Additional per-base thresholds can be specified by repeating
the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a
single base option and a default for all other bases with: --filter-threshold A:0.70
--filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all
other base modification calls.
--mod-threshold <MOD_THRESHOLDS>
Specify a passing threshold to use for a base modification, independent of the threshold
for the primary sequence base or the default. For example, to set the pass threshold for
5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as
usual and used for canonical cytosine and other modifications unless the
`--filter-threshold` option is also passed. See the online documentation for more details.
--only-mapped
Only use base modification probabilities from bases that are aligned when estimating the
filter threshold (i.e. ignore soft-clipped, and inserted bases).
--suppress-progress
Hide the progress bar
Expand Down Expand Up @@ -398,13 +450,9 @@ Options:
-o, --out-dir <OUT_DIR>
Directory to deposit result tables into. Required for model probability histogram output.
Creates two files probabilities.tsv and probabilities.txt The .txt contains
ASCII-histograms and the .tsv contains tab-separated variable data represented by the
histograms.
--prefix <PREFIX>
Label to prefix output files with. E.g. 'foo' will output foo_thresholds.tsv,
foo_probabilities.tsv, and foo_probabilities.txt.
Label to prefix output files with.
--force
Overwrite results if present.
Expand All @@ -431,11 +479,6 @@ Options:
--hist
Output histogram of base modification prediction probabilities.
--buckets <BUCKETS>
Number of buckets for the histogram, if used.
[default: 128]
-n, --num-reads <NUM_READS>
Approximate maximum number of reads to use, especially recommended when using a large BAM
without an index. If an indexed BAM is provided, the reads will be sampled evenly over the
Expand Down Expand Up @@ -1397,7 +1440,7 @@ sample is input as a bgzip pileup bedMethyl (produced by pileup, for example) th
tabix index. Output is a BED file with the score column indicating the magnitude of the difference
in methylation between the two samples. See the online documentation for additional details.
Usage: modkit dmr pair [OPTIONS] -a <CONTROL_BED_METHYL> -b <EXP_BED_METHYL> --ref <REFERENCE_FASTA>
Usage: modkit dmr pair [OPTIONS] --ref <REFERENCE_FASTA>
Options:
-a <CONTROL_BED_METHYL>
Expand Down
3 changes: 2 additions & 1 deletion book/src/dmr_scoring_details.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,8 @@ The model is a simple 2-state hidden Markov model, shown below, where the two hi
![hmm](./images/hmm2.png "2-state segmenting HMM")

</div>
The model is run over the intersection of the modified positions in a [pileup](https://nanoporetech.github.io/modkit/intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.

The model is run over the intersection of the modified positions in a [pileup](./intro_bedmethyl.html#description-of-bedmethyl-output) for which there is enough coverage, from one or more samples.

## Transition parameters
There are two transition probability parameters, \\(p\\) and \\(d\\).
Expand Down
40 changes: 40 additions & 0 deletions book/src/faq.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Frequently asked questions

## How are base modification probabilities calculated?

Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position.
The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical:

\\[ \
P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m} \
\\]

where \\(\textbf{M}\\) is the set of all of the potential modifications for the base.

For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\).
Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine:

\\[
P_{\text{5mC}} = 0.7, \\\\
P_{\text{5hmC}} = 0.2, \\\\
P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\\\
\\]

A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign.

## What value for `--filter-threshold` should I use?

The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities.
The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step.
This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed.
The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\) column instead of the \\( \text{N}\_{\text{valid}} \\) column.
For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data.
The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value.
This percentile can be changed with the `--filter-percentile` option.
Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively.
It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs` [command](./intro_sample_probs.md) is designed for this task.
Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification.



<!-- ## How can I perform differential methylation analysis? -->
28 changes: 28 additions & 0 deletions book/src/intro_sample_probs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Inspecting base modification probabilities

> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated)
For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results.
However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds.
The `modkit sample-probs` command is designed for this task.
There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base.
This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value.
To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities.
Using the command this way produces 3 files in the `$output_dir`:
1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads.
1. Another HTML document containing the proportions of each probability emitted.
1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value.

The schema of the table is as follows:

| column | name | description | type |
|--------|-----------------|----------------------------------------------------------------------------------------------|--------|
| 1 | code | modification code or '-' for canonical | string |
| 2 | primary base | the primary DNA base for which the code applies | string |
| 3 | range_start | the inclusive start probability of the bin | float |
| 4 | range_end | the exclusive end probability of the bin | float |
| 5 | count | the total count of probabilities falling in this bin | int |
| 6 | frac | the fraction of the total calls for this code/primary base in this bin | float |
| 7 | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float |

From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold` [accordingly](./filtering.md).
Loading

0 comments on commit 828697b

Please sign in to comment.