-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'ar/docs-0.3.2' into 'master'
Docs 0.3.2 See merge request machine-learning/modkit!206
- Loading branch information
Showing
39 changed files
with
863 additions
and
85 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Frequently asked questions | ||
|
||
## How are base modification probabilities calculated? | ||
|
||
Base modifications are assigned a probability reflecting the confidence the base modification detection algorithm has in making a decision about the modification state of the molecule at a particular position. | ||
The probabilities are parsed from the `ML` tag in the BAM record. These values reflect the probability of the base having a specific modification, `modkit` uses these values and calculates the probability for each modification as well as the probability the base is canonical: | ||
|
||
\\[ \ | ||
P_{\text{canonical}} = 1 - \sum_{m \in \textbf{M}} P_{m} \ | ||
\\] | ||
|
||
where \\(\textbf{M}\\) is the set of all of the potential modifications for the base. | ||
|
||
For example, consider using a m6A model that predicts m6A or canonical bases at adenine residues, if the \\( P_{\text{m6A}} = 0.9 \\) then the probability of canonical \\( \text{A} \\) is \\( P_{\text{canonical}} = 1 - P_{\text{m6A}} = 0.1 \\). | ||
Or considering a typical case for cytosine modifications where the model predicts 5hmC, 5mC, and canonical cytosine: | ||
|
||
\\[ | ||
P_{\text{5mC}} = 0.7, \\\\ | ||
P_{\text{5hmC}} = 0.2, \\\\ | ||
P_{\text{canonical}} = 1 - P_{\text{5mC}} + P_{\text{5hmC}} = 0.1, \\\\ | ||
\\] | ||
|
||
A potential confusion is that `modkit` does not assume a base is canonical if the probability of modification is close to \\( \frac{1}{N_{\text{classes}}} \\), the lowest probability the algorithm may assign. | ||
|
||
## What value for `--filter-threshold` should I use? | ||
|
||
The same way that you may remove low quality data as a first step to any processing, `modkit` will filter out the lowest confidence base modification probabilities. | ||
The filter threshold (or pass threshold) defines the minimum probability required for a read's base modification information at a particular position to be used in a downstream step. | ||
This does not remove the whole read from consideration, just the base modification information attributed to a particular position in the read will be removed. | ||
The most common place to encounter filtering is in `pileup`, where base modification probabilities falling below the pass threshold will be tabulated in the \\( \text{N}\_{\text{Fail}} \\) column instead of the \\( \text{N}\_{\text{valid}} \\) column. | ||
For highest accuracy, the general recommendation is to let `modkit` estimate this value for you based on the input data. | ||
The value is calculated by first taking a sample of the base modification probabilities from the input dataset and determining the \\(10^{\text{th}}\\) percentile probability value. | ||
This percentile can be changed with the `--filter-percentile` option. | ||
Passing a value to `--filter-threshold` and/or `--mod-threshold` that is higher or lower than the estimated value will have the effect of excluding or including more probabilities, respectively. | ||
It may be a good idea to inspect the distribution of probability values in your data, the `modkit sample-probs` [command](./intro_sample_probs.md) is designed for this task. | ||
Use the `--hist` and `--out-dir` options to collect a histogram of the prediction probabilities for each canonical base and modification. | ||
|
||
|
||
|
||
<!-- ## How can I perform differential methylation analysis? --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Inspecting base modification probabilities | ||
|
||
> For details on how base modification probabilities are calculated, see the [FAQ page](./faq.html#how-are-base-modification-probabilities-calculated) | ||
For most use cases the automatic filtering enabled in `modkit` will produce nearly ideal results. | ||
However, in some cases such as exotic organisms or specialized assays, you may want to interrogate the base modification probabilities directly and tune the pass thresholds. | ||
The `modkit sample-probs` command is designed for this task. | ||
There are two ways to use this command, first by simply running `modkit sample-probs $mod_bam` to get a tab-separated file of threshold values for each modified base. | ||
This can save time in downstream steps where you wish to re-use the threshold value by passing `--filter-threshold` and skip re-estimating the value. | ||
To generate more advanced output, add `--hist --out-dir $output_dir` to the command and generate per-modification histograms of the output probabilities. | ||
Using the command this way produces 3 files in the `$output_dir`: | ||
1. An HTML document containing a histogram of the total counts of each probability emitted for each modification code (including canonical) in the sampled reads. | ||
1. Another HTML document containing the proportions of each probability emitted. | ||
1. A tab-separated table with the same information as the histograms and the percentile rank of each probability value. | ||
|
||
The schema of the table is as follows: | ||
|
||
| column | name | description | type | | ||
|--------|-----------------|----------------------------------------------------------------------------------------------|--------| | ||
| 1 | code | modification code or '-' for canonical | string | | ||
| 2 | primary base | the primary DNA base for which the code applies | string | | ||
| 3 | range_start | the inclusive start probability of the bin | float | | ||
| 4 | range_end | the exclusive end probability of the bin | float | | ||
| 5 | count | the total count of probabilities falling in this bin | int | | ||
| 6 | frac | the fraction of the total calls for this code/primary base in this bin | float | | ||
| 7 | percentile_rank | the [percentile rank](https://en.wikipedia.org/wiki/Percentile_rank) of this probability bin | float | | ||
|
||
From these plots and tables you can decide on a pass threshold per-modification code and use `--mod-threshold`/`--filter-threshold` [accordingly](./filtering.md). |
Oops, something went wrong.