Skip to content

Commit

Permalink
clean docs
Browse files Browse the repository at this point in the history
Don't need to hold onto every previous version of the wiki.
We can just freeze before releases and update here
  • Loading branch information
ACEnglish committed Jan 10, 2025
1 parent 751a207 commit 7380b25
Show file tree
Hide file tree
Showing 252 changed files with 62 additions and 22,691 deletions.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.

If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run:

```
truvari anno chunks input.vcf.gz > counts.bed
```

The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance.
The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance.

If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them.
If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them.

```
# exclude regions with more than 30k SVs. You should test this threshold.
Expand All @@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed
truvari collapse --bed to_analyze.bed -i input.vcf.gz
```

When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons.
When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons.

A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success.
A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
15 changes: 15 additions & 0 deletions docs/v4.3.1/Updates.md → docs/Updates.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
# Truvari 5.0
*in progress*

* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
* Symbolic variant sub-types are ignored e.g. `<DUP:TANDEM> == <DUP>`
* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared
* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`)
* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245))
* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison))
* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None`
* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/))

# Truvari 4.3.1
*September 9, 2024*
* `bench`
Expand Down
8 changes: 7 additions & 1 deletion docs/v4.2.2/anno.md → docs/anno.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants
### Details
Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via:
```bash
zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
tabix anno.trf.bed.gz
```
For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)].
Expand Down Expand Up @@ -116,6 +116,12 @@ options:
--debug Verbose logging
```

### Notes about the Adotto catalog
If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running:
```
zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz
```

# truvari anno grm

For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints.
Expand Down
54 changes: 23 additions & 31 deletions docs/v4.3.1/bench.md → docs/bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,45 +160,35 @@ See [[refine wiki|refine]] for details.
Comparing Sequences of Variants
===============================

Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison.
Two SV insertions could theoretically be at the exact same position and exact same size, but have completely different sequences. Truvari's high specificity when matching variants is due in large part to comparing sequences. Truvari uses edlib to estimate sequence similarity of two alleles. In addition to comparing the direct (a.k.a. unmanipulated) similarity of sequences, truvari also checks rotations of sequences. The first rotation is based on the lexicographically minimum rotation of sequences, and the second is 'unrolled' sequences, which is formally described in [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). These 'rolling' operations can be turned off by using `--no-roll`.

## Unroll
The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c).
The main idea of rotating sequences is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. Finally, to increase matching sensitivity, the lexicographically minimum rotation is also compared as we've found many instances of smaller (<100bp) insertions which should match but fail direct and unrolled sequence similarity, but are captured in this third rotation.

The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end.
As of truvari v5.0, the unroll method is the only supported sequence comparison technique

This unroll method has a number of benefits and a couple of considerations, including:
* not needing a `--reference` for comparison, which saves I/O time
* increasing the number of correctly matching SVs
* decreasing the number of 'suspect' matches in smaller size regimes
* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity
## Symbolic alleles

## Reference context
For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this
variant would be:
As of truvari v5.0, some symbolic alts can be compared with sequence similarity turned on. `<DEL>`, `<INV>` sequences will be filled in when a `--reference` is provided if the variant's size is smaller than `--max-resolve` (default 25kbp). Additionally, `<DUP>` variants can have their sequences filled in if `--dup-to-ins` is also provided. The duplications are assumed to be perfect copies of the reference range spanned by the SV and are placed as a point insertion at the start position.

```text
#POS INS Haplotype
0 ab abAB
1 ba AbaB
2 ab ABab
```
Variants that are symbolic alts (even after attempting to resolve) can be compared with sequence resolved SVs, so there is no need to specify `--pctseq 0`. Note, however the `PctSeqSimilarity` field will not be populated.

Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls'
`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line
looks like:
Symbolic Alt sub-types are ignored e.g. `<DUP:TANDEM>` is considered just `<DUP>`.

``` python
hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq
```
Truvari can replace the symbolic alt of resolved SVs in the output VCF with the parameter `--write-resolved`.

BND Comparison
==============
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)`

The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t).

Where `a1_seq1` is the longer of the REF or ALT allele.
BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.

## SVs without sequences
Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = `<DEL>`, `<INV>`, `<DUP>`) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept.

If the base or comp vcfs do not have sequence resolved calls (e.g. `<DEL>`, simply set `--pctseq=0` to turn off
sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If
`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared.
Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV.

BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.

Controlling the number of matches
=================================
Expand Down Expand Up @@ -283,6 +273,9 @@ See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for de
Methodology
===========
Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.

![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png)

```
* zip the Base and Comp calls together in sorted order
* create chunks of all calls overlapping within ±`--chunksize` basepairs
Expand All @@ -291,5 +284,4 @@ Here is a high-level pseudocode description of the steps Truvari bench conducts
* if the chunk has no Base or Comp calls
** return them all as FNs/FPs
* use `--pick` method to sort and annotate variants with their best match
```
![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png)
```
4 changes: 3 additions & 1 deletion docs/v4.3.1/collapse.md → docs/collapse.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample.
`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/ACEnglish/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample.

Since `collapse` uses the same variant comparison engine as `bench`, there is a lot of overlap in the behavior. Be sure to also be familiar with the [bench documentation](https://github.com/ACEnglish/truvari/wiki/bench).

Example
=======
Expand Down
7 changes: 6 additions & 1 deletion docs/v4.2.2/consistency.md → docs/consistency.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,4 +176,9 @@ Below is a consistency report in json format.
}
]
}
```
```

Output TSV
==========

When given a `--output`, a TSV is written which holds the variant key used for comparison (first 5 columns of the VCF entry), the FLAG corresponding to the 'group' in the main output (e.g. FLAG 3 == group 0011), and a COUNT of how many VCFs the key was found in.
File renamed without changes.
14 changes: 3 additions & 11 deletions docs/freeze_wiki.sh
Original file line number Diff line number Diff line change
@@ -1,16 +1,8 @@
# Pull the current version of the wiki as a VERSION of the docs
VERSION=$1

set -e

if [ -e $1 ]
then
echo 'Must provide version argument'
exit
fi
# Pull the current version of the wiki

git clone https://github.com/spiralgenetics/truvari.wiki.git
rm -rf truvari.wiki/.git
mv truvari.wiki $VERSION
mv truvari.wiki/* ./
rm -rf truvari.wiki

echo 'Review, git add, and git push'
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 7380b25

Please sign in to comment.