clean docs

Don't need to hold onto every previous version of the wiki. We can just freeze before releases and update here
ACEnglish · Jan 10, 2025 · 7380b25 · 7380b25
1 parent 751a207
commit 7380b25
Show file tree

Hide file tree

Showing 252 changed files with 62 additions and 22,691 deletions.
diff --git a/docs/v4.0.0/Citations.md → docs/Citations.md b/docs/v4.0.0/Citations.md → docs/Citations.md
diff --git a/...-on-Regions-with-a-High‐Density-of-SVs.md → ...-on-Regions-with-a-High‐Density-of-SVs.md b/...-on-Regions-with-a-High‐Density-of-SVs.md → ...-on-Regions-with-a-High‐Density-of-SVs.md
@@ -1,14 +1,14 @@
-Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
+Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
 
 If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run:
 
 ```
 truvari anno chunks input.vcf.gz > counts.bed
 ```
 
-The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance.
+The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance.
 
-If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them.
+If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them.
 
 ```
 # exclude regions with more than 30k SVs. You should test this threshold.
@@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed
 truvari collapse --bed to_analyze.bed -i input.vcf.gz
 ```
 
-When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons.
+When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons.
 
-A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success. 
+A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success. 
diff --git a/docs/v4.0.0/Development.md → docs/Development.md b/docs/v4.0.0/Development.md → docs/Development.md
diff --git a/docs/v4.3.1/Home.md → docs/Home.md b/docs/v4.3.1/Home.md → docs/Home.md
diff --git a/docs/v4.3.0/Installation.md → docs/Installation.md b/docs/v4.3.0/Installation.md → docs/Installation.md
diff --git a/docs/v4.2.0/MatchIds.md → docs/MatchIds.md b/docs/v4.2.0/MatchIds.md → docs/MatchIds.md
diff --git a/docs/v4.0.0/Multi-allelic-VCFs.md → docs/Multi-allelic-VCFs.md b/docs/v4.0.0/Multi-allelic-VCFs.md → docs/Multi-allelic-VCFs.md
diff --git a/docs/v4.3.1/Updates.md → docs/Updates.md b/docs/v4.3.1/Updates.md → docs/Updates.md
@@ -1,3 +1,18 @@
+# Truvari 5.0
+*in progress*
+
+* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
+* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
+* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
+* Symbolic variant sub-types are ignored e.g. `<DUP:TANDEM> == <DUP>`
+* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared
+* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`)
+* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245))
+* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
+* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison))
+* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None`
+* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/))
+
 # Truvari 4.3.1
 *September 9, 2024*
 * `bench`

diff --git a/docs/v4.2.2/anno.md → docs/anno.md b/docs/v4.2.2/anno.md → docs/anno.md
@@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants
 ### Details
 Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via:
 ```bash
-zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
+zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
 tabix anno.trf.bed.gz
 ```
 For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. 
@@ -116,6 +116,12 @@ options:
   --debug               Verbose logging
 ```
 
+### Notes about the Adotto catalog
+If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running:
+```
+zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz
+```
+
 # truvari anno grm
 
 For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. 

diff --git a/docs/v4.3.1/bench.md → docs/bench.md b/docs/v4.3.1/bench.md → docs/bench.md
@@ -160,45 +160,35 @@ See [[refine wiki|refine]] for details.
 Comparing Sequences of Variants
 ===============================
 
-Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison.
+Two SV insertions could theoretically be at the exact same position and exact same size, but have completely different sequences. Truvari's high specificity when matching variants is due in large part to comparing sequences. Truvari uses edlib to estimate sequence similarity of two alleles. In addition to comparing the direct (a.k.a. unmanipulated) similarity of sequences, truvari also checks rotations of sequences. The first rotation is based on the lexicographically minimum rotation of sequences, and the second is 'unrolled' sequences, which is formally described in [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). These 'rolling' operations can be turned off by using `--no-roll`.
 
-## Unroll
-The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). 
+The main idea of rotating sequences is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. Finally, to increase matching sensitivity, the lexicographically minimum rotation is also compared as we've found many instances of smaller (<100bp) insertions which should match but fail direct and unrolled sequence similarity, but are captured in this third rotation.
 
-The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end.
+As of truvari v5.0, the unroll method is the only supported sequence comparison technique
 
-This unroll method has a number of benefits and a couple of considerations, including:
-* not needing a `--reference` for comparison, which saves I/O time
-* increasing the number of correctly matching SVs
-* decreasing the number of 'suspect' matches in smaller size regimes
-* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity
+## Symbolic alleles
 
-## Reference context
-For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this 
-variant would be:
+As of truvari v5.0, some symbolic alts can be compared with sequence similarity turned on. `<DEL>`, `<INV>` sequences will be filled in when a `--reference` is provided if the variant's size is smaller than `--max-resolve` (default 25kbp). Additionally, `<DUP>` variants can have their sequences filled in if `--dup-to-ins` is also provided. The duplications are assumed to be perfect copies of the reference range spanned by the SV and are placed as a point insertion at the start position. 
 
-```text
-#POS INS  Haplotype
-  0  ab   abAB
-  1  ba   AbaB
-  2  ab   ABab
-```
+Variants that are symbolic alts (even after attempting to resolve) can be compared with sequence resolved SVs, so there is no need to specify `--pctseq 0`. Note, however the `PctSeqSimilarity` field will not be populated.
 
-Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls'
-`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line 
-looks like:
+Symbolic Alt sub-types are ignored e.g. `<DUP:TANDEM>` is considered just `<DUP>`. 
 
-``` python
-hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq
-```
+Truvari can replace the symbolic alt of resolved SVs in the output VCF with the parameter `--write-resolved`. 
+
+BND Comparison
+==============
+Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)` 
+
+The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t). 
 
-Where `a1_seq1` is the longer of the REF or ALT allele.
+BND's are annotated in the truvari output with fields:  StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position);  TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.
 
-## SVs without sequences
+Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = `<DEL>`, `<INV>`, `<DUP>`) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept. 
 
-If the base or comp vcfs do not have sequence resolved calls (e.g. `<DEL>`, simply set `--pctseq=0` to turn off
-sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If
-`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared.
+Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV.
+
+BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
 
 Controlling the number of matches
 =================================
@@ -283,6 +273,9 @@ See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for de
 Methodology
 ===========
 Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs.
+
+![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png)
+
 ```
 * zip the Base and Comp calls together in sorted order
 * create chunks of all calls overlapping within ±`--chunksize` basepairs
@@ -291,5 +284,4 @@ Here is a high-level pseudocode description of the steps Truvari bench conducts
 * if the chunk has no Base or Comp calls
 ** return them all as FNs/FPs 
 * use `--pick` method to sort and annotate variants with their best match
-```
-![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png)
+```
diff --git a/docs/v4.3.1/collapse.md → docs/collapse.md b/docs/v4.3.1/collapse.md → docs/collapse.md
@@ -1,4 +1,6 @@
-`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. 
+`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/ACEnglish/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. 
+
+Since `collapse` uses the same variant comparison engine as `bench`, there is a lot of overlap in the behavior. Be sure to also be familiar with the [bench documentation](https://github.com/ACEnglish/truvari/wiki/bench).
 
 Example
 =======

diff --git a/docs/v4.2.2/consistency.md → docs/consistency.md b/docs/v4.2.2/consistency.md → docs/consistency.md
@@ -176,4 +176,9 @@ Below is a consistency report in json format.
         }
     ]
 }
-```
+```
+
+Output TSV
+==========
+
+When given a `--output`, a TSV is written which holds the variant key used for comparison (first 5 columns of the VCF entry), the FLAG corresponding to the 'group' in the main output (e.g. FLAG 3 == group 0011), and a COUNT of how many VCFs the key was found in.
diff --git a/docs/v3.5.0/divide.md → docs/divide.md b/docs/v3.5.0/divide.md → docs/divide.md
diff --git a/docs/freeze_wiki.sh b/docs/freeze_wiki.sh
@@ -1,16 +1,8 @@
-# Pull the current version of the wiki as a VERSION of the docs
-VERSION=$1
-
-set -e
-
-if [ -e $1 ]
-then
-    echo 'Must provide version argument'
-    exit
-fi
+# Pull the current version of the wiki 
 
 git clone https://github.com/spiralgenetics/truvari.wiki.git
 rm -rf truvari.wiki/.git
-mv truvari.wiki $VERSION
+mv truvari.wiki/* ./
+rm -rf truvari.wiki
 
 echo 'Review, git add, and git push'
diff --git a/docs/v4.3.1/ga4gh.md → docs/ga4gh.md b/docs/v4.3.1/ga4gh.md → docs/ga4gh.md
diff --git a/docs/v4.2.0/phab.md → docs/phab.md b/docs/v4.2.0/phab.md → docs/phab.md
diff --git a/docs/v4.2.2/refine.md → docs/refine.md b/docs/v4.2.2/refine.md → docs/refine.md
diff --git a/docs/v4.0.0/segment.md → docs/segment.md b/docs/v4.0.0/segment.md → docs/segment.md
diff --git a/docs/v4.0.0/stratify.md → docs/stratify.md b/docs/v4.0.0/stratify.md → docs/stratify.md