From 7380b2562cbb96ac66505423ef3839efaddb6889 Mon Sep 17 00:00:00 2001 From: Adam English Date: Thu, 9 Jan 2025 20:55:01 -0500 Subject: [PATCH] clean docs Don't need to hold onto every previous version of the wiki. We can just freeze before releases and update here --- docs/{v4.0.0 => }/Citations.md | 0 ...-with-a-High\342\200\220Density-of-SVs.md" | 10 +- docs/{v4.0.0 => }/Development.md | 0 docs/{v4.3.1 => }/Home.md | 0 docs/{v4.3.0 => }/Installation.md | 0 docs/{v4.2.0 => }/MatchIds.md | 0 docs/{v4.0.0 => }/Multi-allelic-VCFs.md | 0 docs/{v4.3.1 => }/Updates.md | 15 + docs/{v4.2.2 => }/anno.md | 8 +- docs/{v4.3.1 => }/bench.md | 54 +- docs/{v4.3.1 => }/collapse.md | 4 +- docs/{v4.2.2 => }/consistency.md | 7 +- docs/{v3.5.0 => }/divide.md | 0 docs/freeze_wiki.sh | 14 +- docs/{v4.3.1 => }/ga4gh.md | 0 docs/{v4.2.0 => }/phab.md | 0 docs/{v4.2.2 => }/refine.md | 0 docs/{v4.0.0 => }/segment.md | 0 docs/{v4.0.0 => }/stratify.md | 0 docs/v2.1.1/Comparing-two-SV-programs.md | 63 --- docs/v2.1.1/Creating-Annotations.md | 32 -- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 48 -- docs/v2.1.1/Home.md | 28 - docs/v2.1.1/Multi-allelic-VCFs.md | 11 - docs/v2.1.1/The--multimatch-parameter.md | 5 - docs/v2.1.1/Updates.md | 24 - docs/v2.1.1/anno.md | 231 -------- docs/v2.1.1/bench.md | 194 ------- docs/v2.1.1/collapse.md | 112 ---- docs/v2.1.1/consistency.md | 73 --- docs/v2.1.1/stats.md | 30 -- docs/v2.1.1/stratifications.sh.md | 17 - docs/v2.1.1/vcf2df.md | 72 --- docs/v3.0.0/Citations.md | 27 - docs/v3.0.0/Comparing-two-SV-programs.md | 63 --- docs/v3.0.0/Development.md | 62 --- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.0.0/GIAB-stratifications.sh.md | 17 - docs/v3.0.0/Home.md | 29 - docs/v3.0.0/Multi-allelic-VCFs.md | 11 - docs/v3.0.0/The--multimatch-parameter.md | 5 - docs/v3.0.0/Updates.md | 52 -- docs/v3.0.0/anno.md | 293 ----------- docs/v3.0.0/bench.md | 186 ------- docs/v3.0.0/collapse.md | 130 ----- docs/v3.0.0/consistency.md | 74 --- docs/v3.0.0/vcf2df.md | 70 --- docs/v3.1.0/Citations.md | 29 - docs/v3.1.0/Comparing-two-SV-programs.md | 63 --- docs/v3.1.0/Development.md | 73 --- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.1.0/GIAB-stratifications.sh.md | 17 - docs/v3.1.0/Home.md | 32 -- docs/v3.1.0/Installation.md | 37 -- docs/v3.1.0/MatchIds.md | 27 - docs/v3.1.0/Multi-allelic-VCFs.md | 11 - docs/v3.1.0/Resources.md | 3 - docs/v3.1.0/The--multimatch-parameter.md | 3 - docs/v3.1.0/Updates.md | 69 --- docs/v3.1.0/anno.md | 293 ----------- docs/v3.1.0/bench.md | 232 -------- docs/v3.1.0/collapse.md | 140 ----- docs/v3.1.0/consistency.md | 74 --- docs/v3.1.0/segment.md | 18 - docs/v3.1.0/vcf2df.md | 73 --- docs/v3.2.0/Citations.md | 31 -- docs/v3.2.0/Comparing-two-SV-programs.md | 63 --- docs/v3.2.0/Development.md | 94 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.2.0/GIAB-stratifications.sh.md | 17 - docs/v3.2.0/Home.md | 35 -- docs/v3.2.0/Installation.md | 37 -- docs/v3.2.0/MatchIds.md | 27 - docs/v3.2.0/Multi-allelic-VCFs.md | 11 - docs/v3.2.0/Resources.md | 3 - docs/v3.2.0/The--multimatch-parameter.md | 3 - docs/v3.2.0/Updates.md | 86 --- docs/v3.2.0/anno.md | 400 -------------- docs/v3.2.0/bench.md | 233 --------- docs/v3.2.0/collapse.md | 147 ------ docs/v3.2.0/consistency.md | 74 --- docs/v3.2.0/divide.md | 39 -- docs/v3.2.0/segment.md | 18 - docs/v3.2.0/vcf2df.md | 73 --- docs/v3.3.0/Citations.md | 31 -- docs/v3.3.0/Comparing-two-SV-programs.md | 63 --- docs/v3.3.0/Development.md | 90 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.3.0/GIAB-stratifications.sh.md | 17 - docs/v3.3.0/Home.md | 35 -- docs/v3.3.0/Installation.md | 47 -- docs/v3.3.0/MatchIds.md | 27 - docs/v3.3.0/Multi-allelic-VCFs.md | 11 - docs/v3.3.0/Resources.md | 3 - docs/v3.3.0/The--multimatch-parameter.md | 3 - docs/v3.3.0/Updates.md | 95 ---- docs/v3.3.0/anno.md | 434 --------------- docs/v3.3.0/bench.md | 239 --------- docs/v3.3.0/collapse.md | 147 ------ docs/v3.3.0/consistency.md | 74 --- docs/v3.3.0/divide.md | 39 -- docs/v3.3.0/segment.md | 18 - docs/v3.3.0/vcf2df.md | 73 --- docs/v3.4.0/Citations.md | 31 -- docs/v3.4.0/Comparing-two-SV-programs.md | 63 --- docs/v3.4.0/Development.md | 90 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.4.0/GIAB-stratifications.sh.md | 17 - docs/v3.4.0/Home.md | 35 -- docs/v3.4.0/Installation.md | 51 -- docs/v3.4.0/MatchIds.md | 27 - docs/v3.4.0/Multi-allelic-VCFs.md | 11 - docs/v3.4.0/Resources.md | 3 - docs/v3.4.0/The--multimatch-parameter.md | 3 - docs/v3.4.0/Updates.md | 110 ---- docs/v3.4.0/anno.md | 434 --------------- docs/v3.4.0/bench.md | 239 --------- docs/v3.4.0/collapse.md | 146 ------ docs/v3.4.0/consistency.md | 74 --- docs/v3.4.0/divide.md | 39 -- docs/v3.4.0/segment.md | 18 - docs/v3.4.0/vcf2df.md | 73 --- docs/v3.5.0/Citations.md | 31 -- docs/v3.5.0/Comparing-two-SV-programs.md | 67 --- docs/v3.5.0/Development.md | 90 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v3.5.0/GIAB-stratifications.sh.md | 17 - docs/v3.5.0/Home.md | 35 -- docs/v3.5.0/Installation.md | 53 -- docs/v3.5.0/MatchIds.md | 27 - docs/v3.5.0/Multi-allelic-VCFs.md | 11 - docs/v3.5.0/Resources.md | 3 - docs/v3.5.0/The--multimatch-parameter.md | 3 - docs/v3.5.0/Updates.md | 134 ----- docs/v3.5.0/anno.md | 428 --------------- docs/v3.5.0/bench.md | 261 --------- docs/v3.5.0/collapse.md | 146 ------ docs/v3.5.0/consistency.md | 75 --- docs/v3.5.0/phab.md | 163 ------ docs/v3.5.0/segment.md | 18 - docs/v3.5.0/vcf2df.md | 75 --- docs/v4.0.0/Comparing-two-SV-programs.md | 67 --- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v4.0.0/Home.md | 36 -- docs/v4.0.0/Installation.md | 56 -- docs/v4.0.0/MatchIds.md | 30 -- docs/v4.0.0/Resources.md | 3 - docs/v4.0.0/Updates.md | 181 ------- docs/v4.0.0/anno.md | 468 ----------------- docs/v4.0.0/bench.md | 278 ---------- docs/v4.0.0/collapse.md | 144 ----- docs/v4.0.0/consistency.md | 170 ------ docs/v4.0.0/divide.md | 58 -- docs/v4.0.0/phab.md | 161 ------ docs/v4.0.0/refine.md | 107 ---- docs/v4.1.0/Citations.md | 30 -- docs/v4.1.0/Comparing-two-SV-programs.md | 67 --- docs/v4.1.0/Development.md | 90 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v4.1.0/Home.md | 35 -- docs/v4.1.0/Installation.md | 56 -- docs/v4.1.0/MatchIds.md | 27 - docs/v4.1.0/Multi-allelic-VCFs.md | 11 - docs/v4.1.0/Updates.md | 215 -------- docs/v4.1.0/anno.md | 468 ----------------- docs/v4.1.0/bench.md | 297 ----------- docs/v4.1.0/collapse.md | 144 ----- docs/v4.1.0/consistency.md | 170 ------ docs/v4.1.0/divide.md | 58 -- docs/v4.1.0/phab.md | 155 ------ docs/v4.1.0/refine.md | 110 ---- docs/v4.1.0/segment.md | 18 - docs/v4.1.0/stratify.md | 58 -- docs/v4.1.0/vcf2df.md | 81 --- docs/v4.2.0/Citations.md | 30 -- docs/v4.2.0/Comparing-two-SV-programs.md | 67 --- docs/v4.2.0/Development.md | 90 ---- ...t-Distance-Ratio-vs-Sequence-Similarity.md | 55 -- docs/v4.2.0/Home.md | 35 -- docs/v4.2.0/Installation.md | 56 -- docs/v4.2.0/Multi-allelic-VCFs.md | 11 - docs/v4.2.0/Updates.md | 245 --------- docs/v4.2.0/anno.md | 494 ------------------ docs/v4.2.0/bench.md | 293 ----------- docs/v4.2.0/collapse.md | 163 ------ docs/v4.2.0/consistency.md | 170 ------ docs/v4.2.0/divide.md | 58 -- docs/v4.2.0/refine.md | 114 ---- docs/v4.2.0/segment.md | 18 - docs/v4.2.0/stratify.md | 58 -- docs/v4.2.0/vcf2df.md | 81 --- docs/v4.2.1/Citations.md | 30 -- docs/v4.2.1/Development.md | 90 ---- docs/v4.2.1/Home.md | 35 -- docs/v4.2.1/Installation.md | 56 -- docs/v4.2.1/MatchIds.md | 74 --- docs/v4.2.1/Multi-allelic-VCFs.md | 11 - docs/v4.2.1/Updates.md | 256 --------- docs/v4.2.1/anno.md | 494 ------------------ docs/v4.2.1/bench.md | 293 ----------- docs/v4.2.1/collapse.md | 163 ------ docs/v4.2.1/consistency.md | 179 ------- docs/v4.2.1/divide.md | 58 -- docs/v4.2.1/phab.md | 157 ------ docs/v4.2.1/refine.md | 140 ----- docs/v4.2.1/segment.md | 18 - docs/v4.2.1/stratify.md | 58 -- docs/v4.2.1/vcf2df.md | 81 --- docs/v4.2.2/Citations.md | 30 -- docs/v4.2.2/Development.md | 90 ---- docs/v4.2.2/Home.md | 35 -- docs/v4.2.2/Installation.md | 56 -- docs/v4.2.2/MatchIds.md | 74 --- docs/v4.2.2/Multi-allelic-VCFs.md | 11 - docs/v4.2.2/Updates.md | 273 ---------- docs/v4.2.2/bench.md | 293 ----------- docs/v4.2.2/collapse.md | 163 ------ docs/v4.2.2/divide.md | 58 -- docs/v4.2.2/phab.md | 157 ------ docs/v4.2.2/segment.md | 18 - docs/v4.2.2/stratify.md | 58 -- docs/v4.2.2/vcf2df.md | 81 --- docs/v4.3.0/Citations.md | 30 -- docs/v4.3.0/Development.md | 90 ---- docs/v4.3.0/Home.md | 35 -- docs/v4.3.0/MatchIds.md | 74 --- docs/v4.3.0/Multi-allelic-VCFs.md | 11 - docs/v4.3.0/Updates.md | 284 ---------- docs/v4.3.0/anno.md | 494 ------------------ docs/v4.3.0/bench.md | 295 ----------- docs/v4.3.0/collapse.md | 163 ------ docs/v4.3.0/consistency.md | 179 ------- docs/v4.3.0/divide.md | 58 -- docs/v4.3.0/phab.md | 157 ------ docs/v4.3.0/refine.md | 142 ----- docs/v4.3.0/segment.md | 18 - docs/v4.3.0/stratify.md | 58 -- docs/v4.3.0/vcf2df.md | 81 --- docs/v4.3.1/Citations.md | 30 -- docs/v4.3.1/Development.md | 90 ---- docs/v4.3.1/Installation.md | 72 --- docs/v4.3.1/MatchIds.md | 74 --- docs/v4.3.1/Multi-allelic-VCFs.md | 11 - docs/v4.3.1/anno.md | 494 ------------------ docs/v4.3.1/consistency.md | 179 ------- docs/v4.3.1/divide.md | 58 -- docs/v4.3.1/phab.md | 157 ------ docs/v4.3.1/refine.md | 142 ----- docs/v4.3.1/segment.md | 18 - docs/v4.3.1/stratify.md | 58 -- docs/v4.3.1/vcf2df.md | 81 --- docs/{v4.0.0 => }/vcf2df.md | 0 252 files changed, 62 insertions(+), 22691 deletions(-) rename docs/{v4.0.0 => }/Citations.md (100%) rename "docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" => "docs/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" (62%) rename docs/{v4.0.0 => }/Development.md (100%) rename docs/{v4.3.1 => }/Home.md (100%) rename docs/{v4.3.0 => }/Installation.md (100%) rename docs/{v4.2.0 => }/MatchIds.md (100%) rename docs/{v4.0.0 => }/Multi-allelic-VCFs.md (100%) rename docs/{v4.3.1 => }/Updates.md (92%) rename docs/{v4.2.2 => }/anno.md (98%) rename docs/{v4.3.1 => }/bench.md (72%) rename docs/{v4.3.1 => }/collapse.md (95%) rename docs/{v4.2.2 => }/consistency.md (95%) rename docs/{v3.5.0 => }/divide.md (100%) rename docs/{v4.3.1 => }/ga4gh.md (100%) rename docs/{v4.2.0 => }/phab.md (100%) rename docs/{v4.2.2 => }/refine.md (100%) rename docs/{v4.0.0 => }/segment.md (100%) rename docs/{v4.0.0 => }/stratify.md (100%) delete mode 100644 docs/v2.1.1/Comparing-two-SV-programs.md delete mode 100644 docs/v2.1.1/Creating-Annotations.md delete mode 100644 docs/v2.1.1/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v2.1.1/Home.md delete mode 100644 docs/v2.1.1/Multi-allelic-VCFs.md delete mode 100644 docs/v2.1.1/The--multimatch-parameter.md delete mode 100644 docs/v2.1.1/Updates.md delete mode 100644 docs/v2.1.1/anno.md delete mode 100644 docs/v2.1.1/bench.md delete mode 100644 docs/v2.1.1/collapse.md delete mode 100644 docs/v2.1.1/consistency.md delete mode 100644 docs/v2.1.1/stats.md delete mode 100644 docs/v2.1.1/stratifications.sh.md delete mode 100644 docs/v2.1.1/vcf2df.md delete mode 100644 docs/v3.0.0/Citations.md delete mode 100644 docs/v3.0.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.0.0/Development.md delete mode 100644 docs/v3.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.0.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.0.0/Home.md delete mode 100644 docs/v3.0.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.0.0/The--multimatch-parameter.md delete mode 100644 docs/v3.0.0/Updates.md delete mode 100644 docs/v3.0.0/anno.md delete mode 100644 docs/v3.0.0/bench.md delete mode 100644 docs/v3.0.0/collapse.md delete mode 100644 docs/v3.0.0/consistency.md delete mode 100644 docs/v3.0.0/vcf2df.md delete mode 100644 docs/v3.1.0/Citations.md delete mode 100644 docs/v3.1.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.1.0/Development.md delete mode 100644 docs/v3.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.1.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.1.0/Home.md delete mode 100644 docs/v3.1.0/Installation.md delete mode 100644 docs/v3.1.0/MatchIds.md delete mode 100644 docs/v3.1.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.1.0/Resources.md delete mode 100644 docs/v3.1.0/The--multimatch-parameter.md delete mode 100644 docs/v3.1.0/Updates.md delete mode 100644 docs/v3.1.0/anno.md delete mode 100644 docs/v3.1.0/bench.md delete mode 100644 docs/v3.1.0/collapse.md delete mode 100644 docs/v3.1.0/consistency.md delete mode 100644 docs/v3.1.0/segment.md delete mode 100644 docs/v3.1.0/vcf2df.md delete mode 100644 docs/v3.2.0/Citations.md delete mode 100644 docs/v3.2.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.2.0/Development.md delete mode 100644 docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.2.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.2.0/Home.md delete mode 100644 docs/v3.2.0/Installation.md delete mode 100644 docs/v3.2.0/MatchIds.md delete mode 100644 docs/v3.2.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.2.0/Resources.md delete mode 100644 docs/v3.2.0/The--multimatch-parameter.md delete mode 100644 docs/v3.2.0/Updates.md delete mode 100644 docs/v3.2.0/anno.md delete mode 100644 docs/v3.2.0/bench.md delete mode 100644 docs/v3.2.0/collapse.md delete mode 100644 docs/v3.2.0/consistency.md delete mode 100644 docs/v3.2.0/divide.md delete mode 100644 docs/v3.2.0/segment.md delete mode 100644 docs/v3.2.0/vcf2df.md delete mode 100644 docs/v3.3.0/Citations.md delete mode 100644 docs/v3.3.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.3.0/Development.md delete mode 100644 docs/v3.3.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.3.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.3.0/Home.md delete mode 100644 docs/v3.3.0/Installation.md delete mode 100644 docs/v3.3.0/MatchIds.md delete mode 100644 docs/v3.3.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.3.0/Resources.md delete mode 100644 docs/v3.3.0/The--multimatch-parameter.md delete mode 100644 docs/v3.3.0/Updates.md delete mode 100644 docs/v3.3.0/anno.md delete mode 100644 docs/v3.3.0/bench.md delete mode 100644 docs/v3.3.0/collapse.md delete mode 100644 docs/v3.3.0/consistency.md delete mode 100644 docs/v3.3.0/divide.md delete mode 100644 docs/v3.3.0/segment.md delete mode 100644 docs/v3.3.0/vcf2df.md delete mode 100644 docs/v3.4.0/Citations.md delete mode 100644 docs/v3.4.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.4.0/Development.md delete mode 100644 docs/v3.4.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.4.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.4.0/Home.md delete mode 100644 docs/v3.4.0/Installation.md delete mode 100644 docs/v3.4.0/MatchIds.md delete mode 100644 docs/v3.4.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.4.0/Resources.md delete mode 100644 docs/v3.4.0/The--multimatch-parameter.md delete mode 100644 docs/v3.4.0/Updates.md delete mode 100644 docs/v3.4.0/anno.md delete mode 100644 docs/v3.4.0/bench.md delete mode 100644 docs/v3.4.0/collapse.md delete mode 100644 docs/v3.4.0/consistency.md delete mode 100644 docs/v3.4.0/divide.md delete mode 100644 docs/v3.4.0/segment.md delete mode 100644 docs/v3.4.0/vcf2df.md delete mode 100644 docs/v3.5.0/Citations.md delete mode 100644 docs/v3.5.0/Comparing-two-SV-programs.md delete mode 100644 docs/v3.5.0/Development.md delete mode 100644 docs/v3.5.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v3.5.0/GIAB-stratifications.sh.md delete mode 100644 docs/v3.5.0/Home.md delete mode 100644 docs/v3.5.0/Installation.md delete mode 100644 docs/v3.5.0/MatchIds.md delete mode 100644 docs/v3.5.0/Multi-allelic-VCFs.md delete mode 100644 docs/v3.5.0/Resources.md delete mode 100644 docs/v3.5.0/The--multimatch-parameter.md delete mode 100644 docs/v3.5.0/Updates.md delete mode 100644 docs/v3.5.0/anno.md delete mode 100644 docs/v3.5.0/bench.md delete mode 100644 docs/v3.5.0/collapse.md delete mode 100644 docs/v3.5.0/consistency.md delete mode 100644 docs/v3.5.0/phab.md delete mode 100644 docs/v3.5.0/segment.md delete mode 100644 docs/v3.5.0/vcf2df.md delete mode 100644 docs/v4.0.0/Comparing-two-SV-programs.md delete mode 100644 docs/v4.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v4.0.0/Home.md delete mode 100644 docs/v4.0.0/Installation.md delete mode 100644 docs/v4.0.0/MatchIds.md delete mode 100644 docs/v4.0.0/Resources.md delete mode 100644 docs/v4.0.0/Updates.md delete mode 100644 docs/v4.0.0/anno.md delete mode 100644 docs/v4.0.0/bench.md delete mode 100644 docs/v4.0.0/collapse.md delete mode 100644 docs/v4.0.0/consistency.md delete mode 100644 docs/v4.0.0/divide.md delete mode 100644 docs/v4.0.0/phab.md delete mode 100644 docs/v4.0.0/refine.md delete mode 100644 docs/v4.1.0/Citations.md delete mode 100644 docs/v4.1.0/Comparing-two-SV-programs.md delete mode 100644 docs/v4.1.0/Development.md delete mode 100644 docs/v4.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v4.1.0/Home.md delete mode 100644 docs/v4.1.0/Installation.md delete mode 100644 docs/v4.1.0/MatchIds.md delete mode 100644 docs/v4.1.0/Multi-allelic-VCFs.md delete mode 100644 docs/v4.1.0/Updates.md delete mode 100644 docs/v4.1.0/anno.md delete mode 100644 docs/v4.1.0/bench.md delete mode 100644 docs/v4.1.0/collapse.md delete mode 100644 docs/v4.1.0/consistency.md delete mode 100644 docs/v4.1.0/divide.md delete mode 100644 docs/v4.1.0/phab.md delete mode 100644 docs/v4.1.0/refine.md delete mode 100644 docs/v4.1.0/segment.md delete mode 100644 docs/v4.1.0/stratify.md delete mode 100644 docs/v4.1.0/vcf2df.md delete mode 100644 docs/v4.2.0/Citations.md delete mode 100644 docs/v4.2.0/Comparing-two-SV-programs.md delete mode 100644 docs/v4.2.0/Development.md delete mode 100644 docs/v4.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md delete mode 100644 docs/v4.2.0/Home.md delete mode 100644 docs/v4.2.0/Installation.md delete mode 100644 docs/v4.2.0/Multi-allelic-VCFs.md delete mode 100644 docs/v4.2.0/Updates.md delete mode 100644 docs/v4.2.0/anno.md delete mode 100644 docs/v4.2.0/bench.md delete mode 100644 docs/v4.2.0/collapse.md delete mode 100644 docs/v4.2.0/consistency.md delete mode 100644 docs/v4.2.0/divide.md delete mode 100644 docs/v4.2.0/refine.md delete mode 100644 docs/v4.2.0/segment.md delete mode 100644 docs/v4.2.0/stratify.md delete mode 100644 docs/v4.2.0/vcf2df.md delete mode 100644 docs/v4.2.1/Citations.md delete mode 100644 docs/v4.2.1/Development.md delete mode 100644 docs/v4.2.1/Home.md delete mode 100644 docs/v4.2.1/Installation.md delete mode 100644 docs/v4.2.1/MatchIds.md delete mode 100644 docs/v4.2.1/Multi-allelic-VCFs.md delete mode 100644 docs/v4.2.1/Updates.md delete mode 100644 docs/v4.2.1/anno.md delete mode 100644 docs/v4.2.1/bench.md delete mode 100644 docs/v4.2.1/collapse.md delete mode 100644 docs/v4.2.1/consistency.md delete mode 100644 docs/v4.2.1/divide.md delete mode 100644 docs/v4.2.1/phab.md delete mode 100644 docs/v4.2.1/refine.md delete mode 100644 docs/v4.2.1/segment.md delete mode 100644 docs/v4.2.1/stratify.md delete mode 100644 docs/v4.2.1/vcf2df.md delete mode 100644 docs/v4.2.2/Citations.md delete mode 100644 docs/v4.2.2/Development.md delete mode 100644 docs/v4.2.2/Home.md delete mode 100644 docs/v4.2.2/Installation.md delete mode 100644 docs/v4.2.2/MatchIds.md delete mode 100644 docs/v4.2.2/Multi-allelic-VCFs.md delete mode 100644 docs/v4.2.2/Updates.md delete mode 100644 docs/v4.2.2/bench.md delete mode 100644 docs/v4.2.2/collapse.md delete mode 100644 docs/v4.2.2/divide.md delete mode 100644 docs/v4.2.2/phab.md delete mode 100644 docs/v4.2.2/segment.md delete mode 100644 docs/v4.2.2/stratify.md delete mode 100644 docs/v4.2.2/vcf2df.md delete mode 100644 docs/v4.3.0/Citations.md delete mode 100644 docs/v4.3.0/Development.md delete mode 100644 docs/v4.3.0/Home.md delete mode 100644 docs/v4.3.0/MatchIds.md delete mode 100644 docs/v4.3.0/Multi-allelic-VCFs.md delete mode 100644 docs/v4.3.0/Updates.md delete mode 100644 docs/v4.3.0/anno.md delete mode 100644 docs/v4.3.0/bench.md delete mode 100644 docs/v4.3.0/collapse.md delete mode 100644 docs/v4.3.0/consistency.md delete mode 100644 docs/v4.3.0/divide.md delete mode 100644 docs/v4.3.0/phab.md delete mode 100644 docs/v4.3.0/refine.md delete mode 100644 docs/v4.3.0/segment.md delete mode 100644 docs/v4.3.0/stratify.md delete mode 100644 docs/v4.3.0/vcf2df.md delete mode 100644 docs/v4.3.1/Citations.md delete mode 100644 docs/v4.3.1/Development.md delete mode 100644 docs/v4.3.1/Installation.md delete mode 100644 docs/v4.3.1/MatchIds.md delete mode 100644 docs/v4.3.1/Multi-allelic-VCFs.md delete mode 100644 docs/v4.3.1/anno.md delete mode 100644 docs/v4.3.1/consistency.md delete mode 100644 docs/v4.3.1/divide.md delete mode 100644 docs/v4.3.1/phab.md delete mode 100644 docs/v4.3.1/refine.md delete mode 100644 docs/v4.3.1/segment.md delete mode 100644 docs/v4.3.1/stratify.md delete mode 100644 docs/v4.3.1/vcf2df.md rename docs/{v4.0.0 => }/vcf2df.md (100%) diff --git a/docs/v4.0.0/Citations.md b/docs/Citations.md similarity index 100% rename from docs/v4.0.0/Citations.md rename to docs/Citations.md diff --git "a/docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" "b/docs/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" similarity index 62% rename from "docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" rename to "docs/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" index 1ea24210..04c2c36c 100644 --- "a/docs/v4.3.1/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" +++ "b/docs/Collapse-on-Regions-with-a-High\342\200\220Density-of-SVs.md" @@ -1,4 +1,4 @@ -Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made. +Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made. If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run: @@ -6,9 +6,9 @@ If you find `truvari collapse` 'freezing' during processing where it is no longe truvari anno chunks input.vcf.gz > counts.bed ``` -The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance. +The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance. -If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them. +If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them. ``` # exclude regions with more than 30k SVs. You should test this threshold. @@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed truvari collapse --bed to_analyze.bed -i input.vcf.gz ``` -When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons. +When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons. -A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success. \ No newline at end of file +A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success. \ No newline at end of file diff --git a/docs/v4.0.0/Development.md b/docs/Development.md similarity index 100% rename from docs/v4.0.0/Development.md rename to docs/Development.md diff --git a/docs/v4.3.1/Home.md b/docs/Home.md similarity index 100% rename from docs/v4.3.1/Home.md rename to docs/Home.md diff --git a/docs/v4.3.0/Installation.md b/docs/Installation.md similarity index 100% rename from docs/v4.3.0/Installation.md rename to docs/Installation.md diff --git a/docs/v4.2.0/MatchIds.md b/docs/MatchIds.md similarity index 100% rename from docs/v4.2.0/MatchIds.md rename to docs/MatchIds.md diff --git a/docs/v4.0.0/Multi-allelic-VCFs.md b/docs/Multi-allelic-VCFs.md similarity index 100% rename from docs/v4.0.0/Multi-allelic-VCFs.md rename to docs/Multi-allelic-VCFs.md diff --git a/docs/v4.3.1/Updates.md b/docs/Updates.md similarity index 92% rename from docs/v4.3.1/Updates.md rename to docs/Updates.md index 1cb41cfb..c812eeab 100644 --- a/docs/v4.3.1/Updates.md +++ b/docs/Updates.md @@ -1,3 +1,18 @@ +# Truvari 5.0 +*in progress* + +* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants) +* Symbolic variants (``, ``, ``) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216) +* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure. +* Symbolic variant sub-types are ignored e.g. ` == ` +* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared +* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`) +* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245)) +* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved +* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison)) +* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None` +* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/)) + # Truvari 4.3.1 *September 9, 2024* * `bench` diff --git a/docs/v4.2.2/anno.md b/docs/anno.md similarity index 98% rename from docs/v4.2.2/anno.md rename to docs/anno.md index 55835902..d127d813 100644 --- a/docs/v4.2.2/anno.md +++ b/docs/anno.md @@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants ### Details Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: ```bash -zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz +zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz tabix anno.trf.bed.gz ``` For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. @@ -116,6 +116,12 @@ options: --debug Verbose logging ``` +### Notes about the Adotto catalog +If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running: +``` +zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz +``` + # truvari anno grm For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. diff --git a/docs/v4.3.1/bench.md b/docs/bench.md similarity index 72% rename from docs/v4.3.1/bench.md rename to docs/bench.md index 145bf974..e8335955 100644 --- a/docs/v4.3.1/bench.md +++ b/docs/bench.md @@ -160,45 +160,35 @@ See [[refine wiki|refine]] for details. Comparing Sequences of Variants =============================== -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. +Two SV insertions could theoretically be at the exact same position and exact same size, but have completely different sequences. Truvari's high specificity when matching variants is due in large part to comparing sequences. Truvari uses edlib to estimate sequence similarity of two alleles. In addition to comparing the direct (a.k.a. unmanipulated) similarity of sequences, truvari also checks rotations of sequences. The first rotation is based on the lexicographically minimum rotation of sequences, and the second is 'unrolled' sequences, which is formally described in [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). These 'rolling' operations can be turned off by using `--no-roll`. -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). +The main idea of rotating sequences is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. Finally, to increase matching sensitivity, the lexicographically minimum rotation is also compared as we've found many instances of smaller (<100bp) insertions which should match but fail direct and unrolled sequence similarity, but are captured in this third rotation. -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. +As of truvari v5.0, the unroll method is the only supported sequence comparison technique -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity +## Symbolic alleles -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: +As of truvari v5.0, some symbolic alts can be compared with sequence similarity turned on. ``, `` sequences will be filled in when a `--reference` is provided if the variant's size is smaller than `--max-resolve` (default 25kbp). Additionally, `` variants can have their sequences filled in if `--dup-to-ins` is also provided. The duplications are assumed to be perfect copies of the reference range spanned by the SV and are placed as a point insertion at the start position. -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` +Variants that are symbolic alts (even after attempting to resolve) can be compared with sequence resolved SVs, so there is no need to specify `--pctseq 0`. Note, however the `PctSeqSimilarity` field will not be populated. -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: +Symbolic Alt sub-types are ignored e.g. `` is considered just ``. -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` +Truvari can replace the symbolic alt of resolved SVs in the output VCF with the parameter `--write-resolved`. + +BND Comparison +============== +Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)` + +The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t). -Where `a1_seq1` is the longer of the REF or ALT allele. +BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90. -## SVs without sequences +Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = ``, ``, ``) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept. -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. +Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV. + +BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored. Controlling the number of matches ================================= @@ -283,6 +273,9 @@ See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for de Methodology =========== Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. + +![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) + ``` * zip the Base and Comp calls together in sorted order * create chunks of all calls overlapping within ±`--chunksize` basepairs @@ -291,5 +284,4 @@ Here is a high-level pseudocode description of the steps Truvari bench conducts * if the chunk has no Base or Comp calls ** return them all as FNs/FPs * use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file +``` \ No newline at end of file diff --git a/docs/v4.3.1/collapse.md b/docs/collapse.md similarity index 95% rename from docs/v4.3.1/collapse.md rename to docs/collapse.md index 7ac22adb..1b0171fe 100644 --- a/docs/v4.3.1/collapse.md +++ b/docs/collapse.md @@ -1,4 +1,6 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. +`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/ACEnglish/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. + +Since `collapse` uses the same variant comparison engine as `bench`, there is a lot of overlap in the behavior. Be sure to also be familiar with the [bench documentation](https://github.com/ACEnglish/truvari/wiki/bench). Example ======= diff --git a/docs/v4.2.2/consistency.md b/docs/consistency.md similarity index 95% rename from docs/v4.2.2/consistency.md rename to docs/consistency.md index 17ba23b7..2f2f0ea0 100644 --- a/docs/v4.2.2/consistency.md +++ b/docs/consistency.md @@ -176,4 +176,9 @@ Below is a consistency report in json format. } ] } -``` \ No newline at end of file +``` + +Output TSV +========== + +When given a `--output`, a TSV is written which holds the variant key used for comparison (first 5 columns of the VCF entry), the FLAG corresponding to the 'group' in the main output (e.g. FLAG 3 == group 0011), and a COUNT of how many VCFs the key was found in. \ No newline at end of file diff --git a/docs/v3.5.0/divide.md b/docs/divide.md similarity index 100% rename from docs/v3.5.0/divide.md rename to docs/divide.md diff --git a/docs/freeze_wiki.sh b/docs/freeze_wiki.sh index 71b07a7a..081a5aa6 100644 --- a/docs/freeze_wiki.sh +++ b/docs/freeze_wiki.sh @@ -1,16 +1,8 @@ -# Pull the current version of the wiki as a VERSION of the docs -VERSION=$1 - -set -e - -if [ -e $1 ] -then - echo 'Must provide version argument' - exit -fi +# Pull the current version of the wiki git clone https://github.com/spiralgenetics/truvari.wiki.git rm -rf truvari.wiki/.git -mv truvari.wiki $VERSION +mv truvari.wiki/* ./ +rm -rf truvari.wiki echo 'Review, git add, and git push' diff --git a/docs/v4.3.1/ga4gh.md b/docs/ga4gh.md similarity index 100% rename from docs/v4.3.1/ga4gh.md rename to docs/ga4gh.md diff --git a/docs/v4.2.0/phab.md b/docs/phab.md similarity index 100% rename from docs/v4.2.0/phab.md rename to docs/phab.md diff --git a/docs/v4.2.2/refine.md b/docs/refine.md similarity index 100% rename from docs/v4.2.2/refine.md rename to docs/refine.md diff --git a/docs/v4.0.0/segment.md b/docs/segment.md similarity index 100% rename from docs/v4.0.0/segment.md rename to docs/segment.md diff --git a/docs/v4.0.0/stratify.md b/docs/stratify.md similarity index 100% rename from docs/v4.0.0/stratify.md rename to docs/stratify.md diff --git a/docs/v2.1.1/Comparing-two-SV-programs.md b/docs/v2.1.1/Comparing-two-SV-programs.md deleted file mode 100644 index 7d6025b2..00000000 --- a/docs/v2.1.1/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs. ----------------------------------------------- - -```bash -./truvari.py -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ -./truvari.py -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ -``` - -Combine the TPs within each report. ------------------------------------ - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program. ----------------------------------------- - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference. ------------------------------ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap. ----------------------- - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See https://github.com/spiralgenetics/truvari/wiki/Consistency-report for details. - -Shared FPs between the programs -------------------------------- - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -./truvari.py -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps -``` \ No newline at end of file diff --git a/docs/v2.1.1/Creating-Annotations.md b/docs/v2.1.1/Creating-Annotations.md deleted file mode 100644 index a0fee941..00000000 --- a/docs/v2.1.1/Creating-Annotations.md +++ /dev/null @@ -1,32 +0,0 @@ -## Bed-like reference annotations -Making a reference range annotation - -1) Download the track for your reference. UCSC Genome Browser is pretty good -2) A truvari annotation track has the following columns: -- CHROM : chromosome position of the reference range -- START : start position of the reference range -- END : end position of the reference range -- ANNOS* : One or more annotation columns on each line. These are the values that will be added to the VCF INFO fields - -These should be zero-based half-open intervals - -3) Add the header information -For each ANNOS*, create a header line. These headerlines are the exact same as VCF INFO header lines. See -https://samtools.github.io/hts-specs/VCFv4.2.pdf for details. -These header lines should be in the same order as the columns in each row in the annotation file - -Example -``` -##INFO= -##INFO= -1 10000 10468 6 77.2 -1 10627 10800 29 6 -``` - -So, any entry that intersects with 1:1000-10468 will have the following added to their INFO: - -``` -SREP_periods=6;SREP_copies=77.2 -``` - -4) If some columns are not populated on certain lines, use '.' for Null \ No newline at end of file diff --git a/docs/v2.1.1/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v2.1.1/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index cadb3770..00000000 --- a/docs/v2.1.1/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,48 +0,0 @@ -By default, Truvari uses the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) to compute the `--pctsim` between two variants. This measure is different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign and Levenshtein to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom - -print 'swalign', ident -# swalign 0.966666666667 -print 'levedit', Levenshtein.ratio(seq1, seq2) -# levedit 0.892307692308 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each base as a mismatch, we would se the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. Therefore, Levenshtein is preferred when running Truvari. \ No newline at end of file diff --git a/docs/v2.1.1/Home.md b/docs/v2.1.1/Home.md deleted file mode 100644 index aa6904bb..00000000 --- a/docs/v2.1.1/Home.md +++ /dev/null @@ -1,28 +0,0 @@ -(Work in progress.. some links may not work.. Some details may be missing) - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. To make matters worse, many tools don't require input VCFs to strictly adhere to the format. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description Truvari bench methodology is linked below. - -# Index: - -- [Updates](https://github.com/spiralgenetics/truvari/wiki/Updates) -- Truvari Commands: - - [bench](https://github.com/spiralgenetics/truvari/wiki/bench) - - [The multimatch parameter](https://github.com/spiralgenetics/truvari/wiki/The--multimatch-parameter) - - [Edit Distance Ratio vs Sequence Similarity](https://github.com/spiralgenetics/truvari/wiki/Edit-Distance-Ratio-vs-Sequence-Similarity) - - [Multi-allelic VCFs](https://github.com/spiralgenetics/truvari/wiki/Multi-allelic-VCFs) - - [stratifications.sh](https://github.com/spiralgenetics/truvari/wiki/stratifications.sh) - - [Comparing two SV programs](https://github.com/spiralgenetics/truvari/wiki/Comparing-two-SV-programs) - - [consistency](https://github.com/spiralgenetics/truvari/wiki/consistency) - - [anno](https://github.com/spiralgenetics/truvari/wiki/anno) - - [Creating an annotation source](https://github.com/spiralgenetics/truvari/wiki/making-annotations) - - [collapse](https://github.com/spiralgenetics/truvari/wiki/collapse) - - [vcf2df](https://github.com/spiralgenetics/truvari/wiki/vcf2df) \ No newline at end of file diff --git a/docs/v2.1.1/Multi-allelic-VCFs.md b/docs/v2.1.1/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v2.1.1/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v2.1.1/The--multimatch-parameter.md b/docs/v2.1.1/The--multimatch-parameter.md deleted file mode 100644 index 4662cbcc..00000000 --- a/docs/v2.1.1/The--multimatch-parameter.md +++ /dev/null @@ -1,5 +0,0 @@ -By default, Truvari will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -However, Truvari's current method with this default behavior does create a race-condition (see [Issue #7](https://github.com/spiralgenetics/truvari/issues/7)). The race happens because Truvari compares per BASE event and unites to its best COMP event. Looser parameters could allow an upstream BASE call match a COMP call with a lower 'score' than what another downstream BASE neighbor would - therefore matching variants that aren't necessarily the best global match. This may allow many combinations of BASE/COMP calls floating between TP and FN/FP. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and sensitivity. The final counts of TP-base and TP-call in the `summary.txt` could also have different numbers. Furthermore, when inspecting TPs, some of the comparison calls may not be the most accurate representation of the allele; we have attempted to ensure that strict Truvari parameters produce a subset of the looser parameters over the same base/comparison call sets. \ No newline at end of file diff --git a/docs/v2.1.1/Updates.md b/docs/v2.1.1/Updates.md deleted file mode 100644 index bc71ce97..00000000 --- a/docs/v2.1.1/Updates.md +++ /dev/null @@ -1,24 +0,0 @@ -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v2.1.1/anno.md b/docs/v2.1.1/anno.md deleted file mode 100644 index a434c279..00000000 --- a/docs/v2.1.1/anno.md +++ /dev/null @@ -1,231 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: truvari anno gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: truvari anno gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. Every entry will have the deleted or inserted sequence run through TRF, and the returned fields will be added to the entry's INFO fields. - -Use the `--ref-bed` with a reference based truvari annotation track (`resources/*simpleRepeat.truvari.txt.gz`). Any intersection to a known tandem repeat in the reference will also be added to the entry's INFO field. - -An additional INFO field `TRF_diffs` is populated where any tandem repeat motifs shared between the reference annotation and the TRF annotation over the entry's sequence are the same, we report the Alternate allele copy number difference from its reference SREP_repeat copy number. These numbers are reported 1-to-1 in the same order of `TRF_repeats`. - -``` -usage: truvari anno trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-f] [-m MIN_LENGTH] - [-b BESTN] [-t TRF_PARAMS] [-R REF_BED] [-r REF] [-l] [--debug] - - Wrapper around TRF to annotate a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -f, --full Write full trf output to entries - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -b BESTN, --bestn BESTN - Top trf hits to report (5) - -t TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to tr (2 7 7 80 10 50 500 - -m -f -h -d -ngs) - -R REF_BED, --ref-bed REF_BED - Reference bed of tandem repeat regions - -r REF, --ref REF Reference fasta file (use with -R) - -l, --layered (non-functional) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -t THREADS, --threads THREADS - Number of threads (4) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -e hmmer -species human -lcambig -nocut -div 50 - -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (4) - --debug Verbose logging -``` - -# truvari anno remap -Note that 'novel' in the context of Deletions means that the deleted sequence has no *other* hits in reference besides the source location. -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] -i INPUT [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls over within size boundary -are in the neighborhood - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v2.1.1/bench.md b/docs/v2.1.1/bench.md deleted file mode 100644 index ecc2b416..00000000 --- a/docs/v2.1.1/bench.md +++ /dev/null @@ -1,194 +0,0 @@ - -Quick start -=========== - - $ truvari -b base_calls.vcf -c compare_calls.vcf -o output_dir/ - -Outputs -======= - - * tp-call.vcf -- annotated true positive calls from the COMP - * tp-base.vcf -- anotated true positive calls form the BASE - * fn.vcf -- false negative calls from BASE - * fp.vcf -- false positive calls from COMP - * base-filter.vcf -- size filtered calls from BASE - * call-filter.vcf -- size filtered calls from COMP - * summary.txt -- json output of performance stats - * log.txt -- run log - * giab_report.txt -- (optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below. - -summary.txt -=========== - -The following stats are generated for benchmarking your call set. - - - - - - - - - - - - - - - - - - - - - -
MetricDefinition
TP-baseNumber of matching calls from the base vcf
TP-callNumber of matching calls from the comp vcf
FPNumber of non-matching calls from the comp vcf
FNNumber of non-matching calls from the base vcf
precisionTP-call / (TP-call + FP)
recallTP-base / (TP-base + FN)
f12 * ((recall * precision) / (recall + precision))
base cntNumber of calls in the base vcf
call cntNumber of calls in the comp vcf
base size filteredNumber of base vcf calls outside of (sizemin, sizemax)
call size filteredNumber of comp vcf calls outside of (sizemin, sizemax)
base gt filteredNumber of base calls not passing the no-ref parameter filter
call gt filteredNumber of comp calls not passing the no-ref parameter filter
TP-call_TP-gtTP-call's with genotype match
TP-call_FP-gtTP-call's without genotype match
TP-base_TP-gtTP-base's with genotype match
TP-base_FP-gtTP-base's without genotype match
gt_precisionTP-call_TP-gt / (TP-call_TP-gt + FP + TP-call_FP-gt)
gt_recallTP-base_TP-gt / (TP-base_TP-gt / FN)
gt_f1(gt_recall * gt_precision) / (gt_recall + gt_precision)
- -Methodology -=========== - -``` -Input: - BaseCall - Benchmark TruthSet of SVs - CompCalls - Comparison SVs from another program -Build IntervalTree of CompCalls -For each BaseCall: - Fetch CompCalls overlapping within *refdist*. - If typematch and LevDistRatio >= *pctsim* \ - and SizeRatio >= *pctsize* and PctRecOvl >= *pctovl*: - Add CompCall to list of Neighbors - Sort list of Neighbors by TruScore ((2*sim + 1*size + 1*ovl) / 3.0) - Take CompCall with highest TruScore and BaseCall as TPs - Only use a CompCall once if not --multimatch - If no neighbors: BaseCall is FN -For each CompCall: - If not used: mark as FP -``` - -Matching Parameters --------------------- - - - - - - - - - - -
ParameterDefaultDefinition
refdist500Maximum distance comparison calls must be within from base call's start/end
pctsim0.7Levenshtein distance ratio between the REF/ALT haplotype sequences of base and comparison call. -See "Comparing Haplotype Sequences of Variants" below. -
pctsize0.7Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl0.0Ratio of two calls' (overlapping bases)/(longest span)
typeignoreFalseTypes don't need to match to compare calls.
- -Comparing VCFs without sequence resolved calls ----------------------------------------------- - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. - -Difference between --sizemin and --sizefilt -------------------------------------------- - -`--sizemin` is the minimum size of a base call or comparison call to be considered. - -`--sizefilt` is the minimum size of a call to be added into the IntervalTree for searching. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp call at -the same position. - -These two calls should be considered matching. If we instead removed calls less than `sizemin`, we'd -incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to TP vcfs --------------------------------------------- - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((2*sim + 1*size + 1*ovl) / 3.0)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and it's closest match
PctRecOverlap Percent reciprocal overlap of the two calls' coordinates
StartDistance Distance of this call's start from matching call's start
EndDistance Distance of this call's end from matching call's end
SizeDiff Difference in size(basecall) and size(compcall)
NumNeighbors Number of comparison calls that were in the neighborhood (REFDIST) of the base call
NumThresholdNeighbors Number of comparison calls that passed threshold matching of the base call
- -NumNeighbors and NumThresholdNeighbors are also added to the FN vcf. - -Using the GIAB Report ---------------------- - -When running against the GIAB SV benchmark (link below), you can create a detailed report of -calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. - -To create this report. - -1. Run truvari with the flag `--giabreport`. -2. In your output directory, you will find a file named `giab_report.txt`. -3. Next, make a copy of the -[Truvari Report Template Google Sheet](https://docs.google.com/spreadsheets/d/1T3EdpyLO1Kq-bJ8SDatqJ5nP_wwFKCrH0qhxorvTVd4/edit?usp=sharing). -4. Finally, paste ALL of the information inside `giab_report.txt` into the "RawData" tab. Be careful not -to alter the report text in any way. If successul, the "Formatted" tab you will have a fully formated report. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs --------------------------------- - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - - -Comparing Haplotype Sequences of Variants ---------------------------------------- - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. \ No newline at end of file diff --git a/docs/v2.1.1/collapse.md b/docs/v2.1.1/collapse.md deleted file mode 100644 index 88940556..00000000 --- a/docs/v2.1.1/collapse.md +++ /dev/null @@ -1,112 +0,0 @@ -# truvari collapse - -Will collapse all variants within sizemin/max that match over thresholds. All variants outside size boundaries will be placed into the output. - -Collapse will iterate over a single VCF and 'collapse' variants that match one another above the provided thresholds. This is useful when removing potentially redundant calls or merging calls between samples/software. - -Merging SVs is a tricky business. Most approaches will simply use distances or reciprocal overlap of calls to merge them. There may be a more informed way to merge calls, especially when they're sequence-resolved breakpoint exact calls. This command helps do a more sophisticated merge. - -To start, we can merge multiple VCFs (each with their own sample) using `bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf`. -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -Creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This single VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. -For example, if we collapsed the above example merge.vcf by matching any calls within 3bp, we'd create: - - >> merge.collapse.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - -When collapsing, the first variant from a matching set of variants will -be written to the output while the others will be placed in the collapsed output. - -Samples with no genotype information in the first variant will be filled by the first -collapsed variant containing genotype information. - -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - -When using `--chain` mode, instead of collapsing all variants matching the first variant -together, we'll collapse all variants in a matching set together. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without --chain, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [--debug] [-r REFDIST] [-p PCTSIM] [-B BUFFER] [-P PCTSIZE] - [-O PCTOVL] [-t] [--hap] [--chain] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v2.1.1/consistency.md b/docs/v2.1.1/consistency.md deleted file mode 100644 index 5c1e52c9..00000000 --- a/docs/v2.1.1/consistency.md +++ /dev/null @@ -1,73 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -------- - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -# Example - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries ----------------- - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -------------- - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v2.1.1/stats.md b/docs/v2.1.1/stats.md deleted file mode 100644 index 990b8c84..00000000 --- a/docs/v2.1.1/stats.md +++ /dev/null @@ -1,30 +0,0 @@ - -# Running - -``` -usage: stats [-h] [-d DATAFRAME] [-o OUT] [--qmin QMIN] [--qmax QMAX] - VCF [VCF ...] - -VCF Stats for SVs with sizebins and svtypes - -positional arguments: - VCF VCFs to annotate (stdin) - -optional arguments: - -h, --help show this help message and exit - -d DATAFRAME, --dataframe DATAFRAME - Write dataframe joblib to file (None) - -o OUT, --out OUT Stats output file (/dev/stdout - --qmin QMIN Minimum QUAL score found in VCF (0) - --qmax QMAX Maximum QUAL score found in VCF (100) - ``` - -# Example -`$ truvari stats tp-call.vcf.gz` - -## How it works -Counts per SVTYPE, SZBIN, QUALBIN, Genotype the number of events found. - -The text output makes multiple tables for 'total' and per-sample. Per-sample is the subset of total where the sample has a HET or HOM genotype. - -Optionally outputs a joblib dictionary (`--dataframe`) with keys for the total, and per-sample. The total array doesn't have the GT dimension. See this [notebook](https://github.com/spiralgenetics/truvari/blob/develop/TruvariStatsPlots.ipynb) for examples of how to use this data. diff --git a/docs/v2.1.1/stratifications.sh.md b/docs/v2.1.1/stratifications.sh.md deleted file mode 100644 index b1814aa2..00000000 --- a/docs/v2.1.1/stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v2.1.1/vcf2df.md b/docs/v2.1.1/vcf2df.md deleted file mode 100644 index 95b95f1f..00000000 --- a/docs/v2.1.1/vcf2df.md +++ /dev/null @@ -1,72 +0,0 @@ -# vcf2df - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will take an output Truvari directory and convert each vcf entry into a row of a pandas DataFrame. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist creating a union between sets of variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs are single-allele per line, so there are only 2 alleles described per-VCF line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -``` -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -Since these DataFrames may be useful outside of Truvari results, by default a single VCF file is converted to a DataFrame where `--bench-dir` needs to be specified for a directory. - -```bash -usage: vcf2df [-h] [-b] [-i] [-f] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.0.0/Citations.md b/docs/v3.0.0/Citations.md deleted file mode 100644 index a18d771b..00000000 --- a/docs/v3.0.0/Citations.md +++ /dev/null @@ -1,27 +0,0 @@ -# Citing Truvari -Manuscript in Progress - -For now, pointing to the GitHub repository should be sufficient [https://github.com/spiralgenetics/truvari](https://github.com/spiralgenetics/truvari) - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) diff --git a/docs/v3.0.0/Comparing-two-SV-programs.md b/docs/v3.0.0/Comparing-two-SV-programs.md deleted file mode 100644 index f9ca5e0b..00000000 --- a/docs/v3.0.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap -===================== - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See [[Consistency-report|consistency]] for details. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.0.0/Development.md b/docs/v3.0.0/Development.md deleted file mode 100644 index 1a3da8fc..00000000 --- a/docs/v3.0.0/Development.md +++ /dev/null @@ -1,62 +0,0 @@ -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we either replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# conda - -The package management we maintain is via PyPi so that the command `pip install truvari` is available. There is a very old version of Truvari that was uploaded to conda that we'll eventually look into maintaining. In the meantime, we recommend reading [this](https://www.anaconda.com/blog/using-pip-in-a-conda-environment) when using conda and trying to use Truvari's documented installation procedure. - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -# You'll now be inside the container. Run the FuncTests -bash repo_utils/truvari_ssshtests.sh -# Perform an annotation -truvari anno svinfo -i example.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.0.0/GIAB-stratifications.sh.md b/docs/v3.0.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.0.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.0.0/Home.md b/docs/v3.0.0/Home.md deleted file mode 100644 index 9c1ecd94..00000000 --- a/docs/v3.0.0/Home.md +++ /dev/null @@ -1,29 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v3.0.0/Multi-allelic-VCFs.md b/docs/v3.0.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.0.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.0.0/The--multimatch-parameter.md b/docs/v3.0.0/The--multimatch-parameter.md deleted file mode 100644 index a5bee8f0..00000000 --- a/docs/v3.0.0/The--multimatch-parameter.md +++ /dev/null @@ -1,5 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -However, Truvari bench's current method with this default behavior does create a race-condition (see [Issue #7](https://github.com/spiralgenetics/truvari/issues/7)). The race happens because Truvari bench compares per BASE event and unites to its best COMP event. Looser parameters could allow an upstream BASE call match a COMP call with a lower 'score' than what another downstream BASE neighbor would - therefore matching variants that aren't necessarily the best global match. This may allow many combinations of BASE/COMP calls floating between TP and FN/FP. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and sensitivity. The final counts of TP-base and TP-call in the `summary.txt` could also have different numbers. Furthermore, when inspecting TPs, some of the comparison calls may not be the most accurate representation of the allele; we have attempted to ensure that strict Truvari bench parameters produce a subset of the looser parameters over the same base/comparison call sets. \ No newline at end of file diff --git a/docs/v3.0.0/Updates.md b/docs/v3.0.0/Updates.md deleted file mode 100644 index 026d3d14..00000000 --- a/docs/v3.0.0/Updates.md +++ /dev/null @@ -1,52 +0,0 @@ -# Truvari 3.0 -*Coming soon* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.0.0/anno.md b/docs/v3.0.0/anno.md deleted file mode 100644 index 157d7cea..00000000 --- a/docs/v3.0.0/anno.md +++ /dev/null @@ -1,293 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh, svinfo - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires the [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. This also uses SimpleRepeat tracks over the reference genome (e.g. [`resources/`](https://github.com/spiralgenetics/truvari/tree/develop/resources)). - -First, every SV is parsed and intersected with the SimpleRepeat track. If the SV hits a SimpleRepeat region, the SV will be annotated. Furthermore, we build the haplotype sequence of the alternate allele and the SimpleRepeat region and feed that sequence into tandem repeat finder. We then intersect the repeats TRF found with the repeats found in the SimpleRepeat track intersection. If the same repeat is found in both, we annotate with the TRF information as well as reporting the difference in the number of copies in alternate allele's haplotype sequence via TRFDiff. - -INFO fields added -* TRF - Entry hits a simple repeat region -* TRFDiff - Simple repeat copy difference -* TRFperiod - Period size of the repeat -* TRFcopies - Number of copies aligned with the consensus pattern. -* TRFscore - TRF Alignment score -* TRFentropy - TRF Entropy measure -* TRFrepeat - TRF Repeat found on entry - -``` -usage: trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] - [-s SIMPLE_REPEATS] [-f REFERENCE] [-m MIN_LENGTH] [-M MAX_LENGTH] - [-t THREADS] [-C CHUNK_SIZE] [--debug] - -Intersect vcf with reference simple repeats and report -how the an alternate allele affects the copies using TRF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 - -h -ngs) - -s SIMPLE_REPEATS, --simple-repeats SIMPLE_REPEATS - Simple repeats bed - -f REFERENCE, --reference REFERENCE - Reference fasta file - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of sequence to run through trf (10000) - -t THREADS, --threads THREADS - Number of threads to use (48) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization - (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem -partial - Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit - (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] - [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] - [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, --- is a call, refdist is 2 - - - - - - - -nn: 1 2 1 0 1 1 -id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.0.0/bench.md b/docs/v3.0.0/bench.md deleted file mode 100644 index 31675aae..00000000 --- a/docs/v3.0.0/bench.md +++ /dev/null @@ -1,186 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
- -Matching Parameters -=================== -Picking matching parameters is can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call or comparison call to be considered. - -`--sizefilt` is the minimum size of a call to be added into the IntervalTree for searching. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp call at -the same position. - -These two calls should be considered matching. If we instead removed calls less than `sizemin`, we'd -incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to TP vcfs -========================================== - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((2*sim + 1*size + 1*ovl) / 3.0)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and it's closest match
PctRecOverlap Percent reciprocal overlap of the two calls' coordinates
StartDistance Distance of this call's start from matching call's start
EndDistance Distance of this call's end from matching call's end
SizeDiff Difference in size(basecall) and size(compcall)
NumNeighbors Number of comparison calls that were in the neighborhood (REFDIST) of the base call
NumThresholdNeighbors Number of comparison calls that passed threshold matching of the base call
- -NumNeighbors and NumThresholdNeighbors are also added to the FN vcf. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - - -Comparing Haplotype Sequences of Variants -========================================= - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -Input: - BaseCall - Benchmark TruthSet of SVs - CompCalls - Comparison SVs from another program -Build IntervalTree of CompCalls -For each BaseCall: - Fetch CompCalls overlapping within *refdist*. - If typematch and sequence_similarity >= *pctsim* \ - and size_ratio >= *pctsize* and PctRecOvl >= *pctovl*: - Add CompCall to list of Neighbors - Sort list of Neighbors by TruScore ((2*sim + 1*size + 1*ovl) / 3.0) - Take CompCall with highest TruScore and BaseCall as TPs - Only use a CompCall once if not --multimatch - If no neighbors: BaseCall is FN -For each CompCall: - If not used: mark as FP -``` \ No newline at end of file diff --git a/docs/v3.0.0/collapse.md b/docs/v3.0.0/collapse.md deleted file mode 100644 index 87da439c..00000000 --- a/docs/v3.0.0/collapse.md +++ /dev/null @@ -1,130 +0,0 @@ -Will collapse all variants within sizemin/max that match over thresholds. All variants outside size boundaries will be placed into the output. - -Collapse will iterate over a single VCF and 'collapse' variants that match one another above the provided thresholds. This is useful when removing potentially redundant calls or merging calls between samples/software. - -Merging SVs is a tricky business. Most approaches will simply use distances or reciprocal overlap of calls to merge them. There may be a more informed way to merge calls, especially when they're sequence-resolved breakpoint exact calls. This command helps do a more sophisticated merge. - -To start, we can merge multiple VCFs (each with their own sample) using `bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf`. -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -Creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This single VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. -For example, if we collapsed the above example merge.vcf by matching any calls within 3bp, we'd create: - - >> merge.collapse.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - -When collapsing, the first variant from a matching set of variants will -be written to the output while the others will be placed in the collapsed output. - -Samples with no genotype information in the first variant will be filled by the first -collapsed variant containing genotype information. - -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - -When using `--chain` mode, instead of collapsing all variants matching the first variant -together, we'll collapse all variants in a matching set together. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without --chain, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved - calls (False) - --chain Chain comparisons to extend possible collapsing - (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate - into the kept entry by taking the first non-null from - all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to - ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype - sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) - (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib - editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison - (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.0.0/consistency.md b/docs/v3.0.0/consistency.md deleted file mode 100644 index b8ccda5c..00000000 --- a/docs/v3.0.0/consistency.md +++ /dev/null @@ -1,74 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.0.0/vcf2df.md b/docs/v3.0.0/vcf2df.md deleted file mode 100644 index b2cf8b39..00000000 --- a/docs/v3.0.0/vcf2df.md +++ /dev/null @@ -1,70 +0,0 @@ -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will take an output Truvari directory and convert each vcf entry into a row of a pandas DataFrame. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist creating a union between sets of variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs are single-allele per line, so there are only 2 alleles described per-VCF line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -Since these DataFrames may be useful outside of Truvari results, by default a single VCF file is converted to a DataFrame where `--bench-dir` needs to be specified for a directory. - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.1.0/Citations.md b/docs/v3.1.0/Citations.md deleted file mode 100644 index cfc5a27a..00000000 --- a/docs/v3.1.0/Citations.md +++ /dev/null @@ -1,29 +0,0 @@ -# Citing Truvari -Manuscript in Progress - -For now, pointing to the GitHub repository should be sufficient [https://github.com/spiralgenetics/truvari](https://github.com/spiralgenetics/truvari) - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) diff --git a/docs/v3.1.0/Comparing-two-SV-programs.md b/docs/v3.1.0/Comparing-two-SV-programs.md deleted file mode 100644 index f9ca5e0b..00000000 --- a/docs/v3.1.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap -===================== - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See [[Consistency-report|consistency]] for details. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.1.0/Development.md b/docs/v3.1.0/Development.md deleted file mode 100644 index 09f21e3e..00000000 --- a/docs/v3.1.0/Development.md +++ /dev/null @@ -1,73 +0,0 @@ -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we either replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# conda - -The package management we maintain is via PyPi so that the command `pip install truvari` is available. There is a very old version of Truvari that was uploaded to conda that we'll eventually look into maintaining. In the meantime, we recommend reading [this](https://www.anaconda.com/blog/using-pip-in-a-conda-environment) when using conda and trying to use Truvari's documented installation procedure. - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# Truvari SDK -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` diff --git a/docs/v3.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.1.0/GIAB-stratifications.sh.md b/docs/v3.1.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.1.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.1.0/Home.md b/docs/v3.1.0/Home.md deleted file mode 100644 index 63437a6a..00000000 --- a/docs/v3.1.0/Home.md +++ /dev/null @@ -1,32 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] - - [[segment|segment]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v3.1.0/Installation.md b/docs/v3.1.0/Installation.md deleted file mode 100644 index 0e06f9d2..00000000 --- a/docs/v3.1.0/Installation.md +++ /dev/null @@ -1,37 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.0.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 setup.py install -``` -To see a list of all available tags, run: -``` -git tag -l -``` -If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v3.1.0/MatchIds.md b/docs/v3.1.0/MatchIds.md deleted file mode 100644 index bdabf9bb..00000000 --- a/docs/v3.1.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -Using `bench` MatchId -===================== - -MatchIds are used to tie base/comparison calls together. MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't have `--multimatch` is straight forward. For example, to tie tp-base.vcf and tp-call.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -Tieing a false positives/negatives to their closest matching counter part is a little tricker. -For example, a false negative with MatchId `3.1.0`, we'll need to look in both the `fn.vcf` and `tp-call.vcf` for -`3.*.0`. - -When using `--multimatch` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-call 3.0.0 -tp-call 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `1.0.1` also matches to the 0th base call `tp-base 1.0.0`. \ No newline at end of file diff --git a/docs/v3.1.0/Multi-allelic-VCFs.md b/docs/v3.1.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.1.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.1.0/Resources.md b/docs/v3.1.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v3.1.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v3.1.0/The--multimatch-parameter.md b/docs/v3.1.0/The--multimatch-parameter.md deleted file mode 100644 index bee54a12..00000000 --- a/docs/v3.1.0/The--multimatch-parameter.md +++ /dev/null @@ -1,3 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and specificity as redundantly used calls are only counted once for the final summary metrics. \ No newline at end of file diff --git a/docs/v3.1.0/Updates.md b/docs/v3.1.0/Updates.md deleted file mode 100644 index 856ec155..00000000 --- a/docs/v3.1.0/Updates.md +++ /dev/null @@ -1,69 +0,0 @@ -# Truvari 3.1 -*in progress* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.1.0/anno.md b/docs/v3.1.0/anno.md deleted file mode 100644 index 157d7cea..00000000 --- a/docs/v3.1.0/anno.md +++ /dev/null @@ -1,293 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh, svinfo - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires the [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. This also uses SimpleRepeat tracks over the reference genome (e.g. [`resources/`](https://github.com/spiralgenetics/truvari/tree/develop/resources)). - -First, every SV is parsed and intersected with the SimpleRepeat track. If the SV hits a SimpleRepeat region, the SV will be annotated. Furthermore, we build the haplotype sequence of the alternate allele and the SimpleRepeat region and feed that sequence into tandem repeat finder. We then intersect the repeats TRF found with the repeats found in the SimpleRepeat track intersection. If the same repeat is found in both, we annotate with the TRF information as well as reporting the difference in the number of copies in alternate allele's haplotype sequence via TRFDiff. - -INFO fields added -* TRF - Entry hits a simple repeat region -* TRFDiff - Simple repeat copy difference -* TRFperiod - Period size of the repeat -* TRFcopies - Number of copies aligned with the consensus pattern. -* TRFscore - TRF Alignment score -* TRFentropy - TRF Entropy measure -* TRFrepeat - TRF Repeat found on entry - -``` -usage: trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] - [-s SIMPLE_REPEATS] [-f REFERENCE] [-m MIN_LENGTH] [-M MAX_LENGTH] - [-t THREADS] [-C CHUNK_SIZE] [--debug] - -Intersect vcf with reference simple repeats and report -how the an alternate allele affects the copies using TRF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 - -h -ngs) - -s SIMPLE_REPEATS, --simple-repeats SIMPLE_REPEATS - Simple repeats bed - -f REFERENCE, --reference REFERENCE - Reference fasta file - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of sequence to run through trf (10000) - -t THREADS, --threads THREADS - Number of threads to use (48) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization - (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem -partial - Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit - (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] - [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] - [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, --- is a call, refdist is 2 - - - - - - - -nn: 1 2 1 0 1 1 -id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.1.0/bench.md b/docs/v3.1.0/bench.md deleted file mode 100644 index fd7e14be..00000000 --- a/docs/v3.1.0/bench.md +++ /dev/null @@ -1,232 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
- -Matching Parameters -=================== -Picking matching parameters is can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Matching Parameter Diagrams -=========================== -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctsim PCTSIM (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call or comparison call to be considered. - -`--sizefilt` is the minimum size of a call to be added into the IntervalTree for searching. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp call at -the same position. - -These two calls should be considered matching. If we instead removed calls less than `sizemin`, we'd -incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to TP vcfs -========================================== - - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((pctsim + pctsize + pctovl) / 3 * 100)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's end from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid}
Multi Call is false due to non-multimatching
- -See [[MatchIds wiki|MatchIds]] for details. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - - -Comparing Haplotype Sequences of Variants -========================================= - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* if multimatch: -** return the max of each row (base) and of each column (call) from the match matrix -* else: -** ravel and sort the match matrix -** return the first use of each call in its current state (True or False) -** return the second use of each call with state=False and set Multi=True -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v3.1.0/collapse.md b/docs/v3.1.0/collapse.md deleted file mode 100644 index 4ae267aa..00000000 --- a/docs/v3.1.0/collapse.md +++ /dev/null @@ -1,140 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) via: -```bash -bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> output.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -When using `--chain` mode, instead of collapsing all variants which the first variant, -we'll collapse all variants in a matching set together. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null from all - neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.1.0/consistency.md b/docs/v3.1.0/consistency.md deleted file mode 100644 index b8ccda5c..00000000 --- a/docs/v3.1.0/consistency.md +++ /dev/null @@ -1,74 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.1.0/segment.md b/docs/v3.1.0/segment.md deleted file mode 100644 index 99514df0..00000000 --- a/docs/v3.1.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` with holds the number of SVs overlapping the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, the new annotations would be: - -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - diff --git a/docs/v3.1.0/vcf2df.md b/docs/v3.1.0/vcf2df.md deleted file mode 100644 index cba5c8a4..00000000 --- a/docs/v3.1.0/vcf2df.md +++ /dev/null @@ -1,73 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist set operations on variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-m] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -m, --multisample Parse multiple samples. Splits -s by comma. Sample column names will be flattened - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.2.0/Citations.md b/docs/v3.2.0/Citations.md deleted file mode 100644 index b72e3817..00000000 --- a/docs/v3.2.0/Citations.md +++ /dev/null @@ -1,31 +0,0 @@ -# Citing Truvari - -Pre-print on Biorxiv while in submission: - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) diff --git a/docs/v3.2.0/Comparing-two-SV-programs.md b/docs/v3.2.0/Comparing-two-SV-programs.md deleted file mode 100644 index f9ca5e0b..00000000 --- a/docs/v3.2.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap -===================== - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See [[Consistency-report|consistency]] for details. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.2.0/Development.md b/docs/v3.2.0/Development.md deleted file mode 100644 index 8c63e6f4..00000000 --- a/docs/v3.2.0/Development.md +++ /dev/null @@ -1,94 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we either replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Freeze the Wiki -4) Ensure all code is checked in -5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -6) Use github action to make a testpypi release -7) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -8) Use GitHub action to make a pypi release -9) Change Updates Wiki -10) Download release-tarball.zip from step #8’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge - -# conda - -The package management we maintain is via PyPi so that the command `pip install truvari` is available. There is a very old version of Truvari that was uploaded to conda that we'll eventually look into maintaining. In the meantime, we recommend reading [this](https://www.anaconda.com/blog/using-pip-in-a-conda-environment) when using conda and trying to use Truvari's documented installation procedure. \ No newline at end of file diff --git a/docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.2.0/GIAB-stratifications.sh.md b/docs/v3.2.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.2.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.2.0/Home.md b/docs/v3.2.0/Home.md deleted file mode 100644 index 3bd133bf..00000000 --- a/docs/v3.2.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] - - [[segment|segment]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v3.2.0/Installation.md b/docs/v3.2.0/Installation.md deleted file mode 100644 index 0e06f9d2..00000000 --- a/docs/v3.2.0/Installation.md +++ /dev/null @@ -1,37 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.0.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 setup.py install -``` -To see a list of all available tags, run: -``` -git tag -l -``` -If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v3.2.0/MatchIds.md b/docs/v3.2.0/MatchIds.md deleted file mode 100644 index 29e070ae..00000000 --- a/docs/v3.2.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -Using `bench` MatchId -===================== - -MatchIds are used to tie base/comparison calls together. MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't have `--multimatch` is straight forward. For example, to tie tp-base.vcf and tp-call.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -Tieing a false positives/negatives to their closest matching counter part is a little tricker. -For example, a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-call.vcf` for -`3.*.0`. - -When using `--multimatch` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-call 3.0.0 -tp-call 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. \ No newline at end of file diff --git a/docs/v3.2.0/Multi-allelic-VCFs.md b/docs/v3.2.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.2.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.2.0/Resources.md b/docs/v3.2.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v3.2.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v3.2.0/The--multimatch-parameter.md b/docs/v3.2.0/The--multimatch-parameter.md deleted file mode 100644 index bee54a12..00000000 --- a/docs/v3.2.0/The--multimatch-parameter.md +++ /dev/null @@ -1,3 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and specificity as redundantly used calls are only counted once for the final summary metrics. \ No newline at end of file diff --git a/docs/v3.2.0/Updates.md b/docs/v3.2.0/Updates.md deleted file mode 100644 index 9aea7fea..00000000 --- a/docs/v3.2.0/Updates.md +++ /dev/null @@ -1,86 +0,0 @@ -# Truvari 3.2 -*In Progress* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.2.0/anno.md b/docs/v3.2.0/anno.md deleted file mode 100644 index 3bdd8013..00000000 --- a/docs/v3.2.0/anno.md +++ /dev/null @@ -1,400 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh, svinfo, bpovl, density - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires the [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. This also uses SimpleRepeat tracks over the reference genome (e.g. [`resources/`](https://github.com/spiralgenetics/truvari/tree/develop/resources)). - -First, every SV is parsed and intersected with the SimpleRepeat track. If the SV hits a SimpleRepeat region, the SV will be annotated. Furthermore, we build the haplotype sequence of the alternate allele and the SimpleRepeat region and feed that sequence into tandem repeat finder. We then intersect the repeats TRF found with the repeats found in the SimpleRepeat track intersection. If the same repeat is found in both, we annotate with the TRF information as well as reporting the difference in the number of copies in alternate allele's haplotype sequence via TRFDiff. - -INFO fields added -* TRF - Entry hits a simple repeat region -* TRFDiff - Simple repeat copy difference -* TRFperiod - Period size of the repeat -* TRFcopies - Number of copies aligned with the consensus pattern. -* TRFscore - TRF Alignment score -* TRFentropy - TRF Entropy measure -* TRFrepeat - TRF Repeat found on entry - -``` -usage: trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] - [-s SIMPLE_REPEATS] [-f REFERENCE] [-m MIN_LENGTH] [-M MAX_LENGTH] - [-t THREADS] [-C CHUNK_SIZE] [--debug] - -Intersect vcf with reference simple repeats and report -how the an alternate allele affects the copies using TRF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 - -h -ngs) - -s SIMPLE_REPEATS, --simple-repeats SIMPLE_REPEATS - Simple repeats bed - -f REFERENCE, --reference REFERENCE - Reference fasta file - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of sequence to run through trf (10000) - -t THREADS, --threads THREADS - Number of threads to use (48) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization - (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem -partial - Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit - (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] - [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] - [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, --- is a call, refdist is 2 - - - - - - - -nn: 1 2 1 0 1 1 -id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] [-i INPUT] -a ANNO -o OUTPUT [--spanmin SPANMIN] [--spanmax SPANMAX] - [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] [-e END] [-1] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --spanmin SPANMIN Minimum span of SVs to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME [-i INPUT] -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] - -Identify 'dense' and 'sparse' variant windows of the genome - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` \ No newline at end of file diff --git a/docs/v3.2.0/bench.md b/docs/v3.2.0/bench.md deleted file mode 100644 index 69f89d80..00000000 --- a/docs/v3.2.0/bench.md +++ /dev/null @@ -1,233 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match
- -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Matching Parameter Diagrams -=========================== -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctsim PCTSIM (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls should be considered matching. If we instead removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to TP vcfs -========================================== - - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((pctsim + pctsize + pctovl) / 3 * 100)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's end from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid}
Multi Call is false due to non-multimatching
- -See [[MatchIds wiki|MatchIds]] for details. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench-dir`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - - -Comparing Haplotype Sequences of Variants -========================================= - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* if multimatch: -** return the max of each row (base) and of each column (call) from the match matrix -* else: -** ravel and sort the match matrix -** return the first use of each call in its current state (True or False) -** return the second use of each call with state=False and set Multi=True -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v3.2.0/collapse.md b/docs/v3.2.0/collapse.md deleted file mode 100644 index 001e8f69..00000000 --- a/docs/v3.2.0/collapse.md +++ /dev/null @@ -1,147 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) via: -```bash -bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference/ -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -When using `--chain` mode, instead of collapsing all variants which the first variant, -we'll collapse all variants in a matching set together. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null from all - neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.2.0/consistency.md b/docs/v3.2.0/consistency.md deleted file mode 100644 index b8ccda5c..00000000 --- a/docs/v3.2.0/consistency.md +++ /dev/null @@ -1,74 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.2.0/divide.md b/docs/v3.2.0/divide.md deleted file mode 100644 index 4cedd268..00000000 --- a/docs/v3.2.0/divide.md +++ /dev/null @@ -1,39 +0,0 @@ -Divide a VCF into independent parts. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. \ No newline at end of file diff --git a/docs/v3.2.0/segment.md b/docs/v3.2.0/segment.md deleted file mode 100644 index 99514df0..00000000 --- a/docs/v3.2.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` with holds the number of SVs overlapping the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, the new annotations would be: - -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - diff --git a/docs/v3.2.0/vcf2df.md b/docs/v3.2.0/vcf2df.md deleted file mode 100644 index cba5c8a4..00000000 --- a/docs/v3.2.0/vcf2df.md +++ /dev/null @@ -1,73 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist set operations on variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-m] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -m, --multisample Parse multiple samples. Splits -s by comma. Sample column names will be flattened - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.3.0/Citations.md b/docs/v3.3.0/Citations.md deleted file mode 100644 index b72e3817..00000000 --- a/docs/v3.3.0/Citations.md +++ /dev/null @@ -1,31 +0,0 @@ -# Citing Truvari - -Pre-print on Biorxiv while in submission: - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) diff --git a/docs/v3.3.0/Comparing-two-SV-programs.md b/docs/v3.3.0/Comparing-two-SV-programs.md deleted file mode 100644 index f9ca5e0b..00000000 --- a/docs/v3.3.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap -===================== - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See [[Consistency-report|consistency]] for details. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.3.0/Development.md b/docs/v3.3.0/Development.md deleted file mode 100644 index 0774b2c8..00000000 --- a/docs/v3.3.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Freeze the Wiki -4) Ensure all code is checked in -5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -6) Use github action to make a testpypi release -7) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -8) Use GitHub action to make a pypi release -9) Change Updates Wiki -10) Download release-tarball.zip from step #8’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v3.3.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.3.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.3.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.3.0/GIAB-stratifications.sh.md b/docs/v3.3.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.3.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.3.0/Home.md b/docs/v3.3.0/Home.md deleted file mode 100644 index 3bd133bf..00000000 --- a/docs/v3.3.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] - - [[segment|segment]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v3.3.0/Installation.md b/docs/v3.3.0/Installation.md deleted file mode 100644 index bcaeabaa..00000000 --- a/docs/v3.3.0/Installation.md +++ /dev/null @@ -1,47 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.0.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 setup.py install -``` -To see a list of all available tags, run: -``` -git tag -l -``` -If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba -===== -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` -Note: as of May 4, 2022 mamba is up to date with Truvari's v3.2.0 release. Until we implement an auto-deployment to our github actions, mamba official releases may lag behind. - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v3.3.0/MatchIds.md b/docs/v3.3.0/MatchIds.md deleted file mode 100644 index 29e070ae..00000000 --- a/docs/v3.3.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -Using `bench` MatchId -===================== - -MatchIds are used to tie base/comparison calls together. MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't have `--multimatch` is straight forward. For example, to tie tp-base.vcf and tp-call.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -Tieing a false positives/negatives to their closest matching counter part is a little tricker. -For example, a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-call.vcf` for -`3.*.0`. - -When using `--multimatch` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-call 3.0.0 -tp-call 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. \ No newline at end of file diff --git a/docs/v3.3.0/Multi-allelic-VCFs.md b/docs/v3.3.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.3.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.3.0/Resources.md b/docs/v3.3.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v3.3.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v3.3.0/The--multimatch-parameter.md b/docs/v3.3.0/The--multimatch-parameter.md deleted file mode 100644 index bee54a12..00000000 --- a/docs/v3.3.0/The--multimatch-parameter.md +++ /dev/null @@ -1,3 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and specificity as redundantly used calls are only counted once for the final summary metrics. \ No newline at end of file diff --git a/docs/v3.3.0/Updates.md b/docs/v3.3.0/Updates.md deleted file mode 100644 index 19c27944..00000000 --- a/docs/v3.3.0/Updates.md +++ /dev/null @@ -1,95 +0,0 @@ -# Truvari 3.3 -*In Progress* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.3.0/anno.md b/docs/v3.3.0/anno.md deleted file mode 100644 index 9593dda1..00000000 --- a/docs/v3.3.0/anno.md +++ /dev/null @@ -1,434 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh, svinfo, bpovl, density, dpcnt, lcr - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires the [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. This also uses SimpleRepeat tracks over the reference genome (e.g. [`resources/`](https://github.com/spiralgenetics/truvari/tree/develop/resources)). - -First, every SV is parsed and intersected with the SimpleRepeat track. If the SV hits a SimpleRepeat region, the SV will be annotated. Furthermore, we build the haplotype sequence of the alternate allele and the SimpleRepeat region and feed that sequence into tandem repeat finder. We then intersect the repeats TRF found with the repeats found in the SimpleRepeat track intersection. If the same repeat is found in both, we annotate with the TRF information as well as reporting the difference in the number of copies in alternate allele's haplotype sequence via TRFDiff. - -INFO fields added -* TRF - Entry hits a simple repeat region -* TRFDiff - Simple repeat copy difference -* TRFperiod - Period size of the repeat -* TRFcopies - Number of copies aligned with the consensus pattern. -* TRFscore - TRF Alignment score -* TRFentropy - TRF Entropy measure -* TRFrepeat - TRF Repeat found on entry - -``` -usage: trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] - [-s SIMPLE_REPEATS] [-f REFERENCE] [-m MIN_LENGTH] [-M MAX_LENGTH] - [-t THREADS] [-C CHUNK_SIZE] [--debug] - -Intersect vcf with reference simple repeats and report -how the an alternate allele affects the copies using TRF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 - -h -ngs) - -s SIMPLE_REPEATS, --simple-repeats SIMPLE_REPEATS - Simple repeats bed - -f REFERENCE, --reference REFERENCE - Reference fasta file - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of sequence to run through trf (10000) - -t THREADS, --threads THREADS - Number of threads to use (48) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization - (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem -partial - Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit - (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] - [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] - [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, --- is a call, refdist is 2 - - - - - - - -nn: 1 2 1 0 1 1 -id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] [-i INPUT] -a ANNO -o OUTPUT [--spanmin SPANMIN] [--spanmax SPANMAX] - [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] [-e END] [-1] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --spanmin SPANMIN Minimum span of SVs to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME [-i INPUT] -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] - -Identify 'dense' and 'sparse' variant windows of the genome - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-i INPUT] [-b BINS] [-o OUTPUT] - -Quick utility to count how many samples have >= Nx coverage per-variant - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-i INPUT] [-o OUTPUT] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` \ No newline at end of file diff --git a/docs/v3.3.0/bench.md b/docs/v3.3.0/bench.md deleted file mode 100644 index 5424360e..00000000 --- a/docs/v3.3.0/bench.md +++ /dev/null @@ -1,239 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match
- -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Matching Parameter Diagrams -=========================== -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctsim PCTSIM (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls should be considered matching. If we instead removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to vcfs -========================================== - - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((pctsim + pctsize + pctovl) / 3 * 100)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's start from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid}
Multi Call is false due to non-multimatching
- -See [[MatchIds wiki|MatchIds]] for details. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench-dir`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed -======================== - -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing truth set variants to match call set variants that are just outside of the original region. If a call set variant is in the extended regions it can potentially match a truth set variant that is in the original regions turning it to TP, but call set variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching - only the truth set variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison set variants of size sizefilt (30 by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Comparing Haplotype Sequences of Variants -========================================= - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* if multimatch: -** return the max of each row (base) and of each column (call) from the match matrix -* else: -** ravel and sort the match matrix -** return the first use of each call in its current state (True or False) -** return the second use of each call with state=False and set Multi=True -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v3.3.0/collapse.md b/docs/v3.3.0/collapse.md deleted file mode 100644 index a9d236e5..00000000 --- a/docs/v3.3.0/collapse.md +++ /dev/null @@ -1,147 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) via: -```bash -bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -When using `--chain` mode, instead of collapsing all variants which the first variant, -we'll collapse all variants in a matching set together. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null from all - neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.3.0/consistency.md b/docs/v3.3.0/consistency.md deleted file mode 100644 index b8ccda5c..00000000 --- a/docs/v3.3.0/consistency.md +++ /dev/null @@ -1,74 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.3.0/divide.md b/docs/v3.3.0/divide.md deleted file mode 100644 index 4cedd268..00000000 --- a/docs/v3.3.0/divide.md +++ /dev/null @@ -1,39 +0,0 @@ -Divide a VCF into independent parts. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. \ No newline at end of file diff --git a/docs/v3.3.0/segment.md b/docs/v3.3.0/segment.md deleted file mode 100644 index 99514df0..00000000 --- a/docs/v3.3.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` with holds the number of SVs overlapping the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, the new annotations would be: - -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - diff --git a/docs/v3.3.0/vcf2df.md b/docs/v3.3.0/vcf2df.md deleted file mode 100644 index cba5c8a4..00000000 --- a/docs/v3.3.0/vcf2df.md +++ /dev/null @@ -1,73 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist set operations on variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-m] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -m, --multisample Parse multiple samples. Splits -s by comma. Sample column names will be flattened - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.4.0/Citations.md b/docs/v3.4.0/Citations.md deleted file mode 100644 index b72e3817..00000000 --- a/docs/v3.4.0/Citations.md +++ /dev/null @@ -1,31 +0,0 @@ -# Citing Truvari - -Pre-print on Biorxiv while in submission: - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) diff --git a/docs/v3.4.0/Comparing-two-SV-programs.md b/docs/v3.4.0/Comparing-two-SV-programs.md deleted file mode 100644 index f9ca5e0b..00000000 --- a/docs/v3.4.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,63 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Calculate the overlap -===================== - -One may wish for summary numbers of how many calls are shared/unique between the two programs. -Truvari has a program to help. See [[Consistency-report|consistency]] for details. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.4.0/Development.md b/docs/v3.4.0/Development.md deleted file mode 100644 index 0774b2c8..00000000 --- a/docs/v3.4.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Freeze the Wiki -4) Ensure all code is checked in -5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -6) Use github action to make a testpypi release -7) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -8) Use GitHub action to make a pypi release -9) Change Updates Wiki -10) Download release-tarball.zip from step #8’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v3.4.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.4.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.4.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.4.0/GIAB-stratifications.sh.md b/docs/v3.4.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.4.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.4.0/Home.md b/docs/v3.4.0/Home.md deleted file mode 100644 index 3bd133bf..00000000 --- a/docs/v3.4.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] - - [[segment|segment]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v3.4.0/Installation.md b/docs/v3.4.0/Installation.md deleted file mode 100644 index 59ec0505..00000000 --- a/docs/v3.4.0/Installation.md +++ /dev/null @@ -1,51 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.0.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 setup.py install -``` -To see a list of all available tags, run: -``` -git tag -l -``` -If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Conda / Mamba -============= -Truvari releases are automatically deployed to bioconda. See the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Alternatively, users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v3.4.0/MatchIds.md b/docs/v3.4.0/MatchIds.md deleted file mode 100644 index 29e070ae..00000000 --- a/docs/v3.4.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -Using `bench` MatchId -===================== - -MatchIds are used to tie base/comparison calls together. MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't have `--multimatch` is straight forward. For example, to tie tp-base.vcf and tp-call.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -Tieing a false positives/negatives to their closest matching counter part is a little tricker. -For example, a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-call.vcf` for -`3.*.0`. - -When using `--multimatch` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-call 3.0.0 -tp-call 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. \ No newline at end of file diff --git a/docs/v3.4.0/Multi-allelic-VCFs.md b/docs/v3.4.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.4.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.4.0/Resources.md b/docs/v3.4.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v3.4.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v3.4.0/The--multimatch-parameter.md b/docs/v3.4.0/The--multimatch-parameter.md deleted file mode 100644 index bee54a12..00000000 --- a/docs/v3.4.0/The--multimatch-parameter.md +++ /dev/null @@ -1,3 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and specificity as redundantly used calls are only counted once for the final summary metrics. \ No newline at end of file diff --git a/docs/v3.4.0/Updates.md b/docs/v3.4.0/Updates.md deleted file mode 100644 index 5fccac19..00000000 --- a/docs/v3.4.0/Updates.md +++ /dev/null @@ -1,110 +0,0 @@ -# Truvari 3.4 -*In Progress* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see #128) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.4.0/anno.md b/docs/v3.4.0/anno.md deleted file mode 100644 index 9593dda1..00000000 --- a/docs/v3.4.0/anno.md +++ /dev/null @@ -1,434 +0,0 @@ - -Truvari annotations: - gcpct, gtcnt, trf, grm, repmask, remap, hompct, numneigh, svinfo, bpovl, density, dpcnt, lcr - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -This requires the [trf executable](http://tandem.bu.edu/trf/trf.download.html), which can either be in your environment `$PATH` or can be directly pointed to through the command line parameters. This also uses SimpleRepeat tracks over the reference genome (e.g. [`resources/`](https://github.com/spiralgenetics/truvari/tree/develop/resources)). - -First, every SV is parsed and intersected with the SimpleRepeat track. If the SV hits a SimpleRepeat region, the SV will be annotated. Furthermore, we build the haplotype sequence of the alternate allele and the SimpleRepeat region and feed that sequence into tandem repeat finder. We then intersect the repeats TRF found with the repeats found in the SimpleRepeat track intersection. If the same repeat is found in both, we annotate with the TRF information as well as reporting the difference in the number of copies in alternate allele's haplotype sequence via TRFDiff. - -INFO fields added -* TRF - Entry hits a simple repeat region -* TRFDiff - Simple repeat copy difference -* TRFperiod - Period size of the repeat -* TRFcopies - Number of copies aligned with the consensus pattern. -* TRFscore - TRF Alignment score -* TRFentropy - TRF Entropy measure -* TRFrepeat - TRF Repeat found on entry - -``` -usage: trf [-h] [-i INPUT] [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] - [-s SIMPLE_REPEATS] [-f REFERENCE] [-m MIN_LENGTH] [-M MAX_LENGTH] - [-t THREADS] [-C CHUNK_SIZE] [--debug] - -Intersect vcf with reference simple repeats and report -how the an alternate allele affects the copies using TRF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 - -h -ngs) - -s SIMPLE_REPEATS, --simple-repeats SIMPLE_REPEATS - Simple repeats bed - -f REFERENCE, --reference REFERENCE - Reference fasta file - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of sequence to run through trf (10000) - -t THREADS, --threads THREADS - Number of threads to use (48) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization - (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] - [-t THRESHOLD] [-d DIST] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -novel - Allele has no hits in reference -tandem - Allele's closest hit is within len(allele) bp of the SV's position -interspersed - Allele's closest hit is not tandem -partial - Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: ---minlength - minimum SV length to considred (50) ---dist - For deletion SVs, do not consider alignments that hit within Nbp of the SV's position - (a.k.a. alignments back to the source sequence) (10) ---threshold - Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit - (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's - position to be considered (10)) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] - [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] - [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, --- is a call, refdist is 2 - - - - - - - -nn: 1 2 1 0 1 1 -id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] [-i INPUT] -a ANNO -o OUTPUT [--spanmin SPANMIN] [--spanmax SPANMAX] - [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] [-e END] [-1] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --spanmin SPANMIN Minimum span of SVs to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME [-i INPUT] -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] - -Identify 'dense' and 'sparse' variant windows of the genome - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-i INPUT] [-b BINS] [-o OUTPUT] - -Quick utility to count how many samples have >= Nx coverage per-variant - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-i INPUT] [-o OUTPUT] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` \ No newline at end of file diff --git a/docs/v3.4.0/bench.md b/docs/v3.4.0/bench.md deleted file mode 100644 index 5424360e..00000000 --- a/docs/v3.4.0/bench.md +++ /dev/null @@ -1,239 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match
- -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Haplotype Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Matching Parameter Diagrams -=========================== -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctsim PCTSIM (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls should be considered matching. If we instead removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to vcfs -========================================== - - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((pctsim + pctsize + pctovl) / 3 * 100)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's start from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid}
Multi Call is false due to non-multimatching
- -See [[MatchIds wiki|MatchIds]] for details. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench-dir`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed -======================== - -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing truth set variants to match call set variants that are just outside of the original region. If a call set variant is in the extended regions it can potentially match a truth set variant that is in the original regions turning it to TP, but call set variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching - only the truth set variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison set variants of size sizefilt (30 by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Comparing Haplotype Sequences of Variants -========================================= - -While many SV overlapping approaches match changes within regions with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -For example, consider a tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' -sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* if multimatch: -** return the max of each row (base) and of each column (call) from the match matrix -* else: -** ravel and sort the match matrix -** return the first use of each call in its current state (True or False) -** return the second use of each call with state=False and set Multi=True -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v3.4.0/collapse.md b/docs/v3.4.0/collapse.md deleted file mode 100644 index 95b474c8..00000000 --- a/docs/v3.4.0/collapse.md +++ /dev/null @@ -1,146 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) via: -```bash -bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null from all - neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.4.0/consistency.md b/docs/v3.4.0/consistency.md deleted file mode 100644 index b8ccda5c..00000000 --- a/docs/v3.4.0/consistency.md +++ /dev/null @@ -1,74 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.4.0/divide.md b/docs/v3.4.0/divide.md deleted file mode 100644 index 4cedd268..00000000 --- a/docs/v3.4.0/divide.md +++ /dev/null @@ -1,39 +0,0 @@ -Divide a VCF into independent parts. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. \ No newline at end of file diff --git a/docs/v3.4.0/segment.md b/docs/v3.4.0/segment.md deleted file mode 100644 index 99514df0..00000000 --- a/docs/v3.4.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` with holds the number of SVs overlapping the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, the new annotations would be: - -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - diff --git a/docs/v3.4.0/vcf2df.md b/docs/v3.4.0/vcf2df.md deleted file mode 100644 index cba5c8a4..00000000 --- a/docs/v3.4.0/vcf2df.md +++ /dev/null @@ -1,73 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist set operations on variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-m] [-S] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -m, --multisample Parse multiple samples. Splits -s by comma. Sample column names will be flattened - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.5.0/Citations.md b/docs/v3.5.0/Citations.md deleted file mode 100644 index b72e3817..00000000 --- a/docs/v3.5.0/Citations.md +++ /dev/null @@ -1,31 +0,0 @@ -# Citing Truvari - -Pre-print on Biorxiv while in submission: - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) diff --git a/docs/v3.5.0/Comparing-two-SV-programs.md b/docs/v3.5.0/Comparing-two-SV-programs.md deleted file mode 100644 index c90f58eb..00000000 --- a/docs/v3.5.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,67 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` -Consistency -=========== -The simplest way to compare the programs is to get the intersection of TPbase calls from the two reports. -```bash -truvari consistency cmp_A/tp-base.vcf cmp_B/tp-base.vcf -``` -See [consistency wiki](../consistency) for details on the report created. - -Below are older notes to manually create a similar report to what one can make using `truvari consistency` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-call.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_A/fp.vcf && tabix -p vcf cmp_A/fp.vcf.gz -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf.gz -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v3.5.0/Development.md b/docs/v3.5.0/Development.md deleted file mode 100644 index 0774b2c8..00000000 --- a/docs/v3.5.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Freeze the Wiki -4) Ensure all code is checked in -5) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -6) Use github action to make a testpypi release -7) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -8) Use GitHub action to make a pypi release -9) Change Updates Wiki -10) Download release-tarball.zip from step #8’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v3.5.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v3.5.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v3.5.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v3.5.0/GIAB-stratifications.sh.md b/docs/v3.5.0/GIAB-stratifications.sh.md deleted file mode 100644 index 7666c641..00000000 --- a/docs/v3.5.0/GIAB-stratifications.sh.md +++ /dev/null @@ -1,17 +0,0 @@ -As part of the GIAB analysis, tandem repeat (TR) stratifications for false positives (FP), false negatives (FN), and true positives (TP) were analyzed. See [this discussion](https://groups.google.com/d/msg/giab-analysis-team/tAtVBm9Fdrw/I2qeazh8AwAJ) for details. - -The stratifications.sh script automates the procedure above by taking a Truvari bench output directory as the first argument and TR repeat annotations (found in the link above) to create the following files in the output directory. - -Running this script requires `bedtools`, `vcf-sort`, and `bgzip` to be in your environment. - - -| FileName | Description | -|---------------------|---------------------------------------| -| fn_ins_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_ins_TR.vcf.gz | FN insertions found in TR regions | -| fp_ins_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_ins_TR.vcf.gz | FP insertions found in TR regions | -| fn_del_nonTR.vcf.gz | FN insertions not found in TR regions | -| fn_del_TR.vcf.gz | FN insertions found in TR regions | -| fp_del_nonTR.vcf.gz | FP insertions not found in TR regions | -| fp_del_TR.vcf.gz | FP insertions found in TR regions | \ No newline at end of file diff --git a/docs/v3.5.0/Home.md b/docs/v3.5.0/Home.md deleted file mode 100644 index 3bd133bf..00000000 --- a/docs/v3.5.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Truvari: Refined Structural Variant Comparison Preserves Allelic Diversity -doi: https://doi.org/10.1101/2022.02.21.481353 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[bench|bench]] - - [[The multimatch parameter|The--multimatch-parameter]] - - [[Edit Distance Ratio vs Sequence Similarity|Edit-Distance-Ratio-vs-Sequence-Similarity]] - - [[Multi-allelic VCFs|Multi-allelic-VCFs]] - - [[stratifications.sh|GIAB-stratifications.sh]] - - [[Comparing two SV programs|Comparing-two-SV-programs]] - - [[consistency|consistency]] - - [[anno|anno]] - - [[collapse|collapse]] - - [[vcf2df|vcf2df]] - - [[segment|segment]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v3.5.0/Installation.md b/docs/v3.5.0/Installation.md deleted file mode 100644 index c026f08e..00000000 --- a/docs/v3.5.0/Installation.md +++ /dev/null @@ -1,53 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.0.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 setup.py install -``` -To see a list of all available tags, run: -``` -git tag -l -``` -If you have an older clone of the repository, but don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Conda / Mamba -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. See the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Alternatively, users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v3.5.0/MatchIds.md b/docs/v3.5.0/MatchIds.md deleted file mode 100644 index 29e070ae..00000000 --- a/docs/v3.5.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -Using `bench` MatchId -===================== - -MatchIds are used to tie base/comparison calls together. MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't have `--multimatch` is straight forward. For example, to tie tp-base.vcf and tp-call.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -Tieing a false positives/negatives to their closest matching counter part is a little tricker. -For example, a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-call.vcf` for -`3.*.0`. - -When using `--multimatch` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-call 3.0.0 -tp-call 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. \ No newline at end of file diff --git a/docs/v3.5.0/Multi-allelic-VCFs.md b/docs/v3.5.0/Multi-allelic-VCFs.md deleted file mode 100644 index 7a5f7876..00000000 --- a/docs/v3.5.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only runs on comparing the first alternate allele in the base and comparison calls. If VCFs containing multi-allelic sites: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Pre-process the VCFs with bcftools to get a more complete result: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v3.5.0/Resources.md b/docs/v3.5.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v3.5.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v3.5.0/The--multimatch-parameter.md b/docs/v3.5.0/The--multimatch-parameter.md deleted file mode 100644 index bee54a12..00000000 --- a/docs/v3.5.0/The--multimatch-parameter.md +++ /dev/null @@ -1,3 +0,0 @@ -By default, Truvari bench will only allow a one-to-one match between a base call and comparison call. The motivation for this is be more strict when trying identify FPs in comparison calls. For example, if you have a redundant variant reported in the comparison calls, only the single best representation will become a TP, where the rest will be FPs. - -With the `--multimatch` parameter, we can get around this by allowing all of the redundant matches of base and comparison calls. This should have the effect of increasing the reported recall and specificity as redundantly used calls are only counted once for the final summary metrics. \ No newline at end of file diff --git a/docs/v3.5.0/Updates.md b/docs/v3.5.0/Updates.md deleted file mode 100644 index a702f216..00000000 --- a/docs/v3.5.0/Updates.md +++ /dev/null @@ -1,134 +0,0 @@ -# Truvari 3.5 (4.0?) -*in progress* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](../issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](../issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v3.5.0/anno.md b/docs/v3.5.0/anno.md deleted file mode 100644 index 4410e5c1..00000000 --- a/docs/v3.5.0/anno.md +++ /dev/null @@ -1,428 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-i INPUT] [-o OUTPUT] -r REFERENCE - -Annotates GC content of SVs - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-i INPUT] [-o OUTPUT] - -Annotates GTCounts of alleles - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -Recently refactored in v3.5-dev. Documentation in progress - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] - [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (48) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] - [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa - {threads} -e hmmer -species human -lcambig -nocut -div - 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (48) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] [-i INPUT] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-i INPUT] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-i INPUT] [-o OUTPUT] [-m MINSIZE] - -Adds SVTYPE and SVLEN INFO fields - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] [-i INPUT] -a ANNO -o OUTPUT [--spanmin SPANMIN] [--spanmax SPANMAX] - [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] [-e END] [-1] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --spanmin SPANMIN Minimum span of SVs to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME [-i INPUT] -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] - -Identify 'dense' and 'sparse' variant windows of the genome - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -i INPUT, --input INPUT - Input VCF (/dev/stdin) - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-i INPUT] [-b BINS] [--no-ad] [-p] [-o OUTPUT] - -Quick utility to count how many samples have >= Nx coverage per-variant - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-i INPUT] [-o OUTPUT] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (stdin) - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] -i INPUT [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] - -Add allele frequency annotations for subsets of samples - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v3.5.0/bench.md b/docs/v3.5.0/bench.md deleted file mode 100644 index c48977c0..00000000 --- a/docs/v3.5.0/bench.md +++ /dev/null @@ -1,261 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -f reference.fa -o output_dir/ -``` - -Outputs -======= -Truvari bench creates an output directory with the following files. - - - - - - - - - -
FileDescription
tp-call.vcfAnnotated true positive calls from the comparison VCF
tp-base.vcfAnnotated true positive calls form the base VCF
fp.vcfFalse positive calls from comparison
fn.vcfFalse negative calls from base
summary.txtJson output of performance stats
log.txtRun's logging output
giab_report.txt(optional) Summary of GIAB benchmark calls. See "Using the GIAB Report" below.
giab.jl(optional) DataFrame of benchmark results.
- -summary.txt -=========== -The following stats are generated for benchmarking your call set and written to summary.txt as a json dump. - - - - - - - - - - - - - - - - - -
Metric Definition
TP-base Number of matching calls from the base vcf
TP-call Number of matching calls from the comp vcf
FP Number of non-matching calls from the comp vcf
FN Number of non-matching calls from the base vcf
precision TP-call / (TP-call + FP)
recall TP-base / (TP-base + FN)
f1 2 * ((recall * precision) / (recall + precision))
base cnt Number of calls in the base vcf
call cnt Number of calls in the comp vcf
TP-call_TP-gt TP-call with genotype match
TP-call_FP-gt TP-call without genotype match
TP-base_TP-gt TP-base with genotype match
TP-base_FP-gt TP-base without genotype match
gt_concordance TP-call_TP-gt / (TP-call_TP-gt + TP-call_FP-gt)
gt_matrix Base GT x Comp GT Matrix of all Base calls' best, TP match
- -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctsim=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility (see the json at the top of your log.txt). - -Here is a rundown of each matching parameter. - - - - - - - -
ParameterDefaultDefinition
refdist 500 Maximum distance comparison calls must be within from base call's start/end
pctsim 0.7 Edit distance ratio between the REF/ALT haplotype sequences of base and comparison call. See "Comparing Sequences of Variants" below.
unroll False Use the unroll technique for sequence comparison. See "Comparing Sequences of Variants" below.
pctsize 0.7 Ratio of min(base_size, comp_size)/max(base_size, comp_size)
pctovl 0.0 Ratio of two calls' (overlapping bases)/(longest span)
typeignore False Types don't need to match to compare calls.
- -Matching Parameter Diagrams -=========================== -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctsim PCTSIM (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Comparing VCFs without sequence resolved calls -============================================== - -If the base or comp vcfs do not have sequence resolved calls, simply set `--pctsim=0` to turn off -sequence comparison. The `--reference` does not need to be set when not using sequence comparison. - -Difference between --sizemin and --sizefilt -=========================================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls should be considered matching. If we instead removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Definition of annotations added to vcfs -========================================== - - - - - - - - - - - - -
Anno Definition
TruScore Truvari score for similarity of match. `((pctsim + pctsize + pctovl) / 3 * 100)`
PctSeqSimilarity Pct sequence similarity between this variant and its closest match
PctSizeSimilarity Pct size similarity between this variant and its closest match
PctRecOverlap Percent reciprocal overlap percent of the two calls
StartDistance Distance of the base call's start from comparison call's start
EndDistance Distance of the base call's end from comparison call's end
SizeDiff Difference in size of base and comp calls
GTMatch Base/comp calls' Genotypes match
MatchId Id to help tie base/comp calls together {chunkid}.{baseid}.{compid}
Multi Call is false due to non-multimatching
- -See [[MatchIds wiki|MatchIds]] for details. - -Using the GIAB Report -===================== - -When running against the GIAB SV benchmark (link below), you can automatically add a text `giab_report.txt` to the output directory containing calls summarized by the GIAB VCF's SVTYPE, SVLEN, Technology, and Repeat annotations. This also adds a DataFrame `giab.jl` (similar to running `truvari vcf2df --bench-dir`). - -See the [Recipes Discussions](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes) on how make interesting summaries and plots. - -While Truvari can use other benchmark sets, this formatted report currently only works with GIAB SV v0.5 and v0.6. Work -will need to be done to ensure Truvari can parse future GIAB SV releases. - -[GIAB v0.6 Download Link](https://bit.ly/2SlWYbB) - -ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/ - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed -======================== - -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing truth set variants to match call set variants that are just outside of the original region. If a call set variant is in the extended regions it can potentially match a truth set variant that is in the original regions turning it to TP, but call set variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching - only the truth set variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison set variants of size sizefilt (30 by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Comparing Sequences of Variants -========================================= - -While many SV overlapping approaches match variants with metrics such as reciprocal overlap and -start/end distances, few account for the possibility to represent the same sequence change in multiple ways. - -Truvari has implemented two approaches to compare variant sequences. The original method is to use the reference context of a pair of variants . The new method is called 'unroll'. - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -Future versions of Truvari may take this concept further to reconstruct haplotypes based on phasing information -between variants in order to allow for more flexibility in complex multi-allelic representations. - - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in our 'ab' example above, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations. -* Unroll does not need a `--reference` for comparison, which saves I/O time. -* In the limited testing performed so far, unroll increase the number of matching SVs -* Unroll seems to decrease the number of 'suspect' matches in smaller size regimes -* The pattern between PctSizeSimilarity and PctSeqSimilarity is simpler -* The PctSeqSimilarity of variant pairs reported by the reference context method is higher than the unroll method. This is mainly due to reference context 'inflating' the similarity (i.e. adding identical sequence to both variants in the pair naturally increases their identity). However, a majority of pairs seem to have identical similarity reported by either method. - - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* if multimatch: -** return the max of each row (base) and of each column (call) from the match matrix -* else: -** ravel and sort the match matrix -** return the first use of each call in its current state (True or False) -** return the second use of each call with state=False and set Multi=True -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctsim 0` \ No newline at end of file diff --git a/docs/v3.5.0/collapse.md b/docs/v3.5.0/collapse.md deleted file mode 100644 index 95b474c8..00000000 --- a/docs/v3.5.0/collapse.md +++ /dev/null @@ -1,146 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) via: -```bash -bcftools merge -m none one.vcf two.vcf | bgzip > merge.vcf -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] - [-k {first,maxqual,common}] [--debug] [-r REFDIST] [-p PCTSIM] - [-B BUFFER] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] - [--chain] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] - [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds -All variants outside size boundaries will be placed into the output - -optional arguments: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null from all - neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B BUFFER, --buffer BUFFER - Percent of the reference span to buffer the haplotype sequence created - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Minimum pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v3.5.0/consistency.md b/docs/v3.5.0/consistency.md deleted file mode 100644 index 0b38c515..00000000 --- a/docs/v3.5.0/consistency.md +++ /dev/null @@ -1,75 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-j] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -j, --json Output report in json format -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. \ No newline at end of file diff --git a/docs/v3.5.0/phab.md b/docs/v3.5.0/phab.md deleted file mode 100644 index 488eca3f..00000000 --- a/docs/v3.5.0/phab.md +++ /dev/null @@ -1,163 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--multimatch --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? Perhaps we could build [dynamic-thresholds](https://github.com/ACEnglish/truvari/discussions/94) that would increase or decrease depending on measurements (e.g. variant distance, sequence context); but building a reliable model would require an expansive training dataset. - -In an ideal world, we would simply get rid of inconsistent variant representations. - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -`truvari phab` works by wrapping a number of trusted bioinformatics tools. Reference from the region being analyzed is fetched with `samtools`. Haplotypes are reconstructed from VCFs using `bcftools consensus`. The sequences are then aligned using [MAFFT](https://mafft.cbrc.jp/alignment/software/) - a fast multiple-sequence alignment software. Truvari parses the resultant MSA and calls variants to reproduce the region's VCF. - -Requirements ------------- -Since `truvari phab` uses existing tools, it expects them to be found in the environment path. The list of tools `phab` will call are: - -- bcftools 1.10.2 -- vcf-sort 0.1.16 -- tabix / bgzip - 1.10.2-3 -- samtools 1.10 -- mafft v7.505 - -You can install each of these or you can can build a Truvari [Docker container](Development#docker) - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --unroll \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample with the [`--unroll`](bench#unroll) method of calculating sequence similarity. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-call": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o phab_result/ -``` - -The main file `phab` produces is the `phab_result/output.vcf.gz` which holds our harmonized variants. In it we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after running `phab`. - -```bash -truvari bench -b phab_result/output.vcf.gz \ - -c phab_result/output.vcf.gz \ - -S 1 -s 1 \ - --unroll \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-call": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example/output.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example/output.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example/output.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -Limitations ------------ -* MAFFT, while very fast, is impractical for very long sequences and maybe impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, it becomes harder to count how many TP/FP calls the original pipeline created. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try unphased variants out, but proceed with caution. - diff --git a/docs/v3.5.0/segment.md b/docs/v3.5.0/segment.md deleted file mode 100644 index 99514df0..00000000 --- a/docs/v3.5.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` with holds the number of SVs overlapping the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, the new annotations would be: - -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - diff --git a/docs/v3.5.0/vcf2df.md b/docs/v3.5.0/vcf2df.md deleted file mode 100644 index 6a0e5925..00000000 --- a/docs/v3.5.0/vcf2df.md +++ /dev/null @@ -1,75 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Additionally, the DataFrame is assigned an index of a description of the variant `CHROM:START-STOP.ALT`. This enables operations such as `pandas.DataFrame.join` to assist set operations on variants. - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-call.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-m] [-S] [-c {0,1,2,3,4,5,6,7,8}] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -m, --multisample Parse multiple samples. Splits -s by comma. Sample column names will be flattened - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.0.0/Comparing-two-SV-programs.md b/docs/v4.0.0/Comparing-two-SV-programs.md deleted file mode 100644 index 6463f90d..00000000 --- a/docs/v4.0.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,67 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` -Consistency -=========== -The simplest way to compare the programs is to get the intersection of TPbase calls from the two reports. -```bash -truvari consistency cmp_A/tp-base.vcf cmp_B/tp-base.vcf -``` -See [[consistency wiki|consistency]] for details on the report created. - -Below are older notes to manually create a similar report to what one can make using `truvari consistency` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_A/fp.vcf && tabix -p vcf cmp_A/fp.vcf.gz -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf.gz -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v4.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v4.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v4.0.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v4.0.0/Home.md b/docs/v4.0.0/Home.md deleted file mode 100644 index 312fa193..00000000 --- a/docs/v4.0.0/Home.md +++ /dev/null @@ -1,36 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] -- [[Resources|Resources]] \ No newline at end of file diff --git a/docs/v4.0.0/Installation.md b/docs/v4.0.0/Installation.md deleted file mode 100644 index a929d245..00000000 --- a/docs/v4.0.0/Installation.md +++ /dev/null @@ -1,56 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v4.0.0/MatchIds.md b/docs/v4.0.0/MatchIds.md deleted file mode 100644 index f7bcfd4f..00000000 --- a/docs/v4.0.0/MatchIds.md +++ /dev/null @@ -1,30 +0,0 @@ -Using MatchId -============= - -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds are in `bench` and `collapse` outputs. In the `collapse` main output, the field is named 'CollapseId'. - -MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. - -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't use `--pick multi` is straight forward. For example, to tie tp-base.vcf and tp-comp.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -When using `--pick multi` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-comp 3.0.0 -tp-comp 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. - -This makes tieing false positives/negatives to their closest matching counter part a little tricker. -For example, if we have a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-comp.vcf` for -`3.*.0`. This search will find the 0th comp call in the third chunk which may have matched to a different baseid than 1. \ No newline at end of file diff --git a/docs/v4.0.0/Resources.md b/docs/v4.0.0/Resources.md deleted file mode 100644 index 25a76ce1..00000000 --- a/docs/v4.0.0/Resources.md +++ /dev/null @@ -1,3 +0,0 @@ -*page in progress* - -Use `git-lfs` to pull the resources. Currently only useful for `truvari anno trf` \ No newline at end of file diff --git a/docs/v4.0.0/Updates.md b/docs/v4.0.0/Updates.md deleted file mode 100644 index 350dbf01..00000000 --- a/docs/v4.0.0/Updates.md +++ /dev/null @@ -1,181 +0,0 @@ -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.0.0/anno.md b/docs/v4.0.0/anno.md deleted file mode 100644 index beb60812..00000000 --- a/docs/v4.0.0/anno.md +++ /dev/null @@ -1,468 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -Recently refactored in v3.5-dev. Documentation in progress -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.0.0/bench.md b/docs/v4.0.0/bench.md deleted file mode 100644 index d97096a3..00000000 --- a/docs/v4.0.0/bench.md +++ /dev/null @@ -1,278 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } - } -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - - -Added annotations ------------------ -The output vcfs are annotated with INFO fields described above and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in our 'ab' example above, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations. -* Unroll does not need a `--reference` for comparison, which saves I/O time. -* In the limited testing performed so far, unroll increase the number of matching SVs -* Unroll seems to decrease the number of 'suspect' matches in smaller size regimes -* The pattern between PctSizeSimilarity and PctSeqSimilarity is simpler - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - ---sizemin and --sizefilt -======================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It should -be less than `sizemin` for edge case variants. - -For example: `sizemin` is 50 and `sizefilt` is 30. A 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls should be considered matching. If we instead removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. - -This does have the side effect of artificially inflating specificity. If that same 49bp call in the -above were below the similarity threshold, it would not be classified as a FP due to the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) diff --git a/docs/v4.0.0/collapse.md b/docs/v4.0.0/collapse.md deleted file mode 100644 index aa29a74d..00000000 --- a/docs/v4.0.0/collapse.md +++ /dev/null @@ -1,144 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.0.0/consistency.md b/docs/v4.0.0/consistency.md deleted file mode 100644 index 41fd599c..00000000 --- a/docs/v4.0.0/consistency.md +++ /dev/null @@ -1,170 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-j] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -j, --json Output report in json format -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.0.0/divide.md b/docs/v4.0.0/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.0.0/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.0.0/phab.md b/docs/v4.0.0/phab.md deleted file mode 100644 index 2fdc57fb..00000000 --- a/docs/v4.0.0/phab.md +++ /dev/null @@ -1,161 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? Perhaps we could build [dynamic-thresholds](https://github.com/ACEnglish/truvari/discussions/94) that would increase or decrease depending on measurements (e.g. variant distance, sequence context); but building a reliable model would require an expansive training dataset. - -In an ideal world, we would simply get rid of inconsistent variant representations. - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -`truvari phab` works by wrapping a number of trusted bioinformatics tools. Reference from the region being analyzed is fetched with `samtools`. Haplotypes are reconstructed from VCFs using `bcftools consensus`. The sequences are then aligned using [MAFFT](https://mafft.cbrc.jp/alignment/software/) - a fast multiple-sequence alignment software. Truvari parses the resultant MSA and calls variants to reproduce the region's VCF. - -Requirements ------------- -Since `truvari phab` uses existing tools, it expects them to be found in the environment path. The list of tools `phab` will call are: - -- bcftools 1.10.2 -- vcf-sort 0.1.16 -- tabix / bgzip - 1.10.2-3 -- samtools 1.10 -- mafft v7.505 - -You can install each of these or you can can build a Truvari [Docker container](Development#docker) - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o phab_result/ -``` - -The main file `phab` produces is the `phab_result/output.vcf.gz` which holds our harmonized variants. In it we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after running `phab`. - -```bash -truvari bench -b phab_result/output.vcf.gz \ - -c phab_result/output.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example/output.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example/output.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example/output.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -Limitations ------------ -* MAFFT, while very fast, is impractical for very long sequences and maybe impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, it becomes harder to count how many TP/FP calls the original pipeline created. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try unphased variants out, but proceed with caution. - diff --git a/docs/v4.0.0/refine.md b/docs/v4.0.0/refine.md deleted file mode 100644 index 6e08c730..00000000 --- a/docs/v4.0.0/refine.md +++ /dev/null @@ -1,107 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result, we present the tool `refine`. - -Quick Start -=========== - -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab/` - Per-region results from variant re-evaluation (optional) -* `phab_bench/` - Bench results on the harmonized variants - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--use-original` -================ - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this typically contains a filtered subset of variants originally provided to `bench` since it does filtering such as `--sizemin` and `--passonly`. With the `--use-original` parameter, all of the original calls from the input vcfs are fetched. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--regions` are subset to only those intersecting `--includebed`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-includebed` -================== - -When intersecting includebed with regions, use includebed coordinates. This is helpful for when the original bench result's `--includebed` boundaries should be used instead of the `--regions` - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.1.0/Citations.md b/docs/v4.1.0/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.1.0/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.1.0/Comparing-two-SV-programs.md b/docs/v4.1.0/Comparing-two-SV-programs.md deleted file mode 100644 index 6463f90d..00000000 --- a/docs/v4.1.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,67 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` -Consistency -=========== -The simplest way to compare the programs is to get the intersection of TPbase calls from the two reports. -```bash -truvari consistency cmp_A/tp-base.vcf cmp_B/tp-base.vcf -``` -See [[consistency wiki|consistency]] for details on the report created. - -Below are older notes to manually create a similar report to what one can make using `truvari consistency` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_A/fp.vcf && tabix -p vcf cmp_A/fp.vcf.gz -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf.gz -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v4.1.0/Development.md b/docs/v4.1.0/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.1.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v4.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v4.1.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v4.1.0/Home.md b/docs/v4.1.0/Home.md deleted file mode 100644 index 47fbc626..00000000 --- a/docs/v4.1.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.1.0/Installation.md b/docs/v4.1.0/Installation.md deleted file mode 100644 index a929d245..00000000 --- a/docs/v4.1.0/Installation.md +++ /dev/null @@ -1,56 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v4.1.0/MatchIds.md b/docs/v4.1.0/MatchIds.md deleted file mode 100644 index 14250f57..00000000 --- a/docs/v4.1.0/MatchIds.md +++ /dev/null @@ -1,27 +0,0 @@ -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds are in `bench` and `collapse` outputs. In the `collapse` main output, the field is named 'CollapseId'. - -MatchIds have a structure of `{chunkid}.{baseid}.{compid}`. - -Where: - - chunkid - Unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were - compared. - - baseid - The number of the base call from this chunk - - compid - The number of the comp call from this chunk - -Doing basic pairing with MatchIds when results didn't use `--pick multi` is straight forward. For example, to tie tp-base.vcf and tp-comp.vcf calls together, they will have identical MatchIds. - -MatchIds with `_` for the baseid or compid are calls that were in a chunk without a counterpart (e.g. `5._.0` is the -first comparison call from the fifth chunk, which had no base calls.) - -When using `--pick multi` there can be instances such as: -``` -VCF MatchId -tp-base 3.0.0 -tp-comp 3.0.0 -tp-comp 3.0.1 -``` -Here, the 0th base/comp call match to one another. Additionally, the 1st comp call `3.0.1` also matches to the 0th base call `tp-base 3.0.0`. - -This makes tieing false positives/negatives to their closest matching counter part a little tricker. -For example, if we have a false negative with MatchId `3.1.0`, we'll need to look in both the `fp.vcf` and `tp-comp.vcf` for -`3.*.0`. This search will find the 0th comp call in the third chunk which may have matched to a different baseid than 1. \ No newline at end of file diff --git a/docs/v4.1.0/Multi-allelic-VCFs.md b/docs/v4.1.0/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.1.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.1.0/Updates.md b/docs/v4.1.0/Updates.md deleted file mode 100644 index b26c8bbb..00000000 --- a/docs/v4.1.0/Updates.md +++ /dev/null @@ -1,215 +0,0 @@ -# Truvari 4.1 -*August 7, 2023* - -* `bench` - * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) - * `--recount` for correctly assessing whole-genome refinement results - * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) - * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` -* `phab` - * ~2x faster via reduced IO from operating in stages instead of per-region - * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) - * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization - * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) -* `refine` - * Faster bed file intersection of `--includebed` and `--regions` - * Refine pre-flight check - * Correct refine.regions.txt end position from IntervalTree correction - * Better refine region selection with `--use-original` - * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates - * `--use-original-vcfs` to use the original pre-bench VCFs - * `refine.variant_summary.json` is cleaned of uninformative metrics -* `stratify` - * parallel parsing of truvari directory to make processing ~4x faster -* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes -* `anno grpaf` - expanded annotation info fields -* `anno density` - new parameter `--stepsize` for sliding windows -* `collapse` - * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) -* Minor updates - * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) - * Monomorphic/multiallelic check fix in `bench` - * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key - * Slightly slimmer docker container - -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.1.0/anno.md b/docs/v4.1.0/anno.md deleted file mode 100644 index beb60812..00000000 --- a/docs/v4.1.0/anno.md +++ /dev/null @@ -1,468 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem-repeat annotation to each entry in your VCF. - -Recently refactored in v3.5-dev. Documentation in progress -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.1.0/bench.md b/docs/v4.1.0/bench.md deleted file mode 100644 index 40665125..00000000 --- a/docs/v4.1.0/bench.md +++ /dev/null @@ -1,297 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| candidate.refine.bed | Bed file of regions for `refine` | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | -| weighted | Metrics weighed by variant sequence/size similarity | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } -} -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - -Added annotations ------------------ -The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - - -Refining bench output -===================== -As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -truvari refine --regions result/candidate.refine.bed \ - --reference reference.fasta \ - --recount --use-region-coords \ - result/ -``` -See [[refine wiki|refine]] for details. - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - ---sizemin and --sizefilt -======================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can -be less than `sizemin` for edge case variants. - -For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. - -This has the side effect of artificially inflating specificity. For example, if that same 49bp call described -above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/TruvariBenchMethod.png) - -Weighted Performance -==================== -An experimental feature is weighted performance metrics. For normal performance metrics, every variant has a weight of 1 i.e. a single True Positive is worth +1 to the TP counts. With 'weighted' performance, variants' worth is equivalent to the `PctSeqSimilarity` or `PctSizeSimilarity`. For example, if a TP base variant has 90% sequence similarity to its counterpart, we add +0.9 to the TP-base and +0.1 to the FN. This creates counts that approximate how many variant bases are shared between the base/comparison VCFs. This feature is interesting, but it has yet to be proven to have any practical value. diff --git a/docs/v4.1.0/collapse.md b/docs/v4.1.0/collapse.md deleted file mode 100644 index aa29a74d..00000000 --- a/docs/v4.1.0/collapse.md +++ /dev/null @@ -1,144 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - the identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - the number of variants collapsed into this variant - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.1.0/consistency.md b/docs/v4.1.0/consistency.md deleted file mode 100644 index 41fd599c..00000000 --- a/docs/v4.1.0/consistency.md +++ /dev/null @@ -1,170 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-j] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -j, --json Output report in json format -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.1.0/divide.md b/docs/v4.1.0/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.1.0/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.1.0/phab.md b/docs/v4.1.0/phab.md deleted file mode 100644 index 48ed817a..00000000 --- a/docs/v4.1.0/phab.md +++ /dev/null @@ -1,155 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? Perhaps we could build [dynamic-thresholds](https://github.com/ACEnglish/truvari/discussions/94) that would increase or decrease depending on measurements (e.g. variant distance, sequence context); but building a reliable model would require an expansive training dataset. - -In an ideal world, we would simply get rid of inconsistent variant representations. - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -`truvari phab` works by wrapping a number of trusted bioinformatics tools. Reference from the region being analyzed is fetched with `samtools`. Haplotypes are reconstructed from VCFs using `bcftools consensus`. The sequences are then aligned using [MAFFT](https://mafft.cbrc.jp/alignment/software/) - a fast multiple-sequence alignment software. Truvari parses the resultant MSA and calls variants to reproduce the region's VCF. - -Requirements ------------- -Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) - -Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o phab_result/ -``` - -The main file `phab` produces is the `phab_result/output.vcf.gz` which holds our harmonized variants. In it we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after running `phab`. - -```bash -truvari bench -b phab_result/output.vcf.gz \ - -c phab_result/output.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example/output.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example/output.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example/output.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -Limitations ------------ -* MAFFT, while very fast, is impractical for very long sequences and maybe impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, it becomes harder to count how many TP/FP calls the original pipeline created. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try unphased variants out, but proceed with caution. - diff --git a/docs/v4.1.0/refine.md b/docs/v4.1.0/refine.md deleted file mode 100644 index 823e8cdb..00000000 --- a/docs/v4.1.0/refine.md +++ /dev/null @@ -1,110 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result, we present the tool `refine`. - -Quick Start -=========== - -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab_bench/` - Bench results on the harmonized variants - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--use-original-vcfs` -===================== - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. - -With the `--use-original` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. - -This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-region-coords` -===================== - -When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. This is helpful for when the original bench result's `--includebed` boundaries should be used instead of the `--regions` - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.1.0/segment.md b/docs/v4.1.0/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.1.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.1.0/stratify.md b/docs/v4.1.0/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.1.0/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.1.0/vcf2df.md b/docs/v4.1.0/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.1.0/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.0/Citations.md b/docs/v4.2.0/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.2.0/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.2.0/Comparing-two-SV-programs.md b/docs/v4.2.0/Comparing-two-SV-programs.md deleted file mode 100644 index 6463f90d..00000000 --- a/docs/v4.2.0/Comparing-two-SV-programs.md +++ /dev/null @@ -1,67 +0,0 @@ -A frequent application of comparing SVs is to perform a 'bakeoff' of performance -between two SV programs against a single set of base calls. - -Beyond looking at the Truvari results/report, you may like to investigate what calls -are different between the programs. - -Below is a set of scripts that may help you generate those results. For our examples, -we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs. - -*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf. - -Generate the Truvari report for Asvs and Bsvs -============================================= - -```bash -truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ... -truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ... -``` -Consistency -=========== -The simplest way to compare the programs is to get the intersection of TPbase calls from the two reports. -```bash -truvari consistency cmp_A/tp-base.vcf cmp_B/tp-base.vcf -``` -See [[consistency wiki|consistency]] for details on the report created. - -Below are older notes to manually create a similar report to what one can make using `truvari consistency` - -Combine the TPs within each report -================================== - -```bash -cd cmp_A/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -cd ../cmp_B/ -paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt -``` - -Grab the FNs missed by only one program -======================================= - -```bash -(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt -``` - -Pull the TP sets' difference -============================ - -```bash -cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt -cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt -``` - -To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`. - -To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`. - -Shared FPs between the programs -=============================== - -All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again. - -```bash -bgzip cmp_A/fp.vcf && tabix -p vcf cmp_A/fp.vcf.gz -bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz -truvari bench -b cmp_A/fp.vcf.gz -c cmp_B/fp.vcf.gz -o shared_fps ... -``` \ No newline at end of file diff --git a/docs/v4.2.0/Development.md b/docs/v4.2.0/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.2.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md b/docs/v4.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md deleted file mode 100644 index eb45ce30..00000000 --- a/docs/v4.2.0/Edit-Distance-Ratio-vs-Sequence-Similarity.md +++ /dev/null @@ -1,55 +0,0 @@ -By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm). - -To show this difference, consider the following two sequences.: - -``` - AGATACAGGAGTACGAACAGTACAGTACGA - |||||||||||||||*|||||||||||||| -ATCACAGATACAGGAGTACGTACAGTACAGTACGA - -30bp Aligned -1bp Mismatched (96% similarity) -5bp Left-Trimmed (~14% of the bottom sequence) -``` - -The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences. - - -```python -import swalign -import Levenshtein -import edlib - -seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA" -seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA" - -scoring = swalign.NucleotideScoringMatrix(2, -1) -alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5) -aln = alner.align(seq1, seq2) -mat_tot = aln.matches -mis_tot = aln.mismatches -denom = float(mis_tot + mat_tot) -if denom == 0: - ident = 0 -else: - ident = mat_tot / denom -scr = edlib.align(seq1, seq2) -totlen = len(seq1) + len(seq2) - -print('swalign', ident) -# swalign 0.966666666667 -print('levedit', Levenshtein.ratio(seq1, seq2)) -# levedit 0.892307692308 -print('edlib', (totlen - scr["editDistance"]) / totlen) -# edlib 0.9076923076923077 -``` - -Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio. - -If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%. - -[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed. - -The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference. - -However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`. \ No newline at end of file diff --git a/docs/v4.2.0/Home.md b/docs/v4.2.0/Home.md deleted file mode 100644 index 47fbc626..00000000 --- a/docs/v4.2.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.2.0/Installation.md b/docs/v4.2.0/Installation.md deleted file mode 100644 index a929d245..00000000 --- a/docs/v4.2.0/Installation.md +++ /dev/null @@ -1,56 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v4.2.0/Multi-allelic-VCFs.md b/docs/v4.2.0/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.2.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.2.0/Updates.md b/docs/v4.2.0/Updates.md deleted file mode 100644 index 6f126054..00000000 --- a/docs/v4.2.0/Updates.md +++ /dev/null @@ -1,245 +0,0 @@ -# Truvari 4.2 -*January 12, 2024* -* `collapse` - * New parameter `--gt` disallows intra-sample events to collapse ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--gt)) - * New parameter `--intra` for consolidating SAMPLE information during intra-sample collapsing ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--intra)) - * Preserve phasing information when available - * Faster O(n-1) algorithm instead of O(n^2) - * Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons - * Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently. -* `phab` - * Correct sample handling with `--bSamples` `--cSamples` parameters - * Faster generation of consensus sequence - * Resolved 'overlapping' variant issue causing variants to be dropped - * New `poa` approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate. -* `bench` - * New, easier `MatchId` field to track which baseline/comparison variants match up [details](https://github.com/ACEnglish/truvari/wiki/MatchIds) - * `entry_is_present` method now considers partial missing variants (e.g. `./1`) as present - * Removed the 'weighted' metrics from `summary.json` -* `consistency` - * Fixed issue with counting duplicate records - * Added flag to optionally ignore duplicate records -* `anno svinfo` now overwrites existing SVLEN/SVTYPE info fields -* general - * Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches. - * Bump pysam version - * Fixed bug in `unroll` sequence similarity that sometimes rolled from the wrong end - * Fixed bug for handling of None in ALT field - * `truvari.compress_index_vcf` forces overwriting of tabix index to prevent annoying crashes - - -# Truvari 4.1 -*August 7, 2023* - -* `bench` - * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) - * `--recount` for correctly assessing whole-genome refinement results - * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) - * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` -* `phab` - * ~2x faster via reduced IO from operating in stages instead of per-region - * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) - * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization - * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) -* `refine` - * Faster bed file intersection of `--includebed` and `--regions` - * Refine pre-flight check - * Correct refine.regions.txt end position from IntervalTree correction - * Better refine region selection with `--use-original` - * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates - * `--use-original-vcfs` to use the original pre-bench VCFs - * `refine.variant_summary.json` is cleaned of uninformative metrics -* `stratify` - * parallel parsing of truvari directory to make processing ~4x faster -* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes -* `anno grpaf` - expanded annotation info fields -* `anno density` - new parameter `--stepsize` for sliding windows -* `collapse` - * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) -* Minor updates - * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) - * Monomorphic/multiallelic check fix in `bench` - * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key - * Slightly slimmer docker container - -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.2.0/anno.md b/docs/v4.2.0/anno.md deleted file mode 100644 index 55835902..00000000 --- a/docs/v4.2.0/anno.md +++ /dev/null @@ -1,494 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants a VCF. - -### Annotations added -| Field Name | Description | -|------------|-------------------------------------------------------------| -| TRF | Entry hits a tandem repeat region | -| TRFdiff | ALT TR copy difference from reference | -| TRFrepeat | Repeat motif | -| TRFovl | Percent of ALT covered by TRF annotation | -| TRFstart | tart position of discovered repeat | -| TRFend | End position of discovered repeat | -| TRFperiod | eriod size of the repeat | -| TRFcopies | Number of copies aligned with the consensus pattern | -| TRFscore | Alignment score | -| TRFentropy | Entropy measure | -| TRFsim | Similarity of ALT sequence to generated motif faux sequence | - -### Details -Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: -```bash -zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz -tabix anno.trf.bed.gz -``` -For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. - -For insertions, by default the tool first tries to estimate which motif is contained in the alternate sequence. For each overlapping annotation, the copy number difference of the motif is calculated via `copy_diff = len(entry.alts[0][1:]) / anno["period"]`. Next, a 'feaux sequence' is made from `copy_diff` number of the motif. If the sequence is above the `--motif-similarity` with the insertion sequence, that is considered the insertion's motif. If no estimate is above the `--motif-similarity`, the insertion is incorporated into the reference and TRF is run. If the discovered TRF hits match a motif in the tandem repeat regions file, that annotation is used. If the highest scoring TRF hit doesn't match the tandem repeats region file, the nominally de novo annotation is added to the insertion's vcf entry. - -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-s STEPSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -s STEPSIZE, --stepsize STEPSIZE - Window step size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.0/bench.md b/docs/v4.2.0/bench.md deleted file mode 100644 index 0d027cb6..00000000 --- a/docs/v4.2.0/bench.md +++ /dev/null @@ -1,293 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| candidate.refine.bed | Bed file of regions for `refine` | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | -| weighted | Metrics weighed by variant sequence/size similarity | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } -} -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - -Added annotations ------------------ -The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - - -Refining bench output -===================== -As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -truvari refine --regions result/candidate.refine.bed \ - --reference reference.fasta \ - --recount --use-region-coords \ - result/ -``` -See [[refine wiki|refine]] for details. - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - ---sizemin and --sizefilt -======================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can -be less than `sizemin` for edge case variants. - -For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. - -This has the side effect of artificially inflating specificity. For example, if that same 49bp call described -above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v4.2.0/collapse.md b/docs/v4.2.0/collapse.md deleted file mode 100644 index f660f340..00000000 --- a/docs/v4.2.0/collapse.md +++ /dev/null @@ -1,163 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---gt -==== -For some results, one may not want to collapse variants with conflicting genotypes from a single sample. With the `--gt all` parameter, variants which are present (non `0/0` or `./.`) in the same sample are not collapsed. With the `-gt het` parameter, only variants which are both heterozygous in a sample (e.g. `0/1` and `0/1`) are prevented from collapsing. The `--gt het` is useful for some SV callers which will redundantly call variants and typically genotype them all as `1/1`. - ---intra -======= -When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with: -``` -FORMAT RESULT1 RESULT2 -GT:GQ:AD ./.:.:3,0 1/1:20:0,30 -``` -The `--intra` output would be: -``` -FORMAT RESULT1 -GT:GQ:AD 1/1:20:3,0 -``` -As you can see in this example, 1) The first sample name is the only one preserved. 2) conflicting FORMAT fields can be consolidated in a non-useful way (here the AD of `3,0` isn't informative to a `1/1` genotype). We're working to provide an API to help users write custom intra-sample consolidation scripts. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - Identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - Number of variants collapsed into this variant -- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.2.0/consistency.md b/docs/v4.2.0/consistency.md deleted file mode 100644 index 41fd599c..00000000 --- a/docs/v4.2.0/consistency.md +++ /dev/null @@ -1,170 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-j] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM:POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -j, --json Output report in json format -``` -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM:POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.2.0/divide.md b/docs/v4.2.0/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.2.0/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.2.0/refine.md b/docs/v4.2.0/refine.md deleted file mode 100644 index 3fa818e7..00000000 --- a/docs/v4.2.0/refine.md +++ /dev/null @@ -1,114 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result and recomputing benchmarking performance on harmonized variants, we present the tool `refine`. - -Quick Start -=========== - -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab_bench/` - Bench results on the subset of variants harmonized - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--align` -========= -By default, Truvari will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -`--use-original-vcfs` -===================== - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. - -With the `--use-original` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. - -This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-region-coords` -===================== - -When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. This is helpful for when the original bench result's `--includebed` boundaries should be used instead of the `--regions` - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.2.0/segment.md b/docs/v4.2.0/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.2.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.2.0/stratify.md b/docs/v4.2.0/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.2.0/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.0/vcf2df.md b/docs/v4.2.0/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.2.0/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.1/Citations.md b/docs/v4.2.1/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.2.1/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.2.1/Development.md b/docs/v4.2.1/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.2.1/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.2.1/Home.md b/docs/v4.2.1/Home.md deleted file mode 100644 index 47fbc626..00000000 --- a/docs/v4.2.1/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.2.1/Installation.md b/docs/v4.2.1/Installation.md deleted file mode 100644 index a929d245..00000000 --- a/docs/v4.2.1/Installation.md +++ /dev/null @@ -1,56 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/spiralgenetics/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v4.2.1/MatchIds.md b/docs/v4.2.1/MatchIds.md deleted file mode 100644 index ca52076f..00000000 --- a/docs/v4.2.1/MatchIds.md +++ /dev/null @@ -1,74 +0,0 @@ -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. - -For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: - -``` -# tp-base.vcf -MatchId=4.0,4.1 - -# tp-comp.vcf -MatchId=4.0,4.1 -MatchId=4.0,4.2 -``` - -This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. - -One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. - -```bash -truvari vcf2df --info --bench-dir bench_result/ data.jl -``` - -Next, we combine rows of matched variants: -```python -import joblib -import pandas as pd - -# Load the data -data = joblib.load("data.jl") - -# Separate out the variants from the base VCF and add new columns of the base/comp ids -base = data[data['state'].isin(['tpbase', 'fn'])].copy() -base['base_id'] = base['MatchId'].apply(lambda x: x[0]) -base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) - -# Separate out the variants from the comparison VCF and add new columns of the base/comp ids -comp = data[data['state'].isin(['tp', 'fp'])].copy() -comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) -comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) - -# Merge the base/comparison variants -combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) - -# How many comp variants matched to multiple base variants? -counts1 = combined['base_id_comp'].value_counts() -print('multi-matched comp count', (counts1 != 1).sum()) - -# How many base variants matched to multiple comp variants? -counts2 = combined['comp_id_base'].value_counts() -print('multi-matched base count', (counts2 != 1).sum()) -``` - -The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: - -```bash -truvari vcf2df -i kept.vcf.gz kept.jl -truvari vcf2df -i removed.vcf.gz remov.jl -``` - -Then we combine them: -```python -import joblib -import pandas as pd - -# Load the kept variants and set the index. -kept = joblib.load("kept.jl").set_index('CollapseId') - -# Load the removed variants and set the index. -remov = joblib.load("remov.jl") -remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) -remov.set_index('CollapseId', inplace=True) - -# Join the two sets of variants -result_df = kept.join(remov, how='right', rsuffix='_removed') -``` \ No newline at end of file diff --git a/docs/v4.2.1/Multi-allelic-VCFs.md b/docs/v4.2.1/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.2.1/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.2.1/Updates.md b/docs/v4.2.1/Updates.md deleted file mode 100644 index 9f3d2e4e..00000000 --- a/docs/v4.2.1/Updates.md +++ /dev/null @@ -1,256 +0,0 @@ -# Truvari 4.2.1 -*February 6, 2024* -* `collapse` - * Faster handling of genotype data for `--gt` and `--keep common` -* general - * Fix to bed end position bug for including variants ([details](https://github.com/ACEnglish/truvari/issues/193)) - * Fix to Dockerfile -* `refine` - * Changes to `--recount` that accompany the fix to bed end positions. -* New command `ga4gh` to convert Truvari results into GA4GH truth/query VCFs with intermediates tags - -# Truvari 4.2 -*January 12, 2024* -* `collapse` - * New parameter `--gt` disallows intra-sample events to collapse ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--gt)) - * New parameter `--intra` for consolidating SAMPLE information during intra-sample collapsing ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--intra)) - * Preserve phasing information when available - * Faster O(n-1) algorithm instead of O(n^2) - * Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons - * Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently. -* `phab` - * Correct sample handling with `--bSamples` `--cSamples` parameters - * Faster generation of consensus sequence - * Resolved 'overlapping' variant issue causing variants to be dropped - * New `poa` approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate. -* `bench` - * New, easier `MatchId` field to track which baseline/comparison variants match up [details](https://github.com/ACEnglish/truvari/wiki/MatchIds) - * `entry_is_present` method now considers partial missing variants (e.g. `./1`) as present - * Removed the 'weighted' metrics from `summary.json` -* `consistency` - * Fixed issue with counting duplicate records - * Added flag to optionally ignore duplicate records -* `anno svinfo` now overwrites existing SVLEN/SVTYPE info fields -* general - * Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches. - * Bump pysam version - * Fixed bug in `unroll` sequence similarity that sometimes rolled from the wrong end - * Fixed bug for handling of None in ALT field - * `truvari.compress_index_vcf` forces overwriting of tabix index to prevent annoying crashes - - -# Truvari 4.1 -*August 7, 2023* - -* `bench` - * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) - * `--recount` for correctly assessing whole-genome refinement results - * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) - * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` -* `phab` - * ~2x faster via reduced IO from operating in stages instead of per-region - * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) - * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization - * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) -* `refine` - * Faster bed file intersection of `--includebed` and `--regions` - * Refine pre-flight check - * Correct refine.regions.txt end position from IntervalTree correction - * Better refine region selection with `--use-original` - * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates - * `--use-original-vcfs` to use the original pre-bench VCFs - * `refine.variant_summary.json` is cleaned of uninformative metrics -* `stratify` - * parallel parsing of truvari directory to make processing ~4x faster -* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes -* `anno grpaf` - expanded annotation info fields -* `anno density` - new parameter `--stepsize` for sliding windows -* `collapse` - * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) -* Minor updates - * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) - * Monomorphic/multiallelic check fix in `bench` - * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key - * Slightly slimmer docker container - -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.2.1/anno.md b/docs/v4.2.1/anno.md deleted file mode 100644 index 55835902..00000000 --- a/docs/v4.2.1/anno.md +++ /dev/null @@ -1,494 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants a VCF. - -### Annotations added -| Field Name | Description | -|------------|-------------------------------------------------------------| -| TRF | Entry hits a tandem repeat region | -| TRFdiff | ALT TR copy difference from reference | -| TRFrepeat | Repeat motif | -| TRFovl | Percent of ALT covered by TRF annotation | -| TRFstart | tart position of discovered repeat | -| TRFend | End position of discovered repeat | -| TRFperiod | eriod size of the repeat | -| TRFcopies | Number of copies aligned with the consensus pattern | -| TRFscore | Alignment score | -| TRFentropy | Entropy measure | -| TRFsim | Similarity of ALT sequence to generated motif faux sequence | - -### Details -Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: -```bash -zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz -tabix anno.trf.bed.gz -``` -For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. - -For insertions, by default the tool first tries to estimate which motif is contained in the alternate sequence. For each overlapping annotation, the copy number difference of the motif is calculated via `copy_diff = len(entry.alts[0][1:]) / anno["period"]`. Next, a 'feaux sequence' is made from `copy_diff` number of the motif. If the sequence is above the `--motif-similarity` with the insertion sequence, that is considered the insertion's motif. If no estimate is above the `--motif-similarity`, the insertion is incorporated into the reference and TRF is run. If the discovered TRF hits match a motif in the tandem repeat regions file, that annotation is used. If the highest scoring TRF hit doesn't match the tandem repeats region file, the nominally de novo annotation is added to the insertion's vcf entry. - -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-s STEPSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -s STEPSIZE, --stepsize STEPSIZE - Window step size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.1/bench.md b/docs/v4.2.1/bench.md deleted file mode 100644 index 0d027cb6..00000000 --- a/docs/v4.2.1/bench.md +++ /dev/null @@ -1,293 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| candidate.refine.bed | Bed file of regions for `refine` | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | -| weighted | Metrics weighed by variant sequence/size similarity | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } -} -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - -Added annotations ------------------ -The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - - -Refining bench output -===================== -As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -truvari refine --regions result/candidate.refine.bed \ - --reference reference.fasta \ - --recount --use-region-coords \ - result/ -``` -See [[refine wiki|refine]] for details. - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - ---sizemin and --sizefilt -======================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can -be less than `sizemin` for edge case variants. - -For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. - -This has the side effect of artificially inflating specificity. For example, if that same 49bp call described -above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v4.2.1/collapse.md b/docs/v4.2.1/collapse.md deleted file mode 100644 index f660f340..00000000 --- a/docs/v4.2.1/collapse.md +++ /dev/null @@ -1,163 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---gt -==== -For some results, one may not want to collapse variants with conflicting genotypes from a single sample. With the `--gt all` parameter, variants which are present (non `0/0` or `./.`) in the same sample are not collapsed. With the `-gt het` parameter, only variants which are both heterozygous in a sample (e.g. `0/1` and `0/1`) are prevented from collapsing. The `--gt het` is useful for some SV callers which will redundantly call variants and typically genotype them all as `1/1`. - ---intra -======= -When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with: -``` -FORMAT RESULT1 RESULT2 -GT:GQ:AD ./.:.:3,0 1/1:20:0,30 -``` -The `--intra` output would be: -``` -FORMAT RESULT1 -GT:GQ:AD 1/1:20:3,0 -``` -As you can see in this example, 1) The first sample name is the only one preserved. 2) conflicting FORMAT fields can be consolidated in a non-useful way (here the AD of `3,0` isn't informative to a `1/1` genotype). We're working to provide an API to help users write custom intra-sample consolidation scripts. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - Identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - Number of variants collapsed into this variant -- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.2.1/consistency.md b/docs/v4.2.1/consistency.md deleted file mode 100644 index 17ba23b7..00000000 --- a/docs/v4.2.1/consistency.md +++ /dev/null @@ -1,179 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-d] [-j] [-o OUTPUT] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -d, --no-dups Disallow duplicate SVs - -j, --json Output report in json format - -o OUTPUT, --output OUTPUT - Write tsv of variant keys and their flag -``` - -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. The key and flags can be written with the `--output` option. - -Duplicates -========== - -If there are VCFs with duplicate keys, they are handled by appending a e.g. `.1` to the key. If you'd like to ignore duplicates, just add `--no-dups` - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.2.1/divide.md b/docs/v4.2.1/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.2.1/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.2.1/phab.md b/docs/v4.2.1/phab.md deleted file mode 100644 index bdcaec9e..00000000 --- a/docs/v4.2.1/phab.md +++ /dev/null @@ -1,157 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -Requirements ------------- -Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) - -Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. - -Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft. - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o harmonized.vcf.gz -``` - -In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized. - -```bash -truvari bench -b harmonized.vcf.gz \ - -c harmonized.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example.vcf.gz -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -`--align` -========= -By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -Limitations ------------ -* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, variant counts will likely change. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution. - diff --git a/docs/v4.2.1/refine.md b/docs/v4.2.1/refine.md deleted file mode 100644 index c43e0c0b..00000000 --- a/docs/v4.2.1/refine.md +++ /dev/null @@ -1,140 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result and recomputing benchmarking performance on harmonized variants, we present the tool `refine`. - -Quick Start -=========== - -Basic ------ -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -Whole genome ------------- -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` -Use `refine` on the `result/` analyzing only the regions with putative FP/FN that would benefit from harmonization -```bash -truvari refine -R -U -f ref.fa --regions result/candidate.refine.bed result/ -``` - -Tandem Repeats --------------- -For benchmarks such as the [GIAB TR](https://www.biorxiv.org/content/10.1101/2023.10.29.564632v1), a TR caller may analyze a different subset of regions. In order to avoid unnecessarily penalizing the performance with FNs from unanalyzed regions: - -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` - -Use `refine` on the `result/` analyzing only the `hc_regions.bed` covered by the TR caller's `tool_regions.bed` -``` -truvari refine -f ref.fa --regions tool_regions.bed result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab_bench/` - Bench results on the subset of variants harmonized - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--align` -========= -By default, Truvari will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -`--use-original-vcfs` -===================== - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. - -With the `--use-original` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. - -This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-region-coords` -===================== - -When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. However, the region's coordinates should be used when using the `candidates.refine.bed` to limit analysis to only the regions with putative FP/FN that would benefit from harmonization - for example, when performing whole genome benchmarking. - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.2.1/segment.md b/docs/v4.2.1/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.2.1/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.2.1/stratify.md b/docs/v4.2.1/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.2.1/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.1/vcf2df.md b/docs/v4.2.1/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.2.1/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.2/Citations.md b/docs/v4.2.2/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.2.2/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.2.2/Development.md b/docs/v4.2.2/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.2.2/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.2.2/Home.md b/docs/v4.2.2/Home.md deleted file mode 100644 index 47fbc626..00000000 --- a/docs/v4.2.2/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.2.2/Installation.md b/docs/v4.2.2/Installation.md deleted file mode 100644 index 6d2a5658..00000000 --- a/docs/v4.2.2/Installation.md +++ /dev/null @@ -1,56 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/ACEnglish/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. diff --git a/docs/v4.2.2/MatchIds.md b/docs/v4.2.2/MatchIds.md deleted file mode 100644 index ca52076f..00000000 --- a/docs/v4.2.2/MatchIds.md +++ /dev/null @@ -1,74 +0,0 @@ -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. - -For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: - -``` -# tp-base.vcf -MatchId=4.0,4.1 - -# tp-comp.vcf -MatchId=4.0,4.1 -MatchId=4.0,4.2 -``` - -This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. - -One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. - -```bash -truvari vcf2df --info --bench-dir bench_result/ data.jl -``` - -Next, we combine rows of matched variants: -```python -import joblib -import pandas as pd - -# Load the data -data = joblib.load("data.jl") - -# Separate out the variants from the base VCF and add new columns of the base/comp ids -base = data[data['state'].isin(['tpbase', 'fn'])].copy() -base['base_id'] = base['MatchId'].apply(lambda x: x[0]) -base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) - -# Separate out the variants from the comparison VCF and add new columns of the base/comp ids -comp = data[data['state'].isin(['tp', 'fp'])].copy() -comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) -comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) - -# Merge the base/comparison variants -combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) - -# How many comp variants matched to multiple base variants? -counts1 = combined['base_id_comp'].value_counts() -print('multi-matched comp count', (counts1 != 1).sum()) - -# How many base variants matched to multiple comp variants? -counts2 = combined['comp_id_base'].value_counts() -print('multi-matched base count', (counts2 != 1).sum()) -``` - -The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: - -```bash -truvari vcf2df -i kept.vcf.gz kept.jl -truvari vcf2df -i removed.vcf.gz remov.jl -``` - -Then we combine them: -```python -import joblib -import pandas as pd - -# Load the kept variants and set the index. -kept = joblib.load("kept.jl").set_index('CollapseId') - -# Load the removed variants and set the index. -remov = joblib.load("remov.jl") -remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) -remov.set_index('CollapseId', inplace=True) - -# Join the two sets of variants -result_df = kept.join(remov, how='right', rsuffix='_removed') -``` \ No newline at end of file diff --git a/docs/v4.2.2/Multi-allelic-VCFs.md b/docs/v4.2.2/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.2.2/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.2.2/Updates.md b/docs/v4.2.2/Updates.md deleted file mode 100644 index 88ad1db1..00000000 --- a/docs/v4.2.2/Updates.md +++ /dev/null @@ -1,273 +0,0 @@ -# Truvari 4.2.2 -*March 28, 2024* - -* `collapse` - * Fewer comparisons needed per-chunk on average - * Fixed `--chain` functionality ([details](https://github.com/ACEnglish/truvari/issues/196)) - * Fixed `--gt` consolidation of format fields -* `bench` - * Faster result at the cost of less complete annotations with `--short` flag -* `refine` - * Assures variants are sequence resolved before incorporating into consensus - * `bench --passonly --sizemax` parameters are used when building consensus for a region. Useful for `refine --use-original-vcfs` - * When a refined region has more than 5k variants, it is skipped and a warning of the region is written to the log - * Flag `--use-region-coords` now expands `--region` coordinates by 100bp (`phab --buffer` default) to allow variants to harmonize out of regions. -* general - * Dynamic bed/vcf parsing tries to choose faster of streaming/fetching variants - -# Truvari 4.2.1 -*February 6, 2024* -* `collapse` - * Faster handling of genotype data for `--gt` and `--keep common` -* general - * Fix to bed end position bug for including variants ([details](https://github.com/ACEnglish/truvari/issues/193)) - * Fix to Dockerfile -* `refine` - * Changes to `--recount` that accompany the fix to bed end positions. -* New command `ga4gh` to convert Truvari results into GA4GH truth/query VCFs with intermediates tags - -# Truvari 4.2 -*January 12, 2024* -* `collapse` - * New parameter `--gt` disallows intra-sample events to collapse ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--gt)) - * New parameter `--intra` for consolidating SAMPLE information during intra-sample collapsing ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--intra)) - * Preserve phasing information when available - * Faster O(n-1) algorithm instead of O(n^2) - * Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons - * Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently. -* `phab` - * Correct sample handling with `--bSamples` `--cSamples` parameters - * Faster generation of consensus sequence - * Resolved 'overlapping' variant issue causing variants to be dropped - * New `poa` approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate. -* `bench` - * New, easier `MatchId` field to track which baseline/comparison variants match up [details](https://github.com/ACEnglish/truvari/wiki/MatchIds) - * `entry_is_present` method now considers partial missing variants (e.g. `./1`) as present - * Removed the 'weighted' metrics from `summary.json` -* `consistency` - * Fixed issue with counting duplicate records - * Added flag to optionally ignore duplicate records -* `anno svinfo` now overwrites existing SVLEN/SVTYPE info fields -* general - * Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches. - * Bump pysam version - * Fixed bug in `unroll` sequence similarity that sometimes rolled from the wrong end - * Fixed bug for handling of None in ALT field - * `truvari.compress_index_vcf` forces overwriting of tabix index to prevent annoying crashes - - -# Truvari 4.1 -*August 7, 2023* - -* `bench` - * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) - * `--recount` for correctly assessing whole-genome refinement results - * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) - * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` -* `phab` - * ~2x faster via reduced IO from operating in stages instead of per-region - * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) - * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization - * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) -* `refine` - * Faster bed file intersection of `--includebed` and `--regions` - * Refine pre-flight check - * Correct refine.regions.txt end position from IntervalTree correction - * Better refine region selection with `--use-original` - * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates - * `--use-original-vcfs` to use the original pre-bench VCFs - * `refine.variant_summary.json` is cleaned of uninformative metrics -* `stratify` - * parallel parsing of truvari directory to make processing ~4x faster -* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes -* `anno grpaf` - expanded annotation info fields -* `anno density` - new parameter `--stepsize` for sliding windows -* `collapse` - * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) -* Minor updates - * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) - * Monomorphic/multiallelic check fix in `bench` - * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key - * Slightly slimmer docker container - -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.2.2/bench.md b/docs/v4.2.2/bench.md deleted file mode 100644 index 0d027cb6..00000000 --- a/docs/v4.2.2/bench.md +++ /dev/null @@ -1,293 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| candidate.refine.bed | Bed file of regions for `refine` | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | -| weighted | Metrics weighed by variant sequence/size similarity | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } -} -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - -Added annotations ------------------ -The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - - -Refining bench output -===================== -As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -truvari refine --regions result/candidate.refine.bed \ - --reference reference.fasta \ - --recount --use-region-coords \ - result/ -``` -See [[refine wiki|refine]] for details. - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - ---sizemin and --sizefilt -======================== - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can -be less than `sizemin` for edge case variants. - -For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. - -This has the side effect of artificially inflating specificity. For example, if that same 49bp call described -above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v4.2.2/collapse.md b/docs/v4.2.2/collapse.md deleted file mode 100644 index f660f340..00000000 --- a/docs/v4.2.2/collapse.md +++ /dev/null @@ -1,163 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -f /path/to/reference.fa -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the first variant from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---gt -==== -For some results, one may not want to collapse variants with conflicting genotypes from a single sample. With the `--gt all` parameter, variants which are present (non `0/0` or `./.`) in the same sample are not collapsed. With the `-gt het` parameter, only variants which are both heterozygous in a sample (e.g. `0/1` and `0/1`) are prevented from collapsing. The `--gt het` is useful for some SV callers which will redundantly call variants and typically genotype them all as `1/1`. - ---intra -======= -When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with: -``` -FORMAT RESULT1 RESULT2 -GT:GQ:AD ./.:.:3,0 1/1:20:0,30 -``` -The `--intra` output would be: -``` -FORMAT RESULT1 -GT:GQ:AD 1/1:20:3,0 -``` -As you can see in this example, 1) The first sample name is the only one preserved. 2) conflicting FORMAT fields can be consolidated in a non-useful way (here the AD of `3,0` isn't informative to a `1/1` genotype). We're working to provide an API to help users write custom intra-sample consolidation scripts. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - Identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - Number of variants collapsed into this variant -- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.2.2/divide.md b/docs/v4.2.2/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.2.2/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.2.2/phab.md b/docs/v4.2.2/phab.md deleted file mode 100644 index bdcaec9e..00000000 --- a/docs/v4.2.2/phab.md +++ /dev/null @@ -1,157 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -Requirements ------------- -Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) - -Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. - -Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft. - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o harmonized.vcf.gz -``` - -In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized. - -```bash -truvari bench -b harmonized.vcf.gz \ - -c harmonized.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example.vcf.gz -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -`--align` -========= -By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -Limitations ------------ -* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, variant counts will likely change. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution. - diff --git a/docs/v4.2.2/segment.md b/docs/v4.2.2/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.2.2/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.2.2/stratify.md b/docs/v4.2.2/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.2.2/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.2.2/vcf2df.md b/docs/v4.2.2/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.2.2/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.0/Citations.md b/docs/v4.3.0/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.3.0/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.3.0/Development.md b/docs/v4.3.0/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.3.0/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.3.0/Home.md b/docs/v4.3.0/Home.md deleted file mode 100644 index 47fbc626..00000000 --- a/docs/v4.3.0/Home.md +++ /dev/null @@ -1,35 +0,0 @@ -The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs) - -Citation: -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Before you start -VCFs aren't always created with a strong adherence to the format's specification. - -Truvari expects input VCFs to be valid so that it will only output valid VCFs. - -We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF. - -Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble. - -Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below. - -Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries. - -# Index - -- [[Updates|Updates]] -- [[Installation|Installation]] -- Truvari Commands: - - [[anno|anno]] - - [[bench|bench]] - - [[collapse|collapse]] - - [[consistency|consistency]] - - [[divide|divide]] - - [[phab|phab]] - - [[refine|refine]] - - [[segment|segment]] - - [[stratify|stratify]] - - [[vcf2df|vcf2df]] -- [[Development|Development]] -- [[Citations|Citations]] \ No newline at end of file diff --git a/docs/v4.3.0/MatchIds.md b/docs/v4.3.0/MatchIds.md deleted file mode 100644 index ca52076f..00000000 --- a/docs/v4.3.0/MatchIds.md +++ /dev/null @@ -1,74 +0,0 @@ -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. - -For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: - -``` -# tp-base.vcf -MatchId=4.0,4.1 - -# tp-comp.vcf -MatchId=4.0,4.1 -MatchId=4.0,4.2 -``` - -This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. - -One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. - -```bash -truvari vcf2df --info --bench-dir bench_result/ data.jl -``` - -Next, we combine rows of matched variants: -```python -import joblib -import pandas as pd - -# Load the data -data = joblib.load("data.jl") - -# Separate out the variants from the base VCF and add new columns of the base/comp ids -base = data[data['state'].isin(['tpbase', 'fn'])].copy() -base['base_id'] = base['MatchId'].apply(lambda x: x[0]) -base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) - -# Separate out the variants from the comparison VCF and add new columns of the base/comp ids -comp = data[data['state'].isin(['tp', 'fp'])].copy() -comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) -comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) - -# Merge the base/comparison variants -combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) - -# How many comp variants matched to multiple base variants? -counts1 = combined['base_id_comp'].value_counts() -print('multi-matched comp count', (counts1 != 1).sum()) - -# How many base variants matched to multiple comp variants? -counts2 = combined['comp_id_base'].value_counts() -print('multi-matched base count', (counts2 != 1).sum()) -``` - -The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: - -```bash -truvari vcf2df -i kept.vcf.gz kept.jl -truvari vcf2df -i removed.vcf.gz remov.jl -``` - -Then we combine them: -```python -import joblib -import pandas as pd - -# Load the kept variants and set the index. -kept = joblib.load("kept.jl").set_index('CollapseId') - -# Load the removed variants and set the index. -remov = joblib.load("remov.jl") -remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) -remov.set_index('CollapseId', inplace=True) - -# Join the two sets of variants -result_df = kept.join(remov, how='right', rsuffix='_removed') -``` \ No newline at end of file diff --git a/docs/v4.3.0/Multi-allelic-VCFs.md b/docs/v4.3.0/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.3.0/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.3.0/Updates.md b/docs/v4.3.0/Updates.md deleted file mode 100644 index e6dc4c54..00000000 --- a/docs/v4.3.0/Updates.md +++ /dev/null @@ -1,284 +0,0 @@ -# Truvari 4.3.0 -*in progress* -* `refine` & `stratify` - * Fixed variant and bed boundary overlapping issue -* general - * Definition of variants within a region now includes replacement style from TR callers having an anchor base 1bp upstream of catalog/includebed regions - * Propagating MAFFT errors ([#204](https://github.com/ACEnglish/truvari/issues/204)) - * FIPS compliance ([#205](https://github.com/ACEnglish/truvari/issues/205)) - * Allow csi-indexed vcf files ([#209](https://github.com/ACEnglish/truvari/issues/209)) - * bcftools sort tempfile ([#213](https://github.com/ACEnglish/truvari/issues/213)) - -# Truvari 4.2.2 -*March 28, 2024* - -* `collapse` - * Fewer comparisons needed per-chunk on average - * Fixed `--chain` functionality ([details](https://github.com/ACEnglish/truvari/issues/196)) - * Fixed `--gt` consolidation of format fields -* `bench` - * Faster result at the cost of less complete annotations with `--short` flag -* `refine` - * Assures variants are sequence resolved before incorporating into consensus - * `bench --passonly --sizemax` parameters are used when building consensus for a region. Useful for `refine --use-original-vcfs` - * When a refined region has more than 5k variants, it is skipped and a warning of the region is written to the log - * Flag `--use-region-coords` now expands `--region` coordinates by 100bp (`phab --buffer` default) to allow variants to harmonize out of regions. -* general - * Dynamic bed/vcf parsing tries to choose faster of streaming/fetching variants - -# Truvari 4.2.1 -*February 6, 2024* -* `collapse` - * Faster handling of genotype data for `--gt` and `--keep common` -* general - * Fix to bed end position bug for including variants ([details](https://github.com/ACEnglish/truvari/issues/193)) - * Fix to Dockerfile -* `refine` - * Changes to `--recount` that accompany the fix to bed end positions. -* New command `ga4gh` to convert Truvari results into GA4GH truth/query VCFs with intermediates tags - -# Truvari 4.2 -*January 12, 2024* -* `collapse` - * New parameter `--gt` disallows intra-sample events to collapse ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--gt)) - * New parameter `--intra` for consolidating SAMPLE information during intra-sample collapsing ([details](https://github.com/ACEnglish/truvari/wiki/collapse#--intra)) - * Preserve phasing information when available - * Faster O(n-1) algorithm instead of O(n^2) - * Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons - * Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently. -* `phab` - * Correct sample handling with `--bSamples` `--cSamples` parameters - * Faster generation of consensus sequence - * Resolved 'overlapping' variant issue causing variants to be dropped - * New `poa` approach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate. -* `bench` - * New, easier `MatchId` field to track which baseline/comparison variants match up [details](https://github.com/ACEnglish/truvari/wiki/MatchIds) - * `entry_is_present` method now considers partial missing variants (e.g. `./1`) as present - * Removed the 'weighted' metrics from `summary.json` -* `consistency` - * Fixed issue with counting duplicate records - * Added flag to optionally ignore duplicate records -* `anno svinfo` now overwrites existing SVLEN/SVTYPE info fields -* general - * Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches. - * Bump pysam version - * Fixed bug in `unroll` sequence similarity that sometimes rolled from the wrong end - * Fixed bug for handling of None in ALT field - * `truvari.compress_index_vcf` forces overwriting of tabix index to prevent annoying crashes - - -# Truvari 4.1 -*August 7, 2023* - -* `bench` - * Creates `candidate.refine.bed` which hooks into `refine` on whole-genome VCFs [details](https://github.com/ACEnglish/truvari/wiki/bench#refining-bench-output) - * `--recount` for correctly assessing whole-genome refinement results - * experimental 'weighted' summary metrics [details](https://github.com/ACEnglish/truvari/wiki/bench#weighted-performance) - * Unresolved SVs (e.g. `ALT == `) are filtered when `--pctseq != 0` -* `phab` - * ~2x faster via reduced IO from operating in stages instead of per-region - * Removed most external calls (e.g. samtools doesn't need to be in the environment anymore) - * new `--align wfa` allows much faster (but slightly less accurate) variant harmonization - * increased determinism of results [detals](https://github.com/ACEnglish/truvari/commit/81a9ab85b91b0c530f9faeedfa4e7e0d68a5e8c2) -* `refine` - * Faster bed file intersection of `--includebed` and `--regions` - * Refine pre-flight check - * Correct refine.regions.txt end position from IntervalTree correction - * Better refine region selection with `--use-original` - * `--use-includebed` switched to `--use-region-coords` so that default behavior is to prefer the includebed's coordinates - * `--use-original-vcfs` to use the original pre-bench VCFs - * `refine.variant_summary.json` is cleaned of uninformative metrics -* `stratify` - * parallel parsing of truvari directory to make processing ~4x faster -* `msa2vcf` Fixed REPL decomposition bug to now preserve haplotypes -* `anno grpaf` - expanded annotation info fields -* `anno density` - new parameter `--stepsize` for sliding windows -* `collapse` - * New optional `--median-info` fields [#146](https://github.com/ACEnglish/truvari/issues/146) -* Minor updates - * Fix some `anno` threading on macOS [#154](https://github.com/ACEnglish/truvari/issues/154) - * Monomorphic/multiallelic check fix in `bench` - * `PHAB_WRITE_MAFFT` environment variable to facilitate updating functional test answer key - * Slightly slimmer docker container - -# Truvari 4.0 -*March 13, 2023* - -As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0. - -* New `refine` command for refining benchmarking results. [Details](refine) -* `bench` - * [Unroll](bench#unroll) is now the default sequence comparison approach. - * New `--pick` parameter to control the number of matches a variant can participate in [details](bench#controlling-the-number-of-matches) - * The `summary.txt` is now named `summary.json` - * Outputs parameters to `params.json` - * Output VCFs are sorted, compressed, and indexed - * Ambiguous use of 'call' in outputs corrected to 'comp' (e.g. `tp-call.vcf.gz` is now `tp-comp.vcf.gz`) - * Renamed `--pctsim` parameter to `--pctseq` - * Fixed bug where FP/FN weren't getting the correct, highest scoring match reported - * Fixed bug where `INFO/Multi` wasn't being properly applied - * Fixed bug where variants spanning exactly one `--includebed` region were erroneously being counted. - * Removed parameters: `--giabreport`, `--gtcomp`,`--multimatch`, `--use-lev`, `--prog`, `--unroll` -* `collapse` - * Renamed `--pctsim` parameter to `--pctseq` - * Runtime reduction by ~40% with short-circuiting during `Matcher.build_match` - * Better output sorting which may allow pipelines to be a little faster. -* `vcf2df` - * More granular sizebins for `[0,50)` including better handling of SNPs - * `--multisample` is removed. Now automatically add all samples with `--format` - * key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges -* `anno` - * Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use `--input` but a positional argument. - * Added `addid` -* `consistency` - * Slight speed improvement - * Better json output format -* `segment` - * Added `--passonly` flag - * Changed UI, including writing to stdout by default - * Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN -* API - * Began a focused effort on improving re-usability of Truvari code. - * Entry point to run benchmarking programmatically with [Bench object](https://truvari.readthedocs.io/en/latest/truvari.html#bench). - * Better development version tracking. [details](https://github.com/ACEnglish/truvari/commit/4bbf8d9a5be3b6a3f935afbd3a9b323811b676a0) - * Improved developer documentation. See [readthedocs](https://truvari.readthedocs.io/) -* general - * msa2vcf now left-trims and decomposes variants into indels - * Functional tests reorganization - * Fix for off-by-one errors when using pyintervaltree. See [ticket](https://github.com/ACEnglish/truvari/issues/137) - * Removed progressbar and Levenshtein dependencies as they are no longer used. - -# Truvari 3.5 -*August 27, 2022* - -* `bench` - * `--dup-to-ins` flag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - * New `--unroll` sequence comparison method for `bench` and `collapse` ([details](bench#unroll)) -* Major `anno trf` refactor (TODO write docs) including: - * annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT - * allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations -* New `anno grpaf` annotates vcf with allele frequency info for groups of samples -* New `phab` for variant harmonization ([details](../phab)) -* backend - * `truvari.entry_size` returns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - * New key utility for `truvari.build_anno_trees` -* general - * Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4 - * Nice colors in some `--help` with [rich](https://github.com/Textualize/rich/) -* `divide` - * output shards are now more easily sorted (i.e. `ls divide_result/*.vcf.gz` will return the shards in the order they were made) - * compression/indexing of sub-VCFs in separate threads, reducing runtime -* user issues - * Monomorphic reference ALT alleles no longer throw an error in `bench` ([#131](https://github.com/ACEnglish/truvari/issues/131)) - * `SVLEN Number=A` fix ([#132](https://github.com/ACEnglish/truvari/issues/132)) - -# Truvari 3.4 -*July 7, 2022* - -* Improved performance of `consistency` (see [#127](https://github.com/ACEnglish/truvari/pull/127)) -* Added optional json output of `consistency` report -* Allow GT to be missing, which is allowed by VCF format specification -* TRF now uses `truvari.entry_variant_type` instead of trying to use `pysam.VariantRecord.info["SVLEN"]` -directly which allows greater flexibility. -* vcf2df now parses fields with `Number=\d` (e.g. 2+), which is a valid description -* `truvari.seqsim` is now case insensitive (see [#128](https://github.com/ACEnglish/truvari/issues/128)) -* Collapse option to skip consolidation of genotype information so kept alleles are unaltered -* `truvari anno dpcnt --present` will only count the depths of non ./. variants -* New collapse annotation `NumConsolidate` records how many FORMATs were consolidated -* Official [conda](https://anaconda.org/bioconda/truvari) support - -# Truvari 3.3 -*May 25, 2022* - -* New utilities `vcf_ranges` and `make_temp_filename` -* New annotations `dpcnt` and `lcr` -* Fixed a bug in `truvari collapse --keep` that prevented the `maxqual` or `common` options from working -* Increased determinism for `truvari collapse` so that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF -* New `truvari bench --extend` functionality. See [discussion](https://github.com/ACEnglish/truvari/discussions/99) for details - -# Truvari 3.2 -*Apr 1, 2022* - -* Removed `truvari.copy_entry` for `pysam.VariantRecord.translate` a 10x faster operation -* Faster `truvari collapse` ([@c8b319b](https://github.com/ACEnglish/truvari/commit/c8b319b0e717a9e342f52e4a5e927f154eeb0e4a)) -* When building `MatchResult` between variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. -* Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353 -* New `truvari anno density` for identifying regions with 'sparse' and 'dense' overlapping SVs ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-density)) -* Better `bench` genotype reporting with `summary.txt` having a `gt_matrix` of Base GT x Comp GT for all Base calls' best, TP match. -* New `truvari anno bpovl` for intersecting against tab-delimited files ([details](https://github.com/spiralgenetics/truvari/wiki/anno#truvari-anno-bpovl)) -* New `truvari divide` command to split VCFs into independent parts ([details](https://github.com/ACEnglish/truvari/wiki/divide)) -* Replaced `--buffer` parameter with `--minhaplen` for slightly better matching specificity -* Bugfix - `truvari anno trf` no longer duplicates entries spanning multple parallelization regions -* Bugfix - `collapse` MatchId/CollapseId annotation wasn't working -* Bugfixes - from [wwliao](https://github.com/wwliao) ([@4dd9968](https://github.com/ACEnglish/truvari/commit/4dd99683912236f433166889bb0b5667e9fa936d) [@ef2cfb3](https://github.com/ACEnglish/truvari/commit/ef2cfb366b60a5af4671d65d3ed987b08da72227)) -* Bugfixes - Issues [#107](https://github.com/ACEnglish/truvari/issues/107), [#108](https://github.com/ACEnglish/truvari/issues/108) - -# Truvari 3.1 -*Dec 22, 2021* - -* `bench` now annotates FPs by working a little differently. See [[bench|bench#methodology]] for details. -* Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved `INS` ([details](https://github.com/spiralgenetics/truvari/discussions/92)) -* Match objects are now usable via the SDK. See [#94](https://github.com/spiralgenetics/truvari/discussions/94) for an example of using Truvari programmatically -* `file_zipper` VCF iteration strategy (`GenomeTree` -> `RegionVCFIterator`) that improves speed, particularly when using `--includebed` -* `collapse` refactored to use Match object and for prettier code, cleaner output. -* `anno remap` now optionally adds `INFO` field of the location of the top N hits. -* An experimental tool `truvari segment` added to help SV association analysis. -* `vcf2df` now supports pulling `FORMAT` fields from multiple samples. -* `vcf2df` now adds `('_ref', '_alt')`, or `('_ref', '_het', '_hom')` for `INFO,Number=[R|G]` fields, respectively. -* Improved documentation, including http://truvari.readthedocs.io/ for developers. -* Increasing/diversifying test coverage exposed minor bugs which were fixed. -* `bench --no-ref --cSample` bug fixes. -* Minor usability feature implemented in `help_unknown_cmd`. - -# Truvari 3.0 -*Sep 15, 2021* - -As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple [Actions](https://github.com/spiralgenetics/truvari/actions) for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version. - -* Working on speed improvements -* Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (`--use-lev`). -* `truvari bench` summary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype. `--no-ref` has better functionality. `--giabreport` is different. -* Added `—keep common` to `truvari collapse`, which allows one to choose to keep the allele with the highest MAC. -* `truvari collapse --hap` wasn't working correctly. The assumptions about the calls being phased wasn't being -properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best -match was chosen. The latter is a reporting problem, but the former had an effect on the results with -~3% of collapsed calls being mis-collapsed. -* `truvari anno trf` is now faster and simpler in its approach and whats reported.. and hopefully more useful. -* `truvari anno grm` has min_size and regions arguments added. -* truv2df has become `truvari vcf2df` where the default is vcf conversion with options to run on a `truvari bench` output directory. It also allows a specific sample to be parsed with `--format` and better Number=A handling. -* NeighId added to `truvari anno numneigh`, which works like bedtools cluster. -* The method af_calc now makes MAC/AC. -* Added 'partial' to `truvari anno remap`. -* Added `truvari anno svinfo`. -* Removed `truvari stats` as `truvari vcf2df` is better and began building [community-driven summaries](https://github.com/spiralgenetics/truvari/discussions/categories/vcf2df-recipes). -* Ubiquitous single version. -* Added a Dockerfile and instructions for making a Truvari docker container. -* Code and repository cleaning. -* Github actions for automated pylint, testing, and releases to pypi. -* Preserving per-version documentation from the wiki in `docs/`. - - -# Truvari 2.1 -*Jan 27, 2021* - -We've expanded and improved Truvari's [annotations](https://github.com/spiralgenetics/truvari/wiki/anno). We've added an [SV "collapsing" tool](https://github.com/spiralgenetics/truvari/wiki/collapse). And we've added a way to [turn VCFs into pandas DataFrames](https://github.com/spiralgenetics/truvari/wiki/truv2df) easily for downstream analysis/QC. - -# Truvari 2.0 -*May 14, 2020* - -After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis. - -Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (`truvari bench`), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the [WIKI](https://github.com/spiralgenetics/truvari/wiki). - -We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation. - -# Truvari 1.3 -*September 25th, 2019* - -Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ -We're now only supporting Python 3. - -Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation -below. This will enable us to refactor the code for easier maintenance and reusability. - -Finally, we now automatically report genotype comparisons in the summary stats. \ No newline at end of file diff --git a/docs/v4.3.0/anno.md b/docs/v4.3.0/anno.md deleted file mode 100644 index 55835902..00000000 --- a/docs/v4.3.0/anno.md +++ /dev/null @@ -1,494 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants a VCF. - -### Annotations added -| Field Name | Description | -|------------|-------------------------------------------------------------| -| TRF | Entry hits a tandem repeat region | -| TRFdiff | ALT TR copy difference from reference | -| TRFrepeat | Repeat motif | -| TRFovl | Percent of ALT covered by TRF annotation | -| TRFstart | tart position of discovered repeat | -| TRFend | End position of discovered repeat | -| TRFperiod | eriod size of the repeat | -| TRFcopies | Number of copies aligned with the consensus pattern | -| TRFscore | Alignment score | -| TRFentropy | Entropy measure | -| TRFsim | Similarity of ALT sequence to generated motif faux sequence | - -### Details -Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: -```bash -zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz -tabix anno.trf.bed.gz -``` -For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. - -For insertions, by default the tool first tries to estimate which motif is contained in the alternate sequence. For each overlapping annotation, the copy number difference of the motif is calculated via `copy_diff = len(entry.alts[0][1:]) / anno["period"]`. Next, a 'feaux sequence' is made from `copy_diff` number of the motif. If the sequence is above the `--motif-similarity` with the insertion sequence, that is considered the insertion's motif. If no estimate is above the `--motif-similarity`, the insertion is incorporated into the reference and TRF is run. If the discovered TRF hits match a motif in the tandem repeat regions file, that annotation is used. If the highest scoring TRF hit doesn't match the tandem repeats region file, the nominally de novo annotation is added to the insertion's vcf entry. - -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-s STEPSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -s STEPSIZE, --stepsize STEPSIZE - Window step size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.0/bench.md b/docs/v4.3.0/bench.md deleted file mode 100644 index 145bf974..00000000 --- a/docs/v4.3.0/bench.md +++ /dev/null @@ -1,295 +0,0 @@ - -Quick start -=========== -Run this command where base is your 'truth set' SVs and comp is the comparison set of SVs. -```bash -truvari bench -b base_calls.vcf -c comp_calls.vcf -o output_dir/ -``` - -Matching Parameters -=================== -Picking matching parameters can be more of an art than a science. It really depends on the precision of your callers and the tolerance you wish to allow them such that it is a fair comparison. - -For example, depth of coverage callers (such as CNVnator) will have very 'fuzzy' boundaries, and don't report the exact deleted sequence but only varying regions. So thresholds of `pctseq=0`, `pctsize=.5`, `pctovl=.5`, `refdist=1000` may seem fair. - -[BioGraph](https://github.com/spiralgenetics/biograph) and many long-read callers report precise breakpoints and full alternate allele sequences. When benchmarking those results, we want to ensure our accuracy by using the stricter default thresholds. - -If you're still having trouble picking thresholds, it may be beneficial to do a few runs of Truvari bench over different values. Start with the strict defaults and gradually increase the leniency. From there, you can look at the performance metrics and manually inspect differences between the runs to find out what level you find acceptable. Truvari is meant to be flexible for comparison. More importantly, Truvari helps one clearly report the thresholds used for reproducibility. - -Here is a rundown of each matching parameter. -| Parameter | Default | Definition | -|------------|---------|------------| -| refdist | 500 | Maximum distance comparison calls must be within from base call's start/end | -| pctseq | 0.7 | Edit distance ratio between the REF/ALT haplotype sequences of base and
comparison call. See "Comparing Sequences of Variants" below. | -| pctsize | 0.7 | Ratio of min(base_size, comp_size)/max(base_size, comp_size) | -| pctovl | 0.0 | Ratio of two calls' (overlapping bases)/(longest span) | -| typeignore | False | Types don't need to match to compare calls. | - -Below are matching parameter diagrams to illustrate (approximately) how they work. - -``` - █ = Deletion ^ = Insertion - ---refdist REFDIST (500) - Max reference location distance - - ACTGATCATGAT - |--████--| - █████ - - Calls are within reference distance of 2 - ---pctsize PCTSIZE (0.7) - Min pct allele size similarity - - ACTGATCATGA sizes - █████ -> 5bp - ████ -> 4bp - - variants have 0.8 size similarity - - ---pctovl PCTOVL (0.0) - Min pct reciprocal overlap - - ACTGATCATGA ranges - █████ [2,7) - ████ [4,8) - - variants have 0.6 reciprocial overlap - - ---pctseq PCTSEQ (0.7) - Min percent allele sequence similarity - - A-CTG-ACTG - ^ ^ haplotypes - | └ACTG -> CTGACTGA - └CTGA -> CTGACTGA - - haplotypes have 100% sequence similarity -``` - -Outputs -======= -Truvari bench writes the following files to the `--output` directory. -| File | Description | -|----------------------|---------------------------------------------| -| tp-base.vcf.gz | True positive calls form the base VCF | -| tp-comp.vcf.gz | True positive calls from the comparison VCF | -| fp.vcf.gz | False positive calls from comparison | -| fn.vcf.gz | False negative calls from base | -| summary.json | Json output of performance stats | -| params.json | Json output of parameters used | -| candidate.refine.bed | Bed file of regions for `refine` | -| log.txt | Run's log | - -summary.json ------------- -Stats generated by benchmarking are written to `summary.json`. - -| Metric | Definition | -|----------------|------------------------------------------------------------| -| TP-base | Number of matching calls from the base vcf | -| TP-comp | Number of matching calls from the comp vcf | -| FP | Number of non-matching calls from the comp vcf | -| FN | Number of non-matching calls from the base vcf | -| precision | TP-comp / (TP-comp + FP) | -| recall | TP-base / (TP-base + FN) | -| f1 | 2 * ((recall * precision) / (recall + precision)) | -| base cnt | Number of calls in the base vcf | -| comp cnt | Number of calls in the comp vcf | -| TP-comp_TP-gt | TP-comp with genotype match | -| TP-comp_FP-gt | TP-comp without genotype match | -| TP-base_TP-gt | TP-base with genotype match | -| TP-base_FP-gt | TP-base without genotype match | -| gt_concordance | TP-comp_TP-gt / (TP-comp_TP-gt + TP-comp_FP-gt) | -| gt_matrix | Base GT x Comp GT Matrix of all Base calls' best, TP match | -| weighted | Metrics weighed by variant sequence/size similarity | - -The `gt_matrix` is a table. For example: -```json -"gt_matrix": { - "(0, 1)": { - "(0, 1)": 500, - "(1, 1)": 10 - }, - "(1, 1)": { - "(1, 1)": 800, - "(0, 1)": 20 - } -} -``` -Represents -> -``` -comp (0,1) (1,1) -base -(0,1) 500 10 -(1,1) 20 800 -``` - -Added annotations ------------------ -The output vcfs are annotated with INFO fields and then sorted, compressed, and indexed inside of the output directory. - -| Anno | Definition | -|-------------------|-----------------------------------------------------------------------------------------------------------------| -| TruScore | Truvari score for similarity of match. `((pctseq + pctsize + pctovl) / 3 * 100)` | -| PctSeqSimilarity | Pct sequence similarity between this variant and its closest match | -| PctSizeSimilarity | Pct size similarity between this variant and its closest match | -| PctRecOverlap | Percent reciprocal overlap percent of the two calls | -| StartDistance | Distance of the base call's start from comparison call's start | -| EndDistance | Distance of the base call's end from comparison call's end | -| SizeDiff | Difference in size of base and comp calls | -| GTMatch | Base/comp calls' Genotypes match | -| MatchId | Id to help tie base/comp calls together {chunkid}.{baseid}.{compid} See [[MatchIds wiki\|MatchIds]] for details. | - - -Refining bench output -===================== -As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -truvari refine --regions result/candidate.refine.bed \ - --reference reference.fasta \ - --recount --use-region-coords \ - result/ -``` -See [[refine wiki|refine]] for details. - -Comparing Sequences of Variants -=============================== - -Truvari has implemented two approaches to compare variant sequences. The default comparison is called 'unroll'. Optionally, a `--reference` can be provided and Truvari will use the reference context of a pair of variants for comparison. - -## Unroll -The method of giving a pair of calls the same reference context can be achieved using an 'unroll' method. For a formal description, see [this gist](https://gist.github.com/ACEnglish/1e7421c46ee10c71bee4c03982e5df6c). - -The main idea is that in order to move variants upstream/downstream, the reference sequence flanking the variant will need to be moved downstream/upstream respectively. Or, to say this another way, we can think of the alternate sequences as being circular instead of linear. This means that in order to move the variant e.g. 1bp downstream for an INS, we could remove the first base from the ALT and append it to the end. So in the 'ab' example used to describe "Reference context" below, we only need to unroll the insertion at a position by the distance between it and another variant e.g. the INS `ab` at POS 2 becomes identical to the INS `ba` at POS 1 by rolling `2-1 = 1` bases from the start to the end. - -This unroll method has a number of benefits and a couple of considerations, including: -* not needing a `--reference` for comparison, which saves I/O time -* increasing the number of correctly matching SVs -* decreasing the number of 'suspect' matches in smaller size regimes -* providing a simpler pattern between PctSizeSimilarity and PctSeqSimilarity - -## Reference context -For the reference context method, consider a hypothetical tandem repeat expansion of the reference sequence 'AB'. Here, we'll represent the 'insertion' sequence as lower-case 'ab', though it should be considered equivalent to 'AB'. Three equally valid descriptions of this -variant would be: - -```text -#POS INS Haplotype - 0 ab abAB - 1 ba AbaB - 2 ab ABab -``` - -Therefore, to compare the sequence similarity, Truvari builds the haplotypes over the range of a pair of calls' -`min(starts):max(ends)` before making the the sequence change introduced by the variants. In python, this line -looks like: - -``` python -hap1_seq = ref.get_seq(a1_chrom, start + 1, a1_start).seq + a1_seq + ref.get_seq(a1_chrom, a1_end + 1, end).seq -``` - -Where `a1_seq1` is the longer of the REF or ALT allele. - -## SVs without sequences - -If the base or comp vcfs do not have sequence resolved calls (e.g. ``, simply set `--pctseq=0` to turn off -sequence comparison. The `--reference` does not need to be provided when not using sequence comparison. If -`--pctseq != 0` and an unresolved SV is encountered, a warning will be raised and the variant will not be compared. - -Controlling the number of matches -================================= - -How many matches a variant is allowed to participate in is controlled by the `--pick` parameter. The available pickers are `single`, `ac`, and `multi`. - -* `single` (the default option) allows each variant to participate in up to one match. -* `ac` uses the genotype allele count to control how many matches a variant can have. This means a homozygous alternate variant can participate in two matches (its GT is 1/1 so AC=2). A heterozygous variant can only participate in one match (GT 0/1, AC=1). And, a homozygous reference variant cannot be matched. Note that missing genotypes are considered reference alleles and do not add to the AC e.g. (GT ./1, AC=1). -* `multi` variants can participate in all matches available. - -As an example, imagine we have three variants in a pVCF with two samples we want to compare. - -``` -CHROM POS ID REF ALT base comp -chr20 17785968 ID1 A ACGCGCGCGCG 1/1 1/0 -chr20 17785968 ID2 A ACGCGCGCGCGCG 0/0 0/1 -chr20 17785969 ID3 C CGCGCGCGCGCGC 0/0 1/1 -``` - -To compare samples inside the same vcf, we would use the command: -```bash -truvari bench -b input.vcf.gz -c input.vcf.gz -o output/ --bSample base --cSample comp --no-ref a -``` - -This VCF makes different results depending on the `--pick` parameter - -| Parameter | ID1 State | ID2 State | ID3 State | -|-----------|-----------|-----------|-----------| -| single | TP | FP | FP | -| ac | TP | TP | FP | -| multi | TP | TP | TP | - ---dup-to-ins -============ - -Most SV benchmarks only report DEL and INS SVTYPEs. The flag `--dup-to-ins` will interpret SVs with SVTYPE == DUP to SVTYPE == INS. Note that DUPs generally aren't sequence resolved (i.e. the ALT isn't a sequence) like INS. Therefore, `--dup-to-ins` typically should be used without sequence comparison via `--pctseq 0` - -Size filtering -============== - -`--sizemax` is the maximum size of a base or comparison call to be considered. - -`--sizemin` is the minimum size of a base call to be considered. - -`--sizefilt` is the minimum size of a comparison call that will be matched to base calls. It can -be less than `sizemin` for edge case variants. - -For example: Imagine `sizemin` is set at 50 and `sizefilt` at 30, and a 50bp base call is 98% similar to a 49bp comparison -call at the same position. - -These two calls could be considered matching. However, if we removed comparison calls less than `sizemin`, -we'd incorrectly classify the 50bp base call as a false negative. Instead, we allow comparison calls between `[sizefilt,sizemin)` to find matches. - -This has the side effect of artificially inflating specificity. For example, if that same 49bp call described -above were below the similarity threshold, it would not be classified as a FP since it is below the `sizemin` -threshold. So we're giving the call a better chance to be useful and less chance to be detrimental -to final statistics. - -Include Bed & VCF Header Contigs -================================ - -If an `--includebed` is provided, only base and comp calls contained within the defined regions are used -for comparison. This is similar to pre-filtering your base/comp calls using: - -```bash -(zgrep "#" my_calls.vcf.gz && bedtools intersect -u -a my_calls.vcf.gz -b include.bed) | bgzip > filtered.vcf.gz -``` - -with the exception that Truvari requires the start and the end to be contained in the same includebed region -whereas `bedtools intersect` does not. - -If an `--includebed` is not provided, the comparison is restricted to only the contigs present in the base VCF -header. Therefore, any comparison calls on contigs not in the base calls will not be counted toward summary -statistics and will not be present in any output vcfs. - -Extending an Include Bed ------------------------- -The option `--extend` extends the regions of interest (set in `--includebed` argument) by the given number of bases on each side, allowing base variants to match comparison variants that are just outside of the original region. If a comparison variant is in the extended regions it can potentially match a base variant that is in the original regions turning it to TP. Comparison variants in the extended regions that don't have a match are not counted as FP. This strategy is similar to the one implemented for size matching where only the base variants longer than sizemin (equal to 50 by default) are considered, but they are allowed to match shorter comparison variants sizefilt (30bp by default) or longer. - -See this [discussion](https://github.com/ACEnglish/truvari/discussions/99)for details. - -Methodology -=========== -Here is a high-level pseudocode description of the steps Truvari bench conducts to compare the two VCFs. -``` -* zip the Base and Comp calls together in sorted order -* create chunks of all calls overlapping within ±`--chunksize` basepairs -* make a |BaseCall| x |CompCall| match matrix for each chunk -* build a Match for each call pair in the chunk - annotate as TP if >= all thresholds -* if the chunk has no Base or Comp calls -** return them all as FNs/FPs -* use `--pick` method to sort and annotate variants with their best match -``` -![](https://github.com/acenglish/truvari/blob/develop/imgs/TruvariBenchMethod.png) \ No newline at end of file diff --git a/docs/v4.3.0/collapse.md b/docs/v4.3.0/collapse.md deleted file mode 100644 index 7ac22adb..00000000 --- a/docs/v4.3.0/collapse.md +++ /dev/null @@ -1,163 +0,0 @@ -`collapse` is Truvari's approach to SV merging. After leveraging `bcftools` to merge VCFs, `truvari collapse` can then iterate over the calls and create clusters of SVs that match over the [provided thresholds](https://github.com/spiralgenetics/truvari/wiki/bench#matching-parameters). This is also useful when removing potentially redundant calls within a single sample. - -Example -======= -To start, we merge multiple VCFs (each with their own sample) and ensure there are no multi-allelic entries via: -```bash -bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz -``` - -This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt. -For example, consider two vcfs: - - >> one.vcf: - chr1 1 ... GT 0/1 - chr1 5 ... GT 1/1 - >> two.vcf: - chr1 1 ... GT 1/1 - chr1 7 ... GT 0/1 - -`bcftools merge` creates: - - >> merge.vcf: - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 ./. - chr1 7 ... GT ./. 0/1 - -This VCF can then be collapsed to allow 'fuzzier' matching than the exact merge just performed. - -```bash -truvari collapse -i merge.vcf.gz -o truvari_merge.vcf -c truvari_collapsed.vcf -``` - -For example, if we collapsed our example merge.vcf by matching any calls within 3bp, we'd create: - - >> truvari_merge.vcf - chr1 1 ... GT 0/1 1/1 - chr1 5 ... GT 1/1 0/1 - >> truvari_collapsed.vcf - chr1 7 ... GT ./. 0/1 - ---choose behavior -================= -When collapsing, the default `--choose` behavior is to take the `first` variant by position from a cluster to -be written to the output while the others will be placed in the collapsed output. -Other choosing options are `maxqual` (the call with the highest quality score) or `common` (the call with the highest minor allele count). - -Samples with no genotype information in the kept variant will be filled by the first -collapsed variant containing genotype information. - ---gt -==== -For some results, one may not want to collapse variants with conflicting genotypes from a single sample. With the `--gt all` parameter, variants which are present (non `0/0` or `./.`) in the same sample are not collapsed. With the `-gt het` parameter, only variants which are both heterozygous in a sample (e.g. `0/1` and `0/1`) are prevented from collapsing. The `--gt het` is useful for some SV callers which will redundantly call variants and typically genotype them all as `1/1`. - ---intra -======= -When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with: -``` -FORMAT RESULT1 RESULT2 -GT:GQ:AD ./.:.:3,0 1/1:20:0,30 -``` -The `--intra` output would be: -``` -FORMAT RESULT1 -GT:GQ:AD 1/1:20:3,0 -``` -As you can see in this example, 1) The first sample name is the only one preserved. 2) conflicting FORMAT fields can be consolidated in a non-useful way (here the AD of `3,0` isn't informative to a `1/1` genotype). We're working to provide an API to help users write custom intra-sample consolidation scripts. - ---hap mode -========== -When using `--hap`, we assume phased variants from a single individual. Only the -single best matching call from the other haplotype will be collapsed, -and the consolidated genotype will become 1/1 - -For example, if we collapse anything at the same position: - - chr1 1 .. GT 0|1 - chr1 1 .. GT 1|0 - chr1 2 .. GT 1|0 - -will become: - - chr1 1 .. GT 1/1 - chr1 2 .. GT 1|0 - ---chain mode -============ -Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one. -For example, if we have - - chr1 5 .. - chr1 7 .. - chr1 9 .. - -When we collapse anything within 2bp of each other, without `--chain`, we output: - - chr1 5 .. - chr1 9 .. - -With `--chain`, we would collapse `chr1 9` as well, producing - - chr1 5 .. - -Annotations -=========== -`collapse` produces two files. The output file has kept variants along with unanalyzed (< sizemin) variants. The collapsed file contains the variants that were collapsed into the kept variants. - -The output file has only two annotations added to the `INFO`. -- `CollapseId` - Identifier of the variant when comparing to the collapse outputs. -- `NumCollapsed` - Number of variants collapsed into this variant -- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes - -The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details. - -``` -usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug] - [-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain] - [--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly] - -Structural variant collapser - -Will collapse all variants within sizemin/max that match over thresholds - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Comparison set of calls - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT - Where collapsed variants are written (collapsed.vcf) - -f REFERENCE, --reference REFERENCE - Indexed fasta used to call variants - -k {first,maxqual,common}, --keep {first,maxqual,common} - When collapsing calls, which one to keep (first) - --debug Verbose logging - --hap Collapsing a single individual's haplotype resolved calls (False) - --chain Chain comparisons to extend possible collapsing (False) - --no-consolidate Skip consolidation of sample genotype fields (True) - --null-consolidate NULL_CONSOLIDATE - Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null - from all neighbors (None) - -Comparison Threshold Arguments: - -r REFDIST, --refdist REFDIST - Max reference location distance (500) - -p PCTSIM, --pctsim PCTSIM - Min percent allele sequence similarity. Set to 0 to ignore. (0.95) - -B MINHAPLEN, --minhaplen MINHAPLEN - Minimum haplotype sequence length to create (50) - -P PCTSIZE, --pctsize PCTSIZE - Min pct allele size similarity (minvarsize/maxvarsize) (0.95) - -O PCTOVL, --pctovl PCTOVL - Min pct reciprocal overlap (0.0) for DEL events - -t, --typeignore Variant types don't need to match to compare (False) - --use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False) - -Filtering Arguments: - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for comparison (50) - -S SIZEMAX, --sizemax SIZEMAX - Maximum variant size to consider for comparison (50000) - --passonly Only consider calls with FILTER == PASS -``` \ No newline at end of file diff --git a/docs/v4.3.0/consistency.md b/docs/v4.3.0/consistency.md deleted file mode 100644 index 17ba23b7..00000000 --- a/docs/v4.3.0/consistency.md +++ /dev/null @@ -1,179 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-d] [-j] [-o OUTPUT] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -d, --no-dups Disallow duplicate SVs - -j, --json Output report in json format - -o OUTPUT, --output OUTPUT - Write tsv of variant keys and their flag -``` - -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. The key and flags can be written with the `--output` option. - -Duplicates -========== - -If there are VCFs with duplicate keys, they are handled by appending a e.g. `.1` to the key. If you'd like to ignore duplicates, just add `--no-dups` - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.3.0/divide.md b/docs/v4.3.0/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.3.0/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.3.0/phab.md b/docs/v4.3.0/phab.md deleted file mode 100644 index bdcaec9e..00000000 --- a/docs/v4.3.0/phab.md +++ /dev/null @@ -1,157 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -Requirements ------------- -Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) - -Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. - -Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft. - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o harmonized.vcf.gz -``` - -In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized. - -```bash -truvari bench -b harmonized.vcf.gz \ - -c harmonized.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example.vcf.gz -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -`--align` -========= -By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -Limitations ------------ -* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, variant counts will likely change. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution. - diff --git a/docs/v4.3.0/refine.md b/docs/v4.3.0/refine.md deleted file mode 100644 index eb2ef07a..00000000 --- a/docs/v4.3.0/refine.md +++ /dev/null @@ -1,142 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result and recomputing benchmarking performance on harmonized variants, we present the tool `refine`. - -Quick Start -=========== - -Basic ------ -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -The regions spanned by `subset.bed` should be shorter and focused around the breakpoints of putative FNs/FPs. Haplotypes from these boundaries are fed into a realignment procedure which can take an extremely long time on e.g entire chromosomes. Also, the genotypes within these regions must be phased. - -Whole genome ------------- -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` -Use `refine` on the `result/` analyzing only the regions with putative FP/FN that would benefit from harmonization -```bash -truvari refine -R -U -f ref.fa --regions result/candidate.refine.bed result/ -``` - -Tandem Repeats --------------- -For benchmarks such as the [GIAB TR](https://www.biorxiv.org/content/10.1101/2023.10.29.564632v1), a TR caller may analyze a different subset of regions. In order to avoid unnecessarily penalizing the performance with FNs from unanalyzed regions: - -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` - -Use `refine` on the `result/` analyzing only the `hc_regions.bed` covered by the TR caller's `tool_regions.bed` -``` -truvari refine -f ref.fa --regions tool_regions.bed result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab_bench/` - Bench results on the subset of variants harmonized - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--align` -========= -By default, Truvari will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -`--use-original-vcfs` -===================== - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. - -With the `--use-original-vcfs` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original-vcfs` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. - -This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-region-coords` -===================== - -When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. However, the region's coordinates should be used when using the `candidates.refine.bed` to limit analysis to only the regions with putative FP/FN that would benefit from harmonization - for example, when performing whole genome benchmarking. - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.3.0/segment.md b/docs/v4.3.0/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.3.0/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.3.0/stratify.md b/docs/v4.3.0/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.3.0/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.0/vcf2df.md b/docs/v4.3.0/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.3.0/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.1/Citations.md b/docs/v4.3.1/Citations.md deleted file mode 100644 index d600b860..00000000 --- a/docs/v4.3.1/Citations.md +++ /dev/null @@ -1,30 +0,0 @@ -# Citing Truvari - -English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6 - -# Citations - -List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list. -* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8) -* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract) -* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true) -* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728) -* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full) -* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0) -* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078) -* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5) -* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract) -* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0) -* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9) -* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452) -* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true) -* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x) -* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full) -* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y) -* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246) -* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7) -* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9) -* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract) -* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract) -* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170) -* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110) diff --git a/docs/v4.3.1/Development.md b/docs/v4.3.1/Development.md deleted file mode 100644 index ccb38493..00000000 --- a/docs/v4.3.1/Development.md +++ /dev/null @@ -1,90 +0,0 @@ -# Truvari API -Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/). - -Documentation can also be seen using -```python -import truvari -help(truvari) -``` - -# docker - -A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run -```bash -docker build -t truvari . -``` - -You can then run Truvari through docker using -```bash -docker run -v `pwd`:/data -it truvari -``` -Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point. -```bash -docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz -``` - -If you'd like to interact within the docker container for things like running the CI/CD scripts -```bash -docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari -``` -You'll now be inside the container and can run FuncTests or run Truvari directly -```bash -bash repo_utils/truvari_ssshtests.sh -truvari anno svinfo -i example.vcf.gz -``` - -# CI/CD - -Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts. - -```bash -pip install pylint anybadge coverage -``` - -Check code formatting with -```bash -python repo_utils/pylint_maker.py -``` -We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting. - -Test the code and generate a coverage report with -```bash -bash repo_utils/truvari_ssshtests.sh -``` - -Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage. - -# git flow - -To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/` - -When contributing to the code, be sure you're working off of develop and have run `git flow init`. - -# versioning - -Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo. - -# docs - -The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`. - -# Creating a release -Follow these steps to create a release - -0) Bump release version -1) Run tests locally -2) Update API Docs -3) Change Updates Wiki -4) Freeze the Wiki -5) Ensure all code is checked in -6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/) -7) Use github action to make a testpypi release -8) Check test release -```bash -python3 -m venv test_truvari -python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari -``` -9) Use GitHub action to make a pypi release -10) Download release-tarball.zip from step #9’s action -11) Create release (include #9) from the tag -12) Checkout develop and Bump to dev version and README ‘commits since’ badge \ No newline at end of file diff --git a/docs/v4.3.1/Installation.md b/docs/v4.3.1/Installation.md deleted file mode 100644 index a8aae743..00000000 --- a/docs/v4.3.1/Installation.md +++ /dev/null @@ -1,72 +0,0 @@ -Recommended -=========== -For stable versions of Truvari, use pip -``` -python3 -m pip install truvari -``` -Specific versions can be installed via -``` -python3 -m pip install truvari==3.2.0 -``` -See [pypi](https://pypi.org/project/Truvari/#history) for a history of all distributed releases. - -Manual Installation -=================== -To build Truvari directly, clone the repository and switch to a specific tag. -``` -git clone https://github.com/ACEnglish/truvari.git -git checkout tags/v3.0.0 -python3 -m pip install . -``` - -To see a list of all available tags, run: -``` -git tag -l -``` - -If you have an older clone of the repository and don't see the version you're looking for in tags, make sure to pull the latest changes: -``` -git pull -git fetch --all --tags -``` - -Mamba / Conda -============= -NOTE!! There is a very old version of Truvari on bioconda that - for unknown reasons - supersedes the newer, supported versions. Users may need to specify to conda which release to build. See [this ticket](https://github.com/ACEnglish/truvari/issues/130#issuecomment-1196607866) for details. - -Truvari releases are automatically deployed to bioconda. -Users can follow instructions here (https://mamba.readthedocs.io/en/latest/installation.html) to install mamba. (A faster alternative conda compatible package manager.) - -Creating an environment with Truvari and its dependencies. -``` -mamba create -c conda-forge -c bioconda -n truvari truvari -``` - -Alternatively, see the [conda page](https://anaconda.org/bioconda/truvari) for details -``` -conda install -c bioconda truvari -``` - -Building from develop -===================== -The default branch is `develop`, which holds in-development changes. This is for developers or those wishing to try experimental features and is not recommended for production. Development is versioned higher than the most recent stable release with an added suffix (e.g. Current stable release is `3.0.0`, develop holds `3.1.0-dev`). If you'd like to install develop, repeat the steps above but without `git checkout tags/v3.0.0`. See [wiki](https://github.com/spiralgenetics/truvari/wiki/Development#git-flow) for details on how branching is handled. - -Docker -====== -See [Development](https://github.com/spiralgenetics/truvari/wiki/Development#docker) for details on building a docker container. - -edlib error -=========== -Some environments have a hard time installing edlib due to a possible cython bug ([source](https://bugs.launchpad.net/ubuntu/+source/cython/+bug/2006404)). If you seen an error such as the following: -``` -edlib.bycython.cpp:198:12: fatal error: longintrepr.h: No such file or directory - 198 | #include "longintrepr.h" -``` - -One method to prevent the error message is to run the following commands ([source](https://github.com/Martinsos/edlib/issues/212)) -```bash -python3 -m venv my_env -source my_env/bin/activate -python3 -m pip install --no-cache-dir cython setuptools wheel -EDLIB_USE_CYTHON=1 python3 -m pip install truvari --no-cache -``` \ No newline at end of file diff --git a/docs/v4.3.1/MatchIds.md b/docs/v4.3.1/MatchIds.md deleted file mode 100644 index ca52076f..00000000 --- a/docs/v4.3.1/MatchIds.md +++ /dev/null @@ -1,74 +0,0 @@ -MatchIds are used to tie base/comparison calls together in post-processing for debugging or other exploring. MatchIds have a structure of `{chunkid}.{callid}`. The chunkid is unique id per-chunk of calls. All calls sharing chunkid were within `--chunksize` distance and were compared. The callid is unique to a call in a chunk for each VCF. Because `bench` processes two VCFs (the base and comparison VCFs), the `MatchId` has two values: the first is the base variant's MatchId and the second the comparison variant's MatchId. - -For `--pick single`, the two MatchIds will be identical in the e.g. tp-base.vcf.gz and tp-comp.vcf.gz. However, for `--pick ac|multi`, it's possible to have cases such as one base variant matching to multiple comparison variants. That would give us MatchIds like: - -``` -# tp-base.vcf -MatchId=4.0,4.1 - -# tp-comp.vcf -MatchId=4.0,4.1 -MatchId=4.0,4.2 -``` - -This example tells us that the tp-comp variants are both pointing to `4.0` in tp-base. The tp-base variant has a higher match to the tp-comp `4.1` variant. - -One easy way to combine matched variants is to use `truvari vcf2df` to convert a benchmarking result to a pandas DataFrame and leverage pandas' merge operation. First, we convert the `truvari bench` result. - -```bash -truvari vcf2df --info --bench-dir bench_result/ data.jl -``` - -Next, we combine rows of matched variants: -```python -import joblib -import pandas as pd - -# Load the data -data = joblib.load("data.jl") - -# Separate out the variants from the base VCF and add new columns of the base/comp ids -base = data[data['state'].isin(['tpbase', 'fn'])].copy() -base['base_id'] = base['MatchId'].apply(lambda x: x[0]) -base['comp_id'] = base['MatchId'].apply(lambda x: x[1]) - -# Separate out the variants from the comparison VCF and add new columns of the base/comp ids -comp = data[data['state'].isin(['tp', 'fp'])].copy() -comp['base_id'] = comp['MatchId'].apply(lambda x: x[0]) -comp['comp_id'] = comp['MatchId'].apply(lambda x: x[1]) - -# Merge the base/comparison variants -combined = pd.merge(base, comp, left_on='base_id', right_on='comp_id', suffixes=('_base', '_comp')) - -# How many comp variants matched to multiple base variants? -counts1 = combined['base_id_comp'].value_counts() -print('multi-matched comp count', (counts1 != 1).sum()) - -# How many base variants matched to multiple comp variants? -counts2 = combined['comp_id_base'].value_counts() -print('multi-matched base count', (counts2 != 1).sum()) -``` - -The `MatchId` is also used by `truvari collapse`. However there are two differences. First, in the main `collapse` output, the relevant INFO field is named `CollapsedId`. Second, because collapse only has a single input VCF, it is much easier to merge DataFrames. To merge collapse results kept variants with those that were removed, we again need to convert the VCFs to DataFrames: - -```bash -truvari vcf2df -i kept.vcf.gz kept.jl -truvari vcf2df -i removed.vcf.gz remov.jl -``` - -Then we combine them: -```python -import joblib -import pandas as pd - -# Load the kept variants and set the index. -kept = joblib.load("kept.jl").set_index('CollapseId') - -# Load the removed variants and set the index. -remov = joblib.load("remov.jl") -remov['CollapseId'] = remov['MatchId'].apply(lambda x: x[0]) -remov.set_index('CollapseId', inplace=True) - -# Join the two sets of variants -result_df = kept.join(remov, how='right', rsuffix='_removed') -``` \ No newline at end of file diff --git a/docs/v4.3.1/Multi-allelic-VCFs.md b/docs/v4.3.1/Multi-allelic-VCFs.md deleted file mode 100644 index fd0eb23e..00000000 --- a/docs/v4.3.1/Multi-allelic-VCFs.md +++ /dev/null @@ -1,11 +0,0 @@ -Truvari only compares the first alternate allele in VCFs. If a VCF contains multi-allelic sites such as: - -``` -chr2 1948201 . T TACAACACGTACGATCAGTAGAC,TCAACACACAACACGTACGATCAGTAGAC .... -``` - -Then pre-process the VCFs with bcftools: - -```bash -bcftools norm -m-any base_calls.vcf.gz | bgzip > base_calls_split.vcf.gz -``` \ No newline at end of file diff --git a/docs/v4.3.1/anno.md b/docs/v4.3.1/anno.md deleted file mode 100644 index 55835902..00000000 --- a/docs/v4.3.1/anno.md +++ /dev/null @@ -1,494 +0,0 @@ - -Truvari annotations: -* [gcpct](anno#truvari-anno-gcpct) - GC Percent -* [gtcnt](anno#truvari-anno-gtcnt) - Genotype Counts -* [trf](anno#truvari-anno-trf) - Tandem Repeats -* [grm](anno#truvari-anno-grm) - Mappability -* [repmask](anno#truvari-anno-repmask) - Repeats -* [remap](anno#truvari-anno-remap) - Allele Remapping -* [hompct](anno#truvari-anno-hompct) - Homozygous Percent -* [numneigh](anno#truvari-anno-numneigh) - Number of Neighbors -* [svinfo](anno#truvari-anno-svinfo) - SVINFO Fields -* [bpovl](anno#truvari-anno-bpovl) - Annotation Intersection -* [density](anno#truvari-anno-density) - Call Density -* [dpcnt](anno#truvari-anno-dpcnt) - Depth (DP) and Alt-Depth (AD) Counts -* [lcr](anno#truvari-anno-lcr) - Low-complexity Regions -* [grpaf](anno#truvari-anno-grpaf) - Sample Group Allele-Frequency Annotations - -# truvari anno gcpct - -This will add an INFO tag `GCPCT` to each element in a VCF of the GC percent of the call's sequence. - -For deletions, this is the GC percent of the reference range of the call. For insertions, the ALT sequence is analyzed. -``` -usage: gcpct [-h] [-o OUTPUT] -r REFERENCE [input] - -Annotates GC content of SVs - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -r REFERENCE, --reference REFERENCE - Reference fasta -``` - -# truvari anno gtcnt -This will add an INFO tag `GTCNT` to each element in a VCF with the count of genotypes found across all samples. The value is a list of Counts of genotypes for the allele across all samples (UNK, REF, HET, HOM). This is most useful for pVCFs. - -``` -usage: gtcnt [-h] [-o OUTPUT] [input] - -Annotates GTCounts of alleles - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno trf -Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants a VCF. - -### Annotations added -| Field Name | Description | -|------------|-------------------------------------------------------------| -| TRF | Entry hits a tandem repeat region | -| TRFdiff | ALT TR copy difference from reference | -| TRFrepeat | Repeat motif | -| TRFovl | Percent of ALT covered by TRF annotation | -| TRFstart | tart position of discovered repeat | -| TRFend | End position of discovered repeat | -| TRFperiod | eriod size of the repeat | -| TRFcopies | Number of copies aligned with the consensus pattern | -| TRFscore | Alignment score | -| TRFentropy | Entropy measure | -| TRFsim | Similarity of ALT sequence to generated motif faux sequence | - -### Details -Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via: -```bash -zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz -tabix anno.trf.bed.gz -``` -For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)]. - -For insertions, by default the tool first tries to estimate which motif is contained in the alternate sequence. For each overlapping annotation, the copy number difference of the motif is calculated via `copy_diff = len(entry.alts[0][1:]) / anno["period"]`. Next, a 'feaux sequence' is made from `copy_diff` number of the motif. If the sequence is above the `--motif-similarity` with the insertion sequence, that is considered the insertion's motif. If no estimate is above the `--motif-similarity`, the insertion is incorporated into the reference and TRF is run. If the discovered TRF hits match a motif in the tandem repeat regions file, that annotation is used. If the highest scoring TRF hit doesn't match the tandem repeats region file, the nominally de novo annotation is added to the insertion's vcf entry. - -``` -usage: trf [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-T TRF_PARAMS] -r REPEATS -f REFERENCE [-s MOTIF_SIMILARITY] - [-m MIN_LENGTH] [-R] [--no-estimate] [-C CHUNK_SIZE] [-t THREADS] [--debug] - -Intersect vcf with reference tandem repeats and annotate -variants with the best fitting repeat motif and its copy number -relative to the reference - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to tandem repeat finder (trf409.linux64) - -T TRF_PARAMS, --trf-params TRF_PARAMS - Default parameters to send to trf (3 7 7 80 5 40 500 -h -ngs) - -r REPEATS, --repeats REPEATS - Reference repeat annotations - -f REFERENCE, --reference REFERENCE - Reference fasta file - -s MOTIF_SIMILARITY, --motif-similarity MOTIF_SIMILARITY - Motif similarity threshold (0.9) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -R, --regions-only Only write variants within --repeats regions (False) - --no-estimate Skip INS estimation procedure and run everything through TRF. (False) - -C CHUNK_SIZE, --chunk-size CHUNK_SIZE - Size (in mbs) of reference chunks for parallelization (5) - -t THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno grm - -For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints. -We then remap that kmer to the reference genome and report alignment information. -This does not alter the VCF traditional annotations, but instead will create a pandas -DataFrame and save it to a joblib object. - -There are four queries made per-SV. For both reference (r), alternate (a) we create upstream (up) and downstream (dn) kmers. -So the columns are all prefixed with one of "rup_", "rdn_", "aup_", "adn_". - -In the alignment information per-query, there are three 'hit' counts: -- nhits : number of query hits -- dir_hits : direct strand hit count -- com_hits : compliment strand hit count - -The rest of the alignment information is reported by average (avg), maximum (max), and minimum (min) - -The suffixes are: -- q : mapping quality score of the hits -- ed : edit distance of the hits -- mat : number of matches -- mis : number of mismatches - -For example, "aup_avg_q", is the alternate's upstream breakend kmer's average mapping quality score. - -``` -usage: grm [-h] -i INPUT -r REFERENCE [-R REGIONS] [-o OUTPUT] [-k KMERSIZE] [-m MIN_SIZE] [-t THREADS] [--debug] - -Maps graph edge kmers with BWA to assess Graph Reference Mappability - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Input VCF - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -R REGIONS, --regions REGIONS - Bed file of regions to parse (None) - -o OUTPUT, --output OUTPUT - Output dataframe (results.jl) - -k KMERSIZE, --kmersize KMERSIZE - Size of kmer to map (50) - -m MIN_SIZE, --min-size MIN_SIZE - Minimum size of variants to map (25) - -t THREADS, --threads THREADS - Number of threads (1) - --debug Verbose logging -``` - -# truvari anno repmask - -``` -usage: repmask [-h] -i INPUT [-o OUTPUT] [-e EXECUTABLE] [-m MIN_LENGTH] [-M MAX_LENGTH] [-t THRESHOLD] [-p PARAMS] [-T THREADS] - [--debug] - - Wrapper around RepeatMasker to annotate insertion sequences in a VCF - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - VCF to annotate (None) - -o OUTPUT, --output OUTPUT - Output filename (/dev/stdout) - -e EXECUTABLE, --executable EXECUTABLE - Path to RepeatMasker (RepeatMasker) - -m MIN_LENGTH, --min-length MIN_LENGTH - Minimum size of entry to annotate (50) - -M MAX_LENGTH, --max-length MAX_LENGTH - Maximum size of entry to annotate (50000) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered (0.8) - -p PARAMS, --params PARAMS - Default parameter string to send to RepeatMasker (-pa {threads} -qq -e hmmer -species human -lcambig - -nocut -div 50 -no_id -s {fasta}) - -T THREADS, --threads THREADS - Number of threads to use (1) - --debug Verbose logging -``` - -# truvari anno remap - -Taking the Allele’s sequence, remap it to the reference and annotate based on the closest alignment. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/remap_example.png) - -``` -usage: remap [-h] -r REFERENCE [-o OUTPUT] [-m MINLENGTH] [-t THRESHOLD] [-d DIST] [-H HITS] [--debug] [input] - -Remap VCF'S alleles sequence to the reference to annotate REMAP - -- novel : Allele has no hits in reference -- tandem : Allele's closest hit is within len(allele) bp of the SV's position -- interspersed : Allele's closest hit is not tandem -- partial : Allele only has partial hit(s) less than --threshold - -Which alleles and alignments to consider can be altered with: -- --minlength : minimum SV length to considred (50) -- --dist : For deletion SVs, do not consider alignments that hit within Nbp of the SV's position -(a.k.a. alignments back to the source sequence) (10) -- --threshold : Minimum percent of allele's sequence used by alignment to be considered (.8) - -positional arguments: - input Input VCF (/dev/stdin) - -options: - -h, --help show this help message and exit - -r REFERENCE, --reference REFERENCE - BWA indexed reference - -o OUTPUT, --output OUTPUT - Output VCF (/dev/stdout) - -m MINLENGTH, --minlength MINLENGTH - Smallest length of allele to remap (50) - -t THRESHOLD, --threshold THRESHOLD - Threshold for pct of allele covered to consider hit (0.8) - -d DIST, --dist DIST Minimum distance an alignment must be from a DEL's position to be considered (10)) - -H HITS, --hits HITS Report top hits as chr:start-end.pct (max 0) - --debug Verbose logging -``` -# truvari anno hompct - -``` -usage: hompct [-h] -i INPUT [-o OUTPUT] [-b BUFFER] [-m MINANNO] [-M MAXGT] [-c MINCOUNT] [--debug] - -Calcluate the the Hom / (Het + Hom) of variants in the region of SVs -Requires the VCF to contain SVs beside SNPs/Indels - -options: - -h, --help show this help message and exit - -i INPUT, --input INPUT - Compressed, indexed VCF to annotate - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -b BUFFER, --buffer BUFFER - Number of base-pairs up/dn-stream to query (5000) - -m MINANNO, --minanno MINANNO - Minimum size of event to annotate (50) - -M MAXGT, --maxgt MAXGT - Largest event size to count for genotyping (1) - -c MINCOUNT, --mincount MINCOUNT - Minimum number of genotyping events to report HOMPCT (0) - --debug Verbose logging -``` - -# truvari anno numneigh - -``` -usage: numneigh [-h] [-o OUTPUT] [-r REFDIST] [-s SIZEMIN] [--passonly] [--debug] [input] - -For every call within size boundaries, -Add NumNeighbors info field of how many calls are within the distance -Add NeighId clustering field in the same chained neighborhood -For example, -:: - -- is a call, refdist is 2 - - - - - - - - nn: 1 2 1 0 1 1 - id: 0 0 0 1 2 2 - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output vcf (stdout) - -r REFDIST, --refdist REFDIST - Max reference location distance (1000) - -s SIZEMIN, --sizemin SIZEMIN - Minimum variant size to consider for annotation (50) - --passonly Only count calls with FILTER == PASS - --debug Verbose logging -``` - -# truvari anno svinfo - -Uses `truvari.entry_size` and `truvari.entry_variant_type` on entries >= `args.minsize` to add 'SVLEN' and ‘SVTYPE’ annotations to a VCF’s INFO. - -How SVLEN is determined: -- Starts by trying to use INFO/SVLEN -- If SVLEN is unavailable and ALT field is an SV (e.g. \, \, etc), use abs(vcf.start - vcf.end). The INFO/END tag needs to be available, especially for INS. -- Otherwise, return the size difference of the sequence resolved call using abs(len(vcf.REF) - len(str(vcf.ALT[0]))) - -How SVTYPE is determined: -- Starts by trying to use INFO/SVTYPE -- If SVTYPE is unavailable, infer if entry is a insertion or deletion by looking at the REF/ALT sequence size differences -- If REF/ALT sequences are not available, try to parse the \, \, etc from the ALT column. -- Otherwise, assume 'UNK' - -``` -usage: svinfo [-h] [-o OUTPUT] [-m MINSIZE] [input] - -Adds SVTYPE and SVLEN INFO fields - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -m MINSIZE, --minsize MINSIZE - Minimum size of entry to annotate (50) -``` - -# truvari anno bpovl - -After turning a tab-delimited annotation file into an IntervalTree, intersect the start/end and overlap of SVs. -The output is a light-weight pandas DataFrame saved with joblib. The columns in the output are: - -- vcf_key : Variant key from `truvari.entry_to_key` -- intersection : Type of intersection between the SV and the annotation - - start_bnd - SV's start breakpoints hits the annotation - - end_bnd - SV's end breakpoint hits the annotation - - overlaps - Annotation's start/end boundaries are completely within the SV - - contains - Annotation's start/end boundaries span the entire SV -- anno_key : Annotation file's line index - -The idea with this tool is to annotate variants against tab-delimited files, especially when there's a 1-to-N variant to annotations. This tool is useful when used in conjunction with `truvari vcf2df` and pandas DataFrames. - -For example, if we have a VCF of SVs and a GTF of genes/gene features from Ensmbl. Any SV may intersect multiple features, which doesn't lend itself well to directly annotating the VCF's INFO. After using `bpovl`, we'll use Truvari to convert the SVs to a DataFrame. - -```bash -truvari anno bpovl -i variants.vcf.gz -a genes.gtf.gz -o anno.jl -p gff -truvari vcf2df variants.vcf.gz variants.jl -``` - -We can then treat the files similar to a database and do queries and joins to report which variants intersect which annotations. - -```python -import joblib -from gtfparse import read_gtf -variants = joblib.load("variants.jl") -genes = read_gtf("genes.gtf.gz") -annos = joblib.load("anno.jl") -to_check = annos.iloc[0] - -print(to_check) -# vcf_key chr20:958486-958487.A -# intersection start_bnd -# anno_key 11 - -print(variants.loc[to_check['vcf_key']]) -# id None -# svtype INS -# ... etc - -print(annos.loc[to_check['anno_key']]) -# seqname chr20 -# source ensembl_havana -# feature exon -# start 958452 -# ... etc -``` - -Similar to tabix, `bpovl` has presets for known file types like bed and gff. But any tab-delimited file with sequence/chromosome, start position, and end position can be parsed. Just set the "Annotation File Arguments" to the 0-based column indexes. For example, a bed file -has arguments `-s 0 -b 1 -e 2 -c #`. - -``` -usage: bpovl [-h] -a ANNO -o OUTPUT [--sizemin SIZEMIN] [--spanmax SPANMAX] [-p {bed,gff}] [-c COMMENT] [-s SEQUENCE] [-b BEGIN] - [-e END] [-1] - [input] - -Creates intersection of features in an annotation file with SVs' breakpoints and overlap - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -a ANNO, --anno ANNO Tab-delimited annotation file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - --sizemin SIZEMIN Minimum size of variant to annotate (50) - --spanmax SPANMAX Maximum span of SVs to annotate (50000) - -Annotation File Arguments: - -p {bed,gff}, --preset {bed,gff} - Annotation format. This option overwrites -s, -b, -e, -c and -1 (None) - -c COMMENT, --comment COMMENT - Skip lines started with character. (#) - -s SEQUENCE, --sequence SEQUENCE - Column of sequence/chromosome name. (0) - -b BEGIN, --begin BEGIN - Column of start chromosomal position. (1) - -e END, --end END Column of end chromosomal position. (2) - -1, --one-based The position in the anno file is 1-based rather than 0-based. (False) -``` -# truvari anno density -Partitions a `--genome` into `--windowsize` regions and count how many variants overlap. Annotate -regions with no variants as 'sparse' and with greater than or equal to (mean + `--threshold` * standard -deviation) number of variants as 'dense'. Outputs a joblib DataFrame with columns -`chrom, start, end, count, anno`. - -``` -usage: density [-h] -g GENOME -o OUTPUT [-m MASK] [-w WINDOWSIZE] [-s STEPSIZE] [-t THRESHOLD] [input] - -Identify 'dense' and 'sparse' variant windows of the genome - -positional arguments: - input Input VCF (/dev/stdin) - -optional arguments: - -h, --help show this help message and exit - -g GENOME, --genome GENOME - Genome bed file - -o OUTPUT, --output OUTPUT - Output joblib DataFrame - -m MASK, --mask MASK Mask bed file - -w WINDOWSIZE, --windowsize WINDOWSIZE - Window size (10000) - -s STEPSIZE, --stepsize STEPSIZE - Window step size (10000) - -t THRESHOLD, --threshold THRESHOLD - std for identifying 'dense' regions (3) -``` - -# truvari anno dpcnt - -For multi-sample VCFs, it is often useful to have summarized depth (DP) information across samples per-variant. This adds a `INFO/DPCNT` with counts of how many samples have `FORMAT/DP` for each of the user-defined bins. Bins are incremented using `bisect` e.g. `pos = bisect.bisect(bins, dp); bins[pos] += 1; - -``` -usage: dpcnt [-h] [-b BINS] [--no-ad] [-p] [-o OUTPUT] [input] - -Quick utility to count how many samples have >= Nx coverage per-variant - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -b BINS, --bins BINS Coverage bins to bisect left the counts (0,5,10,15) - --no-ad Skip adding ADCNT bins - -p, --present Only count sites with present (non ./.) genotypes - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno lcr - -``` -usage: lcr [-h] [-o OUTPUT] [input] - -Annotate low complexity region entropy score for variants -Credit: https://jszym.com/blog/dna_protein_complexity/ - -positional arguments: - input VCF to annotate (stdin) - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) -``` - -# truvari anno grpaf - -Add INFO tags of allele frequency annotations for groups of samples. For every group in `--labels` tab-delimited file, calculate the AF,MAF,ExcHet,HWE,MAC,AC for the samples in the group. Adds INFO tags with suffix of the group identifier (e.g. `AF_EAS`). `--strict` will hard fail if there are samples in the `--labels` not present in the vcf header. - -``` -usage: grpaf [-h] [-o OUTPUT] -l LABELS [-t TAGS] [--strict] [--debug] [input] - -Add allele frequency annotations for subsets of samples - -positional arguments: - input VCF to annotate - -options: - -h, --help show this help message and exit - -o OUTPUT, --output OUTPUT - Output filename (stdout) - -l LABELS, --labels LABELS - Tab-delimited file of sample and group - -t TAGS, --tags TAGS Comma-separated list of tags to add from AF,MAF,ExcHet,HWE,MAC,AC,AN (all) - --strict Exit if sample listed in labels is not present in VCF (False) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.1/consistency.md b/docs/v4.3.1/consistency.md deleted file mode 100644 index 17ba23b7..00000000 --- a/docs/v4.3.1/consistency.md +++ /dev/null @@ -1,179 +0,0 @@ - -In addition to looking at performance of a single set of variation against a baseline, one may wish to measure the consistency between multiple sets of variation. The tool `truvari consistency` can automatically create that result. - -Running -======= - -``` -usage: consistency [-h] [-d] [-j] [-o OUTPUT] VCFs [VCFs ...] - -Over multiple vcfs, calculate their intersection/consistency. - -Calls will match between VCFs if they have a matching key of: - CHROM POS ID REF ALT - -positional arguments: - VCFs VCFs to intersect - -optional arguments: - -h, --help show this help message and exit - -d, --no-dups Disallow duplicate SVs - -j, --json Output report in json format - -o OUTPUT, --output OUTPUT - Write tsv of variant keys and their flag -``` - -Example -======= - -```bash -truvari consistency fileA.vcf fileB.vcf fileC.vcf -``` - -Matching Entries -================ - -VCF entries will be considered matching if and only if they have an exact same key of `CHROM POS ID REF ALT`. Because of this stringency, it is recommend that you compare the tp-base.vcf or fn.vcf results from each individual VCF's Truvari output. The key and flags can be written with the `--output` option. - -Duplicates -========== - -If there are VCFs with duplicate keys, they are handled by appending a e.g. `.1` to the key. If you'd like to ignore duplicates, just add `--no-dups` - -Output Report -============= - -Below is an example report: - -```text -# -# Total 5534 calls across 3 VCFs -# -#File NumCalls -fileA.vcf 4706 -fileB.vcf 4827 -fileC.vcf 4882 -# -# Summary of consistency -# -#VCFs Calls Pct -3 3973 71.79% -2 935 16.90% -1 626 11.31% -# -# Breakdown of VCFs' consistency -# -#Group Total TotalPct PctOfFileCalls -111 3973 71.79% 84.42% 82.31% 81.38% -011 351 6.34% 7.27% 7.19% -101 308 5.57% 6.54% 6.31% -110 276 4.99% 5.86% 5.72% -001 250 4.52% 5.12% -010 227 4.10% 4.70% -100 149 2.69% 3.17% -``` - -At the top we see that we compared 5,534 unique variants between the 3 files, with fileC.vcf having the most calls at 4,882. - -The "Summary of consistency" shows us that 3,973 (71.79%) of all the calls are shared between the 3 VCFs, while 626 (11.31%) are only found in one of the VCFs. - -Reading the "Breakdown of VCFs' consistency", a `Group` is a unique key for presence (1) or absence (0) of a call within each of the listed `#Files`. For example: `Group 111` is calls present in all VCFs; `Group 011` is calls present in only the 2nd and 3rd VCFs (i.e. fileB.vcf and fileC.vcf). - -We see that `Group 101` has calls belonging to the 1st and 3rd `#Files` (i.e. fileA.vcf and fileC.vcf). This group has a total of 308 calls that intersect, or 5.57% of all calls in all VCFs. This 308 represents 6.54% of calls in fileA.vcf and 6.31% of calls in fileC.vcf. - -Finally, we see that fileA.vcf has the least amount of calls unique to it on the `Group 100` line. - -Json -==== -Below is a consistency report in json format. -```json -{ - "vcfs": [ - "repo_utils/test_files/variants/input1.vcf.gz", - "repo_utils/test_files/variants/input2.vcf.gz", - "repo_utils/test_files/variants/input3.vcf.gz" - ], - "total_calls": 3513, - "num_vcfs": 3, - "vcf_counts": { - "repo_utils/test_files/variants/input1.vcf.gz": 2151, - "repo_utils/test_files/variants/input2.vcf.gz": 1783, - "repo_utils/test_files/variants/input3.vcf.gz": 2065 - }, - "shared": [ - { - "vcf_count": 3, - "num_calls": 701, - "call_pct": 0.1995445488186735 - }, - { - "vcf_count": 2, - "num_calls": 1084, - "call_pct": 0.3085681753487048 - }, - { - "vcf_count": 1, - "num_calls": 1728, - "call_pct": 0.4918872758326217 - } - ], - "detailed": [ - { - "group": "111", - "total": 701, - "total_pct": 0.1995445488186735, - "repo_utils/test_files/variants/input1.vcf.gz": 0.32589493258949326, - "repo_utils/test_files/variants/input2.vcf.gz": 0.393157599551318, - "repo_utils/test_files/variants/input3.vcf.gz": 0.3394673123486683 - }, - { - "group": "001", - "total": 645, - "total_pct": 0.18360375747224594, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.31234866828087166 - }, - { - "group": "100", - "total": 598, - "total_pct": 0.17022487902077996, - "repo_utils/test_files/variants/input1.vcf.gz": 0.2780102278010228, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "101", - "total": 487, - "total_pct": 0.1386279533162539, - "repo_utils/test_files/variants/input1.vcf.gz": 0.22640632264063226, - "repo_utils/test_files/variants/input2.vcf.gz": 0, - "repo_utils/test_files/variants/input3.vcf.gz": 0.2358353510895884 - }, - { - "group": "010", - "total": 485, - "total_pct": 0.13805863933959578, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.27201346045989905, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "110", - "total": 365, - "total_pct": 0.10389980074010817, - "repo_utils/test_files/variants/input1.vcf.gz": 0.1696885169688517, - "repo_utils/test_files/variants/input2.vcf.gz": 0.2047111609646663, - "repo_utils/test_files/variants/input3.vcf.gz": 0 - }, - { - "group": "011", - "total": 232, - "total_pct": 0.06604042129234272, - "repo_utils/test_files/variants/input1.vcf.gz": 0, - "repo_utils/test_files/variants/input2.vcf.gz": 0.13011777902411667, - "repo_utils/test_files/variants/input3.vcf.gz": 0.11234866828087167 - } - ] -} -``` \ No newline at end of file diff --git a/docs/v4.3.1/divide.md b/docs/v4.3.1/divide.md deleted file mode 100644 index 3bbad3cb..00000000 --- a/docs/v4.3.1/divide.md +++ /dev/null @@ -1,58 +0,0 @@ -Divide a VCF into independent shards. - -Unfortunately, `pysam.VariantRecord` objects aren't pickle-able. This means that if we wanted to have Truvari leverage python's multiprocessing we'd need to make a custom VCF parser. However, the command `truvari divide` allows us to take an input VCF and divide it into multiple independent parts (or shards) which can the be processed over multiple processes. - -`truvari divide` works by parsing a VCF and splitting it into multiple, smaller sub-VCFs. If any variants are within `--buffer` base-pairs, they're output to the same sub-VCF. This allows variants in the same region which would need to be compared to one-another (see `--refdist`) to stay in the same sub-VCF. The `--min` parameter allows us to control the minimum number of variants per sub-VCF so that we don't make too many tiny VCFs. Once the sub-VCFs are created, we can process each independently through whatever truvari command. - -For example, let's say we want to run `truvari collapse` on a very large VCF with many variants and many samples. First, we divide the VCF: - -```bash -truvari divide big_input.vcf.gz sub_vcfs_directory/ -``` - -Inside of `sub_vcfs_directory/` we'll have multiple VCFs, which we can process with a simple bash script - -```bash -NJOBS=$(nproc) # use all processors by default -mkdir -p output_vcfs/ -mkdir -p collap_vcfs/ -mkdir -p logs/ - -for in_vcf in sub_vcfs_directory/*.vcf.gz -do - # Setup file names - base_name=$(basename $in_vcf) - base_name=${base_name%.vcf.gz} - output_name=output_vcfs/${base_name}.vcf - collap_name=collap_vcfs/${base_name}.vcf - log_name=logs/${base_name}.log - # Run the command and send it to the background - truvari collapse -i $in_vcf -o $output_name -c $collap_name -f reference.fa &> logs/${log_name}.log & - # If too many jobs are running, wait for one to finish - while [ $( jobs | wc -l ) -ge ${NJOBS} ] - do - sleep 5 - done -done -``` - -Obviously the logs and `while` loop are tricks for running on a single machine. If you have access to a cluster, I'm sure you can imagine how to create/submit the commands. - -``` -usage: divide [-h] [-b BUFFER] [-m MIN] [--no-compress] [-T THREADS] VCF DIR - -Divide a VCF into independent parts - -positional arguments: - VCF VCF to split - DIR Output directory to save parts - -options: - -h, --help show this help message and exit - -b BUFFER, --buffer BUFFER - Buffer to make mini-clusters (1000) - -m MIN, --min MIN Minimum number of entries per-vcf (100) - --no-compress Don't attempt to compress/index sub-VCFs - -T THREADS, --threads THREADS - Number of threads for compressing/indexing sub-VCFs (1) -``` \ No newline at end of file diff --git a/docs/v4.3.1/phab.md b/docs/v4.3.1/phab.md deleted file mode 100644 index bdcaec9e..00000000 --- a/docs/v4.3.1/phab.md +++ /dev/null @@ -1,157 +0,0 @@ -Introduction ------------- - -Truvari's comparison engine can match variants using a wide range of thresholds. However, some alleles can produce radically different variant representations. We could dramatically lower our thresholds to identify the match, but this would cause variants from unidentical alleles to be falsely matched. - -This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology. - -So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` - -`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison. - -Requirements ------------- -Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/) - -Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use. - -Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft. - -Example -------- -As an example, we'll use Truvari's test files in `repo_utils/test_files/phab*` which were created from real data over a tandem repeat at GRCh38 chr1:26399065-26401053 and translated to a small test genome with coordinates chr1:1-1988. - -* `phab_base.vcf.gz` - an 86 sample squared-off pVCF -* `phab_comp.vcf.gz` - a single sample's VCF -* `phab_ref.fa` - a subset of the GRCh38 reference - -This dataset is interesting because the `HG002` sample in `phab_base.vcf.gz` uses the same sequencing experiment ([HPRC](https://github.com/human-pangenomics/HPP_Year1_Assemblies)) as the sample `syndip` in `phab_comp.vcf.gz`, but processed with a different pipeline. And as we will see, the pipeline can make all the difference. - -To start, let's use `truvari bench` to see how similar the variant calls are in this region. -```bash -truvari bench --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --sizemin 1 --sizefilt 1 \ - --bSample HG002 \ - --cSample syndip \ - --no-ref a \ - --output initial_bench -``` -This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows: -```json -{ - "TP-base": 5, - "TP-comp": 5, - "FP": 2, - "FN": 22, - "precision": 0.7142857142857143, - "recall": 0.18518518518518517, - "f1": 0.2941176470588235, -} -``` - -These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region. -```bash -truvari phab --base phab_base.vcf.gz \ - --comp phab_comp.vcf.gz \ - --bSample HG002 \ - --cSample syndip \ - --reference phab_ref.fa \ - --region chr1:700-900 \ - -o harmonized.vcf.gz -``` - -In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized. - -```bash -truvari bench -b harmonized.vcf.gz \ - -c harmonized.vcf.gz \ - -S 1 -s 1 \ - --no-ref a \ - --bSample HG002 \ - --cSample syndip \ - -o harmonized_bench/ -``` -Looking at `harmonized_bench/summary.txt` shows: -```json -{ - "TP-base": 8, - "TP-comp": 8, - "FP": 0, - "FN": 0, - "precision": 1.0, - "recall": 1.0, - "f1": 1.0, -} -``` -Now there is no difference between our two sets of variants in this region. - -For this variant call-set, `truvri phab` makes `truvari bench` overkill since the variants create identical haplotypes. In fact, we can benchmark simply by counting the genotypes. -```bash -$ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c - 1 0/1 1/0 - 1 1/0 0/1 - 6 1/1 1/1 -``` -(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order) - -MSA ---- - -If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz` -```bash -truvari phab -b phab_base.vcf.gz \ - -f phab_ref.fa \ - -r chr1:700-900 \ - -o msa_example.vcf.gz -``` - -As a simple check, we can count the number of variants before/after `phab`: -```bash -bcftools view -r chr1:700-900 phab_base.vcf.gz | grep -vc "#" -bcftools view -r chr1:700-900 msa_example.vcf.gz | grep -vc "#" -``` -The `160` original variants given to `phab` became just `60`. - -Better yet, these fewer variants occur on fewer positions: -```bash - -bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l -bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l -``` -This returns that the variants were over `98` positions but now sit at just `16` - -We can also observe changes in the allele frequency after running `phab`: -```bash -bcftools +fill-tags -r chr1:700-900 phab_base.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -bcftools +fill-tags -r chr1:700-900 msa_example.vcf.gz | bcftools query -f "%AC\n" | sort -n | uniq -c -``` -The allele-count (AC) shows a 15% reduction in singletons and removal of all variants with an AF > 0.50 which would have suggested the reference holds a minor allele. -```txt - original phab - # AC # AC - 39 1 33 1 - 18 2 4 2 - 3 3 2 3 - 3 4 2 4 - 2 5 1 5 - ... - 3 69 1 35 - 1 89 1 40 - 8 109 1 53 - 1 132 1 56 - 1 150 1 81 -``` - -(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.) - -`--align` -========= -By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -Limitations ------------ -* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions. -* By giving the variants new representations, variant counts will likely change. -* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution. - diff --git a/docs/v4.3.1/refine.md b/docs/v4.3.1/refine.md deleted file mode 100644 index eb2ef07a..00000000 --- a/docs/v4.3.1/refine.md +++ /dev/null @@ -1,142 +0,0 @@ -As described in the [[phab|phab]] documentation, a constraint on Truvari `bench` finding matches is that there needs to be some consistency in how the variants are represented. To help automate the process of running Truvari `phab` on a benchmarking result and recomputing benchmarking performance on harmonized variants, we present the tool `refine`. - -Quick Start -=========== - -Basic ------ -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/ -``` -Use `refine` on the `result/` -```bash -truvari refine -r subset.bed -f ref.fa result/ -``` - -The regions spanned by `subset.bed` should be shorter and focused around the breakpoints of putative FNs/FPs. Haplotypes from these boundaries are fed into a realignment procedure which can take an extremely long time on e.g entire chromosomes. Also, the genotypes within these regions must be phased. - -Whole genome ------------- -After making a `bench` result: -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` -Use `refine` on the `result/` analyzing only the regions with putative FP/FN that would benefit from harmonization -```bash -truvari refine -R -U -f ref.fa --regions result/candidate.refine.bed result/ -``` - -Tandem Repeats --------------- -For benchmarks such as the [GIAB TR](https://www.biorxiv.org/content/10.1101/2023.10.29.564632v1), a TR caller may analyze a different subset of regions. In order to avoid unnecessarily penalizing the performance with FNs from unanalyzed regions: - -```bash -truvari bench -b base.vcf.gz -c comp.vcf.gz --includebed hc_regions.bed -o result/ -``` - -Use `refine` on the `result/` analyzing only the `hc_regions.bed` covered by the TR caller's `tool_regions.bed` -``` -truvari refine -f ref.fa --regions tool_regions.bed result/ -``` - -Output -====== -* `refine.variant_summary.json` - result of re-evaluating calls within the specified regions. Same structure as [[summary.json|bench#summaryjson]] -* `refine.regions.txt` - Tab-delimited file with per-region variant counts -* `refine.region_summary.json` - Per-region performance metrics -* `phab_bench/` - Bench results on the subset of variants harmonized - -To see an example output, look at [test data](https://github.com/ACEnglish/truvari/tree/develop/answer_key/refine/refine_output_one) - -Using `refine.regions.txt` -========================== -| Column | Description | -| ----------------- | --------------------------------------- | -| chrom | Region's chromosome | -| start | Region's start | -| end | Region's end | -| in_tpbase | Input's True Positive base count | -| in_tp | Input's True Positive comparison count | -| in_fp | Input's false positive count | -| in_fn | Input's false negative count | -| refined | Boolean for if region was re-evaluated | -| out_tpbase | Output's true positive base count | -| out_tp | Output's true positive comparison count | -| out_fn | Outputs false positive count | -| out_fp | Output's false negative count | -| state | True/False state of the region | - - -Performance by Regions -====================== - -Because `truvari phab` can alter variant counts during harmonization, one may wish to assess the performance on a per-region basis rather than the per-variant basis. In the `refine.regions.txt`, a column `state` will have a TP/FN/FP value as defined by the following rules: - -```python -false_pos = (data['out_fp'] != 0) -false_neg = (data['out_fn'] != 0) -any_false = false_pos | false_neg - -true_positives = (data['out_tp'] != 0) & (data['out_tpbase'] != 0) & ~any_false - -true_negatives = (data[['out_tpbase', 'out_tp', 'out_fn', 'out_fp']] == 0).all(axis=1) - -baseP = (data['out_tpbase'] != 0) | (data['out_fn'] != 0) -compP = (data['out_tp'] != 0) | (data['out_fp'] != 0) -``` - -This logic has two edge cases to consider. 1) a region with at least one false-positive and one false-negative will be counted as both a false-positive and a false-negative. 2) Regions within `--refdist` may experience 'variant bleed' where they e.g. have an out_tp, but no other variants because a neighboring region actually contains the the corresponding `out_tpbase`. For the first case, we simply count the region twice and set its state in `refine.regions.txt` to "FP,FN". For the second case, we set the state to 'UNK' and ignore it when calculating the region summary. Future versions may figure out exactly how to handle (prevent?) 'UNK' regions. - -These by-region state counts are summarized and written to `refine.region_summary.json`. The definition of metrics inside this json are: -| Key | Definition | Formula | -|--------|----------------------------------------------|---------------------------------| -| TP | True Positive region count | | -| TN | True Negative region count | | -| FP | False Positive region count | | -| FN | False Negative region count | | -| base P | Regions with base variant(s) | | -| base N | Regions without base variant(s) | | -| comp P | Regions with comparison variant(s) | | -| comp N | Regions without comparison variant(s) | | -| PPV | Positive Predictive Value (a.k.a. precision) | TP / comp P | -| TPR | True Positive Rate (a.k.a. recall) | TP / base P | -| TNR | True Negative Rate (a.k.a. specificity) | TN / base N | -| NPV | Negative Predictive Value | TN / comp N | -| ACC | Accuracy | (TP + TN) / (base P + base N) | -| BA | Balanced Accuracy | (TPR + TNR) / 2 | -| F1 | f1 score | 2 * ((PPV * TPR) / (PPV + TPR)) | -| UND | Regions without an undetermined state | | - -Even though PPV is synonymous with precision, we use these abbreviated names when dealing with per-region performance in order to help users differentiate from the by-variant performance reports. - -`--align` -========= -By default, Truvari will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes. - -`--use-original-vcfs` -===================== - -By default, `refine` will use the base/comparison variants from the `bench` results `tp-base.vcf.gz`, `fn.vcf.gz`, `tp-comp.vcf.gz`, and `fp.vcf.gz` as input for `phab`. However, this contains a filtered subset of variants originally provided to `bench` since it removes variants e.g. below `--sizemin` or not `--passonly`. - -With the `--use-original-vcfs` parameter, all of the original calls from the input vcfs are fetched. This parameter is useful in recovering matches in situations when variants in one call set are split into two variants which are smaller than the minimum size analyzed by `bench`. For example, imagine a base VCF with a 20bp DEL, a comp VCF with two 10bp DEL, and `bench --sizemin 20` was used. `--use-original-vcfs` will consider the two 10bp comp variants during phab harmonization with the 20bp base DEL. - - -`--regions` -=========== - -This parameter specifies which regions to re-evaluate. If this is not provided, the original `bench` result's `--includebed` is used. If both `--regions` and `--includebed` are provided, the `--includebed` is subset to only those intersecting `--regions`. - -This parameter is helpful for cases when the `--includebed` is not the same set of regions that a caller analyzes. For example, if a TR caller only discovers short tandem repeats (STR), but a benchmark has TRs of all lengths, it isn't useful to benchmark against the non-STR variants. Therefore, you can run `bench` on the full benchmark's regions (`--includebed`), and automatically subset to only the regions analyzed by the caller with `refine --regions`. - -Note that the larger these regions are the slower MAFFT (used by `phab`) will run. Also, when performing the intersection as described above, there may be edge effects in the reported `refine.variant_summary.json`. For example, if a `--region` partially overlaps an `--includebed` region, you may not be analyzing a subset of calls looked at during the original `bench` run. Therefore, the `*summary.json` should be compared with caution. - -`--use-region-coords` -===================== - -When intersecting `--includebed` with `--regions`, use `--regions` coordinates. By default, `refine` will prefer the `--includebed` coordinates. However, the region's coordinates should be used when using the `candidates.refine.bed` to limit analysis to only the regions with putative FP/FN that would benefit from harmonization - for example, when performing whole genome benchmarking. - -`--reference` -============= - -By default, the reference is pulled from the original `bench` result's `params.json`. If a reference wasn't used with `bench`, it must be specified with `refine` as it's used by `phab` to realign variants. \ No newline at end of file diff --git a/docs/v4.3.1/segment.md b/docs/v4.3.1/segment.md deleted file mode 100644 index 953e5822..00000000 --- a/docs/v4.3.1/segment.md +++ /dev/null @@ -1,18 +0,0 @@ -Segmentation: Normalization of SVs into disjointed genomic regions - -For SVs with a span in the genome (currently only DELs), split the overlaps into disjointed regions. This is an experimental tool that explores the possibility of assisting SV association analysis. - -This tool adds an INFO field `SEGCNT` which holds the number of original SVs that overlap the newly reported region. It also adds a FORMAT field `SEG`, which is the 'allele coverage' per-sample. For example, if a sample has two overlapping heterozygous deletions, the shared region will have `SEG=2`. If the two deletions were homozygous then `SEG=4`. - -In the below example, we have three deletions found across three samples. - -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/segment_example.png) - -The `segment` added annotations for the regions would then be: -| Region | INFO/SEGCNT | S1/SEG | S2/SEG | S3/SEG | -|--------|-------------|--------|--------|--------| -| A | 1 | 2 | 0 | 0 | -| B | 2 | 2 | 1 | 0 | -| C | 3 | 2 | 2 | 2 | -| D | 2 | 2 | 1 | 0 | -| E | 1 | 0 | 1 | 0 | \ No newline at end of file diff --git a/docs/v4.3.1/stratify.md b/docs/v4.3.1/stratify.md deleted file mode 100644 index 353dcab5..00000000 --- a/docs/v4.3.1/stratify.md +++ /dev/null @@ -1,58 +0,0 @@ -`stratify` is a helper utility for counting variants within bed regions which is essentially the same as running `bedtools intersect -c`. When working with benchmarking results, there are are four vcfs to count (tp-base, tp-comp, fn, fp). Instead of running bedtools four times and collating the results, `stratify` can be given a single `bench` result directory to generate the counts. - -For example: -```bash -$ truvari stratify input.bed bench/ -chrom start end tpbase tp fn fp -chr20 280300 280900 0 0 0 0 -chr20 100000 200000 1 1 0 0 -chr20 642000 642350 1 1 2 1 -``` - -The output from this can then be parsed to generate more details: - -```python -import pandas as pd -import truvari - -df = pd.read_csv("stratify.output.txt", sep='\t') - -# If the input.bed didn't have a header and so we couldn't use the `--header` parameter, we need to name columns -df.columns = ['chrom', 'start', 'end', 'tpbase', 'tp', 'fn', 'fp'] - -# Create the precision, recall, and f1 for each row -metrics = df[["tpbase", "tp", "fn", "fp"]].apply((lambda x: truvari.performance_metrics(*x)), axis=1) - -# metrics is now a DataFrame with a single column of tuples, lets separate them into columns -metrics = pd.DataFrame(metrics.to_list(), columns=["precision", "recall", "f1"]) - -# Extend the dataframe's columns -df = df.join(metrics) -df.head() -``` -Which gives the result: -``` - chrom start end tpbase tp fn fp precision recall f1 -0 chr20 135221 239308 1 1 0 0 1.00 1.00 1.000000 -1 chr20 260797 465632 3 3 3 1 0.75 0.50 0.600000 -2 chr20 465866 622410 1 1 0 0 1.00 1.00 1.000000 -3 chr20 623134 655257 1 1 3 1 0.50 0.25 0.333333 -4 chr20 708338 732041 1 1 1 0 1.00 0.50 0.666667 -``` - -``` -usage: stratify [-h] [-o OUT] [--header] [-w] [--debug] BED VCF - -Count variants per-region in vcf - -positional arguments: - BED Regions to process - VCF Truvari bench result directory or a single VCF - -optional arguments: - -h, --help show this help message and exit - -o OUT, --output OUT Output bed-like file - --header Input regions have header to preserve in output - -w, --within Only count variants contained completely within region boundaries - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.3.1/vcf2df.md b/docs/v4.3.1/vcf2df.md deleted file mode 100644 index 14d9f06c..00000000 --- a/docs/v4.3.1/vcf2df.md +++ /dev/null @@ -1,81 +0,0 @@ -We enjoy using [pandas](https://pandas.pydata.org/)/[seaborn](https://seaborn.pydata.org/) for python plotting, so we've made the command `truvari vcf2df`. This will turn a VCF into a pandas DataFrame and save it to a file using joblib. The resulting DataFrame will always have the columns: -* chrom: variant chromosome -* start: 0-based start from pysam.VariantRecord.start -* end: 0-based end from pysam.VariantRecord.stop -* id : VCF column ID -* svtype : SVTYPE as determined by `truvari.entry_variant_type` -* svlen : SVLEN as determined by `truvari.entry_size` -* szbin : SVLEN's size bin as determined by `truvari.get_sizebin` -* qual : VCF column QUAL -* filter : VCF column FILTER -* is_pass : boolean of if the filter is empty or PASS - -Optionally, `vcf2df` can attempt to pull `INFO` and `FORMAT` fields from the VCF and put each field into the DataFrame as a new column. For FORMAT fields, the VCF header definition's `Number` is considered and multiple columns may be added. For example, the `AD` field, typically holding Allele Depth has `Number=A`, indicating that there will be one value for each allele. Truvari assumes that all VCFs hold one allele per-line, so there are only 2 alleles described per-line, the reference and alternate allele. Therefore, two columns are added to the DataFrame, `AD_ref` and `AD_alt` corresponding to the 0th and 1st values from the AD field's list of values. Similarity, for PL (genotype likelihood) with `Number=G`, there's three values and columns are created named `PL_ref`, `PL_het`, `PL_hom`. - -After you've created your benchmarking results with `truvari bench`, you'll often want to plot different views of your results. `vcf2df --bench-dir` can parse a truvari output directory's multiple VCF files and add a 'state' column -* state : The truvari state assigned to the variant - * tpbase : Parsed from the tp-base.vcf - * tp : Parsed from the tp-comp.vcf - * fp : Parsed from the fp.vcf - * fn : Parsed from the fn.vcf - -The created DataFrame is saved into a joblib file, which can then be plotted as simply as: -```python -import joblib -import seaborn as sb -import matplotlib.pyplot as plt - -data = joblib.load("test.jl") -p = sb.countplot(data=data[data["state"] == 'tp'], x="szbin", hue="svtype", hue_order=["DEL", "INS"]) -plt.xticks(rotation=45, ha='right') -p.set(title="True Positives by svtype and szbin") -``` -![](https://github.com/spiralgenetics/truvari/blob/develop/imgs/truv2df_example.png) - -This enables concatenation of Truvari results across multiple benchmarking experiments for advanced comparison. For example, imagine there's multiple parameters used for SV discovery over multiple samples. After running `truvari bench` on each of the results with the output directories named to `params/sample/` and each converted to DataFrames with `truvari vcf2df`, we can expand/concatenate the saved joblib DataFrames with: - -```python -import glob -import joblib -import pandas as pd - -files = glob.glob("*/*/data.jl") -dfs = [] -for f in files: - params, sample, frame = f.split('/') - d = joblib.load(f) - d["params"] = params - d["sample"] = sample - dfs.append(d) -df = pd.concat(dfs) -joblib.dump(df, "results.jl") -``` - -To facilitate range queries, PyRanges is helpful. `vcf2df` results can be parsed quickly by pyranges with the command: -```python -result = pyranges.PyRanges(df.rename(columns={'chrom':"Chromosome", "start":"Start", "end":"End"})) -``` - -``` -usage: vcf2df [-h] [-b] [-i] [-f] [-s SAMPLE] [-n] [-S] [-c LVL] [--debug] VCF JL - -Takes a vcf and creates a data frame. Can parse a bench output directory - -positional arguments: - VCF VCF to parse - JL Output joblib to save - -optional arguments: - -h, --help show this help message and exit - -b, --bench-dir Input is a truvari bench directory - -i, --info Attempt to put the INFO fields into the dataframe - -f, --format Attempt to put the FORMAT fileds into the dataframe - -s SAMPLE, --sample SAMPLE - SAMPLE name to parse when building columns for --format - -n, --no-prefix Don't prepend sample name to format columns - -S, --skip-compression - Skip the attempt to optimize the dataframe's size - -c LVL, --compress LVL - Compression level for joblib 0-9 (3) - --debug Verbose logging -``` \ No newline at end of file diff --git a/docs/v4.0.0/vcf2df.md b/docs/vcf2df.md similarity index 100% rename from docs/v4.0.0/vcf2df.md rename to docs/vcf2df.md