Skip to content

Multialleleics

David A. Parry edited this page Jun 23, 2020 · 2 revisions

Multialleleic Variants

VASE does not require that multiallelic variants be decomposed and in some cases doing so would lose information that might be used by some of VASE’s filtering methods. Whether you choose to decompose multiallelic variants before running VASE will depend on the type of analysis you are doing.

Generally speaking, if you are doing analysis which looks at sample genotypes and filters within the same VCF (e.g. familial segregation analysis, case-control analysis) my recommendation would be not to decompose variants; instead vase_reporter can be used to separate out the alleles of interest once you have finished your segregation filtering analysis. However, if you are only interested in site-level information, variant decomposition may be useful to separate out e.g. a common SNP at the same site as a rare variant.

Regardless, it is useful to understand how these variants are handled by VASE.

Let’s assume we have our input vcf containing the following variant:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 1254772 . C G,T 9137.92 PASS AC=3,1;AF=0.5,0.167;AN=6;VASE_gnomAD_AF_nfe=0.12,3e-6;…

One ALT allele (‘G’) is common in non-Finnish-Europeans in gnomAD and has been annotated as such by vase previously (VASE_gnomAD_AF_nfe=0.12). The other ALT allele (‘T’) is rare in this population (VASE_gnomAD_AF_nfe=3e-6). We could run vase as with the option to remove variants with an allele frequency >0.01 as follows:

vase -i input.vcf --freq 0.01

…but this variant is retained because one of the alleles (‘T’) is under our filtering threshold. However, if we are using vase to perform segregation filtering (e.g. looking for recessive variants), using this non-decomposed VCF should be fine. Assume this variant is part of a larger VCF and we run the following command:

vase -i input.vcf --freq 0.01 --csq default --recessive --ped family.ped

The ‘G’ allele will be ignored when looking for recessive alleles because it is above our frequency filtering threshold. However, if the ‘T’ allele has a CSQ annotation from VEP matching our default values and fits a recessive inheritance model according to our sample genotypes and PED file, the variant will be retained and a label added to indicate that the ‘T’ allele is part of a recessive (e.g. compound heterozygous or homozygous) combination of alleles. Running vase_reporter on the output can identify and output this variant for you, making it clear which allele fits your specified recessive inheritance model.

Of course, if we were to have a decomposed VCF input like below:

#CHROM POS ID REF ALT QUAL FILTER INFO
1 1254772 . C G 9137.92 PASS AC=3;AF=0.5;AN=6;VASE_gnomAD_AF_nfe=0.12;…
1 1254772 . C T 9137.92 PASS AC=1;AF=0.167;AN=6;VASE_gnomAD_AF_nfe=3e-6;…

…the same command would only output the second, rare allele/variant. However, consider variants like the following:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Child Mum Dad
1 22247679 . CGC C,CGCG,CGCGC,CGCGCGCG 137.92 PASS GT 0/2 0/3 1/4

Multiallelic indels like this are often sources of false positives. For example, in this simplified example, the child appears to have a de novo occurence of ALT allele 2 (genotype 0/2), as neither parent carries this allele. In reality, this looks like a variant in a repetitive sequence and it is likely a genotyping error calling the same insertion in the child and parent slightly differently. We can easily ignore variants like this by adding ‘–max_alts 2’ to our VASE command. This, however, is less clear when using a decomposed and normalised version of this record:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Child Mum Dad
1 22247679 . CGC C 137.92 PASS GT 0/. 0/. 1/.
1 22247679 . C CGC 137.92 PASS GT 0/. 0/1 ./.
1 22247681 . C CGCGCG 137.92 PASS GT 0/. 0/. ./1
1 22247681 . C CG 137.92 PASS GT 0/1 0/. ./.

Moreover, INFO annotations may become innaccurate upon decomposition. By avoiding decomposition we avoid losing information. Even if we do not filter using the ‘–max_alts’ option, by retaining the variant in the same format as originally genotyped we will be aware that an allele originated from a multiallelic site should it be part of our filtered output and can interpret it accordingly.

Clone this wiki locally