-
Notifications
You must be signed in to change notification settings - Fork 2
Home
VASE stands for Variant Annotation, Segregation and Exclusion. It provides a program to filter and annotate variant data (in VCF/BCF format) according to user-specified criteria. Example use cases include:
- Annotating/filtering variants on frequency data from public databases (e.g. gnomAD, dbSNP)
- Selecting variants based on VEP consequence annotations
- Filtering variants based on presence in cases vs control samples
- Filtering variants based on familial segregation/inheritance patterns
- Filtering variants/genotypes based on depth data and quality metrics
VASE requires python3 and can be installed using pip or the setup.py script included in the repo. Assuming your system has git and pip installed, the easiest way to install VASE with full functionality is as follows:
pip3 install git+git://github.com/david-a-parry/vase.git#egg=project[BGZIP,REPORTER,MYGENE] --user
For more detailed installation instructions see the README.
Basic features of VASE should work with any correctly formatted VCF. However, assuming you want to analyze genotype data in cohorts and families, you should ensure that you genotype samples together. For example, if you want to detect de novo variants in a trio, it is essential that parents and children are genotyped at the same time. Joint-genotyping also presents more opportunities to filter out artefacts, for example by looking at allele depth data in control samples called as homozygous reference (0/0).
VASE should work with joint genotyping output from GATK, freebayes, Strelka2 and bcftools. If using a tool that does not by default output genotype quality (GQ) scores (such as freebayes and bcftools) or AD FORMAT fields (such as bcftools) it is recommended to turn on these annotations - for example:
# example freebayes command to output GQ scores
freebayes -f ref.fasta --standard-filters --genotype-qualities child.bam mum.bam dad.bam | bcftools view -O b -o variants.bcf
# example bcftools command to output GQ and AD fields
bcftools mpileup -Ou -a FMT/AD -f ref.fasta child.bam mum.bam dad.bam | bcftools call -mv -f GQ -O b -o variants.bcf
Standard GATK and Strelka workflows should output the necessary annotations by default.
VASE has limited support for structural variants, currently only supporting output from Manta. The default settings for Manta will produce a VCF compatible with VASE. For family studies, as for short-variant genotypers, all samples should be genotyped together.
VASE supports filtering using functional annotations from Ensembl's Variant Effect Predictor. In order to have all the relevant annotations please run VEP with the --everything
flag and --fasta
argument along with any desired plugins. You must also use the --vcf
flag to make VEP to produce VCF format output. While VASE should be able to decipher consequences correctly at multiallelic variant sites, it is also a good idea to run VEP with the --allele_number
flag to prevent any ambiguity.
Use exome data from gnomAD to remove variants with an allele frequency of 1% or higher (in any gnomAD population).
vase -i input.bcf --freq 0.01 -g gnomad.exomes.r2.1.1.sites.vcf.bgz -o rare.bcf
Annotate gnomAD and dbSNP frequencies but do not filter.
vase -i input.bcf \
-g gnomad.exomes.r2.1.1.sites.vcf.bgz gnomad.genomes.r2.1.1.sites.vcf.bgz \
-d dbSNP151.vcf.gz \
-o annotated.bcf
Annotation with large datafiles such as gnomAD data can be slow. Frequency information in annotated VCFs can be used to filter post-annotation which is a faster option if you are likely to be performing different types of filtering on the same VCF.
vase -i annotated.bcf --freq 0.01 -o one_pc_filter.bcf
vase -i annotated.bcf --freq 0.05 -o five_pc_filter.bcf
Several features of VASE rely on VEP annotations. It is recommended to run VEP with the '--everything' flag (and optionally LoF, dbNSFP and dbscSNV plugins) for best use of VASE features.
To output only variants with a HIGH impact consequence for at least one overlapping transcript:
vase -i annotated.vep.bcf --impact HIGH
As above but only if the HIGH impact variant is in a canonical transcript:
vase -i annotated.vep.bcf --impact HIGH --canonical
Output variants occuring de novo in affected child(ren) of a parent-child trio(s):
vase -i input.bcf --ped trios.ped --de_novo -o naive_dnms.bcf
As above, but using some sensible filters to reduce false-positives:
vase -i input.bcf \
--ped trios.ped \
--de_novo \
--het_ab 0.27 \
--control_het_ab 0.05 \
--dp 10 \
--gq 20 \
--freq 1e-5 \
-o dnms.bcf
Output rare variants with HIGH or MODERATE VEP impact in canonical transcripts and match recessive inheritance in families:
vase -i annotated.vep.bcf \
--freq 0.005 \
--impact HIGH MODERATE \
--canonical \
--ped trios.ped \
--recessive \
-o recessives.bcf
Optionally combine variants and write report in either XLSX or JSON format:
#concat variants to save running the reporter on both recessive and de novo outputs separately
bcftools concat -O b -o dnms_and_recessives.bcf dnms.bcf recessives.bcf
#write report in Excel format
vase_reporter dnms_and_recessives.bcf dnms_and_recessives.report.xlsx --ped trios.ped
#alternatively #write report in JSON format
vase_reporter dnms_and_recessives.bcf dnms_and_recessives.report.json -o json --ped trios.ped