This is a program for Variant Annotation, Segregation and Exclusion for family or cohort based rare-disease sequencing studies.
VASE can be used to filter VCF files based on allele frequency data, functional consequences from VEP, presence/absence of variants in cases vs controls and inheritance patterns within families. It is designed primarily for use in rare disease cohort or familial studies.
In order to make the most of the functions VASE provides, you will require a multi-sample, VEP annotated VCF. In order to confidently identify variants segregating within families consistent with dominant/recessive/de novo inheritance patterns, your VCF should have been made by calling all of your samples simultaneously (e.g. using the GATK joint-calling workflow).
Detailed instructions and examples to follow in the VASE wiki.
VASE requires python3. It has been tested with python 3.5 and 3.6. The modules 'pysam' and 'natsort' from pypi are required and should be installed for you if following the instructions below. You may also wish to install biopython, which is required if you want to write missing CADD/spliceAI scores to bgzipped output.
To install the vase script to $HOME/.local/bin (or possibly on Mac OS /Users/$USER/Library/Python/3.*/bin/) the simplest way is to use pip:
pip3 install git+https://github.com/david-a-parry/vase.git --user
To install with the extra modules required for bgzip output and vase_reporter functionality (recommended) use the following:
pip3 install git+https://github.com/david-a-parry/vase.git#egg=vase[BGZIP,REPORTER,MYGENEINFO] --user
To install system-wide remove the --user flag and ensure you have root priveleges (e.g. using sudo).
Alternatively, you may first clone this repository:
git clone https://github.com/david-a-parry/vase.git
Alternatively use the 'Clone or download' button above. From the newly created vase directory you may install either by running the setup.py script as follows:
python3 setup.py install --user
or by using pip, if installed:
#without extras
pip3 install . --user
#with extras (recommended)
pip3 install .[BGZIP,REPORTER,MYGENEINFO] --user
If you have root privileges you can install system wide as follows:
sudo python3 setup.py install
or:
sudo pip3 install .
usage: vase -i VCF [-o OUTPUT] [-r REPORT_PREFIX]
[-burden_counts BURDEN_COUNTS] [-gnomad_burden] [-v QUAL]
[-p | --keep_filters KEEP_FILTERS [KEEP_FILTERS ...]]
[--exclude_filters EXCLUDE_FILTERS [EXCLUDE_FILTERS ...]]
[-t TYPE [TYPE ...]] [-max_alts MAX_ALT_ALLELES]
[--filter_asterisk_only_calls] [-af AF] [-min_af MIN_AF]
[-filtering_an FILTERING_AN] [-min_an MIN_AN] [-ac AC]
[-min_ac MIN_AC] [--info_filters INFO_FILTERS [INFO_FILTERS ...]]
[-c [CSQ [CSQ ...]]] [--impact IMPACT [IMPACT ...]] [--canonical]
[--flagged_features] [--biotypes BIOTYPE [BIOTYPE ...]]
[--feature_blacklist FEATURE_BLACKLIST] [--loftee]
[-m MISSENSE_FILTERS [MISSENSE_FILTERS ...]]
[--filter_unpredicted] [--keep_if_any_damaging]
[--splice_filters SPLICE_FILTERS [SPLICE_FILTERS ...]]
[--splice_filter_unpredicted] [--splice_keep_if_any_damaging]
[--retain_labels Label=Value [Label=Value ...]] [--no_vep_freq]
[--vep_af VEP_AF [VEP_AF ...]] [--pathogenic] [--no_conflicted]
[--g2p G2P] [--check_g2p_consequence] [--check_g2p_inheritance]
[--region REGION [REGION ...] | --bed BED | --gene_bed BED]
[--stream] [--exclude_regions] [--cadd_files FILE [FILE ...]]
[-cadd_dir DIR] [--missing_cadd_scores FILE] [--cadd_phred FLOAT]
[--cadd_raw FLOAT] [-d VCF [VCF ...]] [-g VCF [VCF ...]]
[--gnomad_pops POP [POP ...]]
[--vcf_filter VCF,ID[,INFO_FIELD ...] [VCF,ID[,INFO_FIELD ...]
...]] [--dng_vcf DNG_VCF [DNG_VCF ...]] [-f FREQ]
[--min_freq MIN_FREQ]
[--max_gnomad_homozygotes MAX_GNOMAD_HOMOZYGOTES] [-b dbSNP_build]
[--max_build dbSNP_build] [--filter_known] [--filter_novel]
[--clinvar_path] [-ignore_existing]
[--splice_ai_vcfs VCF [VCF ...]] [--splice_ai_min_delta DELTA]
[--splice_ai_max_delta DELTA] [--missing_splice_ai_scores FILE]
[--cases SAMPLE_ID [SAMPLE_ID ...]]
[--controls SAMPLE_ID [SAMPLE_ID ...]] [-ped PED] [-gq GQ]
[-dp DP] [-max_dp MAX_DP] [-het_ab AB] [-hom_ab AB]
[-con_gq CONTROL_GQ] [-con_dp CONTROL_DP]
[-con_max_dp CONTROL_MAX_DP] [-con_het_ab AB] [-con_hom_ab AB]
[-con_ref_ab AB] [-sv_gq SV_GQ] [-sv_dp SV_DP]
[-sv_max_dp SV_MAX_DP] [-sv_het_ab AB] [-sv_hom_ab AB]
[-sv_con_gq SV_CONTROL_GQ] [-sv_con_dp SV_CONTROL_DP]
[-sv_con_max_dp SV_CONTROL_MAX_DP] [-sv_con_het_ab AB]
[-sv_con_hom_ab AB] [-sv_con_ref_ab AB]
[--duphold_del_dhffc DHFFC] [--duphold_dup_dhbfc DHBFC]
[--control_duphold_del_dhffc DHFFC]
[--control_duphold_dup_dhbfc DHBFC] [--n_cases N_CASES]
[--n_controls N_CONTROLS] [--confirm_control_gts] [--biallelic]
[--de_novo] [--dominant] [--min_families MIN_FAMILIES]
[--singleton_recessive SAMPLE_ID [SAMPLE_ID ...]]
[--singleton_dominant SAMPLE_ID [SAMPLE_ID ...]]
[--seg_controls SAMPLE_ID [SAMPLE_ID ...]] [--strict_recessive]
[--prog_interval N] [--log_progress] [--no_progress] [--quiet]
[--debug] [--no_warnings] [--silent] [-h]
Variant annotation, segregation and exclusion.
Required Arguments:
-i VCF, --input VCF Input VCF filename
Output Arguments:
-o OUTPUT, --output OUTPUT
Filename for VCF output. If this ends in .gz or
.bgz the output will be BGZIP compressed.
Default = STDOUT
-r REPORT_PREFIX, --report_prefix REPORT_PREFIX
DEPRECATED - use the 'vase_reporter' program
provided alongside vase instead.
Prefix for segregation summary report output
files. If either --biallelic, --de_novo or
--dominant options are in effect this option will
write summaries for segregating variants to files
with the respective suffixes of
'_recessive.report.tsv', '_de_novo.report.tsv' and
'_dominant.report.tsv'.
-burden_counts BURDEN_COUNTS, --burden_counts BURDEN_COUNTS
File for outputting 'burden counts' per
transcript. If specified, the number of alleles
passing specified filters will be counted for
each transcript identified. Requires your VCF
input to be annotated with Ensembl's VEP. Note,
that if --cases or --controls are specified when
using this argument, variants will not be filtered
on presence in cases/controls; instead counts will
be written for cases and controls to this file.
-gnomad_burden, --gnomad_burden
If using --burden_counts, use this flag to
indicate that the input is from gnomAD and should
be parsed per population.
Annotation File Arguments:
--cadd_files FILE [FILE ...], -cadd_files FILE [FILE ...]
One or more tabix indexed CADD annotation files
(such as those found at
http://cadd.gs.washington.edu/download). Variants
in your input that match any scored variant in
these files will have the CADD RawScore and PHRED
values added to the INFO field, one per ALT
allele. Alleles/variants can be filtered on these
scores using the --cadd_phred or --cadd_raw
options.
-cadd_dir DIR, --cadd_directory DIR
Directory containing one or more tabix indexed
CADD annotation files to be used as above. Only
files with '.gz' or '.bgz' extensions will be
included.
--missing_cadd_scores FILE
Filename to output variants that are not found
in CADD annotation files. Output will be gzip
compressed and in a format suitable for uploading
to https://cadd.gs.washington.edu/score for
scoring (or for scoring locally).
--cadd_phred FLOAT, -cadd_phred FLOAT
CADD PHRED score cutoff. Variants with a CADD
PHRED score below this value will be filtered.
Only used with annotations from files supplied to
--cadd_files or --cadd_dir arguments or a
pre-annotated CADD_PHRED_score INFO field. To
filter on CADD scores annotated using the VEP
dbNSFP plugin use the --missense_filters option.
--cadd_raw FLOAT, -cadd_raw FLOAT
CADD RawScore cutoff. Variants with a CADD
RawScore below this value will be filtered.
Only used with annotations from files supplied to
--cadd_files or --cadd_dir arguments or a
pre-annotated CADD_raw_score INFO field. To filter
on CADD scores annotated using the VEP dbNSFP
plugin use the --missense_filters option.
-d VCF [VCF ...], --dbsnp VCF [VCF ...], --clinvar VCF [VCF ...]
dbSNP or ClinVar VCF file for variant
annotating/filtering.
-g VCF [VCF ...], --gnomad VCF [VCF ...], --exac VCF [VCF ...]
gnomAD/ExAC file for variant annotating/filtering
using population allele frequencies. By default
allele frequencies from AFR, AMR, EAS, FIN, NFE
and SAS populations are used. Populations to use
can be chosen with the --gnomad_pops argument.
--gnomad_pops POP [POP ...]
Populations to use for annotating/filtering from
gnomAD VCFs. The default are AFR, AMR, EAS, FIN,
NFE and SAS. Any combination of these plus "ASJ"
and "POPMAX" can be chosen.
--vcf_filter VCF,ID[,INFO_FIELD ...] [VCF,ID[,INFO_FIELD ...] ...], -vcf_filter VCF,ID[,INFO_FIELD ...] [VCF,ID[,INFO_FIELD ...] ...]
VCF file(s) and name(s) to use in INFO fields
for frequency annotation and/or filtering. Each
file and its associated annotation ID should be
given in pairs separated with commas. INFO fields
will be added to your output for the AN and AF
fields with the field names of VASE_<ID>_AN and
VASE_<ID>_AF. If --freq or --min_freq arguments
are set then matching variants in your input will
be filtered using AF values found in these files.
You may also add additonal INFO fields to extract
and annotate your matching variants with by
including additional comma-separated fields after
the ID.
--dng_vcf DNG_VCF [DNG_VCF ...]
One or more VCFs created by DeNovoGear for adding
PP_DNM and PP_NULL fields to sample calls.
-f FREQ, --freq FREQ, --max_freq FREQ
Allele frequency cutoff (between 0 and 1). Used
for extenal allele frequency sources such as
--dbsnp or --gnomad files. Alleles/variants with
an allele frequency equal to or greater than
this value in these sources will be filtered
from your input. VEP annotated allele frequencies
will also be used for filtering if '--csq' or
'--impact' options are used (annotations from VEP
v90 or higher required). This can be disabled with
the --no_vep_freq option.
--min_freq MIN_FREQ, -min_freq MIN_FREQ
Minimum allele frequency cutoff (between 0 and 1).
Used for extenal allele frequency sources such as
--dbsnp or --gnomad files. Alleles/variants with
a frequency lower than this value will be filtered.
VEP annotated allele frequencies will also be used
for filtering if '--csq' option is used (VEP v90
or higher required). This can be disabled with the
--no_vep_freq option.
--max_gnomad_homozygotes MAX_GNOMAD_HOMOZYGOTES
Filter alleles if the total number of homozygotes
or hemizygotes in any provided gnomAD VCF is equal
to or greater than this value.
-b dbSNP_build, --build dbSNP_build
dbSNP build version cutoff. For use with --dbsnp
files. Alleles/variants present in this dbSNP
build or earlier will be filtered from input.
from your input.
--max_build dbSNP_build, -max_build dbSNP_build
Maximum dbSNP build version cutoff. For use with
--dbsnp files. Alleles/variants present in dbSNP
builds later than this version will be filtered.
--filter_known, -filter_known
Filter any allele/variant present in any of the
files supplied to --gnomad, --dbsnp or
--vcf_filter arguments, or if using '--csq' if any
allele frequency is recorded for any of VEP's AF
annotations. This will also filter
alleles/variants if an annotation from --gnomad or
--dbsnp is present from a previous run unless the
--ignore_existing_annotations option is given.
--filter_novel, -filter_novel
Filter any allele/variant NOT present in
any of the files supplied to --gnomad or --dbsnp or
--vcf_filter arguments, or if using '--csq' if no
allele frequency is recorded for any of VEP's AF
annotations.
--clinvar_path, -path
Retain variants with ClinVar 'likely pathogenic'
or 'pathogenic' flags regardless of frequency or
other settings provided to other Annotation File
Arguments. This requires one of the files
provided to --dbsnp to have CLNSIG annotations
from ClinVar.
-ignore_existing, --ignore_existing_annotations
Ignore previously added annotations from
dbSNP/gnomAD/CADD files that may be present in the
input VCF. Default behaviour is to use these
annotations for filtering if present and the
relevant arguments (e.g. --freq) are given.
--splice_ai_vcfs VCF [VCF ...], -splice_ai_vcfs VCF [VCF ...]
One or more tabix indexed VCFs containing SpliceAI
delta scores with which to filter or annotate
records. SpliceAI INFO fields must be present in
the format produced for pre-scored variants as
downloaded from Jaganathan et al. Cell (2018) or
else as generated by the SpliceAI program
(https://github.com/Illumina/SpliceAI).
Alleles/variants can be retained on these scores
using the --splice_ai_min_delta or
--splice_ai_max_delta options.
--splice_ai_min_delta DELTA, -splice_ai_min_delta DELTA
Retain alleles/consequences with a SpliceAI delta
score equal to or greater than this threshold. If
using filtering on VEP consequence (--csq or
--impact options) VEP consequences for genes
with symbols matching the SpliceAI gene symbol
annotation will be marked for retention also (e.g.
if using segregation filtering). Note that allele
frequency filters will still be applied.
--splice_ai_max_delta DELTA, -splice_ai_max_delta DELTA
Same as --splice_ai_min_delta but
alleles/consequences will be retained only if
SpliceAI delta scores are equal to or below this
threshold.
--missing_splice_ai_scores FILE
Filename to output variants that are not found in
SpliceAI annotation files. Output will be gzip
compressed VCFs suitable for scoring with the
SpliceAI program
(https://github.com/Illumina/SpliceAI).
Variant Filtering Arguments:
Arguments for filtering based on variant features
-v QUAL, --variant_quality QUAL
Minimum variant quality score ('QUAL' field).
Variants with a QUAL score below this value will be
filtered/ignored.
-p, --pass_filters Only keep variants that have passed filters
(i.e. FILTER field must be "PASS")
--keep_filters KEEP_FILTERS [KEEP_FILTERS ...]
Only keep variants that have these FILTER Fields.
Can not be used with --pass_filters but you can
use 'pass' as one of your arguments here to retain
variants that pass filters in addition to variants
with a FILTER Field matching the values specified.
If multiple filter annotations are given for a
variant all must match one of these fields or it
will be filtered.
--exclude_filters EXCLUDE_FILTERS [EXCLUDE_FILTERS ...]
Filter variants that have these FILTER Fields.
If multiple filter annotations are given for a
variant it will be excluded if any match one of
the given fields.
-t TYPE [TYPE ...], --var_types TYPE [TYPE ...]
Keep variants of the following type(s). Valid
types are 'SNV' (single nucleotide variants),
'MNV' (multi-nucleotide variants excluding
indels), 'INSERTION' (insertions or duplications
relative to the reference), 'DELETION' (deletions
relative to the reference), 'INDEL' (shorthand for
both insertions and deletions) and 'SV'
(structural variants). If a site is multiallelic
it will be retained if any ALT allele matches one
of these types, but per-allele filtering for
segregation filtering will only consider ALT
alleles of the appropriate types.
-max_alts MAX_ALT_ALLELES, --max_alt_alleles MAX_ALT_ALLELES
Filter variants at sites with more than this
many ALT alleles. For example, using
'--max_alt_alleles 1' would retain biallelic sites
only ('*' alleles are not counted for this
purpose).
--filter_asterisk_only_calls
Filter variants where the only ALT allele is '*'.
-af AF, --af AF Maximum AF value in input VCF. Any allele with an
AF > than this value will be filtered.
-min_af MIN_AF, --min_af MIN_AF
Minimum AF value in input VCF. Any allele with an
AF < than this value will be filtered.
-filtering_an FILTERING_AN, --filtering_an FILTERING_AN
Require at least this number of allele calls
before filtering with --af or --min_af options.
Useful to avoid filtering at sites with many
uncalled genotpyes. If AN field is missing from a
record and this value is > 0, --af/--min_af
filtering will not occur for that record.
Default=0.
-min_an MIN_AN, --min_an MIN_AN
Minimum number of allele calls as given by the
'AN' INFO field. Variants with an AN value below
this threshold or a missing AN field will be
filtered. Default=0.
-ac AC, --ac AC Maximum AC value in input VCF. Any allele with an
AC > than this value will be filtered.
-min_ac MIN_AC, --min_ac MIN_AC
Minimum AC value in input VCF. Any allele with an
AC < than this value will be filtered.
--info_filters INFO_FILTERS [INFO_FILTERS ...]
Custom filter expressions for filtering on fields
in the INFO field of each record. Must be in the
format '<INFO_FIELD> <comparator> <value>'.
Variants will be retained if they meet the given
criteria. For example, to only keep records with a
QD score greater than 4, you would pass the
expression "QD > 4". To only keep records with the
"DB" flag present you would pass the expression
"DB == True".
Standard python style operators (">", "<", ">=",
"<=", "==", "!=") are supported. Comparisons will
be performed using the types specified for the
given field in the VCF header (e.g. Float, Integer
or String) or as booleans for Flags.
-c [CSQ [CSQ ...]], --csq [CSQ [CSQ ...]]
One or more VEP consequence classes to retain.
Variants which do not result in one of these VEP
consequence classes will be filtered. If this
option is used with no values then the following
default classes will be used:
TFBS_ablation
TFBS_amplification
inframe_deletion
inframe_insertion
frameshift_variant
initiator_codon_variant
missense_variant
protein_altering_variant
regulatory_region_ablation
regulatory_region_amplification
splice_acceptor_variant
splice_donor_variant
start_lost
stop_gained
stop_lost
transcript_ablation
transcript_amplification
You may also pass the value "default" in order to
include these default classes in addition to other
specified classes. Alternatively, you may specify
'all' to include all consequence types if, for
example, you want to filter on other VEP
annotations (e.g. allele frequency or biotype)
irrespective of consequence.
Note, that using the --csq option automatically
turns on biotype filtering (see the --biotypes
option below).
--impact IMPACT [IMPACT ...]
One or more VEP 'IMPACT' types to retain. Valid
values are 'HIGH', 'MODERATE', 'LOW' and
'MODIFIER'. Any consequence classes specified by
the '--csq' argument will still be retained
irrespective of values specified here.
Note, that using the --impact option automatically
turns on biotype filtering (see the --biotypes
option below).
--canonical, -canonical
When used in conjunction with --csq argument,
ignore consequences for non-canonical transcripts.
--flagged_features, -flagged_features
When used in conjunction with --csq argument,
ignore consequences for flagged
transcripts/features (i.e. with a non-empty
'FLAGS' CSQ field).
--biotypes BIOTYPE [BIOTYPE ...], -biotypes BIOTYPE [BIOTYPE ...]
When used in conjunction with --csq argument,
ignore consequences in biotypes other than those
specified here. By default only consequences in
features with the following biotypes are
considered:
3prime_overlapping_ncrna
antisense
CTCF_binding_site
enhancer
IG_C_gene
IG_D_gene
IG_J_gene
IG_V_gene
lincRNA
miRNA
misc_RNA
Mt_rRNA
Mt_tRNA
open_chromatin_region
polymorphic_pseudogene
processed_transcript
promoter
promoter_flanking_region
protein_coding
rRNA
sense_intronic
sense_overlapping
snoRNA
snRNA
TF_binding_site
translated_processed_pseudogene
TR_C_gene
TR_D_gene
TR_J_gene
TR_V_gene
Use this argument to specify one or more biotypes
to consider instead of those listed above. You may
also include the value 'default' in your list to
include the default values listed above in
addition to others provided to this argument.
Alternatively you may use the value 'all' to
disable filtering on biotypes.
--feature_blacklist FEATURE_BLACKLIST, --blacklist FEATURE_BLACKLIST
A file containing a list of Features (e.g. Ensembl
transcript IDs) to ignore. These must correspond
to the IDs in the 'Feature' field annotated by
VEP.
--loftee Retain LoF (stop_gained, frameshift_variant,
splice_acceptor_variant and splice_donor_variant)
classes only if the LoF annotation from loftee is
'HC'.
-m MISSENSE_FILTERS [MISSENSE_FILTERS ...], --missense_filters MISSENSE_FILTERS [MISSENSE_FILTERS ...]
A list of in silico prediction programs to use
for filtering missense variants (must be used in
conjunction with --csq argument). The programs
provided here must have been annotated on the
input VCF file either directly by VEP or via the
dbNSFP VEP plugin. Recognised program names and
default 'damaging' values are provided in the
"data/vep_insilico_pred.tsv" file.
You may optionally specify score criteria for
filtering as in the the following examples:
FATHMM_pred=D
MutationTaster_pred=A
MetaSVM_rankscore=0.8
Or you may just provide the program names
and the default 'damaging' prediction values
will be used, as listed in the file
"vase/data/vep_insilico_pred.tsv".
By default, a missense consequence is filtered
unless each of the programs listed here have an
appropriate or missing prediction/score. This
behaviour can be changed using the
--filter_unpredicted or --keep_if_any_damaging
flags.
--filter_unpredicted, -filter_unpredicted
For use in conjunction with --missense_filters.
The default behaviour when using
--missense_filters is to ignore a program if
there is no prediction given (i.e. the score/pred
is empty). That is, if there are no predictions
for any of the programs annotating a missense
consequence, it will not be filtered, while if
predictions are missing for only some, filtering
will proceed as normal with the other programs. If
this option is given, missense variants will be
filtered if any program does not have a
prediction/score.
--keep_if_any_damaging, -keep_if_any_damaging
For use in conjunction with --missense_filters.
If this option is provided, a missense consequence
is only filtered if ALL of the programs provided
to --missense_filters do not have an appropriate
prediction/score - that is, the missense
consequence will be retained if ANY of the given
programs has an appropriate value for the
prediction/score. This behaviour is overridden by
'--filter_unpredicted' when a prediction/score is
missing for any program.
--splice_filters SPLICE_FILTERS [SPLICE_FILTERS ...]
Similar to --missense_filters except only splice
consequences (splice_donor_variant,
splice_acceptor_variant and splice_region_variant)
are checked versus the given in silico prediction
programs. Currently only dbscSNV, (rf_score and
ada_score), MaxEntScan and SpliceDistance
(https://github.com/david-a-parry/SpliceDistance)
plugins are supported.
For example '--splice_filters ada_score' will
filter splice region variants with a dbscSNV
ada_score cutoff below the default value (0.7).
Alternatively, '--splice_filters ada_score=0.9'
would filter on a higher threshold of 0.9 or
above.
--splice_filter_unpredicted
Same as --filter_unpredicted but for
--splice_filters only.
--splice_keep_if_any_damaging
Same as --keep_if_any_damaging but for
--splice_filters only.
--retain_labels Label=Value [Label=Value ...]
Retain consequence annotations if there is a
matching annotation for the given label. For
example, to retain any consequence where there is
a VEP annotation for 'FOO' matching 'BAR' use
"--retain_labels FOO=BAR".
--no_vep_freq, -no_vep_freq
Use this option if you want to ignore VEP
annotated allele frequencies when using --freq and
--csq/--impact options.
--vep_af VEP_AF [VEP_AF ...], -vep_af VEP_AF [VEP_AF ...]
One or more VEP allele frequency annotations to
use for frequency filtering. Default is to use the
following (assuming --csq/--impact and --freq or
--min_freq arguments are in effect):
MAX_AF
AFR_AF
AMR_AF
EAS_AF
EUR_AF
SAS_AF
AA_AF
EA_AF
gnomAD_AF
gnomAD_AFR_AF
gnomAD_AMR_AF
gnomAD_ASJ_AF
gnomAD_EAS_AF
gnomAD_FIN_AF
gnomAD_NFE_AF
gnomAD_OTH_AF
gnomAD_SAS_AF
gnomADg_AF_AFR
gnomADg_AF_AMR
gnomADg_AF_ASJ
gnomADg_AF_EAS
gnomADg_AF_FIN
gnomADg_AF_NFE
gnomADg_AF_OTH
--pathogenic When used in conjunction with --csq argument,
retain variants flagged as pathogenic by either
'CLIN_SIG' or 'clinvar_clnsig' VEP annotations
even if the consequence class is not included in
those selected using the --csq argument. Note that
this only alters filtering as specified by --csq
and --missense_filters options; frequency,
canonical transcript, flagged_features and biotype
filtering will still occur as normal.
--no_conflicted When used in conjunction with --pathogenic
argument, variants labelled as pathogenic will
only be retained if there are no conflicting
'benign' or 'likely benign' assertions.
--g2p G2P A G2P CSV file for filtering variants based on G2P
annotations. Requires your VCF to be annotated
with VEP. Only variants with consequences
affecting genes in this file will be retained.
--check_g2p_consequence
If using --g2p option, use this flag to require
that the observed consequence matches the
'mutation consequence' in the specified G2P file.
--check_g2p_inheritance
If using --g2p option, use this flag to require
that the observed inheritance or
hetero/hemi/homozygosity of alleles match the
requirement specified in the specified G2P file.
Requires at least one of --recessive/--de_novo/
--dominant/singleton_recessive/singleton_dominant
arguments.
Region Filtering Arguments:
Arguments for filtering variants on genomic regions. These arguments are mutually exclusive.
--region REGION [REGION ...]
Only include variants overlapping these intervals
(in the format chr1:1000-2000).
--bed BED Only include variants overlapping the intervals in
the provided BED file.
--gene_bed BED Only include variants overlapping the intervals in
the provided BED file and with a VEP annotation
for the provided gene/transcript/protein
identifiers. The fourth column of the provided BED
file should contain gene symbols and/or Ensembl
gene/transcript/protein identifiers (multiple IDs
should be separated with '/' characters).
Requires input to be annotated with VEP.
A suitably formatted BED can be created using the
'coordinates_from_genes' program installed with
vase.
--stream When using region filtering arguments, read all
variants in your VCF and filter out all that do
overlap your regions of interest instead of
index-jumping. This allows processing of unindexed
VCFs and potentially speeds up processing of VCFs
with large structural variants that otherwise
severely slow-down tabix-style variant retrieval.
--exclude_regions When using region filtering arguments, output
variants that do NOT overlap regions instead of
those that do. This forces streaming rather than
index-jumping retrieval.
Sample Based Filtering Arguments:
Arguments for filtering variants based on presence/absence in samples and/or
inheritance patterns.
--cases SAMPLE_ID [SAMPLE_ID ...], -cases SAMPLE_ID [SAMPLE_ID ...]
One or more sample IDs to treat as cases. Default
behaviour is to retain variants/alleles present in
all of these samples as long as they are not
present in any sample specified using the
'--controls' option. This behaviour can be
adjusted using other options detailed below.
--controls SAMPLE_ID [SAMPLE_ID ...], -controls SAMPLE_ID [SAMPLE_ID ...]
One or more sample IDs to treat as controls.
Default behaviour is to filter variants/alleles
present in any of these samples. This behaviour
can be adjusted using other options detailed
below.
-ped PED, --ped PED A ped file containing information about samples in
your VCF for use for filtering on affectation
status and inheritance patterns.
A PED file is a white-space (space or tab)
delimited file with the first six mandatory
columns:
Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype
Affection status should be coded:
-9 missing
0 missing
1 unaffected
2 affected
All individuals of interest, including parents,
should be specified in this file so that
affectation status can be read and dominant versus
recessive/de novo inheritance models can be
inferred.
--n_cases N_CASES, -n_cases N_CASES
Instead of requiring a variant to be present in
ALL samples specified by --cases, require at least
this many cases.
--n_controls N_CONTROLS, -n_controls N_CONTROLS
Instead of filtering an allele/variant if present
in ANY sample specified by --controls, require at
least this many controls to carry a variant before
it is filtered.
--confirm_control_gts
If using the --controls argument, also filter
variants if any control sample is either a no-call
or fails specified genotype quality, depth or
allele balance thresholds. If used in conjunction
with the --n_controls option, control samples with
no-call genotypes or genotypes failing the above
thresholds will be counted towards the number of
controls with an allele/variant.
--biallelic, -biallelic, --recessive
Identify variants matching a recessive inheritance
pattern in cases present in the PED file specified
by the --ped argument. Input must be VEP
annotated. If the --csq argument is given, only
variants/alleles resulting in the given functional
consequences will be used to identify qualifying
variants/alleles, otherwise the default set of
VEP consequences (see --csq argument for details)
will be used.
--de_novo, -de_novo Idenfify apparent de novo variants in cases
present in the PED file specified by the --ped
argument. This requires that at least one
parent-child trio exists in the given PED file.
--dominant, -dominant
Idenfify variants segregating in manner matching
dominant inheritance in cases present in the PED
file specified by the --ped argument.
--min_families MIN_FAMILIES, -min_families MIN_FAMILIES
Minimum number of families (or unrelated samples)
required to contain a qualifying dominant/de novo
or biallelic combination of variants in a feature
before they are output. Default = 1.
--singleton_recessive SAMPLE_ID [SAMPLE_ID ...], -singleton_recessive SAMPLE_ID [SAMPLE_ID ...]
One or more samples to treat as unrelated
individuals and identify variants matching a
recessive inheritance pattern.
--singleton_dominant SAMPLE_ID [SAMPLE_ID ...], -singleton_dominant SAMPLE_ID [SAMPLE_ID ...]
One or more samples to treat as unrelated
individuals and identify variants matching a
dominant inheritance pattern.
--seg_controls SAMPLE_ID [SAMPLE_ID ...], -seg_controls SAMPLE_ID [SAMPLE_ID ...]
One or more sample IDs to treat as controls for
segregation analysis only. Useful if you want to
specify controls to use for rejecting compound
heterozygous combinations of variants or
homozygous variants when using --biallelic option.
Unlike the --controls option, alleles/variants
present in these samples will only be used for
filtering when looking at inheritance patterns in
families present in a PED file or samples
specified with --singleton_recessive or
--singleton_dominant options. This option is not
necessary if your unaffected samples are already
present in your PED file specified with --ped.
--strict_recessive When using the --biallelic/--recessive option,
for any affected sample with parents, require
confirmation of parental genotypes. If either
parent genotype is a no-call or fails genotype
filters then a potential biallelic variant will be
ignored.
Genotype Filtering Arguments:
Arguments for filtering genotypes when using 'Sample Based Filtering
Arguments' to filter on presence/absence in samples and/or inheritance
patterns.
-gq GQ, --gq GQ Minimum genotype quality score threshold. Sample
genotype calls with a score lower than this
threshold will be treated as no-calls.
Default = 20.
-dp DP, --dp DP Minimum genotype depth threshold. Sample genotype
calls with a read depth lower than this threshold
will be treated as no-calls. Default = 0.
-max_dp MAX_DP, --max_dp MAX_DP
Maximum genotype depth threshold. Sample genotype
calls with a read depth higher than this threshold
will be treated as no-calls. Default = 0 (i.e. not
used).
-het_ab AB, --het_ab AB
Minimum genotype allele balance for heterozygous
genotypes. Heterozygous sample genotype calls
with a ratio of the alternate allele vs total
depth lower than this threshold will be treated as
no-calls. Default = 0.
-hom_ab AB, --hom_ab AB
Minimum genotype allele balance for homozygous
genotypes. Homozygous sample genotype calls
with a ratio of the alternate allele vs total
depth lower than this threshold will be treated as
no-calls. Default = 0.
-con_gq CONTROL_GQ, --control_gq CONTROL_GQ
Minimum genotype quality score threshold for
parents/unaffecteds/controls when filtering
variants. Defaults to the same value as --gq but
you may wish to set this to a lower value if, for
example, you require less evidence from
controls/unaffected in order to filter a variant
or from parental genotype calls when confirming
a potential de novo variant.
-con_dp CONTROL_DP, --control_dp CONTROL_DP
Minimum depth threshold for
parents/unaffecteds/controls when filtering
variants. Defaults to the same value as --dp but
you may wish to set this to a lower value if, for
example, you require less evidence from
controls/unaffected in order to filter a variant
or from parental genotype calls when confirming
a potential de novo variant.
-con_max_dp CONTROL_MAX_DP, --control_max_dp CONTROL_MAX_DP
Maximum depth threshold for
parents/unaffecteds/controls when filtering
variants. Defaults to the same value as --max_dp.
-con_het_ab AB, --control_het_ab AB
Minimum genotype allele balance for heterozygous
genotypes. Heterozygous sample genotype calls
with a ratio of the alternate allele vs total
depth lower than this threshold will be treated as
no-calls. Defaults to the same as --het_ab but
you may wish to set this to a lower value if, for
example, you require less evidence from
controls/unaffected in order to filter a variant.
-con_hom_ab AB, --control_hom_ab AB
Minimum genotype allele balance for homozygous
genotypes. Homozygous sample genotype calls
with a ratio of the alternate allele vs total
depth lower than this threshold will be treated as
no-calls. Defaults to the same as --hom_ab but
you may wish to set this to a lower value if, for
example, you require less evidence from
controls/unaffected in order to filter a variant.
-con_ref_ab AB, --control_max_ref_ab AB
Maximum genotype allele balance for
parents/unaffecteds/controls with reference (0/0)
genotypes when filtering variants. If you wish to
count/exclude variants where controls/unaffecteds
are called as homozygous reference but still have a
low proportion of ALT alleles specify a suitable
cutoff here.
Structural Variant Genotype Filtering Arguments:
Arguments for filtering genotypes for Structural Variant calls when using
'Sample Based Filtering Arguments' to filter on presence/absence in samples
and/or inheritance patterns. Only output from Manta currently supported.
-sv_gq SV_GQ, --sv_gq SV_GQ
Minimum genotype quality score threshold for
structural variants. Sample genotype calls with a
score lower than this threshold will be treated as
no-calls. Default = 20.
-sv_dp SV_DP, --sv_dp SV_DP
Minimum genotype 'depth' threshold for structural
variants. Sample genotype calls with fewer than
this nunmber of supporting reads will be treated
as no-calls. Default = 0.
-sv_max_dp SV_MAX_DP, --sv_max_dp SV_MAX_DP
Maximum genotype 'depth' threshold for structural
variants. Sample genotype calls with more than
this nunmber of supporting reads will be treated
as no-calls. Default = 0 (i.e. not used).
-sv_het_ab AB, --sv_het_ab AB
Minimum genotype allele balance for heterozygous
genotypes for structural variants. Heterozygous
sample genotype calls with a ratio of reads
supporting the alternate allele vs total
supporting reads lower than this threshold will
be treated as no-calls. Default = 0.
-sv_hom_ab AB, --sv_hom_ab AB
Minimum genotype allele balance for homozygous
genotypes for structural variants. Homozygous
sample genotype calls with a ratio of reads
supporting the alternate allele vs total
supporting reads lower than this threshold will be
treated as no-calls. Default = 0.
-sv_con_gq SV_CONTROL_GQ, --sv_control_gq SV_CONTROL_GQ
Minimum genotype quality score threshold for
parents/unaffecteds/controls when filtering
structural variants. Defaults to the same value as
--sv_gq but you may wish to set this to a lower
value if, for example, you require less evidence
from controls/unaffected in order to filter a
variant or from parental genotype calls when
confirming a potential de novo variant.
-sv_con_dp SV_CONTROL_DP, --sv_control_dp SV_CONTROL_DP
Minimum supporting read threshold for
parents/unaffecteds/controls when filtering
structural variants. Defaults to the same value as
--sv_dp but you may wish to set this to a lower
value if, for example, you require less evidence
from controls/unaffected in order to filter a
variant or from parental genotype calls when
confirming a potential de novo variant.
-sv_con_max_dp SV_CONTROL_MAX_DP, --sv_control_max_dp SV_CONTROL_MAX_DP
Maximum supporting read threshold for
parents/unaffecteds/controls when filtering
structural variants. Defaults to the same value as
--sv_max_dp.
-sv_con_het_ab AB, --sv_control_het_ab AB
Minimum genotype allele balance for heterozygous
genotypes for structural variants. Heterozygous
sample genotype calls with a ratio of the
reads supporting the alternate allele vs total
supporting reads depth lower than this threshold
will be treated as no-calls. Defaults to the same
as --sv_het_ab but you may wish to set this to a
lower value if, for example, you require less
evidence from controls/unaffected in order to
filter a variant.
-sv_con_hom_ab AB, --sv_control_hom_ab AB
Minimum genotype allele balance for homozygous
genotypes for structural variants. Homozygous
sample genotype calls with a ratio of the
reads supporting the alternate allele vs total
supporting reads depth lower than this threshold
will be treated as no-calls. Defaults to the same
as --sv_hom_ab but you may wish to set this to a
lower value if, for example, you require less
evidence from controls/unaffected in order to
filter a variant.
-sv_con_ref_ab AB, --sv_control_max_ref_ab AB
Maximum genotype allele balance for
parents/unaffecteds/controls with reference (0/0)
genotypes when filtering structural variants. If
you wish to count/exclude variants where
controls/unaffecteds are called as homozygous
reference but still have a low proportion of ALT
alleles specify a suitable cutoff here.
--duphold_del_dhffc DHFFC
Maximum fold-change for deletion calls relative to
flanking regions as annotated by duphold
(https://github.com/brentp/duphold). Deletion
calls will be filtered if the DHFFC annotation
from duphold is greater than this value.
--duphold_dup_dhbfc DHBFC
Minimum fold-change for duplication calls relative
to flanking regions as annotated by duphold
(https://github.com/brentp/duphold). Duplication
calls will be filtered if the DHBFC annotation
from duphold is less than this value.
--control_duphold_del_dhffc DHFFC
Maximum fold-change for deletion calls relative to
flanking regions as annotated by duphold for
parent/unaffected/control sample het/homozygous
alternative calls. Defaults to the same value as
--duphold_del_dhffc but you may wish to set this
to a higher value if, for example, you require
less evidence from controls/unaffected in order to
filter a variant or from parental genotype calls
when confirming a potential de novo variant.
--control_duphold_dup_dhbfc DHBFC
Minimum fold-change for duplication calls relative
to flanking regions as annotated by duphold for
parent/unaffected/control sample het/homozygous
alternative calls. Defaults to the same value as
--duphold_dup_dhbfc but you may wish to set this
to a lower value if, for example, you require
less evidence from controls/unaffected in order to
filter a variant or from parental genotype calls
when confirming a potential de novo variant.
Help/Logging Arguments:
--prog_interval N, -prog_interval N
Report progress information every N variants.
Default=1000.
--log_progress, -log_progress
Use logging output for progress rather than wiping
progress line after each update.
--no_progress Do not output progress information to STDERR.
--quiet Do not output INFO messages to STDERR. Warnings
will still be shown.
--debug Output debugging level information to STDERR.
--no_warnings Do not output INFO or WARN messages to
STDERR. Only program ending errors will appear.
--silent Equivalent to specifying both '--no_progress' and
'--no_warnings' options.
-h, --help Show this help message and exit
Written by David A. Parry at the University of Edinburgh.
MIT License
Copyright (c) 2017-2020 David A. Parry
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.