Skip to content

Latest commit

 

History

History
136 lines (88 loc) · 4.91 KB

prerequisites_setup.md

File metadata and controls

136 lines (88 loc) · 4.91 KB

PREREQUISITE PREPARATION

Following GATK4 best practices workflow: GATK_Germline_Short_Variant_Discovery

Table of Contents

Tools

1. bcftools

Please download and install bcftools from here to manipulate vcf file.

Please also refer to this htslib_Guide or latest release guide on how to build and install appropriately.

2. htslib

Please download and install htslib from here.

Please have a look at this htslib_latest guide on how to install it correctly.

3. GATK

This variant discovery practice will be using GATK v4.4.0.0.

One can download the zip file for this latest version here.

Please have a look at this in depth tutorial in GATK github: broadinstitute/gatk.

  1. Requirements
  2. Downloading GATK
  3. Building GATK
  4. Running GATK

Reference genome for human chromosome 21 and necessary databases

Please download the reference genome and databases required for this variant calling practice. One should always download both the vcf file and its tbi indexed file.

1. Reference genome

One can navigate to UCSC sequence data by chromosome using this link: UCSC_hg38_sequence_data_by_chromosome, then choose to download "chr21.fa.gz" reference fasta file.

OR

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz

2. dbsnp_146 database

  • VCF file
wget -c ftp://[email protected]/bundle/hg38/dbsnp_146.hg38.vcf.gz
  • INDEX file
wget -c ftp://[email protected]/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi

3. 1000G_omni2.5.hg38 known snps database

  • VCF file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz
  • INDEX file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_omni2.5.hg38.vcf.gz.tbi

4. 1000G_phase1 snps with high confidence hg38 database

  • VCF file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz
  • INDEX file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi

5. Homo sapiens assembly38 known indels database

  • VCF file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz
  • INDEX file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi

6. Mills and 1000G gold standard for known indels hg38 database

  • VCF file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
  • INDEX file
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi

Dataset for practice

Please download both bam file and bai file to follow this practice. These 2 files can be found in test_data folder.