-
Notifications
You must be signed in to change notification settings - Fork 8
Tutorial
- Build a basic understanding of Princess workflow
- Analyzing long-reads like PacBio HiFi or Oxford Nanopore (ONT)
- Sequence data QC
- Call Single Nucleotide Variations (SNPs), short Insertion and Deletions (Indels)
- Call Structural Variants (SVs)
- Phase SNPs and SVs
- Get familiar with VCF files
Starting with a capture sample, we will identify SNVs, indels, SVs, and their haplotype with minimum installation and no prior expertise with bioinformatics tools optimization.
- Basic experience with the Linux system and command line will be helpful.
Princess was tested on CentOS release 6.7, and Conda version 4.7.12 is installed: for more information about Installing Conda press here. To download the same Conda version here..
mkdir princess_tutorial
cd princess_tutorial
conda create --name princess_env python=3.7
conda activate princess_env
conda install snakemake=5.7.1
conda install pyyaml
conda install bcftools
# for SNVs and SVs in-depth analysis
conda install bedtools
# for bed files intersection
git clone https://github.com/MeHelmy/princess.git
chmod +x install.sh
./install.sh
python princess -h
- downloading the fastq file
wget https://bcm.box.com/shared/static/hqv7ghnncroxcvfzz1k45z12du99v4i0 --output-document MdaMb231_brPanel_MinION.12.fastq.gz
- In case you didn't manage to run Princess and finalize the analysis, here is a copy of the Princess final output, you can download and follow up with the tutorial
wget https://bcm.box.com/shared/static/x9ena1gd41p4x12etk60h6j2l8qg29sb --output-document analysis_finall.tar.gz
- Extracting files
tar -xf analysis_finall.tar.gz
Note: We used GRCh37 reference.
python princess all -r ont -d analysis -f hs37d5_mainchr.fa -s $PWD/MdaMb231_brPanel_MinION.12.fastq.gz -e --printshellcmds --dry-run
benchmark log PrincessLog.txt result snake_log statistics
awk '{print $2}' stat.benchmark.txt
awk '{print $2}' sv.benchmark.txt
awk '{print $(NF)}' sv.benchmark.txt
cd statistics/raw_reads
cat reads_stat.txt
Reads: 28396
Bases: 409038723
Mean read length: 14404.800781800253
Median: 7733.5
Max: 178579
N50: 29994
cd statitics/minimap
cat data.stat| grep ^SN | cut -f 2-
What is the number of “reads mapped”? Compare to the number of raw reads from statistics/raw_reads/reads_stat.txt.
What is the “average quality” of mapped reads? The number of reads with a mapping quality of zero, “reads MQ0”?
cd statitics/sv
cat data.stat
cat data.stat_CHR
grep "^SN" snp.txt
So far all the analysis we got is directly from Princess, now let us use other tools to get more information about our results.
cd analysis/result
bcftools view -v snps -i 'GT="het"' minimap.phased.SNVs.vcf.gz | wc -l
# ~25079
bcftools view -p minimap.phased.SNVs.vcf.gz | wc -l
Note: gene is located on chromosome 16 starts at 67952369 and ends at 67976758
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools view -H -P | wc -l
bcftools view -r 16:67952369-67976758 minimap.phased.SNVs.vcf.gz | bcftools query -f '[%PS]\n' | sort | uniq -c
# 15 67965575
bcftools view -r 5:57679993-57686103 minimap.SVs.phased.vcf.gz | less
Bonus:
If you identified SV in the previous location, try to take a look in the IGV using the bam file minimap.hap.bam
and color reads by tag HP
- Installing Princess
- Running Princess for a comprehensive sample analysis
- QC for raw reads (number of reads, bases, max read length, N50, etc...)
- Variant QC (Number of SVs, SNVs, and Indels)
- Calling variants, SVs, SNVs, and indels
- Phasing variants
- Count heterozygote vs. homozygous variant
- Select only phased variant
- The number of variants per gene and how many phased variants are there.
- Detecting SVs per region/gene