- Software Requirement
- Quality control
- Mapping and Filtering
- Variant Calling
- Variant Annotation
- Variant filtering
- Visualization
Required packages for processing the ATAC-seq pipeline.
Following softwear can be install in the cluster either from source code or from conda platform
conda install -c bioconda samtools
conda install -c bioconda bamtools
conda install -c bioconda bedtools
conda install -c bioconda vcflib
conda install -c bioconda bcftools
conda install -c bioconda tabix
conda install -c bioconda htseq
conda install -c bioconda qualimap
It is generally a good idea to generate some quality metrics for raw sequence data using FastQC.
Quality-based trimming as well as Adapter removal can be done in Flexbar
The next step is to align the reads to a reference genome. There are many programs available to perform the alignment. The high efficiency and accuracy alines without allowing large gaps, such as splice junctions is BWA.
The genome indexes can build using bwa index bwa index [-a bwtsw|is] index_prefix reference.fasta bwa index -p hg19bwaidx -a bwtsw hg19.fa -p index name (change this to whatever you want) -a index algorithm (bwtsw for long genomes and is for short genomes)
bwa mem -M -M to flag shorter split hits as secondary. This is optional for Picard compatibility as MarkDuplicates can directly process BWA's alignment
Sequenceing error occurs sometimes in sample or library preparation. When reads come from the exact same input DNA template and accumulate at the same start position on the reference genome due to the PCR dulication. Any sequencing error could lead to artefacts in the downstream variant calling. It is impossible to distinguish the true DNA materials and the PCR artifacts. To decrease this harmful effect of duplicates prior the Picard tol of MarkDuplicates is useful. The Picard MarkDuplicates uses the start coordinates and orientations of both reads of a read pair. Based on the identical 5’mapping coordinates it discards all duplicates with the exception of the “best” copy.
It is best practice to validate the BAM file, to make sure there were not issues or mistakes with previous analyses. This is done with ValidateSamFile.
Nucleotide difference vs.some reference at a given position in an individual genome or transcriptome usually referred to as a Single Nucleotide Polymorphism (SNP). The variant call is usually accompanied by an estimate of variant frequency and some measure of confidence. There are a number of tools available for variant calling, however, bcftools is one of the widely used software. Following steps were performed to for variant calling.
- Calculate the read coverage of positions in the genome
- Detect the single nucleotide polymorphisms (SNPs)
- Filter and report the SNP variants in variant calling format (VCF)
- Vcf indexing and merging to one file
Annotate the SNP variants connecting to different functional aspects using variant effect predictorVEP. The VCF file header information
Create a bigWig file for visualizing the peak covrage using bamCoverage in deepTools. An alternative visualization tool is Integrative Genomics Viewer.The Peak files can be loaded directly (File → Load from File). Viewing BAM files with IGV requires sorted (by coordinate) and indexed using SAMtools. For making plot BAM file can be converted to bed (bam to bed) using bedtools and load to IGV.
samtools view -f 0x2 test.sorted.bam | wc -l
samtools view -F 0x2 test.sorted.bam
samtools sort test.bam -o test.sorted.bam
samtools index test.sorted.bam
samtools view -f 0x2 test.sorted.bam | wc -l
samtools view -h -F 4 -b test.sorted.bam > test_only_mapped.bam
samtools view -b test.bam chr2 > test_chr2.bam
bamtools stats -in test_sort.bam -insert
samtools view -S -b test.sam > test.bam
bamtools convert -format json -in test_sort.bam -out myData.json
bamtools convert -format fasta -in test_sort.bam -out myData.fasta
bamtools convert -format fastq -in test_sort.bam -out myData.fastq
bamtools convert -format sam -in test_sort.bam -out myData.sam
bamtools convert -format pileup -in test_sort.bam -out myData.pileup
bamtools convert -format yaml -in test_sort.bam -out myData.yaml
bamtools convert -format bed -in test_sort.bam -out myData.bed
bedtools bamtobed -i test_sort.bam >test_sort.bed
bamCoverage -b test_sort.bam -o test_sort.bigWig
bam2bed < test_sort.bam | cut -f1-3,5 > test_sort.bg
samtools depth test.bam > test.coverage
genomeCoverageBed -ibam test.bam -g genome.fasta > coverage.txt
awk '$1 == 10 {print $0}' test.coverage > chr10_test.coverage
awk '$1 == "chr10" {print $0}' test.coverage > chr10_test.coverage
samtools stats test.bam |grep ^SN | cut -f 2-
samtools view test_sorted.bam | wc -l
It compares two or more BED/BAM/VCF/GFF files and identifies all the regions in the gemome where the features in the two files overlap
bedtools intersect -a test1.bed -b test2.bed | head -5
bedtools intersect -a test1.bed -b test2.bed -wa -wb | head -15
bedtools intersect -a test1.bed -b test2.bed -wo | head -10
bedtools intersect -a test1.bed -b test2.bed -c | head -10
bedtools intersect -a test1.bed -b test2.bed -v >final.bed
bedtools intersect -a test1.bed -b test2.bed test3.bed test4.bed -sorted >final.bed
bedtools intersect -a test1.bed -b test2.bed test3.bed test4.bed -sorted -wa -wb >final.bed
bedtools intersect -a test1.bed -b test2.bed test3.bed test4.bed -sorted -wa -wb -names test test2 chrom
sort -k1,1 -k2,2n test.bed > test.sort.bed
samtools mpileup -r 'VII:3,120,957-3,147,147' accepted_hits.bam | awk 'BEGIN{C=0}; {C=C+$4}; END{print C "\t" C/NR}'
bedtools genomecov -i test.bed -g genome.txt
multiBamSummary bins --bamfiles test1.bam test2.bam test3.bam --minMappingQuality 30 -out readCounts.npz --outRawCounts readCounts.tab
Filter the bam covrage only for chr19.
awk '$1 == "19"' readCounts.tab >all_bam_covrage.txt
d is the distance in the. For example, to merge features that are no more than 100bp apart, one would run:
bedtools merge -i test.bed -d 100 -c 1 -o count | head -10
bedtools jaccard -a test1.bed -b test2.bed
picard CollectAlignmentSummaryMetrics \
R=genome.fasta \
I=input.bam \
O=output.txt
get -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz
| gunzip --stdout -
| awk '$3 == "gene"' -
| convert2bed -i gff -
> genes.bed
test.chr10 <- read.table("~/data/test.coverage",header=FALSE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)
library(reshape)
test.chr10<-rename(test.chr10,c(V1="Chr", V2="locus", V3="depth")) # renames the header
plot(test.chr10$locus,test.chr10$depth)
library(lattice, pos=10) xyplot(depth ~ locus, type="p", pch=16, auto.key=list(border=TRUE), par.settings=simpleTheme(pch=16),
scales=list(x=list(relation='same'), y=list(relation='same')), data=test.chr10, main="depth by locus - Chr10)")