Skip to content

Public Benchmark of Long-Read Structural Variant Caller on PacBio CCS HG002 Data

Notifications You must be signed in to change notification settings

PacificBiosciences/sv-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Structural Variant Calling Benchmark

This benchmark is based on the publicly available PacBio CCS 15kb dataset of the Ashkenazim son HG002/NA24385. We provide each step how to reproduce the final metrics with publicly available tools.

Please let us know, if you find mistakes or want your tool added.

Comparison

At 28-fold coverage:

Run F1^ % Precision % Recall % FP FN FP+FN
pbsv 96.29 94.61 98.02 538 191 729
svim 94.12 93.76 94.48 606 532 1138
sniffles 93.56 93.30 93.83 650 595 1245

Coverage titration for 5, 10, and 28-fold:

Get tools

Information how to install conda and add the bioconda channel is available on https://bioconda.github.io/.

conda create --name sv python=3.6
source activate sv
conda install pbmm2 pbsv==2.2 svim==0.4.3 sniffles==1.0.10

Get data

  1. Create directory structure:
mkdir -p fastqs ref alns alns_merged tools/{pbsv,sniffles} giab
  1. Download genome in a bottle annotations:
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/
curl -s ${FTPDIR}/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.bed > giab/HG002_SVs_Tier1_v0.6.bed
curl -s ${FTPDIR}/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz > giab/HG002_SVs_Tier1_v0.6.vcf.gz
  1. Make sure that you have exact truvari version (check that all python dependencies are met, not described):
git clone https://github.com/spiralgenetics/truvari
(cd truvari; git reset --hard 600b4ed7)
  1. Download hg19 reference with decoys and map non-ACGT characters to N:
curl -s ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz > ref/human_hs37d5.fasta.gz
gunzip ref/human_hs37d5.fasta.gz
sed -i '/^[^>]/ y/BDEFHIJKLMNOPQRSUVWXYZbdefhijklmnopqrsuvwxyz/NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN/' ref/human_hs37d5.fasta
  1. Download hg19 tandem repeat annotations:
curl -s https://raw.githubusercontent.com/PacificBiosciences/pbsv/master/annotations/human_hs37d5.trf.bed > ref/human_hs37d5.trf.bed
  1. Download all .fastq files:
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/HG002_NA24385_son/PacBio_CCS_15kb/
for fastq in $(curl -s -l ${FTPDIR} | grep -E '.fastq$'); do curl -s ${FTPDIR}${fastq} > fastqs/${fastq}; done

Alignment

  1. Index reference:
pbmm2 index ref/human_hs37d5.fasta ref/human_hs37d5.mmi --preset CCS
  1. Align each movie, can be done in parallel:
for i in fastqs/*.fastq; do
    FILENAME="${i#*fastqs/}"
    FILEPREFIX="${FILENAME%.*}"
    pbmm2 align ref/human_hs37d5.mmi $i "alns/hg19.${FILEPREFIX}.bam" --preset CCS \
                --sort --rg '@RG\tID:${FILEPREFIX}' --sample HG2
done

Run pbsv

8a) Discover SV signatures for each alignment, can be done in parallel:

for i in alns/*.bam; do
    FILENAME="${i#*alns/}"
    FILEPREFIX="${FILENAME%.*}"
    pbsv discover $i "tools/pbsv/${FILEPREFIX}.svsig.gz" --tandem-repeats ref/human_hs37d5.trf.bed
done

8b) Call and polish SVs:

pbsv call ref/human_hs37d5.fasta tools/pbsv/*.svsig.gz tools/pbsv/hg2.pbsv.vcf --ccs -t INS,DEL
bgzip tools/pbsv/hg2.pbsv.vcf
tabix tools/pbsv/hg2.pbsv.vcf.gz

Run svim

9a) Merge and re-sort alignments:

ls alns/*.bam > alns_merged/bam_list.fn
samtools merge -b alns_merged/bam_list.fn alns_merged/hs37d5.all.bam
samtools sort -n alns_merged/hs37d5.all.bam > alns_merged/hs37d5.all.querysorted.bam

9b) Run svim:

svim alignment --cluster_max_distance 1.4 tools/svim alns_merged/hs37d5.all.querysorted.bam

9c) Prepare for truvari:

cat tools/svim/final_results.vcf \
    | sed 's/INS:NOVEL/INS/g' \
    | sed 's/DUP:INT/INS/g' \
    | sed 's/DUP:TANDEM/INS/g' \
    | awk '{ if($1 ~ /^#/) { print $0 } else { if(($5=="<DEL>" || $5=="<INS>") && $6>40) { print $0 } } }' \
    > tools/svim/hg2.svim.vcf
bgzip tools/svim/hg2.svim.vcf
tabix tools/svim/hg2.svim.vcf.gz

Run sniffles

10a) Add MD tag to BAM:

samtools calmd -bS alns_merged/hs37d5.all.bam ref/human_hs37d5.fasta > alns_merged/hs37d5.all.md.bam

10b) Run sniffles:

sniffles -s 3 --skip_parameter_estimation -m alns_merged/hs37d5.all.md.bam -v tools/sniffles/hg2.sniffles.vcf --report_seq

10c) Prepare for truvari:

cat <(cat tools/sniffles/hg2.sniffles.vcf | grep "^#") \
    <(cat tools/sniffles/hg2.sniffles.vcf | grep -vE "^#" | \
      grep 'DUP\|INS\|DEL' | sed 's/DUP/INS/g' | sort -k1,1 -k2,2g) \
    | bgzip -c > tools/sniffles/hg2.sniffles.vcf.gz
tabix tools/sniffles/hg2.sniffles.vcf.gz

Final comparison

  1. Compare to ground truth:
truvari/truvari.py -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
                   --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-pbsv --passonly\
                   --giabreport -r 1000 -p 0.00 -c tools/pbsv/hg2.pbsv.vcf.gz
truvari/truvari.py -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
                   --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-svim --passonly\
                   --giabreport -r 1000 -p 0.00 -c tools/svim/hg2.svim.vcf.gz
truvari/truvari.py -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
                   --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-sniffles --passonly\
                   --giabreport -r 1000 -p 0.00 -c tools/sniffles/hg2.sniffles.vcf.gz
  1. Parse results:
function sumsv() { cat $1 | grep ':' | tr -d ',' |sed "s/^[ \t]*//"| tr -d '"' |\
                   tr -d ' ' | tr ':' '\t' | awk '{ for (i=1; i<=NF; i++)  {
                     a[NR,i] = $i } } NF>p { p = NF } END { for(j=1; j<=p; j++)
                     { str=a[1,j]; for(i=2; i<=NR; i++){ str=str" "a[i,j]; } print str } }' |\
                   tail -n 1 | awk '{ printf "%1.4f\t%1.4f\t%1.4f\t%10.0f\t%10.0f\t%10.0f\n", $2,$4,$11,$1,$8,$1+$8 }';}
cat <(echo -e "Run\tF1\tPrecision\tRecall\tFP\tFN\tFP+FN")\
    <(cat <(for i in bench*; do printf $i"\t";sumsv $i/summary.txt;done) |\
    sed 's/bench//g;s/-//g' | sort -k 2 -n -r) | column -t

Or for markdown:

cat <(echo -e "Run\tF1\tPrecision\tRecall\tFP\tFN\tFP+FN")\
    <(echo -e ":-:\t:-:\t:-:\t:-:\t:-:\t:-:\t:-:")\
    <(cat <(for i in bench*; do printf $i"\t";sumsv $i/summary.txt;done) |\
    sed 's/bench//g;s/-//g' | sort -k 2 -n -r) | tr '\t' ' ' | tr -s ' ' | tr ' ' '|' | awk '{print "|"$0"|"}'

Coverage titrations

  1. Create a new base folder for each coverage and subsample the merged BAM with MD:
samtools view -bS -s 0.35 alns_merged/hs37d5.all.md.bam > ../10x/alns_merged/hs37d5.all.md.bam
samtools view -bS -s 0.18 alns_merged/hs37d5.all.md.bam > ../5x/alns_merged/hs37d5.all.md.bam

Repeat steps to run pbsv, sniffles, and svim. For pbsv, run discover only once for this merged BAM.

Plot

  1. Extract TP information:
 cat bench-pbsv/tp-base.vcf| grep -v '#' | awk '{ split($8,a,";"); for(i in a) { split(a[i],b,"="); if (b[1] == "SVLEN" || b[1] == "PctSizeSimilarity" || b[1] == "SizeDiff") printf(b[2] "\t");  } print "pbsv" }' > pbsv.tpbase
 cat bench-svim/tp-base.vcf| grep -v '#' | awk '{ split($8,a,";"); for(i in a) { split(a[i],b,"="); if (b[1] == "SVLEN" || b[1] == "PctSizeSimilarity" || b[1] == "SizeDiff") printf(b[2] "\t");  } print "svim" }' > svim.tpbase
 cat bench-sniffles/tp-base.vcf| grep -v '#' | awk '{ split($8,a,";"); for(i in a) { split(a[i],b,"="); if (b[1] == "SVLEN" || b[1] == "PctSizeSimilarity" || b[1] == "SizeDiff") printf(b[2] "\t");  } print "sniffles" }' > sniffles.tpbase
  1. Run truvari, computing sequence similarity (unsupported by svim):
truvari/truvari.py -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
                   --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-seq-pbsv --passonly\
                   --giabreport -r 1000 -p 0.01 -c tools/pbsv/hg2.pbsv.vcf.gz
truvari/truvari.py -f ref/human_hs37d5.fasta -b giab/HG002_SVs_Tier1_v0.6.vcf.gz\
                   --includebed giab/HG002_SVs_Tier1_v0.6.bed -o bench-seq-sniffles --passonly\
                   --giabreport -r 1000 -p 0.01 -c tools/sniffles/hg2.sniffles.vcf.gz

Extract seq info

 cat bench-pbsv/tp-base.vcf | grep -v '#' | awk '{ split($8,a,";"); for(i in a) { split(a[i],b,"="); if (b[1] == "SVLEN" || b[1] == "PctSeqSimilarity") printf(b[2] "\t");  } print "pbsv" }' > pbsv.tpbaseseq
 cat bench-sniffles/tp-base.vcf | grep -v '#' | awk '{ split($8,a,";"); for(i in a) { split(a[i],b,"="); if (b[1] == "SVLEN" || b[1] == "PctSeqSimilarity") printf(b[2] "\t");  } print "sniffles" }' > sniffles.tpbaseseq

16a) Compute perfect SV size matches:

tp1 = fread("~/pbsv.tpbase")
tp2 = fread("~/svim.tpbase")
tp3 = fread("~/sniffles.tpbase")
tp = bind_rows(tp1,tp2,tp3)
names(tp) = c("length", "pct", "diff", "tool")
tp = tp %>% mutate(type=ifelse(length<0,"DEL","INS"))
tp$length = abs(tp$length)

tp %>% mutate(bin=abs(diff)) %>% group_by(tool,bin) %>% summarize(count=n()) %>% filter(bin == 0) %>% group_by(tool) %>% summarize(sum(count)/9641)

16b) Plot TP by bins

x = bind_rows(tp %>% filter(type=="INS") %>% filter(length<1000) %>% mutate(bin=ceiling(abs(length)/100)*100) %>% group_by(tool,bin,type) %>% summarize(count=n()),
              tp %>% filter(type=="DEL") %>% filter(length<1000) %>% mutate(bin=ceiling(abs(length)/100)*100) %>% group_by(tool,bin) %>% summarize(count=n()) %>% mutate(bin=-1*bin))

m=1.5
gs=ggplot() +
  geom_rect(data=x%>%filter(tool=="pbsv"), aes(xmin=bin-30*m,xmax=bin-10*m,ymin=10,ymax=count,fill=tool))+
  geom_rect(data=x%>%filter(tool=="svim"), aes(xmin=bin-10*m,xmax=bin+10*m,ymin=10,ymax=count,fill=tool))+
  geom_rect(data=x%>%filter(tool=="sniffles"), aes(xmin=bin+10*m,xmax=bin+30*m,ymin=10,ymax=count,fill=tool))+
  scale_x_continuous(breaks=seq(-1000,1000,200),limits=c(-1050,1050))+
  ylab("Counts")+
  theme_minimal()

y = bind_rows(tp %>% filter(type=="INS") %>% filter(length<10000 & length>=1000) %>% mutate(bin=round(abs(length)/1000)*1000) %>% group_by(tool,bin) %>% summarize(count=n()),
              tp %>% filter(type=="DEL") %>% filter(length<10000 & length>=1000) %>% mutate(bin=round(abs(length)/1000)*1000) %>% group_by(tool,bin) %>% summarize(count=n()) %>% mutate(bin=-1*bin))
gl=ggplot() +
  geom_rect(data=y%>%filter(tool=="pbsv"), aes(xmin=bin-300*n,xmax=bin-100*n,ymin=0,ymax=count,fill=tool))+
  geom_rect(data=y%>%filter(tool=="svim"), aes(xmin=bin-100*n,xmax=bin+100*n,ymin=0,ymax=count,fill=tool))+
  geom_rect(data=y%>%filter(tool=="sniffles"), aes(xmin=bin+100*n,xmax=bin+300*n,ymin=0,ymax=count,fill=tool))+
  scale_x_continuous(breaks=seq(-20000,20000,2000),limits=c(-10500,10500))+
  xlab("True SV size")+
  ylab("Counts")+
  theme_minimal()


g = ggarrange(gs,gl,ncol=1,
          nrow = 2,
          common.legend = TRUE,
          legend="bottom",labels=c("A","B"), align='hv'
)

ggsave("tp.pdf",g,width=20,height=10,dpi=200,units="cm")

16c) Plot sequence similarity for insertions

tpseq1 = fread("~/pbsv.tpbaseseq")
tpseq3 = fread("~/sniffles.tpbaseseq")
tpseq = bind_rows(tpseq1,tpseq3)
names(tpseq) = c("length", "seqpct", "tool")
tpseq = tpseq %>% mutate(type=ifelse(length<0,"DEL","INS"))
tpseq$length = abs(tpseq$length)

x = tpseq %>% group_by(tool, type, seqpct) %>% summarize(count=n()) %>% arrange(desc(seqpct)) %>% mutate(s=cumsum(count))
g = ggplot(x %>% filter(type == "INS")) +
  geom_line(aes(seqpct, s,col=tool),geom="line")+
  xlim(c(1.0,0.8)) +
  xlab("Sequence similarity")+
  ylab("Cumulative insertion counts")+
  theme_minimal()
ggsave("sequence_similarity.pdf",g,width=20,height=10,dpi=200,units="cm")

About

Public Benchmark of Long-Read Structural Variant Caller on PacBio CCS HG002 Data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published