Skip to content

Bisulfite sequencing

Xin He edited this page Apr 10, 2019 · 20 revisions

WHAT?

  • to determine which C in a DNA sequence is methylated
  • unmethylated C will be converted to U(Uracil), which will pair with A. C -> U:A
  • methylated C will remain intact.
    • before: C C C C G G G C G G A A G C T G C G G G C G G
    • after : T T T T G G G T G G A A G T T G C G G G C G G

image

  • Limitations
    • 5-Hydroxymethylcytosine a new mammalian DNA modification 5-hydroxymethylcytosine converts to cytosine-5-methylsulfonate upon bisulfite treatment, which then reads as a C when sequenced. Therefore, bisulfite sequencing cannot discriminate between 5-methylcytosine and 5-hydroxymethylcytosine.
    • Incomplete conversion Bisulfite sequencing relies on the conversion of every single unmethylated cytosine residue to uracil. If conversion is incomplete, the subsequent analysis will incorrectly interpret the unconverted unmethylated cytosines as methylated cytosines, resulting in false positive results for methylation.
    • Degradation of DNA during bisulfite treatment A major challenge in bisulfite sequencing is the degradation of DNA that takes place concurrently with the conversion. The conditions necessary for complete conversion, such as long incubation times, elevated temperature, and high bisulfite concentration, can lead to the degradation of about 90% of the incubated DNA. Given that the starting amount of DNA is often limited, such extensive degradation can be problematic.

Mapping

GEM(7), BSMAP(488) VAliBS(1), BS-Seeker2(155) BISMARK(1363)


Bismark Bisulfite-converted duplexes for the strand-specific detection and quantification of rare mutations

  • 2011 first published, latest doc is 2016.
  • This is a perl program using bowtie2

image image


VAliBS: a visual aligner for bisulfite sequences

  • 2017 This is a C GUI program

  • in order to obtain methylation information, the DNA was dissolved into two single strands, where the underlined letter C marked the methylated cytosine. After bisulfite treated, non-methylated cytosine (C) will convert into uracil (U). Then PCR makes U converted into thymine (T), at the same time a double strand is synthesized based on each single strand (as shown in step 2 of Fig. 1). Different from normal mapping, the bisulfite mapping allows T to match C and A to match G in the reference. By comparing un-bisulfite-treated to bisulfite-treated sequences, we can identify where cytosine is methylated. image

  • The existing mapping tools for bisulfite-treated sequences can be categorized into two groups: wild-card aligners and three-letter aligners. The common character of wild-card aligners is to replace cytosines in the sequenced reads with wild-card Y nucleotides to allow bisulfite mismatches.

  • On the other hand, three-letter aligners, convert C to T in both sequenced reads and genome reference prior to performing the reads mapping by using modified conventional aligners. Three-letter strategy makes it easier to reuse non-bisulfite aligner as an internal module, with these non-bisulfite aligners improved, it is convenient to replace the internal module. image image


Ref

Genestack tutorial:

  • Setting up a WGBS experiment
  • Quality control of bisulfite sequencing reads
  • Preprocessing of the raw reads: trimming adaptors, contaminants and low quality bases
  • Bisulfite sequencing mapping of the preprocessed reads onto a reference genome
  • Merging the mapped reads
  • Quality control of the mapped reads
  • Methylation ratio analysis
    • Trim N end-repairing fill-in bases set to “3”. This option allows to trim 3 bases from the read end to remove the DNA overhangs created during read end-repair in library preparation. It is important because this end repair procedure may introduce artefacts if the repaired bases contain methylated cytosines.
    • Report only unique mappings
    • Discard duplicated reads option to remove duplicated reads which have identical sequences and could be the result of library preparation. These reads could be mapped to the same position and distort results of downstream analysis.
  • Exploring the genome methylation levels in Genome Browse

Optimized Workflow for Bisulfite Sequencing Data Analysis

  • Bismark toolkit which deploys three-letter mapping algorithm and uses Bowtie as an internal aligner.

Checking bsmap

  • paper
    • In the mammalian genome, although ~19% of the bases are Cs and another 19% are Gs, only ~1.8% of dinucleotides are CpG dinucleotides.
    • As a result, we expect the overall C content of bisulfite reads to be reduced by ~50%.
    • We used the general premise that all the C positions in the genome, where the asymmetric C/T transition can occur, are already known and can be used to guide the mapping of bisulfite reads. BSMAP masks Ts in the bisulfite reads as Cs (i.e., reverse bisulfite conversion) only at C positions in the original reference while keeping all other Ts in the bisulfite reads unchanged. BSMAP then maps the masked BS read directly to the reference. The asymmetric C/T conversion is achieved through position-specific bitwise masking of the bisulfite reads;
    • if REF is C, T/C in reads does not count as a mismatch. image
    • BSMAP can also report non-unique multiple hits with a user-defined maximum number of mismatches.

checking


Checking bismark

We decided to try the bismark program. It is one of the most cited tools for bisulfite sequencing analysis, and it is still actively being maintained.

  • It turns out the Bismark only include unique mapping due to this reason. This seems to be a bit problematic because Sargasso uses multi-map during its separation decesion making.

We can try implement the mapping ourself.

  • bs_genome
  • bs_read
  • bowtie2
  • use read_id to map bs_read back to the origin read

bismark_genome_preparation script from bismark creates two bowtie2 genome indexed, C->T and G->A. We need to write scripts to:

  1. convert original reads to bs_reads
  2. convert mapped bs_read back to original reads.

questions:

  • bisulfite treatment reduces the sequence complexity, which makes the read more likely to map to multiple locations. How is this affect the separation?
Clone this wiki locally