Skip to content

Lab 3: De Novo Assembly

Ricky Woo edited this page Dec 2, 2018 · 5 revisions

Learning Goals

After this lab, you shoud be able to

  1. Understand the algorithms underlying the de novo genome assembly.
  2. Know how to program the shortest common superstring (SCS) algorithm.
  3. Know how to program the De Bruijn graph assembly algorithm.
  4. Perform a simple assembly for a small genome using Velvet.
  5. Grasp the trick for selecting the appropriate parameters and performing assessment.
  6. Visualize the assembly graph using Bandage.

Theorectical Exercise

Given that the average read length is $d$, number of reads is $m$, the total read length is $N$. And the hash length is set to $k$. For the reference genome, the size is $s$.

  1. What are the differences between Overlap-Layout-Consensus and De Bruijn graph algorithm?
  2. Why do we say that De Bruijn graph algorithm is not an overlap-based algorithm for sequence assembly?
  3. What is the time complexity for naive shortest common superstring and heuristic SCS algorithms? What about the space complexity?
  4. What are the differences between Eulerian and Hamiltonian paths in the two algorithms for short read assembly?

Programming Exercises

  1. Write a function overlap to compute the maximal overlap length between two reads. The function has three arguments: the first two are reads; the third is an integer $k \in \mathbb{N}$ to denote the minimal overlap size.
  2. Use the function overlap to write a function overlapGraph that takes two arguments: a collection (a list) of reads and an integer $k \in \mathbb{N}$. This function will compute the overlap graph for this collection of reads. YOu can use a singly linked-list to represent a graph.
  3. Write a function naiveSCS() to conduct the brute-force SCS.
  4. Implement greedy shortest common superstring (greedySCS) to reach the final assembled contigs with the sequences as well as the lengths.
  5. Write a function N50 to compute the N50 for the obtained contigs.
  6. (Optional)Use a simulation starting from a small genome to generate the reads (single-end or paired-end) to illustrate the usage of the above functions.

Velvet Exercise

Tools

Data

The data used here are compiled from SRA accession SRR2054105: https://www.ncbi.nlm.nih.gov/sra/?term=SRR2054105, and now reposited in

Introduction

In this session we will start from raw sequencing reads and use de novo assembly to organize them into contigs. We will also explore the internal graph structure to aid our understanding of the assembly approach.

$\S$Question 1: Data check

Copy the data to your directory, decompress them, and then check the data:

  1. How many reads are in this data set?
  2. How many bases?
  3. Assuming an average bacterial genome size of 4Mb, what depth of coverage do we have?

$\S$Question 2: Quality checking (QC) and read trimming

Use fastqc to check the quality of the reads in this dataset:

  • Low quality bases typically occur towards the 3'-end of Illumina reads. The lower the quality score, the higher the chance that the base is an error. This may introduce false $k$-mers into the assembly process. A good assembler should handle these gracefully.
  • Sequencing adapters are artificial sequence that can occur at the end of reads that came from fragments of DNA that were shorter than desired. The existence of the artifacts will confuse the assemblers.
  1. Using a median quality threshold of 20, how long of the 3'-end of the reads should be trimmed?
  2. Do we have adapter sequences in these reads?
  3. After trimming the reads with Trimmomatic, use fastqc to recheck the dataset again. What happens to the data?

Assemble the reads using Velvet

Choose a $K$ value, and fill the spreadsheet denovo_assembly_results.xlsx.

1. Run velveth

velveth DIR K -shortPaired -fastq.gz -separate R1.fastq.gz R2.fastq.gz

where

  • DIR: Directory name you choose to write the results to.
  • K: $k$-mer value, must be odd number.
  • -shortPaired: short ($<300$) and paired-end reads.
  • -fastq.gz: format of the input files.
  • -separate: R1 and R2 reads are in distinct files.

2. Run velvetg

time -f "%e" velvetg DIR -exp_cov auto -cov_cutoff auto

where

  • time: capturing the velvetg run time.
  • DIR: output directory, as in velveth.
  • -exp_cov: expected coverage.
  • -cov_cutoff: coverage cutoff.

$\S$Question 3: Add your results to the spreadsheet

Column Where to find the value
K-mer size Chosen by yourself.
Run time Final output line in seconds: NN.N
Average K-mer coverage Look for Estimated Coverage = NN.N
Number of contigs Look for Final graph has NNN nodes
N50 contig size Look for n50 of NNNNN
Largest contig size Look for max NNNNN
Contig length sum Look for total NNNNNNN

$\S$Question 4: Effect of $K$ on assembly

  1. Use either R or Matplotlib to draw the scatterplot of N50 versus different $K$ values.
  2. How does $K$ affect the other statistics of the assembly result? Which value of $K$, in your opinion, is doing the best job? Specify your reason.
  3. Let's examine the stats.txt file and look at the short1_cov column which is the $k$-mer coverage of each contig. What do you notice the distribution of the $k$-mer coverage (Hint: histogram)? What do the outliers correspond to?
  4. Have a look at the contigs.fa file, how many N letters occur in the assembly? What are they?

Visualize the assembly using Bandage

The final graph used by velvetg is stored in the file LastGraph. The tool, Bandage, can be used to view and explore the assembly graph.

Use Bandage to load the LastGraph file, draw the graph and change the option Random colours to Colour by read depth.

$\S$Question 5: $K$ versus the assembly graph

Since you and your classmates are running using different $K$ values, you can make comparison with your classmates, and see how $K$ affect the graph.

You can refer to the example from Bandage website on how $K$ can affect assembly graph.

$\S$Question 6: Optimal $K$

Use VelvetOptimiser to choose the optimal K value, using the N50 as the objective function.

What is the optimal $k$ value? Under such $k$ value. Also, store the statistcs in this Velvet run for the dataset.

Conclusion

Up to now, you have

  • known how the algorithms work for de novo genome assembly
  • known how to use Velvet to assemble a small genome from Illumina short reads
  • understood the role of $k$ in the assembly process
  • been able to relate the graph structure to the final contigs
  • realized the limitations of short read sequences w.r.t. genome assembly.

Enjoy yourself with the assembly journey.

A bioinformatics wiki for the course BI462.

Clone this wiki locally