Lab 3: De Novo Assembly

Learning Goals

After this lab, you shoud be able to

Understand the algorithms underlying the de novo genome assembly.
Know how to program the shortest common superstring (SCS) algorithm.
Know how to program the De Bruijn graph assembly algorithm.
Perform a simple assembly for a small genome using Velvet.
Grasp the trick for selecting the appropriate parameters and performing assessment.
Visualize the assembly graph using Bandage.

Theorectical Exercise

Given that the average read length is $d$, number of reads is $m$, the total read length is $N$. And the hash length is set to $k$. For the reference genome, the size is $s$.

What are the differences between Overlap-Layout-Consensus and De Bruijn graph algorithm?
Why do we say that De Bruijn graph algorithm is not an overlap-based algorithm for sequence assembly?
What is the time complexity for naive shortest common superstring and heuristic SCS algorithms? What about the space complexity?
What are the differences between Eulerian and Hamiltonian paths in the two algorithms for short read assembly?

Programming Exercises

Write a function overlap to compute the maximal overlap length between two reads. The function has three arguments: the first two are reads; the third is an integer $k \in \mathbb{N}$ to denote the minimal overlap size.
Use the function overlap to write a function overlapGraph that takes two arguments: a collection (a list) of reads and an integer $k \in \mathbb{N}$. This function will compute the overlap graph for this collection of reads. YOu can use a singly linked-list to represent a graph.
Write a function naiveSCS() to conduct the brute-force SCS.
Implement greedy shortest common superstring (greedySCS) to reach the final assembled contigs with the sequences as well as the lengths.
Write a function N50 to compute the N50 for the obtained contigs.
(Optional)Use a simulation starting from a small genome to generate the reads (single-end or paired-end) to illustrate the usage of the above functions.

Velvet Exercise

Tools

FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic
Velvet: https://www.ebi.ac.uk/~zerbino/velvet/
Bandage: https://rrwick.github.io/Bandage/

Data

The data used here are compiled from SRA accession SRR2054105: https://www.ncbi.nlm.nih.gov/sra/?term=SRR2054105, and now reposited in

/share/data/course/bi462/lab3.tgz

Introduction

In this session we will start from raw sequencing reads and use de novo assembly to organize them into contigs. We will also explore the internal graph structure to aid our understanding of the assembly approach.

$\S$Question 1: Data check

Copy the data to your directory, decompress them, and then check the data:

How many reads are in this data set?
How many bases?
Assuming an average bacterial genome size of 4Mb, what depth of coverage do we have?

$\S$Question 2: Quality checking (QC) and read trimming

Use fastqc to check the quality of the reads in this dataset:

Low quality bases typically occur towards the 3'-end of Illumina reads. The lower the quality score, the higher the chance that the base is an error. This may introduce false $k$-mers into the assembly process. A good assembler should handle these gracefully.
Sequencing adapters are artificial sequence that can occur at the end of reads that came from fragments of DNA that were shorter than desired. The existence of the artifacts will confuse the assemblers.

Using a median quality threshold of 20, how long of the 3'-end of the reads should be trimmed?
Do we have adapter sequences in these reads?
After trimming the reads with Trimmomatic, use fastqc to recheck the dataset again. What happens to the data?

Assemble the reads using `Velvet`

Choose a $K$ value, and fill the spreadsheet denovo_assembly_results.xlsx.

1. Run `velveth`

velveth DIR K -shortPaired -fastq.gz -separate R1.fastq.gz R2.fastq.gz

where

DIR: Directory name you choose to write the results to.
K: $k$-mer value, must be odd number.
-shortPaired: short ($<300$) and paired-end reads.
-fastq.gz: format of the input files.
-separate: R1 and R2 reads are in distinct files.

2. Run `velvetg`

time -f "%e" velvetg DIR -exp_cov auto -cov_cutoff auto

where

time: capturing the velvetg run time.
DIR: output directory, as in velveth.
-exp_cov: expected coverage.
-cov_cutoff: coverage cutoff.

$\S$Question 3: Add your results to the spreadsheet

Column	Where to find the value
K-mer size	Chosen by yourself.
Run time	Final output line in seconds: NN.N
Average K-mer coverage	Look for `Estimated Coverage = NN.N`
Number of contigs	Look for `Final graph has NNN nodes`
N50 contig size	Look for `n50 of NNNNN`
Largest contig size	Look for `max NNNNN`
Contig length sum	Look for `total NNNNNNN`

$\S$Question 4: Effect of $K$ on assembly

Use either R or Matplotlib to draw the scatterplot of N50 versus different $K$ values.
How does $K$ affect the other statistics of the assembly result? Which value of $K$, in your opinion, is doing the best job? Specify your reason.
Let's examine the stats.txt file and look at the short1_cov column which is the $k$-mer coverage of each contig. What do you notice the distribution of the $k$-mer coverage (Hint: histogram)? What do the outliers correspond to?
Have a look at the contigs.fa file, how many N letters occur in the assembly? What are they?

Visualize the assembly using `Bandage`

The final graph used by velvetg is stored in the file LastGraph. The tool, Bandage, can be used to view and explore the assembly graph.

Use Bandage to load the LastGraph file, draw the graph and change the option Random colours to Colour by read depth.

$\S$Question 5: $K$ versus the assembly graph

Since you and your classmates are running using different $K$ values, you can make comparison with your classmates, and see how $K$ affect the graph.

You can refer to the example from Bandage website on how $K$ can affect assembly graph.

$\S$Question 6: Optimal $K$

Use VelvetOptimiser to choose the optimal K value, using the N50 as the objective function.

What is the optimal $k$ value? Under such $k$ value. Also, store the statistcs in this Velvet run for the dataset.

Conclusion

Up to now, you have

known how the algorithms work for de novo genome assembly
known how to use Velvet to assemble a small genome from Illumina short reads
understood the role of $k$ in the assembly process
been able to relate the graph structure to the final contigs
realized the limitations of short read sequences w.r.t. genome assembly.

Enjoy yourself with the assembly journey.

On the way to the garden of bioinformatics.

A bioinformatics wiki for the course BI462.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lab 3: De Novo Assembly

Learning Goals

Theorectical Exercise

Programming Exercises

Velvet Exercise

Tools

Data

Introduction

$\S$Question 1: Data check

$\S$Question 2: Quality checking (QC) and read trimming

Assemble the reads using `Velvet`

1. Run `velveth`

2. Run `velvetg`

$\S$Question 3: Add your results to the spreadsheet

$\S$Question 4: Effect of $K$ on assembly

Visualize the assembly using `Bandage`

$\S$Question 5: $K$ versus the assembly graph

$\S$Question 6: Optimal $K$

Conclusion

Clone this wiki locally

Lab 3: De Novo Assembly

Learning Goals

Theorectical Exercise

Programming Exercises

Velvet Exercise

Tools

Data

Introduction

$\S$Question 1: Data check

$\S$Question 2: Quality checking (QC) and read trimming

Assemble the reads using Velvet

1. Run velveth

2. Run velvetg

$\S$Question 3: Add your results to the spreadsheet

$\S$Question 4: Effect of $K$ on assembly

Visualize the assembly using Bandage

$\S$Question 5: $K$ versus the assembly graph

$\S$Question 6: Optimal $K$

Conclusion

Clone this wiki locally

Assemble the reads using `Velvet`

1. Run `velveth`

2. Run `velvetg`

Visualize the assembly using `Bandage`