-
Notifications
You must be signed in to change notification settings - Fork 10
Lab 3: De Novo Assembly
After this lab, you shoud be able to
- Understand the algorithms underlying the de novo genome assembly.
- Know how to program the shortest common superstring (SCS) algorithm.
- Know how to program the De Bruijn graph assembly algorithm.
- Perform a simple assembly for a small genome using
Velvet
. - Grasp the trick for selecting the appropriate parameters and performing assessment.
- Visualize the assembly graph using
Bandage
.
Given that the average read length is
- What are the differences between Overlap-Layout-Consensus and De Bruijn graph algorithm?
- Why do we say that De Bruijn graph algorithm is not an overlap-based algorithm for sequence assembly?
- What is the time complexity for naive shortest common superstring and heuristic SCS algorithms? What about the space complexity?
- What are the differences between Eulerian and Hamiltonian paths in the two algorithms for short read assembly?
- Write a function
overlap
to compute the maximal overlap length between two reads. The function has three arguments: the first two are reads; the third is an integer$k \in \mathbb{N}$ to denote the minimal overlap size. - Use the function
overlap
to write a functionoverlapGraph
that takes two arguments: a collection (a list) of reads and an integer$k \in \mathbb{N}$ . This function will compute the overlap graph for this collection of reads. YOu can use a singly linked-list to represent a graph. - Write a function
naiveSCS()
to conduct the brute-force SCS. - Implement greedy shortest common superstring (
greedySCS
) to reach the final assembled contigs with the sequences as well as the lengths. - Write a function
N50
to compute the N50 for the obtained contigs. - (Optional)Use a simulation starting from a small genome to generate the reads (single-end or paired-end) to illustrate the usage of the above functions.
-
FastQC
: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ -
Trimmomatic
: http://www.usadellab.org/cms/?page=trimmomatic -
Velvet
: https://www.ebi.ac.uk/~zerbino/velvet/ -
Bandage
: https://rrwick.github.io/Bandage/
The data used here are compiled from SRA accession SRR2054105: https://www.ncbi.nlm.nih.gov/sra/?term=SRR2054105, and now reposited in
In this session we will start from raw sequencing reads and use de novo assembly to organize them into contigs. We will also explore the internal graph structure to aid our understanding of the assembly approach.
Copy the data to your directory, decompress them, and then check the data:
- How many reads are in this data set?
- How many bases?
- Assuming an average bacterial genome size of 4Mb, what depth of coverage do we have?
Use fastqc
to check the quality of the reads in this dataset:
- Low quality bases typically occur towards the 3'-end of Illumina reads.
The lower the quality score, the higher the chance that the base
is an error. This may introduce false
$k$ -mers into the assembly process. A good assembler should handle these gracefully. - Sequencing adapters are artificial sequence that can occur at the end of reads that came from fragments of DNA that were shorter than desired. The existence of the artifacts will confuse the assemblers.
- Using a median quality threshold of 20, how long of the 3'-end of the reads should be trimmed?
- Do we have adapter sequences in these reads?
- After trimming the reads with
Trimmomatic
, usefastqc
to recheck the dataset again. What happens to the data?
Choose a denovo_assembly_results.xlsx
.
velveth DIR K -shortPaired -fastq.gz -separate R1.fastq.gz R2.fastq.gz
where
-
DIR
: Directory name you choose to write the results to. -
K
:$k$ -mer value, must be odd number. -
-shortPaired
: short ($<300$ ) and paired-end reads. -
-fastq.gz
: format of the input files. -
-separate
: R1 and R2 reads are in distinct files.
time -f "%e" velvetg DIR -exp_cov auto -cov_cutoff auto
where
-
time
: capturing thevelvetg
run time. -
DIR
: output directory, as invelveth
. -
-exp_cov
: expected coverage. -
-cov_cutoff
: coverage cutoff.
Column | Where to find the value |
---|---|
K-mer size | Chosen by yourself. |
Run time | Final output line in seconds: NN.N |
Average K-mer coverage | Look for Estimated Coverage = NN.N
|
Number of contigs | Look for Final graph has NNN nodes
|
N50 contig size | Look for n50 of NNNNN
|
Largest contig size | Look for max NNNNN
|
Contig length sum | Look for total NNNNNNN
|
- Use either R or Matplotlib to draw the scatterplot of
N50
versus different$K$ values. - How does
$K$ affect the other statistics of the assembly result? Which value of$K$ , in your opinion, is doing the best job? Specify your reason. - Let's examine the
stats.txt
file and look at theshort1_cov
column which is the$k$ -mer coverage of each contig. What do you notice the distribution of the$k$ -mer coverage (Hint: histogram)? What do the outliers correspond to? - Have a look at the
contigs.fa
file, how manyN
letters occur in the assembly? What are they?
The final graph used by velvetg
is stored in the file LastGraph
. The tool,
Bandage
, can be used to view and explore the assembly graph.
Use Bandage to load the LastGraph
file, draw the graph and change the option
Random colours
to Colour by read depth
.
Since you and your classmates are running using different
You can refer to the example
from Bandage
website on how
Use VelvetOptimiser
to choose the optimal K
value, using the N50
as the objective function.
What is the optimal Velvet
run for the dataset.
Up to now, you have
- known how the algorithms work for de novo genome assembly
- known how to use
Velvet
to assemble a small genome from Illumina short reads - understood the role of
$k$ in the assembly process - been able to relate the graph structure to the final contigs
- realized the limitations of short read sequences w.r.t. genome assembly.
Enjoy yourself with the assembly journey.
On the way to the garden of bioinformatics.
A bioinformatics wiki for the course BI462.