v1.0.0 – Hefty mûmakil
Overview
The pipeline takes a CSV file that contains assembly accession number, Ensembl species names (as they may differ from Tree of Life ones !), output directories, and geneset versions.
Assembly accession numbers are optional. If missing, the pipeline assumes it can be retrieved from files named ACCESSION
in the standard location on disk.
The pipeline downloads the Fasta files of the genes (cdna, cds, and protein sequences) as well as the GFF3 file.
All files are compressed with bgzip
, and indexed with samtools faidx
or tabix
.
Steps involved:
- Download from Ensembl the GFF3 file, and the sequences of the genes in
Fasta format. - Compress and index all Fasta files with
bgzip
,samtools faidx
, and
samtools dict
. - Compress and index the GFF3 file with
bgzip
andtabix
.
Dependencies
All dependencies are automatically fetched by Singularity.
- bgzip
- samtools
- tabix
- python3
- wget
- awk
- gzip