Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for generating taxprofiler/funcscan input samplesheets for preprocessed FASTQs/FASTAs #688

Draft
wants to merge 19 commits into
base: dev
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions conf/test_hybrid.config
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,8 @@ params {
skip_gtdbtk = true
gtdbtk_min_completeness = 0
skip_concoct = true

// Generate downstream samplesheets
generate_downstream_samplesheets = true
generate_pipeline_samplesheets = "funcscan,taxprofiler"
}
25 changes: 25 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -707,6 +707,9 @@ Because of aDNA damage, _de novo_ assemblers sometimes struggle to call a correc

</details>

The pipeline can also generate downstream pipeline input samplesheets.
jasmezz marked this conversation as resolved.
Show resolved Hide resolved
These are stored in `<outdir>/downstream_samplesheets`.

### MultiQC

<details markdown="1">
Expand Down Expand Up @@ -751,3 +754,25 @@ Summary tool-specific plots and tables of following tools are currently displaye
</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

### Downstream samplesheets

The pipeline can also generate input files for the following downstream
pipelines:
jasmezz marked this conversation as resolved.
Show resolved Hide resolved

- [nf-core/funcscan](https://nf-co.re/funcscan)
- [nf-core/taxprofiler](https://nf-co.re/taxprofiler)

<details markdown="1">
<summary>Output files</summary>

- `downstream_samplesheets/`
- `funcscan.csv`: Filled out nf-core/funcscan `--input` csv with absolute paths to the assembly FASTA files produced by MAG (MEGAHIT, SPAdes, SPAdesHybrid)
jasmezz marked this conversation as resolved.
Show resolved Hide resolved
- `taxprofiler.csv`: Partially filled out nf-core/taxprofiler preprocessed short reads csv with paths to database directories or `.fast1.gz` relative to the results directory
jasmezz marked this conversation as resolved.
Show resolved Hide resolved

</details>

:::warning
Any generated downstream samplesheet is provided as 'best effort' and are not guaranteed to work straight out of the box!
They may not be complete (e.g. some columns may need to be manually filled in).
:::
3 changes: 3 additions & 0 deletions nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,9 @@ params {
validationShowHiddenParams = false
validate_params = true

// Generate downstream samplesheets
generate_downstream_samplesheets = false
generate_pipeline_samplesheets = "funcscan,taxprofiler"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this default to null so that users have to opt-in to samplesheet generation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good idea! I will set that in createtaxdb too

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference? In both cases null/false as default it would be an opt-in by the user, no?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah maybe I misunderstood Carson... not sure :D

}

// Load base.config by default for all pipelines
Expand Down
22 changes: 22 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,25 @@
}
}
},
"generate_samplesheet_options": {
"title": "Downstream pipeline samplesheet generation options",
"type": "object",
"fa_icon": "fas fa-align-justify",
"description": "Options for generating input samplesheets for complementary downstream pipelines.",
"properties": {
"generate_downstream_samplesheets": {
"type": "boolean",
"description": "Turn on generation of samplesheets for downstream pipelines.",
"fa_icon": "fas fa-toggle-on"
},
"generate_pipeline_samplesheets": {
"type": "string",
"default": "funcscan,taxprofiler",
"description": "Specify which pipeline to generate a samplesheet for.",
"fa_icon": "fas fa-toolbox"
}
}
},
"institutional_config_options": {
"title": "Institutional config options",
"type": "object",
Expand Down Expand Up @@ -914,6 +933,9 @@
{
"$ref": "#/definitions/reference_genome_options"
},
{
"$ref": "#/definitions/generate_samplesheet_options"
},
{
"$ref": "#/definitions/institutional_config_options"
},
Expand Down
86 changes: 86 additions & 0 deletions subworkflows/local/generate_downstream_samplesheets/main.nf
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like @jfy133 used only one workflow, which will selectively generate samplesheets based on params.generate_pipeline_samplesheets. Do you think it would be best to keep that consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since FastQ files are being pulled from the publishDir, it might be a good idea to include options that override user inputs for params.publish_dir_mode (so that it is always 'copy' if a samplesheet is generated) and params.save_clipped_reads, params.save_phixremoved_reads ...etc so that the preprocessed FastQ files are published to the params.outdir if a downstream samplesheet is generated

Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
//
// Subworkflow with functionality specific to the nf-core/mag pipeline
//

workflow SAMPLESHEET_TAXPROFILER {
take:
ch_reads

main:
def fastq_rel_path = '/'
if (params.bbnorm) {
fastq_rel_path = '/bbmap/bbnorm/'
} else if (!params.keep_phix) {
fastq_rel_path = '/QC_shortreads/remove_phix/'
}
else if (params.host_fasta) {
fastq_rel_path = '/QC_shortreads/remove_host/'
}
else if (!params.skip_clipping) {
fastq_rel_path = '/QC_shortreads/fastp/'
}
ch_list_for_samplesheet = ch_reads
.map {
meta, fastq ->
def sample = meta.id
def run_accession = meta.id
def instrument_platform = ""
def fastq_1 = file(params.outdir).toString() + fastq_rel_path + meta.id + '/' + fastq[0].getName()
def fastq_2 = file(params.outdir).toString() + fastq_rel_path + meta.id + '/' + fastq[1].getName()
def fasta = ""
[ sample: sample, run_accession: run_accession, instrument_platform: instrument_platform, fastq_1: fastq_1, fastq_2: fastq_2, fasta: fasta ]
}
.tap{ ch_header }

ch_header
jfy133 marked this conversation as resolved.
Show resolved Hide resolved
.first()
.map{ it.keySet().join(",") }
.concat( ch_list_for_samplesheet.map{ it.values().join(",") })
.collectFile(
name:"${params.outdir}/downstream_samplesheet/taxprofiler.csv",
newLine: true,
sort: false
)
}

workflow SAMPLESHEET_FUNCSCAN {
take:
ch_assemblies

main:
ch_list_for_samplesheet = ch_assemblies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next thing which I don't think will be so complicated is to add another input channel for bins, and here make an if/else statement if they want to send just the raw assemblies (all contigs) or binned contigs to the samplesheet.

It will need another pipeline level parameter too though --generate_samplesheet_funcscan_seqtype or something

.map {
meta, filename ->
def sample = meta.id
def fasta = file(params.outdir).toString() + '/Assembly/' + meta.assembler + '/' + filename.getName()
[ sample: sample, fasta: fasta ]
}
.tap{ ch_header }

ch_header
.first()
.map{ it.keySet().join(",") }
.concat( ch_list_for_samplesheet.map{ it.values().join(",") })
.collectFile(
name:"${params.outdir}/downstream_samplesheet/funcscan.csv",
newLine: true,
sort: false
)
}

workflow GENERATE_DOWNSTREAM_SAMPLESHEETS {
take:
ch_reads
ch_assemblies

main:
def downstreampipeline_names = params.generate_pipeline_samplesheets.split(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also implemented the same system in createtaxdb now, but with an additional input validation thing that you should also adopt here (i.e., to check that someone doesn't add an unsupported pipeline, or makes a typo).

Check the utils_nfcore_createtaxdb_pipeline file there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


if ( downstreampipeline_names.contains('taxprofiler') && params.save_clipped_reads ) { // save_clipped_reads must be true
SAMPLESHEET_TAXPROFILER(ch_reads)
}

if ( downstreampipeline_names.contains('funcscan') ) {
SAMPLESHEET_FUNCSCAN(ch_assemblies)
}
}
30 changes: 19 additions & 11 deletions workflows/mag.nf
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,18 @@ include { methodsDescriptionText } from '../subworkflows/local/utils_nfcore_mag_
//
// SUBWORKFLOW: Consisting of a mix of local and nf-core/modules
//
include { BINNING_PREPARATION } from '../subworkflows/local/binning_preparation'
include { BINNING } from '../subworkflows/local/binning'
include { BINNING_REFINEMENT } from '../subworkflows/local/binning_refinement'
include { BUSCO_QC } from '../subworkflows/local/busco_qc'
include { VIRUS_IDENTIFICATION } from '../subworkflows/local/virus_identification'
include { CHECKM_QC } from '../subworkflows/local/checkm_qc'
include { GUNC_QC } from '../subworkflows/local/gunc_qc'
include { GTDBTK } from '../subworkflows/local/gtdbtk'
include { ANCIENT_DNA_ASSEMBLY_VALIDATION } from '../subworkflows/local/ancient_dna'
include { DOMAIN_CLASSIFICATION } from '../subworkflows/local/domain_classification'
include { DEPTHS } from '../subworkflows/local/depths'
include { BINNING_PREPARATION } from '../subworkflows/local/binning_preparation'
include { BINNING } from '../subworkflows/local/binning'
include { BINNING_REFINEMENT } from '../subworkflows/local/binning_refinement'
include { BUSCO_QC } from '../subworkflows/local/busco_qc'
include { VIRUS_IDENTIFICATION } from '../subworkflows/local/virus_identification'
include { CHECKM_QC } from '../subworkflows/local/checkm_qc'
include { GUNC_QC } from '../subworkflows/local/gunc_qc'
include { GTDBTK } from '../subworkflows/local/gtdbtk'
include { ANCIENT_DNA_ASSEMBLY_VALIDATION } from '../subworkflows/local/ancient_dna'
include { DOMAIN_CLASSIFICATION } from '../subworkflows/local/domain_classification'
include { DEPTHS } from '../subworkflows/local/depths'
include { GENERATE_DOWNSTREAM_SAMPLESHEETS } from '../subworkflows/local/generate_downstream_samplesheets/main.nf'

//
// MODULE: Installed directly from nf-core/modules
Expand Down Expand Up @@ -1002,6 +1003,13 @@ workflow MAG {
}
}

//
// Samplesheet generation
//
if ( params.generate_downstream_samplesheets ) {
GENERATE_DOWNSTREAM_SAMPLESHEETS ( ch_short_reads_assembly, ch_assemblies )
}

//
// Collate and save software versions
//
Expand Down
Loading