Merge pull request #96 from peterk87/force-clair3-full-aln

Fix Clair3 sometimes missing variants
peterk87 · Dec 13, 2024 · d4c5f42 · d4c5f42
2 parents 92f1f2e + 7b298d2
commit d4c5f42
Show file tree

Hide file tree

Showing 11 changed files with 83 additions and 34 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,17 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## [[3.6.1](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.6.1)] - 2024-12-13
+
+This patch release fixes an issue with Clair3 not producing variant calls for some regions due to full-alignment not being triggered. This issue was resolved by adding `--var_pct_phasing=1`, `--var_pct_full=1` and `--ref_pct_full=1` to the Clair3 command line.
+
+### Changes
+
+* fix: Added `--var_pct_phasing=1`, `--var_pct_full=1` and `--ref_pct_full=1` to Clair3 command line to ensure full-alignment is triggered for all reads to avoid missing variant calls in some regions.
+* fix: Added `stageAs: "input*/*"` to `CAT_NANOPORE_FASTQ` process input channels to ensure that input files are not concatenated with themselves in an infinite loop until disk space is exhausted in rare cases.
+* feat: Don't save NCBI Influenza reference sequences, metadata CSV and BLAST DB to the output directory by default. Added `--save_ncbi_db` and `--save_blastdb` workflow params to save these files to the output directory if desired.
+* docs: Updated README.md to mention Apptainer. Updated `usage.md` to describe new workflow params. Updated `output.md` to better describe BLAST subtyping results.
+
 ## [[3.6.0](https://github.com/CFIA-NCFAD/nf-flu/releases/tag/3.6.0)] - 2024-12-02
 
 This minor release adds [FluMut](https://github.com/izsvenezie-virology/FluMut) to "to search for molecular markers with potential impact on the biological characteristics of Influenza A viruses of the A(H5N1) subtype."

diff --git a/README.md b/README.md
@@ -1,12 +1,14 @@
 # CFIA-NCFAD/nf-flu - Influenza A and B Virus Genome Assembly Nextflow Workflow
 
-[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13892044.svg)](https://doi.org/10.5281/zenodo.13892044)
+[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14268099.svg)](https://doi.org/10.5281/zenodo.14268099)
 [![CI](https://github.com/CFIA-NCFAD/nf-flu/actions/workflows/ci.yml/badge.svg)](https://github.com/CFIA-NCFAD/nf-flu/actions/workflows/ci.yml)
 
 [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A521.04.0-23aa62.svg?labelColor=000000)](https://www.nextflow.io/)
 [![run with conda](http://img.shields.io/badge/run%20with-conda-3EB049?labelColor=000000&logo=anaconda)](https://docs.conda.io/en/latest/)
 [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/)
+[![run with apptainer](https://img.shields.io/badge/run%20with-apptainer-1d355c.svg?labelColor=000000)](https://apptainer.org/)
 [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/)
+[![run with podman](https://img.shields.io/badge/run%20with-podman-1d355c.svg?labelColor=000000)](https://podman.io/)
 
 ## Introduction
 
@@ -32,25 +34,25 @@ After reference sequence selection, the pipeline performs read mapping to each r
 
 ## Quick Start
 
-1. Install [`Nextflow`](https://www.nextflow.io/docs/latest/getstarted.html#installation) (`>=21.04.0`).
-2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort)_
+1. Install [`Nextflow`](https://www.nextflow.io/docs/latest/getstarted.html#installation) (`>=22.10.1`; latest stable release recommended!).
+2. Install any of [`Docker`](https://docs.docker.com/engine/installation/), [`Apptainer`][], [`Singularity`](https://www.sylabs.io/guides/3.0/user-guide/), [`Podman`](https://podman.io/), [`Shifter`](https://nersc.gitlab.io/development/shifter/how-to-use/) or [`Charliecloud`](https://hpc.github.io/charliecloud/) for full pipeline reproducibility _(please only use [`Conda`](https://conda.io/miniconda.html) as a last resort)_
 3. Download the pipeline and test it on a minimal dataset with a single command:
 
     For Illumina workflow test:
 
     ```bash
-    nextflow run CFIA-NCFAD/nf-flu -profile test_illumina,<docker/singularity/podman/shifter/charliecloud/conda> \
+    nextflow run CFIA-NCFAD/nf-flu -profile test_illumina,<docker/apptainer/singularity/podman/shifter/charliecloud/conda> \
       --max_cpus $(nproc) # use all available CPUs; default is 2
     ```
 
     For Nanopore workflow test:
 
     ```bash
-    nextflow run CFIA-NCFAD/nf-flu -profile test_nanopore,<docker/singularity/podman/shifter/charliecloud/conda> \
+    nextflow run CFIA-NCFAD/nf-flu -profile test_nanopore,<docker/apptainer/singularity/podman/shifter/charliecloud/conda> \
       --max_cpus $(nproc) # use all available CPUs; default is 2
     ```
 
-    > * If you are using `singularity` then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
+    > * If you are using `apptainer`/`singularity` then the pipeline will auto-detect this and attempt to download the Apptainer/Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Apptainer/Singularity images directly due to timeout or network issues then please use the `--singularity_pull_docker_container` parameter to pull and convert the Docker image instead. Alternatively, it is highly recommended to use the [`nf-core download`](https://nf-co.re/tools/#downloading-pipelines-for-offline-use) command to pre-download all of the required containers before running the pipeline and to set the [`NXF_SINGULARITY_CACHEDIR` or `singularity.cacheDir`](https://www.nextflow.io/docs/latest/singularity.html?#singularity-docker-hub) Nextflow options to be able to store and re-use the images from a central location for future pipeline runs.
     > * If you are using `conda`, it is highly recommended to use the [`NXF_CONDA_CACHEDIR` or `conda.cacheDir`](https://www.nextflow.io/docs/latest/conda.html) settings to store the environments in a central location for future pipeline runs.
 
 4. Run your own analysis
@@ -69,7 +71,7 @@ After reference sequence selection, the pipeline performs read mapping to each r
         nextflow run CFIA-NCFAD/nf-flu \
           --input samplesheet.csv \
           --platform illumina \
-          --profile <docker/singularity/podman/shifter/charliecloud/conda>
+          --profile <docker/apptainer/singularity/podman/shifter/charliecloud/conda>
         ```
 
     * Typical command for Nanopore Platform
@@ -78,7 +80,7 @@ After reference sequence selection, the pipeline performs read mapping to each r
       nextflow run CFIA-NCFAD/nf-flu \
         --input samplesheet.csv \
         --platform nanopore \
-        --profile <docker/singularity/conda>
+        --profile <docker/apptainer/singularity/conda>
       ```
 
 ## Documentation
@@ -223,8 +225,9 @@ Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney
 * [nf-core](https://nf-co.re) project for establishing Nextflow workflow development best-practices, [nf-core tools](https://nf-co.re/tools-docs/) and [nf-core modules](https://github.com/nf-core/modules)
 * [nf-core/viralrecon](https://github.com/nf-core/viralrecon) for inspiration and setting a high standard for viral sequence data analysis pipelines
 * [Conda](https://docs.conda.io/projects/conda/en/latest/) and [Bioconda](https://bioconda.github.io/) project for making it easy to install, distribute and use bioinformatics software.
-* [Biocontainers](https://biocontainers.pro/) for automatic creation of [Docker] and [Singularity] containers for bioinformatics software in [Bioconda]
+* [Biocontainers](https://biocontainers.pro/) for automatic creation of [Docker] and [Apptainer]/[Singularity] containers for bioinformatics software in [Bioconda]
 
+[Apptainer]: https://apptainer.org/
 [BcfTools]: https://samtools.github.io/bcftools/
 [BLAST]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
 [Clair3]: https://github.com/HKU-BAL/Clair3

diff --git a/conf/modules.config b/conf/modules.config
@@ -33,12 +33,12 @@ process {
   }
   withName: 'BLAST_MAKEBLASTDB' {
     ext.args = '-dbtype nucl'
-    publishDir = [
+    publishDir = [ params.save_blastdb ?
       [
         path: { "${params.outdir}/blast/db/ncbi"},
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         mode: params.publish_dir_mode
-      ]
+      ] : []
     ]
   }
   withName: 'BLAST_BLASTN.*' {
@@ -99,12 +99,12 @@ process {
     ]
   }
   withName: 'ZSTD_DECOMPRESS_.*' {
-    publishDir = [
+    publishDir = [ params.save_ncbi_db ?
       [
         path: { "${params.outdir}/ncbi-influenza-db"},
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         mode: params.publish_dir_mode
-      ]
+      ] : []
     ]
   }
   withName: 'MQC_VERSIONS_TABLE' {

diff --git a/conf/modules_illumina.config b/conf/modules_illumina.config
@@ -12,12 +12,12 @@ process {
 
   withName: 'BLAST_MAKEBLASTDB_NCBI' {
     ext.args = '-dbtype nucl'
-    publishDir = [
+    publishDir = [ params.save_blastdb ?
       [
         path: { "${params.outdir}/blast/db/ncbi"},
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         mode: params.publish_dir_mode
-      ]
+      ] : []
     ]
   }
 

diff --git a/conf/modules_nanopore.config b/conf/modules_nanopore.config
@@ -12,12 +12,12 @@ process {
   }
   withName: 'BLAST_MAKEBLASTDB_REFDB' {
     ext.args  = '-dbtype nucl'
-    publishDir = [
+    publishDir = [ params.save_blastdb ?
       [
         path: { "${params.outdir}/blast/db/ref_db" },
         saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         mode: params.publish_dir_mode
-      ]
+      ] : []
     ]
   }
   withName: 'BLAST_BLASTN_IRMA' {
@@ -213,16 +213,6 @@ process {
     ]
   }
 
-  withName: 'ZSTD_DECOMPRESS_.*' {
-    publishDir = [
-      [
-        path: { "${params.outdir}/ncbi-influenza-db"},
-        saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-        mode: params.publish_dir_mode
-      ]
-    ]
-  }
-
   withName: 'READ_COUNT_FAIL_TSV' {
     publishDir = [
       [

diff --git a/docs/output.md b/docs/output.md
@@ -158,11 +158,11 @@ The report contains 2 sheets:
 <details markdown="1">
 <summary>Output files</summary>
 
-- H/N subtyping Excel report: `iav-subtyping-report.xlsx`
+- H/N subtyping Excel report: `nf-flu-subtyping-report.xlsx`
 
 </details>
 
-A H/N subtyping Excel report is generated from all [BLAST analysis](#blast-analysis) results for all samples and final assembled gene segments.
+A H/N subtyping Excel report is generated from all [BLAST analysis](#blast-analysis) results for all samples and final assembled gene segments. The H and N subtypes are based on the proportion of high-quality BLAST matches that support the subtype prediction, that is, the top BLAST match for the HA and NA segments does not determine the subtype since the metadata for the top match could be incorrectly entered into NCBI.
 
 The subtyping report spreadsheet contains four sheets:
 
@@ -190,7 +190,7 @@ This sheet contains the H/N subtype prediction results for each sample along wit
 
 #### Sheet: 2_Top Segment Matches
 
-This sheet contains the top 3 Influenza DB sequence matches for each segment of each sample along with BLASTN hit values and reference sequence metadata.
+This sheet contains the top N Influenza DB sequence matches for each segment of each sample along with BLASTN hit values and reference sequence metadata.
 
 | Field | Description | Example |
 |-------|-------------|---------|

diff --git a/docs/usage.md b/docs/usage.md
@@ -133,6 +133,20 @@ Reference database in fasta file, sequence ID must be in format `SequenceName_se
 
 The output directory where the results will be saved.
 
+#### `--save_ncbi_db`
+
+- Type: boolean
+- Default: `false`
+
+Save the NCBI Influenza database FASTA and metadata CSV to the output directory.
+
+#### `--save_blastdb`
+
+- Type: boolean
+- Default: `false`
+
+Save the BLAST database to the output directory.
+
 ### IRMA assembly options
 
 #### `--irma_module`

diff --git a/modules/local/clair3.nf b/modules/local/clair3.nf
@@ -64,7 +64,10 @@ process CLAIR3 {
     --haploid_sensitive \\
     --enable_long_indel \\
     --keep_iupac_bases \\
-    --include_all_ctgs
+    --include_all_ctgs \\
+    --var_pct_phasing=1 \\
+    --var_pct_full=1 \\
+    --ref_pct_full=1
 
   ln -s ${clair3_dir}/merge_output.vcf.gz ${vcf}
 

diff --git a/modules/local/misc.nf b/modules/local/misc.nf
@@ -10,7 +10,7 @@ process CAT_NANOPORE_FASTQ {
   }
 
   input:
-  tuple val(meta), path(fqgz), path(fq)
+  tuple val(meta), path(fqgz, stageAs: "input*/*"), path(fq, stageAs: "input*/*")
 
   output:
   tuple val(meta), path(merged_fqgz), emit: reads

diff --git a/nextflow.config b/nextflow.config
@@ -18,6 +18,8 @@ params {
   irma_module                       = ''
   keep_ref_deletions                = true
   skip_irma_subtyping_report        = true
+  save_ncbi_db                      = false
+  save_blastdb                      = false
   // H/N subtyping options
   pident_threshold                  = 0.85
   min_aln_length                    = 700
@@ -68,15 +70,26 @@ params {
 includeConfig 'conf/base.config'
 
 profiles {
+  apptainer {
+    apptainer.enabled      = true
+    apptainer.autoMounts   = true
+    charliecloud.enabled   = false
+    docker.enabled         = false
+    apptainer.enabled      = false
+    podman.enabled         = false
+    shifter.enabled        = false
+  }
   charliecloud {
     charliecloud.enabled   = true
     docker.enabled         = false
+    apptainer.enabled      = false
     singularity.enabled    = false
     podman.enabled         = false
     shifter.enabled        = false
   }
   conda {
     conda.enabled          = true
+    apptainer.enabled      = false
     docker.enabled         = false
     singularity.enabled    = false
     podman.enabled         = false
@@ -89,6 +102,7 @@ profiles {
   mamba {
     conda.enabled          = true
     conda.useMamba         = true
+    apptainer.enabled      = false
     docker.enabled         = false
     singularity.enabled    = false
     podman.enabled         = false
@@ -101,20 +115,22 @@ profiles {
   docker {
     docker.enabled         = true
     docker.userEmulation   = true
+    apptainer.enabled      = false
     singularity.enabled    = false
     podman.enabled         = false
     shifter.enabled        = false
     charliecloud.enabled   = false
   }
   podman {
     podman.enabled         = true
+    apptainer.enabled      = false
     docker.enabled         = false
     singularity.enabled    = false
     shifter.enabled        = false
     charliecloud.enabled   = false
   }
   singularity {
-    singularity.enabled = true
+    singularity.enabled    = true
     singularity.autoMounts = true
     docker.enabled         = false
     podman.enabled         = false
@@ -155,7 +171,7 @@ manifest {
   description     = 'Influenza A virus genome assembly pipeline'
   homePage        = 'https://github.com/CFIA-NCFAD/nf-flu'
   author          = 'Peter Kruczkiewicz, Hai Nguyen'
-  version         = '3.6.0'
+  version         = '3.6.1'
   nextflowVersion = '!>=22.10.1'
   mainScript      = 'main.nf'
   doi             = '10.5281/zenodo.13892044'

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -36,6 +36,18 @@
                     "description": "The output directory where the results will be saved.",
                     "default": "./results",
                     "fa_icon": "fas fa-folder-open"
+                },
+                "save_ncbi_db": {
+                    "type": "boolean",
+                    "description": "Save the NCBI Influenza database FASTA and metadata CSV to the output directory.",
+                    "default": false,
+                    "fa_icon": "fas fa-database"
+                },
+                "save_blastdb": {
+                    "type": "boolean",
+                    "description": "Save the BLAST database to the output directory.",
+                    "default": false,
+                    "fa_icon": "fas fa-database"
                 }
             },
             "required": [
-Original file line number
+Diff line change
@@ Expand Up / @@ -10,7 +10,7 @@ process CAT_NANOPORE_FASTQ { @@
       }
       input:
-      tuple val(meta), path(fqgz), path(fq)
+      tuple val(meta), path(fqgz, stageAs: "input*/*"), path(fq, stageAs: "input*/*")
       output:
       tuple val(meta), path(merged_fqgz), emit: reads
@@ Expand Down @@