Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InterProScan to Pipeline and integrate in AMPcombi #428

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#421](https://github.com/nf-core/funcscan/pull/421) Updated to nf-core template 3.0.2. (by @jfy133)
- [#427](https://github.com/nf-core/funcscan/pull/427) AMPcombi now can use multiple other databases for classifications. (by @darcy220606)
- [#429](https://github.com/nf-core/funcscan/pull/429) Updated to nf-core template 3.1.0. (by @jfy133 and @jasmezz)
- [#428](https://github.com/nf-core/funcscan/pull/XXX) Added InterProScan annotation workflow to the pipeline. The results are coupled to AMPcombi final table. (by @darcy220606)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [#428](https://github.com/nf-core/funcscan/pull/XXX) Added InterProScan annotation workflow to the pipeline. The results are coupled to AMPcombi final table. (by @darcy220606)
- [#428](https://github.com/nf-core/funcscan/pull/428) Added InterProScan annotation workflow to the pipeline. The results are coupled to AMPcombi final table. (by @darcy220606)


### `Fixed`

Expand All @@ -18,11 +19,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### `Dependencies`

| Tool | Previous version | New version |
| -------- | ---------------- | ----------- |
| AMPcombi | 0.2.2 | 2.0.1 |
| Macrel | 1.2.0 | 1.4.0 |
| MultiQC | 1.24.0 | 1.25.1 |
| Tool | Previous version | New version |
| ------------ | ---------------- | ----------- |
| AMPcombi | 0.2.2 | 2.0.1 |
| Macrel | 1.2.0 | 1.4.0 |
| MultiQC | 1.24.0 | 1.25.1 |
| InterProScan | - | 5.59_91.0 |

### `Deprecated`

Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,14 @@

> Eddy S. R. (2011). Accelerated Profile HMM Searches. PLoS computational biology, 7(10), e1002195. [DOI: 10.1371/journal.pcbi.1002195](https://doi.org/10.1371/journal.pcbi.1002195)

- [InterPro](https://doi.org/10.1093/nar/gkaa977)

> Blum, M., Chang, H-Y., Chuguransky, S., Grego, T., Kandasaamy, S., Mitchell, A., Nuka, G., Paysan-Lafosse, T., Qureshi, M., Raj, S., Richardson, L., Salazar, G.A., Williams, L., Bork, P., Bridge, A., Gough, J., Haft, D.H., Letunic, I., Marchler-Bauer, A., Mi, H., Natale, D.A., Necci, M., Orengo, C.A., Pandurangan, A.P., Rivoire, C., Sigrist, C.A., Sillitoe, I., Thanki, N., Thomas, P.D., Tosatto, S.C.E, Wu, C.H., Bateman, A., Finn, R.D. (2021) The InterPro protein families and domains database: 20 years on, Nucleic Acids Research, 49(D1), D344–D354.[DOI: 10.1093/nar/gkaa977](https://doi.org/10.1093/nar/gkaa977).

- [InterProScan](https://doi.org/10.1093/bioinformatics/btu031)

> Jones, P., Binns, D., Chang, H-Y., Fraser, M., Li, W., McAnulla, C., McWilliam, H., Maslen, J., Mitchell, A., Nuka, G., Pesseat, S., Quinn, A.F., Sangrador-Vegas, A., Scheremetjew, M., Yong, S-Y., Lopez, R., Hunter, S. (2014)InterProScan 5: genome-scale protein function classification, Bioinformatics, 30(9), 1236–1240. [DOI: 10.1093/bioinformatics/btu031](https://doi.org/10.1093/bioinformatics/btu031)

- [Macrel](https://doi.org/10.7717/peerj.10555)

> Santos-Júnior, C. D., Pan, S., Zhao, X. M., & Coelho, L. P. (2020). Macrel: antimicrobial peptide screening in genomes and metagenomes. PeerJ, 8, e10555. [DOI: 10.7717/peerj.10555](https://doi.org/10.7717/peerj.10555)
Expand Down
7 changes: 7 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -230,4 +230,11 @@ process {
memory = { 6.GB * task.attempt }
time = { 2.h * task.attempt }
}

withName: INTERPROSCAN_DATABASE {
memory = { 6.GB * task.attempt }
time = { 4.h * task.attempt } // Download might take longer with some Bandwidth!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
time = { 4.h * task.attempt } // Download might take longer with some Bandwidth!
time = { 4.h * task.attempt }

cpus = { 6 * task.attempt }
}

}
43 changes: 41 additions & 2 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ process {
]
}

withName: SEQKIT_SEQ {
withName: SEQKIT_SEQ_LENGTH {
ext.prefix = { "${meta.id}_long" }
publishDir = [
path: { "${params.outdir}/bgc/seqkit/" },
Expand All @@ -96,6 +96,45 @@ process {
].join(' ').trim()
}

withName: SEQKIT_SEQ_FILTER {
ext.prefix = { "${meta.id}_cleaned.faa" }
publishDir = [
path: { "${params.outdir}/function/interproscan/" },
mode: params.publish_dir_mode,
enabled: { params.run_function_interproscan },
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
ext.args = [
"--gap-letters '* \t.' --remove-gaps"
].join(' ').trim()
}

withName: INTERPROSCAN_DATABASE {
publishDir = [
path: { "${params.outdir}/databases/interproscan/" },
mode: params.publish_dir_mode,
enabled: params.save_db,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: INTERPROSCAN {
ext.prefix = { "${meta.id}_interproscan.faa" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to have the file suffix at the end?

publishDir = [
path: { "${params.outdir}/function/interproscan/" },
mode: params.publish_dir_mode,
enabled: params.run_function_interproscan,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
ext.args = [
"--applications ${params.function_interproscan_applications}",
params.function_interproscan_enableprecalc ? '' : '--disable-precalc',
params.function_interproscan_enableresidueannot ? '' : '--disable-residue-annot',
params.function_interproscan_disableresidueannottsv ? '--enable-tsv-residue-annot' : '',
"--formats tsv"
].join(' ').trim()
}

withName: PROKKA {
ext.prefix = { "${meta.id}_prokka" }
publishDir = [
Expand Down Expand Up @@ -676,7 +715,7 @@ process {

withName: AMP_DATABASE_DOWNLOAD {
publishDir = [
path: { "${params.outdir}/databases/${params.amp_ampcombi_db}" },
path: { "${params.outdir}/databases/" },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
path: { "${params.outdir}/databases/" },
path: { "${params.outdir}/databases/ampcombi" },

If we use the interproscan example above?

mode: params.publish_dir_mode,
enabled: params.save_db,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
Expand Down
23 changes: 23 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ results/
| ├── prodigal/
| ├── prokka/
| └── pyrodigal/
├── function/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly do you mean here by function? I don't find that particularly descriptive and would'nt know what you mean by that necessarily.

| └── interproscan/
├── amp/
| ├── ampir/
| ├── amplify/
Expand Down Expand Up @@ -74,6 +76,10 @@ ORF prediction and annotation with any of:
- [Prokka](#prokka) – open reading frame prediction and functional protein annotation.
- [Bakta](#bakta) – open reading frame prediction and functional protein annotation.

CDS domain annotation:

- [InterProScan](#interproscan) (default) – for open reading frame protein and domain predictions.

Antimicrobial Resistance Genes (ARGs):

- [ABRicate](#abricate) – antimicrobial resistance gene detection, based on alignment to one of several databases.
Expand Down Expand Up @@ -216,6 +222,23 @@ Output Summaries:

[Bakta](https://github.com/oschwengers/bakta) is a tool for the rapid & standardised annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides dbxref-rich, sORF-including and taxon-independent annotations in machine-readable JSON & bioinformatics standard file formats for automated downstream analysis. The output is used by some of the functional screening tools.

### Functional classifications
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about 'domainprediction' or something like that?


[InterProScan](#interproscan)

#### InterProScan

<details markdown="1">
<summary>Output files</summary>

- `interproscan/`
- `<samplename>_cleaned.faa`: clean version of the fasta files (amino acids) generated by one of the annotated tools. These contain sequences with no special character
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `<samplename>_cleaned.faa`: clean version of the fasta files (amino acids) generated by one of the annotated tools. These contain sequences with no special character
- `<samplename>_cleaned.faa`: clean version of the fasta files (amino acids) generated by one of the annotated tools. These contain sequences with no special character.

What is an 'annotated tool'? Is this atool within interproscan?
What is a special character?

- `<samplename>_interproscan_faa.tsv`: predicted proteins and domains using the InterPro database in TSV format

</details>

[InterProScan](https://academic.oup.com/bioinformatics/article/30/9/1236/237988?login=true) (**a**nti**m**icrobial **p**eptide **p**rediction **i**n **r**) was designed to predict the protein function and and provide possible domain and motif information for the coding regions. It utilizes the InterPro database that consists of multiple sister databases such as PANTHER, ProSite, Pfam, etc. More details can be found in the [documentation](https://interproscan-docs.readthedocs.io/en/latest/index.html).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[InterProScan](https://academic.oup.com/bioinformatics/article/30/9/1236/237988?login=true) (**a**nti**m**icrobial **p**eptide **p**rediction **i**n **r**) was designed to predict the protein function and and provide possible domain and motif information for the coding regions. It utilizes the InterPro database that consists of multiple sister databases such as PANTHER, ProSite, Pfam, etc. More details can be found in the [documentation](https://interproscan-docs.readthedocs.io/en/latest/index.html).
[InterProScan](https://academic.oup.com/bioinformatics/article/30/9/1236/237988?login=true) is designed to predict the protein function and and provide possible domain and motif information for the coding regions. It utilizes the InterPro database that consists of multiple sister databases such as PANTHER, ProSite, Pfam, etc. More details can be found in the [documentation](https://interproscan-docs.readthedocs.io/en/latest/index.html).


### AMP detection tools

[ampir](#ampir), [AMPlify](#amplify), [hmmsearch](#hmmsearch), [Macrel](#macrel)
Expand Down
18 changes: 17 additions & 1 deletion docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ We highly recommend performing quality control on input contigs before running t
For example, ideally BGC screening requires contigs of at least 3,000 bp else downstream tools may crash.
:::

## Notes on screening tools and taxonomic classification
## Notes on screening tools, taxonomic and functional classifications

The implementation of some tools in the pipeline may have some particular behaviours that you should be aware of before you run the pipeline.

Expand All @@ -133,6 +133,18 @@ MMseqs2 is currently the only taxonomic classification tool used in the pipeline
--taxa_classification_mmseqs_db_id 'Kalamari'
```

### InterProScan

[InterProScan](https://github.com/ebi-pf-team/interproscan) is currently the only functional classification tool that gives a snapshot of the protein families and domains for each coding region. By runnning this tool `--run_function_interproscan`, the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/) v5.67-99.0 is by default downloaded and prepared. This can be changed by downloading and extracting the files from any [InterPro version](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/) and the path to the folder assigned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[InterProScan](https://github.com/ebi-pf-team/interproscan) is currently the only functional classification tool that gives a snapshot of the protein families and domains for each coding region. By runnning this tool `--run_function_interproscan`, the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/) v5.67-99.0 is by default downloaded and prepared. This can be changed by downloading and extracting the files from any [InterPro version](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/) and the path to the folder assigned.
[InterProScan](https://github.com/ebi-pf-team/interproscan) is currently the only functional classification tool that gives a snapshot of the protein families and domains for each coding region. By giving `--run_function_interproscan`, the [InterPro database](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.67-99.0/) v5.67-99.0 is by default downloaded and prepared and the input sequences will be screened against the database. You can skip database downloading by the pipeline on each run by manually downloading and extracting the files from any [InterPro version](http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/) and giving the resulting directory path to `--??????`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also give the rough instrucitons on how to do this manually to the end of the USAGE section in the corresponding section as we do for all other tools


```bash
--function_interproscan_db 'path/to/InterPro_directory/'
```

:::info
By default the databases used to assign the nearest protein domain is set as `PANTHER,ProSiteProfiles,ProSitePatterns,Pfam`. An addition of other application to the list, does not guarantee that the results will be integrated correctly within `AMPcombi`.
:::

### antiSMASH

antiSMASH has a minimum contig parameter, in which only contigs of a certain length (or longer) will be screened. In cases where no hits are found in these, the tool ends successfully without hits. However if no contigs in an input file reach that minimum threshold, the tool will end with a 'failure' code, and cause the pipeline to crash.
Expand Down Expand Up @@ -258,6 +270,10 @@ The pipeline will automatically run Pyrodigal instead of Prodigal if the paramet
This is due to an incompatibility issue of Prodigal's output `.gbk` file with multiple downstream tools.
:::

:::tip
If the `run_function_interproscan` is activated, protein and domain classifications of the coding regions are generated and the output is then integrated into the `AMPcombi parsetables` resulting table for every sample and the complete summary files e.g., `Ampcombi_summary.tsv`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the `run_function_interproscan` is activated, protein and domain classifications of the coding regions are generated and the output is then integrated into the `AMPcombi parsetables` resulting table for every sample and the complete summary files e.g., `Ampcombi_summary.tsv`.
If `--run_function_interproscan` is given, protein and domain classifications of the coding regions are generated and the output is then integrated into the `AMPcombi parsetables` resulting table for every sample and the complete summary files e.g., `Ampcombi_summary.tsv`.

:::

### Abricate

The default ABRicate installation comes with a series of 'default' databases:
Expand Down
5 changes: 5 additions & 0 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,11 @@
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"interproscan": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
"installed_by": ["modules"]
},
"macrel/contigs": {
"branch": "master",
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
Expand Down
35 changes: 35 additions & 0 deletions modules/local/interproscan_download.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
process INTERPROSCAN_DATABASE {
tag "interproscan_database_download"
label 'process_medium'

conda "conda-forge::sed=4.7"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/curl:7.80.0' :
'biocontainers/curl:7.80.0' }"

input:
val database_url

output:
path("interproscan_db/*") , emit: db
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
"""
mkdir -p interproscan_db/

filename=\$(basename ${database_url})

curl -L ${database_url} -o interproscan_db/\$filename
tar -xzf interproscan_db/\$filename -C interproscan_db/

cat <<-END_VERSIONS > versions.yml
"${task.process}":
tar: \$(tar --version 2>&1 | sed -n '1s/tar (busybox) //p')
curl: "\$(curl --version 2>&1 | sed -n '1s/^curl \\([0-9.]*\\).*/\\1/p')"
END_VERSIONS
"""
}
5 changes: 5 additions & 0 deletions modules/nf-core/interproscan/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

66 changes: 66 additions & 0 deletions modules/nf-core/interproscan/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading