Add concat fastqs from SRA manifest #227

lldelisle · 2023-10-09T20:47:35Z

Here is a new workflow when the samples has been sequenced in different run.

Points to discuss:

SRA list vs manifest etc... I need more columns than just SRA name
The use of parallel download seems to fail, probably because there are too many levels of nested in lists.

lldelisle · 2023-10-10T06:46:41Z

Also I don't like this hard written column 6 buy I don't know how to do because it is a column parameter.

wm75 · 2023-10-10T11:39:56Z

I've incorporated a couple of suggestions into https://usegalaxy.eu/u/wolfgang-maier/w/sralisttoconcatenatedfastqs-imported-from-url, specifically:

perform splitting on the input dataset directly and make sure the header line gets propagated to each collection element
the output at this stage will always be of tabular format ->change this to sra_manifest.tabular and feed it directly to the faster download tool (saves cut jobs on collection elements)
I've also changed the tool for splitting to toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0 because it's the more versatile tool and is also used in at least one other iwc workflow (covid-19 variation reporting) already, so we might want to promote use of it, but feel free to ignore me
the Select last, Cut and Unique steps can all be replaced with a single Datamash step - just that this requires now hard-coding of the final sample name column instead of column 6, again because we cannot create a param of type data_column from a WF input parameter.

I can confirm the remaining problem besides the data_column issue: parallel faster download jobs don't seem to organize their outputs correctly, but produce empty nested list structures

lldelisle · 2023-10-10T13:04:01Z

Meanwhile, I was wondering if we should not start by using the 'cut' tool to keep only the first column with SRA and the column with IDs the user wants. Then we could set column 2. What do you think?
I will have a look at your workflow.

lldelisle · 2023-10-10T13:05:44Z

Datamash is indeed a good improvement.

lldelisle · 2023-10-11T05:11:12Z

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection':
In the workflow:

In the history:

And it is the same with the tool split_file_on_column...
I also tried to put a 'apply rule' step between the 'split_file_to_collection' and the 'fasterqdump' to change the datatype.
While in theory it worked:

It seems that something went wrong with the column name:

That's why I need to add a 'cut1' step after 'split_file_to_collection'.

lldelisle · 2023-10-11T05:30:17Z

The tests pass with the master branch of planemo, should we release planemo?

lldelisle · 2023-10-11T05:33:19Z

@PierreOsteil for your information

wm75 · 2023-10-11T06:07:09Z

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection'

Yes, splitting to a collection will always produce tabular format. It's ok not to care about the input format until after that step, but then change the datatype of its output to sra_manifest.tabular.

lldelisle · 2023-10-12T06:09:37Z

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection'

Yes, splitting to a collection will always produce tabular format. It's ok not to care about the input format until after that step, but then change the datatype of its output to sra_manifest.tabular.

I don't understand. Do you mean this or something else:

lldelisle · 2023-10-18T14:40:51Z

The workflow is working. However, because one of the fasterq-dump output for an accession number or for a list of accession number is a list:paired, when we run this in parallel we get a list:list:paired and then we face galaxyproject/galaxy#16878.
A way to solve this would be to change the fasterq-dump wrapper to output a 'paired' when an accession number is given. @mvdbeek @wm75 what do you think?
Ideally, I would like to have this workflow in IWC for a training next Thursday...

mvdbeek · 2023-10-18T15:22:53Z

I'm pretty sure we can fix the Galaxy side before this ... and in either case I'm happy to merge the workflow now if you're happy with it.

lldelisle · 2023-10-18T15:27:35Z

We still need to solve an issue @wm75 identified which is to be sure to use identifier which passes the 'relabel' step.

lldelisle · 2023-10-18T15:29:40Z

I think I can include a fix into my awk step:
https://github.com/galaxyproject/galaxy/blob/d8d27e8e86ec56199bdfda00e81400c02a1776c2/lib/galaxy/tools/relabel_from_file.xml#L201C145-L201C145

lldelisle · 2023-10-18T21:01:27Z

@wm75, tell me if it is OK for you like this (keeping the awk step) or if you prefer that I change it.

wm75 · 2023-10-20T09:55:51Z

Since https://github.com/galaxyproject/galaxy/pull/8757/files, the allowed chars for the Relabel tool also include ,. and space so no need to replace these with awk.

wm75 · 2023-10-20T09:58:03Z

Let me check how complicated it would be to replace awk with Cut and Replace ...

lldelisle · 2023-10-20T11:50:41Z

Since https://github.com/galaxyproject/galaxy/pull/8757/files, the allowed chars for the Relabel tool also include ,. and space so no need to replace these with awk.

Good point. The documentation had not been updated. I've just wrote a PR to fix it.

Tell me if you manage with a cut replace.

wm75

I've adjusted the README a bit and I think we can handle hypothetical sample names with triple underscores in them correctly with just a minimal change to the APPLY_RULES regex (untested though).

workflows/data-fetching/sra-manifest-to-concatenated-fastqs/CHANGELOG.md

workflows/data-fetching/sra-manifest-to-concatenated-fastqs/README.md

...ows/data-fetching/sra-manifest-to-concatenated-fastqs/sra-manifest-to-concatenated-fastqs.ga

Accept sample names with '___' Co-authored-by: Wolfgang Maier <[email protected]>

lldelisle · 2023-10-23T09:09:12Z

Thanks. I updated the tests... If they pass, we are ready.

lldelisle · 2023-10-23T09:44:05Z

Youhou! May I click on merge?

wm75 · 2023-10-23T09:52:12Z

Great work @lldelisle !

lldelisle · 2023-10-23T09:52:49Z

I would say. Great collaboration! Thanks @wm75

wm75 · 2023-10-23T10:32:03Z

Hmm, the merge failed with a failing test now:

Output collection 'paired_output': failed to find identifier 'GSM461177-' in the tool generated elements []
Output collection 'single_output': failed to find identifier 'GSM461176.-' in the tool generated elements []

@mvdbeek any idea why the results would be different from the within-PR testing?

lldelisle · 2023-10-23T12:38:03Z

I've relaunched the CI and then I will check on eu if I can reproduce the error...

lldelisle · 2023-10-23T13:27:53Z

It seems fixed.

mvdbeek · 2023-10-23T15:34:38Z

Yes, I assume a temporary job error. We'll have to rework the planemo testing code a little so we always get a report ...

lldelisle added 3 commits October 9, 2023 22:32

first version

aebc895

clean collections + relabel

5702a5b

generate missing files for PR

a2f3ba5

fix test labels

62e83b7

lldelisle added 3 commits October 10, 2023 21:37

add 2 co-authors + use column number at start + fix filtering

026d65e

remove unnecessary tail step

38b67b8

use split_file_to_collection

5b4b969

relabel list to manifest + update README and dockstore

d1e3b8e

update workflow to make it parallel

04d5bbb

wm75 mentioned this pull request Oct 18, 2023

WF Change datatype post-job action not working for collections with discovered elements galaxyproject/galaxy#16876

Closed

lldelisle added 4 commits October 18, 2023 11:53

hide intermediate results

b647946

Add info in README about ___

35e124e

change order between cut and split by @wm75

fa8ba0f

use default value for SRA ID column

8ffb5fb

lldelisle added 3 commits October 18, 2023 17:41

change awk to ensure RELABEL

ced8e7f

fake sample id to better test

c30ff89

fix awk command

d4c01b2

lldelisle force-pushed the add_concat_fastqs branch from 4b20c63 to d4c01b2 Compare October 18, 2023 20:28

update warning [skip ci]

3da71ec

replace awk by cut and replace by @wm75

e140253

wm75 reviewed Oct 23, 2023

View reviewed changes

lldelisle and others added 2 commits October 23, 2023 11:05

Include suggestions from @wm75

74cb7b1

Accept sample names with '___' Co-authored-by: Wolfgang Maier <[email protected]>

use sample names with non regular chars

f069303

wm75 approved these changes Oct 23, 2023

View reviewed changes

wm75 merged commit da6e63e into galaxyproject:main Oct 23, 2023
5 checks passed

lldelisle deleted the add_concat_fastqs branch October 23, 2023 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add concat fastqs from SRA manifest #227

Add concat fastqs from SRA manifest #227

lldelisle commented Oct 9, 2023

lldelisle commented Oct 10, 2023

wm75 commented Oct 10, 2023

lldelisle commented Oct 10, 2023

lldelisle commented Oct 10, 2023

lldelisle commented Oct 11, 2023

lldelisle commented Oct 11, 2023

lldelisle commented Oct 11, 2023

wm75 commented Oct 11, 2023

lldelisle commented Oct 12, 2023

lldelisle commented Oct 18, 2023

mvdbeek commented Oct 18, 2023

lldelisle commented Oct 18, 2023

lldelisle commented Oct 18, 2023

lldelisle commented Oct 18, 2023

wm75 commented Oct 20, 2023

wm75 commented Oct 20, 2023

lldelisle commented Oct 20, 2023

wm75 left a comment

lldelisle commented Oct 23, 2023

lldelisle commented Oct 23, 2023

wm75 commented Oct 23, 2023

lldelisle commented Oct 23, 2023

wm75 commented Oct 23, 2023

lldelisle commented Oct 23, 2023 •

edited

Loading

lldelisle commented Oct 23, 2023

mvdbeek commented Oct 23, 2023

Add concat fastqs from SRA manifest #227

Add concat fastqs from SRA manifest #227

Conversation

lldelisle commented Oct 9, 2023

lldelisle commented Oct 10, 2023

wm75 commented Oct 10, 2023

lldelisle commented Oct 10, 2023

lldelisle commented Oct 10, 2023

lldelisle commented Oct 11, 2023

lldelisle commented Oct 11, 2023

lldelisle commented Oct 11, 2023

wm75 commented Oct 11, 2023

lldelisle commented Oct 12, 2023

lldelisle commented Oct 18, 2023

mvdbeek commented Oct 18, 2023

lldelisle commented Oct 18, 2023

lldelisle commented Oct 18, 2023

lldelisle commented Oct 18, 2023

wm75 commented Oct 20, 2023

wm75 commented Oct 20, 2023

lldelisle commented Oct 20, 2023

wm75 left a comment

Choose a reason for hiding this comment

lldelisle commented Oct 23, 2023

lldelisle commented Oct 23, 2023

wm75 commented Oct 23, 2023

lldelisle commented Oct 23, 2023

wm75 commented Oct 23, 2023

lldelisle commented Oct 23, 2023 • edited Loading

lldelisle commented Oct 23, 2023

mvdbeek commented Oct 23, 2023

lldelisle commented Oct 23, 2023 •

edited

Loading