Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concat fastqs from SRA manifest #227

Merged
merged 20 commits into from
Oct 23, 2023

Conversation

lldelisle
Copy link
Contributor

Here is a new workflow when the samples has been sequenced in different run.

Points to discuss:

  • SRA list vs manifest etc... I need more columns than just SRA name
  • The use of parallel download seems to fail, probably because there are too many levels of nested in lists.

@lldelisle
Copy link
Contributor Author

Also I don't like this hard written column 6 buy I don't know how to do because it is a column parameter.

@wm75
Copy link
Contributor

wm75 commented Oct 10, 2023

I've incorporated a couple of suggestions into https://usegalaxy.eu/u/wolfgang-maier/w/sralisttoconcatenatedfastqs-imported-from-url, specifically:

  • perform splitting on the input dataset directly and make sure the header line gets propagated to each collection element
  • the output at this stage will always be of tabular format ->change this to sra_manifest.tabular and feed it directly to the faster download tool (saves cut jobs on collection elements)
  • I've also changed the tool for splitting to toolshed.g2.bx.psu.edu/repos/bgruening/split_file_to_collection/split_file_to_collection/0.5.0 because it's the more versatile tool and is also used in at least one other iwc workflow (covid-19 variation reporting) already, so we might want to promote use of it, but feel free to ignore me
  • the Select last, Cut and Unique steps can all be replaced with a single Datamash step - just that this requires now hard-coding of the final sample name column instead of column 6, again because we cannot create a param of type data_column from a WF input parameter.

I can confirm the remaining problem besides the data_column issue: parallel faster download jobs don't seem to organize their outputs correctly, but produce empty nested list structures

@lldelisle
Copy link
Contributor Author

Meanwhile, I was wondering if we should not start by using the 'cut' tool to keep only the first column with SRA and the column with IDs the user wants. Then we could set column 2. What do you think?
I will have a look at your workflow.

@lldelisle
Copy link
Contributor Author

Datamash is indeed a good improvement.

@lldelisle
Copy link
Contributor Author

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection':
In the workflow:
image
In the history:
image

And it is the same with the tool split_file_on_column...
I also tried to put a 'apply rule' step between the 'split_file_to_collection' and the 'fasterqdump' to change the datatype.
While in theory it worked:
image
It seems that something went wrong with the column name:
image
That's why I need to add a 'cut1' step after 'split_file_to_collection'.

@lldelisle
Copy link
Contributor Author

The tests pass with the master branch of planemo, should we release planemo?

@lldelisle
Copy link
Contributor Author

@PierreOsteil for your information

@wm75
Copy link
Contributor

wm75 commented Oct 11, 2023

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection'

Yes, splitting to a collection will always produce tabular format. It's ok not to care about the input format until after that step, but then change the datatype of its output to sra_manifest.tabular.

@lldelisle
Copy link
Contributor Author

@wm75 I have an issue with 'sra_manifest.tabular'. I do not manage to have this datatype out of 'split_file_to_collection'

Yes, splitting to a collection will always produce tabular format. It's ok not to care about the input format until after that step, but then change the datatype of its output to sra_manifest.tabular.

I don't understand. Do you mean this or something else:
image

@lldelisle
Copy link
Contributor Author

The workflow is working. However, because one of the fasterq-dump output for an accession number or for a list of accession number is a list:paired, when we run this in parallel we get a list:list:paired and then we face galaxyproject/galaxy#16878.
A way to solve this would be to change the fasterq-dump wrapper to output a 'paired' when an accession number is given. @mvdbeek @wm75 what do you think?
Ideally, I would like to have this workflow in IWC for a training next Thursday...

@mvdbeek
Copy link
Member

mvdbeek commented Oct 18, 2023

I'm pretty sure we can fix the Galaxy side before this ... and in either case I'm happy to merge the workflow now if you're happy with it.

@lldelisle
Copy link
Contributor Author

We still need to solve an issue @wm75 identified which is to be sure to use identifier which passes the 'relabel' step.

@lldelisle
Copy link
Contributor Author

@lldelisle
Copy link
Contributor Author

@wm75, tell me if it is OK for you like this (keeping the awk step) or if you prefer that I change it.

@wm75
Copy link
Contributor

wm75 commented Oct 20, 2023

Since https://github.com/galaxyproject/galaxy/pull/8757/files, the allowed chars for the Relabel tool also include ,. and space so no need to replace these with awk.

@wm75
Copy link
Contributor

wm75 commented Oct 20, 2023

Let me check how complicated it would be to replace awk with Cut and Replace ...

@lldelisle
Copy link
Contributor Author

Since https://github.com/galaxyproject/galaxy/pull/8757/files, the allowed chars for the Relabel tool also include ,. and space so no need to replace these with awk.

Good point. The documentation had not been updated. I've just wrote a PR to fix it.

Tell me if you manage with a cut replace.

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've adjusted the README a bit and I think we can handle hypothetical sample names with triple underscores in them correctly with just a minimal change to the APPLY_RULES regex (untested though).

lldelisle and others added 2 commits October 23, 2023 11:05
Accept sample names with '___'

Co-authored-by: Wolfgang Maier <[email protected]>
@lldelisle
Copy link
Contributor Author

Thanks. I updated the tests... If they pass, we are ready.

@lldelisle
Copy link
Contributor Author

Youhou! May I click on merge?

@wm75 wm75 merged commit da6e63e into galaxyproject:main Oct 23, 2023
5 checks passed
@wm75
Copy link
Contributor

wm75 commented Oct 23, 2023

Great work @lldelisle !

@lldelisle
Copy link
Contributor Author

I would say. Great collaboration! Thanks @wm75

@lldelisle lldelisle deleted the add_concat_fastqs branch October 23, 2023 09:52
@wm75
Copy link
Contributor

wm75 commented Oct 23, 2023

Hmm, the merge failed with a failing test now:

Output collection 'paired_output': failed to find identifier 'GSM461177-' in the tool generated elements []
Output collection 'single_output': failed to find identifier 'GSM461176.-' in the tool generated elements []

@mvdbeek any idea why the results would be different from the within-PR testing?

@lldelisle
Copy link
Contributor Author

lldelisle commented Oct 23, 2023

I've relaunched the CI and then I will check on eu if I can reproduce the error...

@lldelisle
Copy link
Contributor Author

It seems fixed.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 23, 2023

Yes, I assume a temporary job error. We'll have to rework the planemo testing code a little so we always get a report ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants