Update sanger.config #724

maxozo · 2024-07-30T10:02:21Z

I think the sanger small queue is 30 min, might be good to perhaps adjust this?

name: New Config
about: A new cluster config

Please follow these steps before submitting your PR:

If your PR is a work in progress, include [WIP] in its title
Your PR targets the master branch
You've included links to relevant issues, if any

Steps for adding a new config profile:

Add your custom config file to the conf/ directory
Add your documentation file to the docs/ directory
Add your custom profile to the nfcore_custom.config file in the top-level directory
Add your custom profile to the README.md file in the top-level directory
Add your profile name to the profile: scope in .github/workflows/main.yml

I think the sanger small queue is 30 min, might be good to perhaps adjust this?

muffato · 2024-07-30T11:01:02Z

True, but small also batches jobs together. I'm not sure how that's handled

priyanka-surana

Check with @muffato if you should investigate the small queue batching before merging.

maxozo · 2024-07-30T14:44:00Z

hmm,let me try on one pipeline that spins up loads of small jobs and see if it causes any isues.

muffato

@maxozo

ISG think the 30 min limit apply to each job within the batch. I'm running a test now and will confirm later
However, once a job has been committed onto a batch, it has to wait for all previous jobs of the batch to start. Say Nextflow submits 1 or multiple jobs in the small queue, LSF may take them together with other jobs of the user into a batch that's dispatched on a machine somewhere. Because of batching, the n-th job may wait up to (n-1)x30 min before even starting, which can be an unnecessary wait. (Without batching, each job has the chance of starting somewhere when resources become available). It's not something I want on the Tree of Life cluster without further testing. Can you add a test based on clustername.startsWith("tol") so that we still use a threshold of 1 min on the ToL farm ?

maxozo · 2024-07-30T14:48:52Z

i agree that if this is what it truly does i dont think we want that also in Humgen.
Let me know, and thanks for testing this. Lets ignore this request if the small queue really does that.

SamD28 · 2024-07-30T15:06:22Z

It seems like the CHUNK_JOB_SIZE is set to 10, I agree its probably not best to have jobs that take the full 30 tied into chunks of 10. Maybe there is a intermediate such as 10/15 mins max that wouldn't cause much delay but actually get use out of the small queue? Not sure how often NF-CORE pipelines request that sort of time!

muffato · 2024-07-31T08:31:21Z

Hi. I can confirm that the 30 min RUNLIMIT applies to the jobs within the batch, not the entire batch. In my test, each job ran for 20 min, and the batch 200 min.

maxozo · 2024-07-31T08:47:39Z

hmm, yes, thats not good and defeats the purpose. thanks for checking this @muffato
i might mention this to systems and see if they may change this so the small queue is utilised more.

maxozo · 2024-07-31T12:04:13Z

Some of the comments i got from ISG @muffato
Just sharing it here:

I haven't tested the behaviour or measured how long the last/incomplete batch takes to get started, but as mentioned earlier this use is intended for LOTS of small jobs, so the scheduler saving would (hopefully) outweigh the delayed start of the last batch. If you're running a small number of small jobs (finger in the air: <100) you are probably more interested in the overall runtime, and the potential scheduler savings is smaller, so you might as well just put them in "normal" (which shares the fairshare policy with "small" and uses the same execution hosts, so there is little difference from your point of view).

if a job takes a minute then about 25% of the time is wasted on the scheduler cycles

I presume in ToL you have pipelines with jobs that request <30 min and there is not too many of them?
What do you think about the @SamD28 to make it a 15min limit? ot perhaps 10min limit?

muffato · 2024-07-31T13:20:26Z

Systems also told me this:

The idea is that the batching helps reduce scheduler overhead for the properly short jobs (<5 minutes). The 30 minutes is a "backstop" for any outliers that happen to run a little longer than expected, to allow them to succeed, while still having a net to catch long jobs which shouldn't be there at all.

I think 10 minutes may even be too high a threshold for the small queue. 5 min may be fine ? But how many jobs declare a time under 10 minutes ? Across our production pipelines, we only have 1 job that may declare time = 10.min or time = 20.min. I use time = 30.min as the default minimum otherwise (that's why I was uneasy with using 30 min as the threshold).

sb10 · 2024-07-31T13:29:00Z

You should probably change the config to just remove the fall through to the small queue.

If you ever care about efficiency of your jobs and want to get them run as quickly as possible, especially in the case that you do have a workflow that has thousands of <5min jobs, you'll want to switch to wr for your scheduler.

wr nextflow.pdf

Update sanger.config

de5c38e

I think the sanger small queue is 30 min, might be good to perhaps adjust this?

maxozo requested a review from priyanka-surana as a code owner July 30, 2024 10:02

priyanka-surana approved these changes Jul 30, 2024

View reviewed changes

muffato requested changes Jul 30, 2024

View reviewed changes

muffato closed this Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update sanger.config #724

Update sanger.config #724

maxozo commented Jul 30, 2024

muffato commented Jul 30, 2024

priyanka-surana left a comment

maxozo commented Jul 30, 2024

muffato left a comment

maxozo commented Jul 30, 2024 •

edited

Loading

SamD28 commented Jul 30, 2024 •

edited

Loading

muffato commented Jul 31, 2024

maxozo commented Jul 31, 2024 •

edited

Loading

maxozo commented Jul 31, 2024

muffato commented Jul 31, 2024

sb10 commented Jul 31, 2024

Update sanger.config #724

Update sanger.config #724

Conversation

maxozo commented Jul 30, 2024

name: New Config about: A new cluster config

muffato commented Jul 30, 2024

priyanka-surana left a comment

Choose a reason for hiding this comment

maxozo commented Jul 30, 2024

muffato left a comment

Choose a reason for hiding this comment

maxozo commented Jul 30, 2024 • edited Loading

SamD28 commented Jul 30, 2024 • edited Loading

muffato commented Jul 31, 2024

maxozo commented Jul 31, 2024 • edited Loading

maxozo commented Jul 31, 2024

muffato commented Jul 31, 2024

sb10 commented Jul 31, 2024

name: New Config
about: A new cluster config

maxozo commented Jul 30, 2024 •

edited

Loading

SamD28 commented Jul 30, 2024 •

edited

Loading

maxozo commented Jul 31, 2024 •

edited

Loading