Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update sanger.config #724

Closed
wants to merge 1 commit into from
Closed

Update sanger.config #724

wants to merge 1 commit into from

Conversation

maxozo
Copy link
Member

@maxozo maxozo commented Jul 30, 2024

I think the sanger small queue is 30 min, might be good to perhaps adjust this?


name: New Config
about: A new cluster config

Please follow these steps before submitting your PR:

  • If your PR is a work in progress, include [WIP] in its title
  • Your PR targets the master branch
  • You've included links to relevant issues, if any

Steps for adding a new config profile:

  • Add your custom config file to the conf/ directory
  • Add your documentation file to the docs/ directory
  • Add your custom profile to the nfcore_custom.config file in the top-level directory
  • Add your custom profile to the README.md file in the top-level directory
  • Add your profile name to the profile: scope in .github/workflows/main.yml

I think the sanger small queue is 30 min, might be good to perhaps adjust this?
@maxozo maxozo requested a review from priyanka-surana as a code owner July 30, 2024 10:02
@muffato
Copy link
Member

muffato commented Jul 30, 2024

True, but small also batches jobs together. I'm not sure how that's handled

Copy link
Contributor

@priyanka-surana priyanka-surana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check with @muffato if you should investigate the small queue batching before merging.

@maxozo
Copy link
Member Author

maxozo commented Jul 30, 2024

hmm,let me try on one pipeline that spins up loads of small jobs and see if it causes any isues.

Copy link
Member

@muffato muffato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxozo

  • ISG think the 30 min limit apply to each job within the batch. I'm running a test now and will confirm later
  • However, once a job has been committed onto a batch, it has to wait for all previous jobs of the batch to start. Say Nextflow submits 1 or multiple jobs in the small queue, LSF may take them together with other jobs of the user into a batch that's dispatched on a machine somewhere. Because of batching, the n-th job may wait up to (n-1)x30 min before even starting, which can be an unnecessary wait. (Without batching, each job has the chance of starting somewhere when resources become available). It's not something I want on the Tree of Life cluster without further testing. Can you add a test based on clustername.startsWith("tol") so that we still use a threshold of 1 min on the ToL farm ?

@maxozo
Copy link
Member Author

maxozo commented Jul 30, 2024

i agree that if this is what it truly does i dont think we want that also in Humgen.
Let me know, and thanks for testing this. Lets ignore this request if the small queue really does that.

@SamD28
Copy link

SamD28 commented Jul 30, 2024

It seems like the CHUNK_JOB_SIZE is set to 10, I agree its probably not best to have jobs that take the full 30 tied into chunks of 10. Maybe there is a intermediate such as 10/15 mins max that wouldn't cause much delay but actually get use out of the small queue? Not sure how often NF-CORE pipelines request that sort of time!

@muffato
Copy link
Member

muffato commented Jul 31, 2024

Hi. I can confirm that the 30 min RUNLIMIT applies to the jobs within the batch, not the entire batch. In my test, each job ran for 20 min, and the batch 200 min.

@maxozo
Copy link
Member Author

maxozo commented Jul 31, 2024

hmm, yes, thats not good and defeats the purpose. thanks for checking this @muffato
i might mention this to systems and see if they may change this so the small queue is utilised more.

@maxozo
Copy link
Member Author

maxozo commented Jul 31, 2024

Some of the comments i got from ISG @muffato
Just sharing it here:

I haven't tested the behaviour or measured how long the last/incomplete batch takes to get started, but as mentioned earlier this use is intended for LOTS of small jobs, so the scheduler saving would (hopefully) outweigh the delayed start of the last batch. If you're running a small number of small jobs (finger in the air: <100) you are probably more interested in the overall runtime, and the potential scheduler savings is smaller, so you might as well just put them in "normal" (which shares the fairshare policy with "small" and uses the same execution hosts, so there is little difference from your point of view).

if a job takes a minute then about 25% of the time is wasted on the scheduler cycles

I presume in ToL you have pipelines with jobs that request <30 min and there is not too many of them?
What do you think about the @SamD28 to make it a 15min limit? ot perhaps 10min limit?

@muffato
Copy link
Member

muffato commented Jul 31, 2024

Systems also told me this:

The idea is that the batching helps reduce scheduler overhead for the properly short jobs (<5 minutes). The 30 minutes is a "backstop" for any outliers that happen to run a little longer than expected, to allow them to succeed, while still having a net to catch long jobs which shouldn't be there at all.

I think 10 minutes may even be too high a threshold for the small queue. 5 min may be fine ? But how many jobs declare a time under 10 minutes ? Across our production pipelines, we only have 1 job that may declare time = 10.min or time = 20.min. I use time = 30.min as the default minimum otherwise (that's why I was uneasy with using 30 min as the threshold).

@sb10
Copy link

sb10 commented Jul 31, 2024

You should probably change the config to just remove the fall through to the small queue.

If you ever care about efficiency of your jobs and want to get them run as quickly as possible, especially in the case that you do have a workflow that has thousands of <5min jobs, you'll want to switch to wr for your scheduler.

wr nextflow.pdf

@muffato muffato closed this Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants