Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all workers are restarted on cluster #57

Open
gilrrei opened this issue Dec 20, 2024 · 0 comments
Open

Not all workers are restarted on cluster #57

gilrrei opened this issue Dec 20, 2024 · 0 comments
Labels
status: open No solution for this issue has been provided topic: cluster Issue/PR realted to cluster computations topic: dask Issue/PR related to dask topic: scheduler Issue/PR related to the schedulers type: bug report Issue/PR to highlight/fix a bug

Comments

@gilrrei
Copy link
Contributor

gilrrei commented Dec 20, 2024

This issue was originally created by @maxdinkel on gitlab.lrz.de on 2024-01-24.

Motivation and Context

The restart_worker flag should enable the restart of a dask worker after it has finished a job.

Current Behavior

If the ClusterScheduler is used and restart_worker=True, a worker is sometimes not killed and accepts a new job (around 5%-10% of the cases). This means that the subsequent job that runs on this worker might be killed due to the walltime constraint. Subsequently, the scheduler will retry to run the job on a different worker.

To sum it up: Using restart_worker=True on a ClusterScheduler should produce reliable results, but can lead to inefficient use of computational resources.

Possible Solution

Setting the walltime to about 2-3 times the duration of one job might help.

Related Issues

  • Blocks
  • Is blocked by
  • Follows
  • Precedes
  • Related to 666 (the number of the beast)
  • Part of
  • Composed of

Additional Information

Interested Parties

@gilrrei gilrrei added topic: dask Issue/PR related to dask topic: cluster Issue/PR realted to cluster computations topic: scheduler Issue/PR related to the schedulers type: bug report Issue/PR to highlight/fix a bug status::unresolved status: open No solution for this issue has been provided and removed status: unresolved labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: open No solution for this issue has been provided topic: cluster Issue/PR realted to cluster computations topic: dask Issue/PR related to dask topic: scheduler Issue/PR related to the schedulers type: bug report Issue/PR to highlight/fix a bug
Projects
None yet
Development

No branches or pull requests

1 participant