Not all workers are restarted on cluster #57

gilrrei · 2024-12-20T16:18:20Z

This issue was originally created by @maxdinkel on gitlab.lrz.de on 2024-01-24.

Motivation and Context

The restart_worker flag should enable the restart of a dask worker after it has finished a job.

Current Behavior

If the ClusterScheduler is used and restart_worker=True, a worker is sometimes not killed and accepts a new job (around 5%-10% of the cases). This means that the subsequent job that runs on this worker might be killed due to the walltime constraint. Subsequently, the scheduler will retry to run the job on a different worker.

To sum it up: Using restart_worker=True on a ClusterScheduler should produce reliable results, but can lead to inefficient use of computational resources.

Possible Solution

Setting the walltime to about 2-3 times the duration of one job might help.

Related Issues

Blocks
Is blocked by
Follows
Precedes
Related to 666 (the number of the beast)
Part of
Composed of

Additional Information

Interested Parties

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all workers are restarted on cluster #57

Not all workers are restarted on cluster #57

gilrrei commented Dec 20, 2024

Not all workers are restarted on cluster #57

Not all workers are restarted on cluster #57

Comments

gilrrei commented Dec 20, 2024

Motivation and Context

Current Behavior

Possible Solution

Related Issues

Additional Information

Interested Parties