Not all workers are restarted on cluster #57
Labels
status: open
No solution for this issue has been provided
topic: cluster
Issue/PR realted to cluster computations
topic: dask
Issue/PR related to dask
topic: scheduler
Issue/PR related to the schedulers
type: bug report
Issue/PR to highlight/fix a bug
Motivation and Context
The
restart_worker
flag should enable the restart of a dask worker after it has finished a job.Current Behavior
If the
ClusterScheduler
is used andrestart_worker=True
, a worker is sometimes not killed and accepts a new job (around 5%-10% of the cases). This means that the subsequent job that runs on this worker might be killed due to the walltime constraint. Subsequently, the scheduler will retry to run the job on a different worker.To sum it up: Using
restart_worker=True
on aClusterScheduler
should produce reliable results, but can lead to inefficient use of computational resources.Possible Solution
Setting the walltime to about 2-3 times the duration of one job might help.
Related Issues
Additional Information
Interested Parties
The text was updated successfully, but these errors were encountered: