Deal with HELD jobs in a better way #34

pmendes · 2022-06-13T12:43:28Z

When HTCondor shows jobs as "held" cloud-copasi marks the whole task as "Error" and it never again checks it.

It would be better if the cloud-copasi-daemon could periodically check those tasks and if they are no longer held move the task either to the RUNNING or FINISHED status. This would help with the situation where the admin managed to release the jobs in HTCondor allowing the held jobs to run again.

An even better solution would be for the daemon to attempt (once or twice, maybe a configurable number of times) to release the jobs automatically.

One of the most common reasons for jobs to be held in the HTCondor queue is when the slurm system goes offline or reboots. After that it is quite possible that releasing the jobs allows them to finish. (This is not always the case, so if we do this there needs to be some way of not entering an infinite loop of releasing-holding jobs, that is why it would be good to have a setting for the number of times to try to release)

pmendes added the enhancement label Jun 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deal with HELD jobs in a better way #34

Deal with HELD jobs in a better way #34

pmendes commented Jun 13, 2022

Deal with HELD jobs in a better way #34

Deal with HELD jobs in a better way #34

Comments

pmendes commented Jun 13, 2022