Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with HELD jobs in a better way #34

Open
pmendes opened this issue Jun 13, 2022 · 0 comments
Open

Deal with HELD jobs in a better way #34

pmendes opened this issue Jun 13, 2022 · 0 comments

Comments

@pmendes
Copy link
Member

pmendes commented Jun 13, 2022

When HTCondor shows jobs as "held" cloud-copasi marks the whole task as "Error" and it never again checks it.

It would be better if the cloud-copasi-daemon could periodically check those tasks and if they are no longer held move the task either to the RUNNING or FINISHED status. This would help with the situation where the admin managed to release the jobs in HTCondor allowing the held jobs to run again.

An even better solution would be for the daemon to attempt (once or twice, maybe a configurable number of times) to release the jobs automatically.

One of the most common reasons for jobs to be held in the HTCondor queue is when the slurm system goes offline or reboots. After that it is quite possible that releasing the jobs allows them to finish. (This is not always the case, so if we do this there needs to be some way of not entering an infinite loop of releasing-holding jobs, that is why it would be good to have a setting for the number of times to try to release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant