You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When HTCondor shows jobs as "held" cloud-copasi marks the whole task as "Error" and it never again checks it.
It would be better if the cloud-copasi-daemon could periodically check those tasks and if they are no longer held move the task either to the RUNNING or FINISHED status. This would help with the situation where the admin managed to release the jobs in HTCondor allowing the held jobs to run again.
An even better solution would be for the daemon to attempt (once or twice, maybe a configurable number of times) to release the jobs automatically.
One of the most common reasons for jobs to be held in the HTCondor queue is when the slurm system goes offline or reboots. After that it is quite possible that releasing the jobs allows them to finish. (This is not always the case, so if we do this there needs to be some way of not entering an infinite loop of releasing-holding jobs, that is why it would be good to have a setting for the number of times to try to release)
The text was updated successfully, but these errors were encountered:
When HTCondor shows jobs as "held" cloud-copasi marks the whole task as "Error" and it never again checks it.
It would be better if the cloud-copasi-daemon could periodically check those tasks and if they are no longer held move the task either to the RUNNING or FINISHED status. This would help with the situation where the admin managed to release the jobs in HTCondor allowing the held jobs to run again.
An even better solution would be for the daemon to attempt (once or twice, maybe a configurable number of times) to release the jobs automatically.
One of the most common reasons for jobs to be held in the HTCondor queue is when the slurm system goes offline or reboots. After that it is quite possible that releasing the jobs allows them to finish. (This is not always the case, so if we do this there needs to be some way of not entering an infinite loop of releasing-holding jobs, that is why it would be good to have a setting for the number of times to try to release)
The text was updated successfully, but these errors were encountered: