-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume from crashed head job caused process and its downstream dependency to run at the same time #5595
Comments
Hi Øyvind When you say
Are you sure that it was Nextflow that relaunched the task? If a task directory already exists, Nextflow will not resubmit a task to that same directory. Is it possible that LSF re-submitted the task when it came back online? |
Hi Robert,
I don't think that's possible. The resume happened many hours after the cluster came back up, and the resume was submitted through seqera platform. I've got traces from nextflow showing that it thought the resumed process had completed, and logs from the process showing it continued running. The logs in the process working directory also show two separate runs - one on december 4th and one on the 5th. I've verified through seqera platform that both in the original and resumed runs, the process in question has the same working directory. |
Thanks for the clarification, Øyvind. Nextflow should not re-use the same task working directory on a resumed task, so it's very unexpected to see the same directory being re-used on resume. In the checkCachedOrLaunchTask method, Nextflow checks to see if the work directory exists and if so, avoids submitting to that directory. In pseudocode:
Essentially, Nextflow spins around this loop, generating a deterministic series of hashes looking for a non-empty directory in which to submit the job. @bentsherman - do you have any idea how a task workDir could possible be re-used on resume? |
Indeed, Nextflow should never submit a new task to an existing directory. It should only reuse the directory if the task is completely cached and doesn't need to be re-executed. At a first glance, I'm not sure how this situation would happen.
|
I'm having trouble reproducing the example. When you say
Does this require forcibly terminating the scheduler (LSF), or simply killing the Nextflow process? |
Thanks both! And apologies for not having a clean repro. I've reviewed the logs from our cluster, and I think the situation is more complex than I initially realised.
Looking at the code you linked, I could see nextflow launching a job in the same work directory if one or more of the two variables below end up as null.
That would mean that the first try block completes fine with no existence check on the directory. The block below does check existence, but does not abort a launch if the directory exists.
My best guess is that the failed relaunch relaunch (not resume) did something to the cache. Though I'm not sure if that's possible? The resume definitely skipped over all the tasks that had completed before the outage, but then somehow decided to re-launch the two processes that were running at the time of the network outage. |
Indeed, if there is no cache entry for a given task hash, Nextflow will proceed to execute the task in that directory, without checking whether the directory already exists. It is assuming that if there is no cache entry then there must be no directory either. Meanwhile, the cache entry for a task is not saved until Nextflow sees that the task is complete: nextflow/modules/nextflow/src/main/groovy/nextflow/Session.groovy Lines 1065 to 1068 in b23e42c
Here is a plausible scenario:
So you might be right after all. Let me see if I can reproduce this with the local executor. I think I know what we need to change but I will feel more confident if we also have a red-green test case. |
There is also this minor failsafe to make the job script delete any existing input files: nextflow/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy Lines 132 to 133 in b23e42c
But it's only intended for debugging purposes. I think during normal execution, Nextflow would see the |
Bug report
Due to an LSF compute cluster outage, two of our pipelines ended up in a state where the nextflow process was unable to submit jobs to the cluster and crashed. However, the running processes completed successfully.
When we resumed the pipeline after the cluster had recovered from the crash, we saw nextflow
The first launched process remained running for another ten minutes and wrote output files while its dependent process ran, leading to truncated output and a false negative result.
After some digging, I believe that what happened is related to the behaviour of the GridTaskHandler, specifically this line:
https://github.com/nextflow-io/nextflow/blob/18f7de132c0342102f7160ce97b2e68dc6b67[…]xtflow/src/main/groovy/nextflow/executor/GridTaskHandler.groovy
The function
readExitStatus()
looks for a file called.exitfile
and reads the exit code of the completed process from there. Since nextflow resumed a process that had successfully completed after the head job crashed, that file was already present in the work directory when the pipeline resumed and this caused a premature launch of the downstream process.I think an easy fix would be to make sure the .exitfile is deleted before submitting a process, possibly just with an
rm -f .exitfile
Expected behavior and actual behavior
Expected behaviour: when resuming a pipeline, nextflow should wait for the resumed process to complete executing on the cluster.
Actual behaviour: nextflow immedately launches a downstream process while still running.
Steps to reproduce the problem
The following script illustrates the problem. It has a different output on a grid scheduler than when run locally.
A similar result can be achieved by removing the line
echo 0 > .exitcode
, launching nextflow on a grid scheduler, killing nextflow when the process has started, and then relaunching with -resume.Program output
On a grid scheduler, the above will only print
Environment
Additional context
N/A
The text was updated successfully, but these errors were encountered: