-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash after weeks with errors about fastas with bad headers and empty fasta sequences #1525
Comments
Yes, this does seem to be the same issue as #1466. Unfortunately, I don't think I ever was able to reproduce it properly nor figure it out and even worse, I don't think there's a way to recover from it. All you can do is to run step-by-step which makes issues like this much easier to recover from (but I realize this doesn't help you recover the time lost from your current run). But there does seem to be something going on where fasta files are getting corrupted going through the pipeline. I've seen it happen once myself in the pangenome pipeline (only with docker binaries), and it seemed to be a filesystem-related error. If you're somehow able to reproduce the error by aligning just two genomes, |
As an aside, do you know which file system you're using? We've had trouble with NFS before... |
Thanks for responding! It's certainly suspicious, I am getting errors with other species as well, including ones where I have successfully run cactus using the exact same genome before. Certainly makes me suspect that something may be corrupted during the course of running. We're on the XFS file system I believe! There have been periods of very heavy I/O on the system during this run, not sure if slow read/write itself could cause this issue. |
@diekhans @adamnovak do we know of any XFS-related toil issues? |
I don't think so; XFS is a local filesystem and I'm not finding anything on it having any important deviations from POSIX guarantees on reading other processes' writes. |
Thanks for your help here, I'll try to rerun using the step-by-step protocol unless there are any other possible fixes you can think of! |
Yeah, the step by step is a bit more of a hassle, but gives you full agency to recover from any errors, switching parameters or cactus versions if necessary, so is definitely worth it here. I put a few checks in the most recent version of cactus (2.9.3) as well as some (long-shot) attempts to correct these errors, so it would be ideal to use that version if possible. |
Actually I do have a question about that, I've looked at the documents and tried running cactus-prepare and cactus-prepare-toil... cactus-prepare gives you the sequence of commands, however none of them use the job submitter to parallelize that job and I don't see a way to tell cactus-prepare to use slurm. I figured that cactus-prepare-toil would do the this but it seems to essentially do the same thing as the full cactus alignment, as it starts running everything and doesn't output the commands that I can see. Is there a way to run cactus-prepare where each step is parallelized, but the jobstore is still the nice straightforward jobstore/0 etc, and then also be able to check if each job was completed successfully I need to restart the whole thing and pick up where I left off? |
Right, you can run any individual command on slurm using What I've done in the past for larger alignments is use the WDL output from This is what I'm planning on doing for an upcoming large alignment here, and I'll add documentation to the cactus README as I go.. About |
This error appears to be similar to #1466
After running for a few weeks on ~50 mammalian fastas from NCBI, cactus crashes with two seemingly related errors from a bunch of toil jobs. I'm wondering if there's any way to fix these errors and restart (just restarting leads to the same errors), or if in order to fix them I'd have to go back to the start and do some sort of sanitization of the fastas myself (I tried setting checkAssemblyHub="0" in the config on the restart but I expect there's a copy of the config somewhere else already or it can't be changed). Any help you could give here would be much appreciated, as I really ran up my slurm sshare usage and it will take very long to restart :) Thanks!
These are the seemingly relevant errors:
The text was updated successfully, but these errors were encountered: