Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash after weeks with errors about fastas with bad headers and empty fasta sequences #1525

Open
mtvector opened this issue Nov 15, 2024 · 9 comments

Comments

@mtvector
Copy link

mtvector commented Nov 15, 2024

This error appears to be similar to #1466

After running for a few weeks on ~50 mammalian fastas from NCBI, cactus crashes with two seemingly related errors from a bunch of toil jobs. I'm wondering if there's any way to fix these errors and restart (just restarting leads to the same errors), or if in order to fix them I'd have to go back to the start and do some sort of sanitization of the fastas myself (I tried setting checkAssemblyHub="0" in the config on the restart but I expect there's a copy of the config somewhere else already or it can't be changed). Any help you could give here would be much appreciated, as I really ran up my slurm sshare usage and it will take very long to restart :) Thanks!

These are the seemingly relevant errors:

<=========
Log from job "'run_lastz' kind-run_lastz/a/instance-p4oz3d0s v24" follows:
=========>
        [2024-11-12T15:16:49-0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2024-11-12T15:16:49-0800] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host n239.
        [2024-11-12T15:16:50-0800] [MainThread] [I] [toil.worker] Working on job 'run_lastz' kind-run_lastz/a/instance-p4oz3d0s v22
        [2024-11-12T15:16:51-0800] [MainThread] [I] [toil.worker] Loaded body Job('run_lastz' kind-run_lastz/a/instance-p4oz3d0s v22) from description 'run_lastz' kind-run_lastz/a/instance-p4oz3d0s v22
        [2024-11-12T15:17:00-0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
        [2024-11-12T15:17:00-0800] [MainThread] [C] [toil.worker] Worker crashed with traceback:
        Traceback (most recent call last):
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/worker.py", line 438, in workerScript
            job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 2984, in _runner
            returnValues = self._run(jobGraph=None, fileStore=fileStore)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 2895, in _run
            return self.run(fileStore)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 3158, in run
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/cactus/paf/local_alignment.py", line 67, in run_lastz
            kegalign_messages = cactus_call(parameters=lastz_cmd, outfile=alignment_file, work_dir=work_dir, returnStdErr=gpu>0, gpus=gpu,
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/cactus/shared/common.py", line 910, in cactus_call
            raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
        RuntimeError: Command ['lastz', 'Equus_quagga_5.fa[multiple][nameparse=darkspace]', 'Rousettus_aegyptiacus_6.fa[nameparse=darkspace]', '--format=paf:minimap2', '--step=1', '--ambiguous=iupac,100,100', '--ydrop=3000'] exited 1: stderr=FAILURE: bad fasta character in Rousettus_aegyptiacus_6.fa, >id=Rousettus_aegyptiacus|NW_023416290.1|60676320|0 (greater than sign ">")
        remove or replace non-ACGTN characters or consider using --ambiguous=iupac
"er2.log" 76751L, 9051973C
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
        remove or replace non-ACGTN characters or consider using --ambiguous=iupac


        [2024-11-12T15:21:18-0800] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host n107
<=========
Log from job "'run_lastz' kind-run_lastz/instance-5hydikq5 v24" follows:
=========>
        [2024-11-12T15:24:51-0800] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2024-11-12T15:24:51-0800] [MainThread] [I] [toil] Running Toil version 7.0.0-d569ea5711eb310ffd5703803f7250ebf7c19576 on host n260.
        [2024-11-12T15:24:51-0800] [MainThread] [I] [toil.worker] Working on job 'run_lastz' kind-run_lastz/instance-5hydikq5 v22
        [2024-11-12T15:24:53-0800] [MainThread] [I] [toil.worker] Loaded body Job('run_lastz' kind-run_lastz/instance-5hydikq5 v22) from description 'run_lastz' kind-run_lastz/instance-5hydikq5 v22
        [2024-11-12T15:24:54-0800] [MainThread] [I] [toil-rt] 2024-11-12 15:24:54.291271: Running the command: "lastz Sarcophilus_harrisii_1.fa[multiple][nameparse=darkspace] Ornithorhynchus_anatinus_6.fa[nameparse=darkspace] --format=paf:minimap2 --step=1 --ambiguous=iupac,100,100 --ydrop=3000"
        [2024-11-12T15:24:54-0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
        [2024-11-12T15:24:54-0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-mh20pmkj/cleanup/file-0e6e828d1f2d4c668210a77f891ddf01/1.fa' to path '/scratch/fast/21879688/toilwf-0801d94f683b5ec49f533a5216cef17b/402c/job/tmpxaa2qbk_/Sarcophilus_harrisii_1.fa'
        [2024-11-12T15:24:54-0800] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-make_chunked_alignments/instance-mh20pmkj/cleanup/file-f812457ea7ab45edbb84ea36ed064420/6.fa' to path '/scratch/fast/21879688/toilwf-0801d94f683b5ec49f533a5216cef17b/402c/job/tmpxaa2qbk_/Ornithorhynchus_anatinus_6.fa'
        [2024-11-12T15:24:54-0800] [MainThread] [C] [toil.worker] Worker crashed with traceback:
        Traceback (most recent call last):
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/worker.py", line 438, in workerScript
            job._runner(jobGraph=None, jobStore=job_store, fileStore=fileStore, defer=defer)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 2984, in _runner
            returnValues = self._run(jobGraph=None, fileStore=fileStore)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 2895, in _run
            return self.run(fileStore)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/toil/job.py", line 3158, in run
            rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/cactus/paf/local_alignment.py", line 67, in run_lastz
            kegalign_messages = cactus_call(parameters=lastz_cmd, outfile=alignment_file, work_dir=work_dir, returnStdErr=gpu>0, gpus=gpu,
          File "/home/matthew.schmitz/Matthew/utils/miniforge3/envs/cactus2/lib/python3.8/site-packages/cactus/shared/common.py", line 910, in cactus_call
            raise RuntimeError("{}Command {} exited {}: {}".format(sigill_msg, call, process.returncode, out))
        RuntimeError: Command ['lastz', 'Sarcophilus_harrisii_1.fa[multiple][nameparse=darkspace]', 'Ornithorhynchus_anatinus_6.fa[nameparse=darkspace]', '--format=paf:minimap2', '--step=1', '--ambiguous=iupac,100,100', '--ydrop=3000'] exited 1: stderr=FAILURE: target file Sarcophilus_harrisii_1.fa contains no sequence
@glennhickey
Copy link
Collaborator

Yes, this does seem to be the same issue as #1466. Unfortunately, I don't think I ever was able to reproduce it properly nor figure it out and even worse, I don't think there's a way to recover from it. All you can do is to run step-by-step which makes issues like this much easier to recover from (but I realize this doesn't help you recover the time lost from your current run).

But there does seem to be something going on where fasta files are getting corrupted going through the pipeline. I've seen it happen once myself in the pangenome pipeline (only with docker binaries), and it seemed to be a filesystem-related error.

If you're somehow able to reproduce the error by aligning just two genomes, Sarcophilus_harrisii.fa and Ornithorhynchus_anatinus.fa, and can share those genomes with me I'd very much like to try to reproduce here. Otherwise I'm afraid I don't really know what to suggest.

@glennhickey
Copy link
Collaborator

As an aside, do you know which file system you're using? We've had trouble with NFS before...

@mtvector
Copy link
Author

Thanks for responding!

It's certainly suspicious, I am getting errors with other species as well, including ones where I have successfully run cactus using the exact same genome before. Certainly makes me suspect that something may be corrupted during the course of running.

We're on the XFS file system I believe! There have been periods of very heavy I/O on the system during this run, not sure if slow read/write itself could cause this issue.

@glennhickey
Copy link
Collaborator

@diekhans @adamnovak do we know of any XFS-related toil issues?

@adamnovak
Copy link
Collaborator

I don't think so; XFS is a local filesystem and I'm not finding anything on it having any important deviations from POSIX guarantees on reading other processes' writes.

@mtvector
Copy link
Author

Thanks for your help here, I'll try to rerun using the step-by-step protocol unless there are any other possible fixes you can think of!

@glennhickey
Copy link
Collaborator

Yeah, the step by step is a bit more of a hassle, but gives you full agency to recover from any errors, switching parameters or cactus versions if necessary, so is definitely worth it here.

I put a few checks in the most recent version of cactus (2.9.3) as well as some (long-shot) attempts to correct these errors, so it would be ideal to use that version if possible.

@mtvector
Copy link
Author

Actually I do have a question about that, I've looked at the documents and tried running cactus-prepare and cactus-prepare-toil... cactus-prepare gives you the sequence of commands, however none of them use the job submitter to parallelize that job and I don't see a way to tell cactus-prepare to use slurm. I figured that cactus-prepare-toil would do the this but it seems to essentially do the same thing as the full cactus alignment, as it starts running everything and doesn't output the commands that I can see.

Is there a way to run cactus-prepare where each step is parallelized, but the jobstore is still the nice straightforward jobstore/0 etc, and then also be able to check if each job was completed successfully I need to restart the whole thing and pick up where I left off?

@glennhickey
Copy link
Collaborator

Right, you can run any individual command on slurm using --batchSystem slurm (which you can add by hand to the prepare output or use --cactusOptions to add it). But then if you run the output directly, you'd only ever align one tree node at a time which isn't ideal. For 50 genomes it still might be managable -- you'd have to break it into different scripts at the beginning where you'd run a few in parallel.

What I've done in the past for larger alignments is use the WDL output from cactus-prepare. There's some documentation on that in the context of Terra. But since then, WDL support has been added to Toil itself. So in theory you can use cactus-prepare to make a WDL and then Toil to execute it on slurm. If something fails, it's not as trivial to just fiddle with the sequential script that comes out of cactus-prepare, but modifying the WDL isn't that bad -- and this way you get full parallelism with proper dependency checking.

This is what I'm planning on doing for an upcoming large alignment here, and I'll add documentation to the cactus README as I go..

About cactus-prepare-toil, I originally made that as a hack to get cactus working on a kubernetes cluster, but we've since switched to slurm...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants