Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checks for invalid sequence characters when writing temporary FASTA files #1529

Merged
merged 1 commit into from
Nov 18, 2024

Conversation

glennhickey
Copy link
Collaborator

There seem to be gremlins in Cactus causing invalid FASTA files. In cactus-pangenome, this has manifested as slightly truncated files being imported into chromosome alignment jobs. And in progressive Cactus, there are a couple issues, #1466 #1525, reporting corrupt FASTA chunks (missing newline?) going into lastz.

I still don't know what the underlying problem is, though on the pangenome side it really looks like the corruption is happening at the filesystem level.

This PR just adds some asserts to try to catch these errors a little earlier to (hopefully) make debugging if / when they come up again.

  • cactus_sanitizeFastaHeaders now checks the sequence in addition to headers and reports an error if a non-ACGTN character is found.
  • faffy chunk and faffy extract changed to check (with an assertion) that they only write valid sequence characters. Because in stderr=FAILURE: bad fasta character #1466 it looks like the invalid sequence is coming out of faffy chunk.

@glennhickey glennhickey merged commit d07cb23 into master Nov 18, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant