Add checks for invalid sequence characters when writing temporary FASTA files #1529
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There seem to be gremlins in Cactus causing invalid FASTA files. In
cactus-pangenome
, this has manifested as slightly truncated files being imported into chromosome alignment jobs. And in progressive Cactus, there are a couple issues, #1466 #1525, reporting corrupt FASTA chunks (missing newline?) going intolastz
.I still don't know what the underlying problem is, though on the pangenome side it really looks like the corruption is happening at the filesystem level.
This PR just adds some asserts to try to catch these errors a little earlier to (hopefully) make debugging if / when they come up again.
cactus_sanitizeFastaHeaders
now checks the sequence in addition to headers and reports an error if a non-ACGTN character is found.faffy chunk
andfaffy extract
changed to check (with an assertion) that they only write valid sequence characters. Because in stderr=FAILURE: bad fasta character #1466 it looks like the invalid sequence is coming out offaffy chunk
.