Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long headers in the FASTA are not parsed correctly #87

Open
CorinYeatsCGPS opened this issue Oct 15, 2024 · 2 comments
Open

Very long headers in the FASTA are not parsed correctly #87

CorinYeatsCGPS opened this issue Oct 15, 2024 · 2 comments
Assignees

Comments

@CorinYeatsCGPS
Copy link

I'm not sure the length limit, but I have a few FASTAs with >100 characters in the headers, which seems to cause Kleborate to fall over during the MLST stage. I replaced the original headers with shortened versions and the FASTA was processed correctly. Simply putting in a long run of digits was enough to trigger the issue. It might also be worth noting that in the FASTA which triggered this issue the first 300 characters of the header of each record were the same and couldn't be truncated.

strain  species N50     ST      virulence_score resistance_score        num_resistance_classes  num_resistance_genes
Traceback (most recent call last):
  File "/usr/local/bin/kleborate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/__main__.py", line 154, in main
    module_results = modules[module].get_results(unzipped_assembly, minimap2_index, args, results)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/modules/klebsiella_pneumo_complex__mlst/klebsiella_pneumo_complex__mlst.py", line 73, in get_results
    st, _, alleles = mlst(assembly, minimap2_index, profiles, alleles, genes, None,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in mlst
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/mlst.py", line 44, in <dictcomp>
    hits_per_gene = {g: align_query_to_ref(allele_paths[g], assembly_path,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in align_query_to_ref
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 134, in <listcomp>
    alignments = [Alignment(x, query_seqs=query_seqs, ref_seqs=ref_seqs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 51, in __init__
    self.set_sequences(query_seqs, ref_seqs)
  File "/usr/local/lib/python3.11/site-packages/kleborate/shared/alignment.py", line 88, in set_sequences
    self.ref_seq = ref_seqs[self.ref_name][self.ref_start:self.ref_end]
                   ~~~~~~~~^^^^^^^^^^^^^^^
KeyError: '22222222222222222222222222222222222222222222222222222222222'
@Marysteph Marysteph self-assigned this Oct 15, 2024
@Marysteph
Copy link
Collaborator

Thanks @CorinYeatsCGPS. I will address this.

@CorinYeatsCGPS
Copy link
Author

After final review I found only one instance of this in almost 300,000 FASTA files, so it's not a big problem! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants