Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues integrating GPU-accelerated search in colabfold alignment protocol #904

Open
clami66 opened this issue Nov 22, 2024 · 3 comments
Open

Comments

@clami66
Copy link

clami66 commented Nov 22, 2024

I am trying to integrate the new GPU-accelerated search in colabfold_search. From what I can see, only search and easy-search are GPU-accelerated. However, the colabfold_search alignment protocol also includes a expandaln step (among others).

Unfortunately, it seems like expandaln is incompatible with the padded sequence DB generated and indexed for GPU, as running mmseqs expandaln on this database will cause it to crash. I think this is because the database .idx.index file lacks rows 24-25, i.e. ALNINDEX, ALNDATA as defined here: https://github.com/soedinglab/MMseqs2/blob/266c894c117a9bd650450974747424ce51124bf5/src/prefiltering/PrefilteringIndexReader.cpp#L33C1-L34C52

I thought that this was due to using the --index-subset 2 flag when running mmseqs createindex as recommended in the guide, but even using --index-subset 0 doesn't fix the issue for me.

Now I am wondering if the whole alignment protocol should change (e.g. by removing expandaln altogether) or perhaps there is something I am doing incorrectly when setting the database up? Thanks for any help on this!

Steps to Reproduce (for bugs)

  1. Generate the padded DB:
    mmseqs makepaddedseqdb uniref30_2302_db uniref30_2302_db_gpu

  2. Generate the index (either with --index-subset 0 or --index-subset 2)

$ mmseqs createindex uniref30_2302_db_gpu tmp --split 0 --index-subset 0
...
Write VERSION (0)
Write META (1)
Write SCOREMATRIXNAME (2)
Write SPACEDPATTERN (23)
Write GENERATOR (22)
Write DBR1INDEX (5)
Write DBR1DATA (6)
Write HDR1INDEX (18)
Write HDR1DATA (19)
Write SCOREMATRIX3MER (4)
Write SCOREMATRIX2MER (3)
...
Write ENTRIES (9)
Write ENTRIESOFFSETS (10)
Write SEQINDEXDATASIZE (15)
Write SEQINDEXSEQOFFSET (16)
Write SEQINDEXDATA (14)
Write ENTRIESNUM (12)
Write SEQCOUNT (13)

  1. The resulting .idx.index file lacks rows 24-25:
$ tail uniref30_2302_db_gpu.idx.index
...
21      10770190336     105711065
22      20480   41
23      16384   1
  1. Run mmseqs expandaln
mmseqs expandaln ./example/qdb colabfold_databases/uniref30_2302_db_gpu.idx ./example/res colabfold_databases/uniref30_2302_db_gpu.idx ./res_exp

MMseqs Output

expandaln crashes while attempting to load the index:

MMseqs Version:                 dc7395810db17ec7de8adf32599562452b0c4d78
Expansion mode                  0
Substitution matrix             aa:blosum62.out,nucl:nucleotide.out
Gap open cost                   aa:11,nucl:5
Gap extension cost              aa:1,nucl:2
Max sequence length             65535
Score bias                      0
Compositional bias              1
Compositional bias              1
E-value threshold               0.001
Seq. id. threshold              0
Coverage threshold              0
Coverage mode                   0
Pseudo count mode               0
Pseudo count a                  substitution:1.100,context:1.400
Pseudo count b                  substitution:4.100,context:5.800
Expand filter clusters          0
Use filter only at N seqs       0
Maximum seq. id. threshold      0.9
Minimum seq. id.                0.0
Minimum score per column        -20
Minimum coverage                0
Select N most diverse seqs      1000
Preload mode                    0
Compressed                      0
Threads                         128
Verbosity                       3

Index version: 16
Generated by:  dc7395810db17ec7de8adf32599562452b0c4d78
ScoreMatrix:  VTML80.out
Index version: 16
Generated by:  dc7395810db17ec7de8adf32599562452b0c4d78
ScoreMatrix:  VTML80.out
Invalid database read for database data file=colabfold_databases/uniref30_2302_db_gpu.idx, database index=colabfold_databases/uniref30_2302_db_gpu.idx.index
getData: local id (4294967295) >= db size (22)

Your Environment

  • MMseqs2 commit: dc73958
  • Compiled with DENABLE_CUDA=1 -DCMAKE_CUDA_ARCHITECTURES="75;80;86;89;90"
  • CUDA environment spec: gcccuda/12.1.1-gcc12.3.0
  • System: NVIDIA SuperPOD/DGX-A100 - Linux
@milot-mirdita
Copy link
Member

Still working on it, we'll likely release the changes to do ColabFold with MMseqs2-GPU this weekend. colabfold_search doesn't actually require any changes directly. The new protocol can be activated with environment variables only, after building GPU databases.

@clami66
Copy link
Author

clami66 commented Nov 22, 2024

Thanks for responding so quickly, I will keep an eye out for the updates

@jbderoo
Copy link

jbderoo commented Dec 3, 2024

Thanks for all your hard work Milot. Any headway on the MMseqs2-GPU integration with colabfold_search? Or perhaps a guideline on how to make colabfold_search use MMseqs2-GPU with only environment variables?

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants