Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to reduce memory consumption? #14

Open
glennhickey opened this issue Oct 20, 2020 · 24 comments
Open

Is there any way to reduce memory consumption? #14

glennhickey opened this issue Oct 20, 2020 · 24 comments

Comments

@glennhickey
Copy link
Contributor

Hello, I'm experimenting with adding abPOA as an option within cactus (manuscript). Thanks for making a great tool -- it's amazingly fast.

I was wondering if there's a way to reduce memory consumption, however, in order to increase the sequence lengths I can run on. Right now it seems roughly quadratic in the sequence length, which is as expected when reading your manuscript. I'm curious to know if there are any options I can use to reduce this and/or if you've thought about using the banding to reduce the DP table size (as far as I can tell, it's only used to reduce computation)?

@yangao07
Copy link
Owner

Hi Glenn, you are right.
Right now, the banding in abPOA only reduces the time, not the memory, so it is still quadratic.
I do plan to reduce the memory consumption in different ways, but I haven't implemented it yet.
Will let you know if I have any progress.

Yan

@rlorigro
Copy link

I would also like to express my interest in resolving this issue 👍 It would be really nice to be able to take full advantage of the banding.

@yangao07
Copy link
Owner

Hi Glenn,

In the latest version of abPOA (v1.2.0), I implemented minimizer-based seeding before POA, this can reduce the memory usage for long input sequence.
For most of the time, it can produce nearly the same or even better alignment result.

Please try it out and let me know if this works for you.

Yan

@glennhickey
Copy link
Contributor Author

Great! This is perfect timing since I was just about to review some of what I'd been doing for stitching alignments together. I'll try it out on Monday.
Thanks!

@yangao07
Copy link
Owner

@glennhickey, just updated abPOA to v1.2.1, removes a redundant sorting step which is very time-consuming.

@glennhickey
Copy link
Contributor Author

Thanks for letting me know. I'm switching to 1.2.1 now. My 1.2.0 tests have been okay so far: it passes all Cactus unit tests, and let me disable our current stitching logic on a bigger run. I'll do a much bigger test this week and report the results.

@glennhickey
Copy link
Contributor Author

Do you have a sense of the maximum sequence lengths I can pass in while using the seeding? I just got an error

               [SIMDMalloc] mm_Malloc fail!
                Size: 549755813888

when I allowed up to 1mb. Thanks.

@ekg
Copy link
Contributor

ekg commented May 19, 2021

@glennhickey what alignment parameters are you using?

@glennhickey
Copy link
Contributor Author

glennhickey commented May 19, 2021

@ekg I'm still using default everything (edit -- except wb/wf which I increased from 10/0.01 to 25/0.025). I haven't yet explored the parameter space much despite meaning to for a while, especially in the context of alignments between more distant species.

Until now, I've been capping abpoa jobs at 10kb (and using an overlapping sliding window and stitching the results together). Bumping this up to 1mb with the latest abpoa seemed to work on smaller tests but not on a bigger job.

@glennhickey
Copy link
Contributor Author

I'm getting failures even on datasets that ran before (without the seeding)

...
               == 05-19-2021 22:50:05 == [abpoa_anchor_poa] Performing POA between anchors ...                                                                                 
                == 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors done.                                                                               
                == 05-19-2021 22:50:07 == [abpoa_build_guide_tree_partition] Seeding and chaining ...                                                                           
                == 05-19-2021 22:50:07 == [abpoa_build_guide_tree_partition] Seeding and chaining done!                                                                         
                == 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors ...                                                                                 
                == 05-19-2021 22:50:07 == [abpoa_anchor_poa] Performing POA between anchors done.                                                                               
                Command terminated by signal 11           

@yangao07
Copy link
Owner

@glennhickey Can you share the dataset that causes the error/failure?

@glennhickey
Copy link
Contributor Author

Sure. I'll need to hack cactus a bit to spit it out, but should be able to do that soon.

@glennhickey
Copy link
Contributor Author

glennhickey commented May 26, 2021

I was just about to send another error segfault I got without seeding

wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_may26.fa
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1

But realized when I built with a newer -march it worked. More specifically, upgrading -march=nehalem to -march=haswell fixed it. (Cactus had previously built against nehalem to maximize portability for releases). I think it's pretty likely the problem I mentioned above is related to this.

@yangao07
Copy link
Owner

@glennhickey I did not get any error on my computer for this data.
However, I did notice a big difference when using the scoring parameter you mentioned.
Not only causes different MSA output but also eats a bigger memory and requires a longer run time.

abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1
[abpoa_main] Real time: 111.849 sec; CPU: 111.208 sec; Peak RSS: 17.228 GB.
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -N -b 100 -f 0.025
[abpoa_main] Real time: 28.047 sec; CPU: 27.946 sec; Peak RSS: 3.135 GB.

For seeding mode:

abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -b 100 -f 0.025 -M 96 -X 90 -O 400,1200 -E 30,1
[abpoa_main] Real time: 94.547 sec; CPU: 70.398 sec; Peak RSS: 13.474 GB.
abpoa ./abpoa_fail_may26.fa -m 0 -o out.msa -r 1 -b 100 -f 0.025 
[abpoa_main] Real time: 35.114 sec; CPU: 26.085 sec; Peak RSS: 4.830 GB.

@glennhickey
Copy link
Contributor Author

Yes, that data works for me now too. I just thought it was interesting as that command line was the first I found that did not work on architectures older than haswell, so to reproduce the crash you'd have to build with -march nehalem instead of -march native (or use a computer that's more than 7 years old).

While the scoring parameters make a big difference in runtime, they also seem to help accuracy considerably when aligning different species together. The best we've found for this has been to use the default HOXD70 matrix from lastz

  A C G T
A 91 ‑114 ‑31 ‑123
C ‑114 100 ‑125 ‑31
G ‑31 ‑125 100 ‑114
T ‑123 ‑31 ‑114 91

On a simulation test, this matrix (which I override in abpt->mat) brings accuracy up by around 7% vs the abpoa defaults. On less divergent sequences, there is also an improvement but it is much much smaller.

@yangao07
Copy link
Owner

Thanks for the information!
I am working on changing the scoring parameter options.

Yan

@glennhickey
Copy link
Contributor Author

Hi @yangao07 , I've been experimenting with the seeding option to run on longer sequence sizes. It works pretty well, but I get segfaults from time to time. Here are some examples

wget http://public.gi.ucsc.edu/~hickey/debug/abpoa_fail_mar17.tar.gz
tar xzf abpoa_fail_mar17.tar.gz
for f in abpoa_fail_mar17/*.fa; do abpoa $f -m 0 -r 1 -S ; done

If I understand, the memory with seeding is much lower... but only if abpoa can find enough seeds. If the sequences are too diverged, the memory can still explode.

If this is correct, do you think there would be a way to change the API to fail more gracefully in these cases? For example, if there are not enough seeds, and the memory will exceed a given threshold, return an error code. Or a function that checks the seeds in the input and estimates the memory requirement? Either of these would allow the user to use seeding when possible and fall back on another approach if it won't work.

Thanks as always for your wonderful tool!

@yangao07
Copy link
Owner

but only if abpoa can find enough seeds. If the sequences are too diverged, the memory can still explode.

You are right, for divergent sequences, specifically with greatly different lengths, the memory can still be very large.

The memory size is simply dependent on the graph size and the sequence length, so it can be estimated.
I can try to add a pre-calculation step for this.

Yan

@glennhickey
Copy link
Contributor Author

Thanks, that would be amazing. Or even some kind of interface where the user passes in a MAX_SIZE parameter and abpoa exits 1 instead instead of trying to allocate >MAX_SIZE would be very helpful.

@glennhickey
Copy link
Contributor Author

exits 1 instead instead of trying to allocate >MAX_SIZE would be very helpful.

oops, exit isn't much better than running out of memory -- would have to be a return code or exception.

@yangao07
Copy link
Owner

yangao07 commented Jun 10, 2022

Hey @glennhickey , I am working on adding some interfaces related to memory usage by abPOA.
Here is what I have done for now:
Added 2 variables in abpoa_t: status/req_mem.
For status, 0 means success, 1 means not enough memory, and 2 means other errors.
req_mem indicates the size of memory abPOA tried to allocate but failed.

This way, by checking the status variable, users can choose to re-run abpoa with adjusted parameters.
What do you think?

@glennhickey
Copy link
Contributor Author

Wow, I'm happy to hear that you're thinking about this!

If I call abpoa, it fails a malloc, and instead of crashing it sets status to 1 and returns normally, that would be a big help indeed and I'd definitely like to try it out.

I would still be a bit worried though, because I run abpoa in a bunch of different threads on cloud instances, so I could imagine a case where a big malloc succeeds and abpoa takes 100% of the resources on a system, then all concurrent threads would crash and that would effectively bring down the job I was running anyway, even if it's not directly abpoa's fault.

Do you think there could be any way for me to give abpoa a limit, and ask it to set status 1 if it ever tries to allocate more than that amount of memory at one time?

thanks again for all your help.

@yangao07
Copy link
Owner

I have thought about using the size limit.
The concern is that the size abpoc allocates is the virtual size, not the resident size, and the virtual size could be much larger than the physical memory of the computer.
I am not sure how to properly set the size limit, do you have any idea?

@glennhickey
Copy link
Contributor Author

ok, I think I understand better now, thanks. I was (probably naively) hoping it was simple for you to detect how far from the band it'd gotten and abort before overrunning the memory. I'll try to see with some of my colleagues who know more about virtual address spaces than I do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants