-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to reduce memory consumption? #14
Comments
Hi Glenn, you are right. Yan |
I would also like to express my interest in resolving this issue 👍 It would be really nice to be able to take full advantage of the banding. |
Hi Glenn, In the latest version of abPOA (v1.2.0), I implemented minimizer-based seeding before POA, this can reduce the memory usage for long input sequence. Please try it out and let me know if this works for you. Yan |
Great! This is perfect timing since I was just about to review some of what I'd been doing for stitching alignments together. I'll try it out on Monday. |
@glennhickey, just updated abPOA to v1.2.1, removes a redundant sorting step which is very time-consuming. |
Thanks for letting me know. I'm switching to 1.2.1 now. My 1.2.0 tests have been okay so far: it passes all Cactus unit tests, and let me disable our current stitching logic on a bigger run. I'll do a much bigger test this week and report the results. |
Do you have a sense of the maximum sequence lengths I can pass in while using the seeding? I just got an error
when I allowed up to 1mb. Thanks. |
@glennhickey what alignment parameters are you using? |
@ekg I'm still using default everything (edit -- except wb/wf which I increased from 10/0.01 to 25/0.025). I haven't yet explored the parameter space much despite meaning to for a while, especially in the context of alignments between more distant species. Until now, I've been capping abpoa jobs at 10kb (and using an overlapping sliding window and stitching the results together). Bumping this up to 1mb with the latest abpoa seemed to work on smaller tests but not on a bigger job. |
I'm getting failures even on datasets that ran before (without the seeding)
|
@glennhickey Can you share the dataset that causes the error/failure? |
Sure. I'll need to hack cactus a bit to spit it out, but should be able to do that soon. |
I was just about to send another error segfault I got without seeding
But realized when I built with a newer |
@glennhickey I did not get any error on my computer for this data.
For seeding mode:
|
Yes, that data works for me now too. I just thought it was interesting as that command line was the first I found that did not work on architectures older than haswell, so to reproduce the crash you'd have to build with While the scoring parameters make a big difference in runtime, they also seem to help accuracy considerably when aligning different species together. The best we've found for this has been to use the default HOXD70 matrix from lastz
On a simulation test, this matrix (which I override in abpt->mat) brings accuracy up by around 7% vs the abpoa defaults. On less divergent sequences, there is also an improvement but it is much much smaller. |
Thanks for the information! Yan |
Hi @yangao07 , I've been experimenting with the seeding option to run on longer sequence sizes. It works pretty well, but I get segfaults from time to time. Here are some examples
If I understand, the memory with seeding is much lower... but only if abpoa can find enough seeds. If the sequences are too diverged, the memory can still explode. If this is correct, do you think there would be a way to change the API to fail more gracefully in these cases? For example, if there are not enough seeds, and the memory will exceed a given threshold, return an error code. Or a function that checks the seeds in the input and estimates the memory requirement? Either of these would allow the user to use seeding when possible and fall back on another approach if it won't work. Thanks as always for your wonderful tool! |
You are right, for divergent sequences, specifically with greatly different lengths, the memory can still be very large. The memory size is simply dependent on the graph size and the sequence length, so it can be estimated. Yan |
Thanks, that would be amazing. Or even some kind of interface where the user passes in a MAX_SIZE parameter and abpoa exits 1 instead instead of trying to allocate >MAX_SIZE would be very helpful. |
oops, exit isn't much better than running out of memory -- would have to be a return code or exception. |
Hey @glennhickey , I am working on adding some interfaces related to memory usage by abPOA. This way, by checking the |
Wow, I'm happy to hear that you're thinking about this! If I call abpoa, it fails a malloc, and instead of crashing it sets status to 1 and returns normally, that would be a big help indeed and I'd definitely like to try it out. I would still be a bit worried though, because I run abpoa in a bunch of different threads on cloud instances, so I could imagine a case where a big malloc succeeds and abpoa takes 100% of the resources on a system, then all concurrent threads would crash and that would effectively bring down the job I was running anyway, even if it's not directly abpoa's fault. Do you think there could be any way for me to give abpoa a limit, and ask it to set status 1 if it ever tries to allocate more than that amount of memory at one time? thanks again for all your help. |
I have thought about using the size limit. |
ok, I think I understand better now, thanks. I was (probably naively) hoping it was simple for you to detect how far from the band it'd gotten and abort before overrunning the memory. I'll try to see with some of my colleagues who know more about virtual address spaces than I do. |
Hello, I'm experimenting with adding abPOA as an option within cactus (manuscript). Thanks for making a great tool -- it's amazingly fast.
I was wondering if there's a way to reduce memory consumption, however, in order to increase the sequence lengths I can run on. Right now it seems roughly quadratic in the sequence length, which is as expected when reading your manuscript. I'm curious to know if there are any options I can use to reduce this and/or if you've thought about using the banding to reduce the DP table size (as far as I can tell, it's only used to reduce computation)?
The text was updated successfully, but these errors were encountered: