-
Notifications
You must be signed in to change notification settings - Fork 464
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation faults - Inconsistent errors with lots of async workers and objects #1174
Comments
Hi @kelvinhammond , We discussed this in the 13 May Node API meeting. Is it possible to provide a standalone repository that reproduces this issue? This will help with our investigations. Thanks, Kevin |
@KevinEady The branch here openzim/node-libzim#72 demonstrates the problem, you can npm install and run For gdb I run I can then access the backtrace. |
@KevinEady Let me know if there is anymore info I can provide for you. |
Hi @kelvinhammond , I will take a look at your example repository and get back to you as soon as possible. Thanks for the quick response! Kevin |
Hi @kelvinhammond , Can I ask which Node version you are testing with? Thanks, Kevin |
@KevinEady v16.15.0 |
So I tried your makeLargeZim test with a debug version of node 16.15.0 and I can receive one of three different segfaults:
or
or
Unfortunately I am not too familiar with any of these V8 internals. @legendecas or @gabrielschulhof are you able to provide any insight? Or some steps I can take to further troubleshoot? Thanks, Kevin |
Is there a more descriptive error available? I'm wondering if I can somehow work around this and its blocking a major release. @KevinEady Any ideas? From what I see, this happens sometimes when creating a threadsafe function and other times when doing some random operation :/ |
Where can I find the debug version of node? |
I assume Kevin built it directly from |
I tried working around this by calling
Digging through some of the nodejs docs and code. It is my understanding that |
We discussed in the 10 June Node API meeting to try to make a standalone reproduction of this problem that would be a little easier to debug / troubleshoot. If we mimic the several-thousand creation of I will continue working on this standalone reproduction in order to try to get a similar crash or rule out core issues. |
For some reason my persistent api seems to not be storing my function correct. I check and store the function here, it is called in the
GDB
Maybe this is what causes some of the issues downstream? |
@KevinEady Should the above be a new issue? I still can't figure out what I might be doing wrong. |
Hi @kelvinhammond , I have not been able to replicate the above error in my own libzim environment. Looking at the code, everythings looks correct: Runtime type-checking if (!provider.Get("feed").IsFunction()) {
throw std::runtime_error("ContentProvider.feed must be a function.");
}
provider_ = Napi::Persistent(provider);
feed_ = Napi::Persistent(provider.Get("feed").As<Napi::Function>()); Function call: auto blobObj = feed_.Call(provider_.Value(), {}); This looks correct to me. I am a little confused: You have some checks in I'm looking at the stack traces I posted, and there seems to be a new TSFN creation called after |
I have changed some things locally trying to find a work around to this issue, experimenting with things and changing the class MyContentProvider {
get contentProvider() { ... }
} to a function For the ThreadSafeFunction, there is a path where a new one will be created during the callback. |
I have been pulling my hair out trying to understand what's going on. In a previous meeting, there was a suggestion that maybe there is some buffer overflow in the stack / heap, but I'm really not sure how to verify this. I just wanted to see what would happen in a later version of node, and using 18.4.0 I get different stack trace errors, eg:
(lldb) fr s 0
frame #0: 0x00000001123257b3 zim_binding.node`StringProvider::feed(this=0x0000600004eb36c1, info=0x00007ff7bfef6258) at contentProvider.h:151:30
148 Napi::Value feed(const Napi::CallbackInfo &info) {
149 try {
150 // TODO(kelvinhammond): need a way to move this to avoid copying
-> 151 auto blob = provider_->feed();
152 return Blob::New(info.Env(), blob);
153 } catch (const std::exception &err) {
154 throw Napi::Error::New(info.Env(), err.what()); Unfortunately I had to wipe my machine and lost my debug build of node 16.15.0 ... I am going to build debug version of 18.14.0 and see if there are any significant changes |
In the 1 July 2022 Node API meeting, we discussed trying to use the MemorySanitizer and AddressSanitizer feature in Clang to find a possible memory issue. The documentation is here: Also,
I will try using these above methods to help troubleshoot issues. |
@KevinEady Thank you. |
@KevinEady Have you had a chance to try them out? If so, what were your results? |
This may be fixed on libzim side with openzim/libzim#684 |
Hi @kelvinhammond , Apologies for the delay. I was on a traveling vacation and should be getting back into the swing of things this weekend. I see that openzim/node-libzim#72 was merged into |
This comment was marked as spam.
This comment was marked as spam.
Sorry, forgot to answer this question. We merged but we are exactly at the same stage and have open a dedicated ticket at openzim/node-libzim#80 |
Any update here? This problem is a blocker for future versions of Wikipedia scraper. Would a bounty helps to identify the bug? |
@KevinEady Sorry I missed your comment. No, I was not able to figure out what was causing this issue. I tried many different methods to work around it too but non of them worked because I need to use |
Just tested it against node v18. I'm still getting the below errors.
In the main branch for node-libzim (or libzim8-build-fix1 if its not been merged yet). I edit Errors in gdb when running it like so: |
One of the inconsistent errors I keep seeing is that the ThreadSafeNativeFunction's callback function does not appear to store the function for some reason. Not sure if this is a node issue or node-addon-api issue.
I get the following error if that function is called.
And I can't reliably check the function is valid without getting it's value first and then checking if the function is undefined for some reason. Also the function should not be undefined. |
Hi @kelvinhammond , Is this still an issue? I noticed there has been some work after merging the initial branch. I forked the repository and enabled the |
@KevinEady Yes, the code that triggers it is commented out in the |
So the latest update I have... Running with The valgrind warnings are of this fashion:
Not sure if these memory warnings could cause the behavior witnessed with the segfaulting |
@KevinEady Thank you. Any hint for you in this Valgrind report? |
Hi @kelvinhammond , @kelson42 , I'm not too sure with the internal workings of the addon or libzim, but looking at the stacktrace it seems like a use-after-free error. The addon's I'm running valgrind again with a higher
So it seems like there is some invalid thread safety regarding the Does this provide any information / clues? EDIT: Here's a link to the latest action run: https://github.com/KevinEady/node-libzim/actions/runs/4128364660/jobs/7132716816 |
@KevinEady Thanks for taking a look at this. I'll take a look into this too and see what I can change and if it helps. |
@kelvinhammond I can help you on that if you want. |
I can't actually connect to this. Can someone create me an account? |
@KevinEady Thank you very much for your last comment. This is pretty much the guess of the libzim dev @mgautierfr as well. @kelvinhammond We have made an improvement with openzim/node-libzim#93. How is the status now of this ticket? Does it has evolved? |
No, this bug still exists. It usually happens when I create the thread safe native function. |
@KevinEady mentioned that there have been further changes merged since the last comment. Can this now be closed? |
AFAIK after talking to @kelvinhammond, the bug/problem is still there. |
Since the PR mentioned above was merged, I performed the following steps on f61346fa4 of the main branch: nvm use 16 # (v16.20.0)
npm install
npm run test-mem-leak
npm i --debug
npm run bundle # This put libzim.so.8 into Release rather than Debug, so
cp build/Release/libzim.so.8 build/Debug
node -r ts-node/register test/makeLargeZim.ts
nvm use 18 # (v18.15.0)
node -r ts-node/register test/makeLargeZim.ts There was no segfault. |
I'll take a look when I'm available to do so. |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Hello,
I'm working on a pull request for a project using node-addon-api.
When I create a lot of objects via AsyncWorker I get Segmentation faults.
I traced some of them down to create a fresh
Napi::Env
viainfo.Env()
vs the persistent object'sEnv()
call. Is there a difference here or a reason why I can't do that?I'm now running into other issues. These problems only occur after over 10,000 objects are created and destroyed.
https://github.com/openzim/node-libzim/pull/72/files#diff-f28f98b9b15ae6a552a5b00f2b80c65a701a42b1e75f81aff149ac16e480b238R15
This occurs when I call
makeLargeZim.ts
which callscreator.addItem
.addItem
creates anAsyncWorker
which stores a promise which is resolved once the libzim call completes. This is done because libzim will wait until other threads within its library are finished in the queue which blocks the nodejs / libuv main event thread.This works fine most of the time with 10, 100, 1000, 10000, but when I try to do ~1,000,000 I get an error. Its not a memory issue, plenty of ram. Any guidance would be helpful. Thank you.
The backtrace from gdb is here
The text was updated successfully, but these errors were encountered: