-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coordinator OOM caused by new jackson version regression #16282
Comments
We have downgraded to jackson 2.13.5, our cluster has been running for more than 24 hours without OOM. I will update if it changes. |
We upgraded to Trino 409 and returned jackson 2.14.1. Now we can't reproduce the issue, our test cluster has been working for about a week without memory issues. We are going to test it on a bigger cluster and load. I will updated about the results. |
We can't reproduce the issue on Trino 409 on a big cluster under the load. I'm closing the ticket, will reopen in case we face it again and if I have more information. |
Hit the same issue twice. Upgrade jackson from 2.14 to 2.15 fixes the issue (or upgrade to 419 which has 2.15 jaskson) |
Our load changed and we started to experince this issue again on Trino Some technical details: Just to understand level of collisions. The cache map contains ~120K items. The map itself contains hash table of 260K buckets. |
Thanks for following up and figuring out the cause and fix @davseitsev. Do you plan to send a pull-request with the updated version once you have verified internally? EDIT: Nevermind, I see I already updated to 2.15.x in past. Thanks for letting us know about the source of the issue anyway. |
We are upgrading Trino from 367 to 407. Under the load coordinator fails with OutOfMemoryError.
Heap dump analysis shows that 96% of heap is taken by a single instance of
io.airlift.concurrent.BoundedExecutor
with thread factory producingremote-task-callback-%s
threads.As it's difficult to get LinkedQueue size from the heap, I estimated number of different classes referenced by the queue:
io.airlift.http.client.jetty.JettyResponseFuture
: 3.45Mio.trino.server.remotetask.SimpleHttpResponseHandler
: 3.42MProbably the size of the queue is about 3.5M items.
I've checked Thread dump and noticed that 998 threads like
remote-task-callback-%s
are blocked inside jacksonPrivateMaxEntriesMap
:Search in FasterXML repo showed that TypFactory cache implementation was changed in version 2.14 FasterXML/jackson-databind#3531.
There is also an open issue which is very similar to what I see FasterXML/jackson-module-scala#428.
I'm not sure it's completely the same case because in the heap dump I see that
LRUMap
capacity is 200 but actual size is 169 but I could be wrong. Maybe the issue is in low configured concurrency in https://github.com/FasterXML/jackson-databind/blob/jackson-databind-2.14.1/src/main/java/com/fasterxml/jackson/databind/util/LRUMap.java#L39The text was updated successfully, but these errors were encountered: