Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coordinator OOM caused by new jackson version regression #16282

Closed
davseitsev opened this issue Feb 27, 2023 · 8 comments
Closed

Coordinator OOM caused by new jackson version regression #16282

davseitsev opened this issue Feb 27, 2023 · 8 comments

Comments

@davseitsev
Copy link

davseitsev commented Feb 27, 2023

We are upgrading Trino from 367 to 407. Under the load coordinator fails with OutOfMemoryError.
Heap dump analysis shows that 96% of heap is taken by a single instance of io.airlift.concurrent.BoundedExecutor with thread factory producing remote-task-callback-%s threads.
image
As it's difficult to get LinkedQueue size from the heap, I estimated number of different classes referenced by the queue:
io.airlift.http.client.jetty.JettyResponseFuture: 3.45M
io.trino.server.remotetask.SimpleHttpResponseHandler: 3.42M
Probably the size of the queue is about 3.5M items.

I've checked Thread dump and noticed that 998 threads like remote-task-callback-%s are blocked inside jackson PrivateMaxEntriesMap:

stackTrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.concurrent.ConcurrentHashMap.putVal([email protected]/ConcurrentHashMap.java:1031)
- waiting to lock <0x00007f0306e29600> (a java.util.concurrent.ConcurrentHashMap$TreeBin)
at java.util.concurrent.ConcurrentHashMap.putIfAbsent([email protected]/ConcurrentHashMap.java:1541)
at com.fasterxml.jackson.databind.util.internal.PrivateMaxEntriesMap.put(PrivateMaxEntriesMap.java:655)
at com.fasterxml.jackson.databind.util.internal.PrivateMaxEntriesMap.putIfAbsent(PrivateMaxEntriesMap.java:633)
at com.fasterxml.jackson.databind.util.LRUMap.putIfAbsent(LRUMap.java:53)
at com.fasterxml.jackson.databind.type.TypeFactory._fromClass(TypeFactory.java:1523)
at com.fasterxml.jackson.databind.type.TypeFactory._fromParamType(TypeFactory.java:1632)
at com.fasterxml.jackson.databind.type.TypeFactory._fromAny(TypeFactory.java:1395)
at com.fasterxml.jackson.databind.type.TypeFactory._resolveSuperInterfaces(TypeFactory.java:1547)
at com.fasterxml.jackson.databind.type.TypeFactory._fromClass(TypeFactory.java:1494)
at com.fasterxml.jackson.databind.type.TypeFactory._fromParamType(TypeFactory.java:1632)
at com.fasterxml.jackson.databind.type.TypeFactory._fromAny(TypeFactory.java:1395)
at com.fasterxml.jackson.databind.type.TypeFactory._resolveSuperClass(TypeFactory.java:1534)
at com.fasterxml.jackson.databind.type.TypeFactory._fromClass(TypeFactory.java:1493)
at com.fasterxml.jackson.databind.type.TypeFactory._bindingsForSubtype(TypeFactory.java:531)
at com.fasterxml.jackson.databind.type.TypeFactory.constructSpecializedType(TypeFactory.java:510)
at com.fasterxml.jackson.databind.SerializerProvider.constructSpecializedType(SerializerProvider.java:347)
at com.fasterxml.jackson.databind.ser.std.ReferenceTypeSerializer._findCachedSerializer(ReferenceTypeSerializer.java:461)
at com.fasterxml.jackson.databind.ser.std.ReferenceTypeSerializer.serialize(ReferenceTypeSerializer.java:381)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:733)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:733)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:657)
at io.trino.metadata.AbstractTypedJacksonModule$InternalTypeSerializer.serialize(AbstractTypedJacksonModule.java:116)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:733)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:733)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:657)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:735)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:657)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:735)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeWithType(BeanSerializerBase.java:657)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:735)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
at com.fasterxml.jackson.databind.ser.std.ReferenceTypeSerializer.serialize(ReferenceTypeSerializer.java:386)
at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:733)
at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:774)
at com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
at com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:400)
at com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1568)
at com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1273)
at com.fasterxml.jackson.databind.ObjectWriter.writeValueAsBytes(ObjectWriter.java:1163)
at io.airlift.json.JsonCodec.toJsonBytes(JsonCodec.java:211)
at io.trino.server.remotetask.HttpRemoteTask.sendUpdate(HttpRemoteTask.java:703)
at io.trino.server.remotetask.HttpRemoteTask$$Lambda$11871/0x00000008039c9d08.run(Unknown Source)
at io.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:80)
at io.airlift.concurrent.BoundedExecutor$$Lambda$11850/0x00000008039c76a8.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:833)

Search in FasterXML repo showed that TypFactory cache implementation was changed in version 2.14 FasterXML/jackson-databind#3531.
There is also an open issue which is very similar to what I see FasterXML/jackson-module-scala#428.
I'm not sure it's completely the same case because in the heap dump I see that LRUMap capacity is 200 but actual size is 169 but I could be wrong. Maybe the issue is in low configured concurrency in https://github.com/FasterXML/jackson-databind/blob/jackson-databind-2.14.1/src/main/java/com/fasterxml/jackson/databind/util/LRUMap.java#L39

@davseitsev
Copy link
Author

We have downgraded to jackson 2.13.5, our cluster has been running for more than 24 hours without OOM. I will update if it changes.

@raunaqmorarka
Copy link
Member

cc: @dain @electrum

@davseitsev
Copy link
Author

davseitsev commented Mar 14, 2023

We upgraded to Trino 409 and returned jackson 2.14.1. Now we can't reproduce the issue, our test cluster has been working for about a week without memory issues. We are going to test it on a bigger cluster and load. I will updated about the results.

@sopel39
Copy link
Member

sopel39 commented Mar 17, 2023

Cc @electrum @dain. This is probably release blocke

@davseitsev
Copy link
Author

We can't reproduce the issue on Trino 409 on a big cluster under the load. I'm closing the ticket, will reopen in case we face it again and if I have more information.

@jinyangli34
Copy link
Contributor

Hit the same issue twice. Upgrade jackson from 2.14 to 2.15 fixes the issue (or upgrade to 419 which has 2.15 jaskson)

@davseitsev
Copy link
Author

Our load changed and we started to experince this issue again on Trino 409, jackson 2.14.1. Upgraded jackson to 2.15.3 hope it will help.

Some technical details:
Under heavy load remote-task-callback executor slows down and starts collecting SimpleHttpResponseHandler in a queue. We get OOM when the queue grows up to 5M+ items.
The reason of slowness is heavy cache collisions inside com.fasterxml.jackson.databind.util.LRUMap. Here is the ticket: FasterXML/jackson-databind#3876. It must be fixed in 2.15.0.

Just to understand level of collisions. The cache map contains ~120K items. The map itself contains hash table of 260K buckets.
But only 5 buckets are not empty. Thus 120K items are spread among 5 buckets and each time we read/update the cache, we iterate over huge linked list to find necessary element. As it's concurrent map, many threads a blocked.
image

@hashhar
Copy link
Member

hashhar commented Jan 11, 2024

Thanks for following up and figuring out the cause and fix @davseitsev. Do you plan to send a pull-request with the updated version once you have verified internally?

EDIT: Nevermind, I see I already updated to 2.15.x in past. Thanks for letting us know about the source of the issue anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants