Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure to call close() on Scope returned from io.opentelemetry.context.Context.makeCurrent() #924

Open
wants to merge 12 commits into
base: series/2.x
Choose a base branch
from

Conversation

joost-de-vries
Copy link

@joost-de-vries joost-de-vries commented Dec 30, 2024

The documentation for io.opentelemetry.context.Context.makeCurrent() states
Every makeCurrent() must be followed by a Scope#close(). Breaking these rules may lead to memory leaks and incorrect scoping.

Running zio-opentelemetry with -Dio.opentelemetry.context.enableStrictContext=true leads to an error

Dec 30, 2024 11:08:42 AM io.opentelemetry.context.StrictContextStorage$PendingScopes run
SEVERE: Scope garbage collected before being closed.
java.lang.AssertionError: Thread [ZScheduler-Worker-2] opened a scope of OtelContext{currentSpan=datadog.opentelemetry.shim.trace.OtelSpan$NoopSpanContext@1f7a631e, rootSpan=datadog.opentelemetry.shim.trace.OtelSpan$NoopSpanContext@1f7a631e} here:
	at datadog.opentelemetry.shim.context.OtelContext.makeCurrent(OtelContext.java:90)
	at zio.telemetry.opentelemetry.context.ContextStorage$Native$.set$$anonfun$1(ContextStorage.scala:63)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1027)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1063)
	at zio.internal.FiberRuntime.runLoop(FiberRuntime.scala:1090)
	at zio.internal.FiberRuntime.evaluateEffect(FiberRuntime.scala:412)
	at zio.internal.FiberRuntime.evaluateMessageWhileSuspended(FiberRuntime.scala:488)
	at zio.internal.FiberRuntime.drainQueueOnCurrentThread(FiberRuntime.scala:249)
	at zio.internal.FiberRuntime.run(FiberRuntime.scala:137)
	at zio.internal.ZScheduler$$anon$3.run(ZScheduler.scala:380)

This PR uses zio.Scope to make sure that the otel Scope returned by Context.makeCurrent is always closed.

@joost-de-vries joost-de-vries requested a review from a team as a code owner December 30, 2024 10:21
@CLAassistant
Copy link

CLAassistant commented Dec 30, 2024

CLA assistant check
All committers have signed the CLA.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

@joost-de-vries Thanks for spotting this! I'll look into it when I have time and tell you what I think about your solution. The only concern I have is that these changes are not backward-compatible.

@joost-de-vries
Copy link
Author

@grouzen thank you for your quick response.

Yes, too bad that the Baggage api is affected. I couldn't think of a solution that doesn't. Maybe you have an idea?

I've tested this fix on datadog.
In our code we call after a http response Tracing.setAttribute("http.status_code", response.status.code.toString).
That status code was always missing in the span.
When I deployed this fix it started showing up. You can see the exact moment of the deploy
result-of-fix-on-datadog

@joost-de-vries
Copy link
Author

There's still a failing assertion java.lang.IllegalStateException: Thread [ZScheduler-Worker-7] opened scope, but thread [ZScheduler-Worker-3] closed it
The assertion occurs at the end of the 2nd (zio) argument of Tracing.root(...)(zio)
So the otel Scope is getting closed. But from a different thread.
I'm not sure whether that is an issue for zio code or no. The otel strict code expects to be called by a blocking java call.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

Now I realize that it is required to change the Baggage API this way to reflect the .makeCurrent semantics.

@grouzen
Copy link
Contributor

grouzen commented Dec 30, 2024

@joost-de-vries Please pull the changes from series/2.x branch to fix the CI error.

@joost-de-vries
Copy link
Author

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

@joost-de-vries
Copy link
Author

joost-de-vries commented Jan 2, 2025

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing.
Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

@grouzen
Copy link
Contributor

grouzen commented Jan 2, 2025

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing. Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

Frankly, I'm still thinking about how to avoid having scoped baggage API. I need to find some time to read the Baggage OTEL spec. I assume it is fine to have nested OTEL scopes when dealing with baggage data because, as far as I remember, according to the spec, it stores the baggage data in the trace context rather than the span, which makes perfect sense to me.
Sorry for this stream of consciousness if it doesn't make any sense ;) Just don't have enough time at the moment to dig deeply into it.

@grouzen
Copy link
Contributor

grouzen commented Jan 2, 2025

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

Sounds good. I think it is a sane way of testing this kind of stuff.

@joost-de-vries
Copy link
Author

Thinking about the changed api of Baggage: I guess I can introduce a new trait. And add deprecation warnings to the existing. Of course that raises the question what to call the new trait. And what to call the alternative to the OpenTelemetry function

 def baggage(logAnnotated: Boolean = false): URLayer[ContextStorage, Baggage] =
    Baggage.live(logAnnotated)

I guess we could call the existing trait BaggageDeprecated

Frankly, I'm still thinking about how to avoid having scoped baggage API. I need to find some time to read the Baggage OTEL spec. I assume it is fine to have nested OTEL scopes when dealing with baggage data because, as far as I remember, according to the spec, it stores the baggage data in the trace context rather than the span, which makes perfect sense to me. Sorry for this stream of consciousness if it doesn't make any sense ;) Just don't have enough time at the moment to dig deeply into it.

Yes, I'm chewing on that too.
In my mind the Tracing working correctly is much more important than Baggage. Thankfully that api is not affected. So I would like the fix to Tracing to not get hold up by the Baggage api.
We can leave the Baggage api as is. (Unless we think of a transparent solution.) And introduce a ScopedBaggage trait. Or for the affected methods a scoped variant. Like def setScoped(name: String, value: String)(implicit trace: Trace): URIO[Scope, Unit]. That way current usage is unaffected.

Come to think of it: the latter solution is simplest probably.

@joost-de-vries
Copy link
Author

@grouzen I wonder how to test the fix. One way would be to mock de java OpenTelemetry Context.makeCurrent. And check that close() is called.

Sounds good. I think it is a sane way of testing this kind of stuff.

I've looked into this:
It seems the test for ContextStorage.native doesn't initialize the otel OpenTelemetry. So I don't know if the strict jvm argument will work.
Also for a jvm argument to be applied to the test it probably needs to run forked in a separate jvm.
I don't think we can mock the ThreadLocal.

@joost-de-vries
Copy link
Author

@grouzen I've implemented the latter approach. Existing code will compile unchanged.

(of course that raises the question what the implementation of the existing methods should be. I've implemented it with a local ZIO.scoped. If we want to keep the old behaviour we should provide a scope that is never finalized. )

@joost-de-vries
Copy link
Author

I've added a test to verify that ContextStorage.native adheres to the contract of otel Context.makeCurrent:

Context prevCtx = Context.current(); 
try (Scope ignored = ctx.makeCurrent()) {   
  assert Context.current() == ctx; 
  ... 
} 
assert Context.current() == prevCtx;

Most of those tests fail on master. But not all.

@grouzen
Copy link
Contributor

grouzen commented Jan 7, 2025

Hey! Sorry for being silent, don't have time to work on this atm.
I want to take a timeout to think more since I don't quite like the breaking changes and the scoped API in general. However, I still hope it may be possible to avoid this.
Here are some links to have a look if you are interested:

@joost-de-vries
Copy link
Author

joost-de-vries commented Jan 7, 2025

@grouzen I understand.
I agree it would be nice if we don't need to change the api.

Are you on zio discord? I can share our experiences with zio + datadog.

@grouzen
Copy link
Contributor

grouzen commented Jan 7, 2025

@grouzen I understand. I agree it would be nice if we don't need to change the api.

Are you on zio discord? I can share our experiences with zio + datadog.

Yeah, you can find me on the zio-telemetry channel as well. It would be lovely to hear about your experience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants