-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT REVIEW] Add Otel/STEF project #2492
base: main
Are you sure you want to change the base?
[DO NOT REVIEW] Add Otel/STEF project #2492
Conversation
Otel/STEF (Sequential Tabular Encoding Format) is a new data format and network protocol for OpenTelemetry data. For the target use-cases Otel/STEF outperforms both OTLP and Otel Arrow: Otel/STEF is smaller and/or faster. See stef.md for details: benchmarks and comparisons to other formats, links to prototypes and description of project goals.
f9c88e7
to
33208b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is a legitimate need in OpenTelemetry for a stateful protocol (or even something as simple as adding dictionaries to OTLP itself).
As discussed before, I'm supportive of investigating STEF and bringing it to a usable state.
However - I think we need to understand our end-game here. When we evalaute "why not Arrow" or "Why STEF" or "Why not OTLP", I think this proposal is still lacking our primary goal/scope. It has projects, but not implications of delivery of those projects.
I'd like to make sure we align on where this will be used. I called out two areas I think could dramatically improve from some low-level, stateful, efficient protocols. I don't think these are the only targets, but I'm also not sure these were on your radar either.
Let's confirm we agree on end-state then I'm in.
A draft specification and a prototype implementation of Otel/STEF is attached to thi | ||
proposal. | ||
|
||
### Goals, objectives, and requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be two additional goals here (possibly one if you phrase it well).
- We should evaluating an efficient file-based protocol for direct-export from API with out-of-band "Collector-style" SDK. That is, imagine an SDK implementation that can serialize events out-of-band quickly and efficiently, and offers more resilience on process-death. This is what I was working my towards with https://github.com/jsuereth/otlp-mmap/ and I think if we invest in a stateless protocol, this should be a use case it can support.
- Providing guidance to eBPF based telemetry extraction. If we are able to define structures and buffers and efficient stateful communication, we might be able to provide a good set of primitives for eBPF based event-extraction (complementary to my first bullet point). This is an avenue I think may be worth exploring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should evaluating an efficient file-based protocol for direct-export from API with out-of-band "Collector-style" SDK.
That's certainly something we can extend STEF to do. A mmap-ed ring buffer of STEF frames can be that. Since STEF optionally allows full state resets between frames, it essentially has a stateless mode built-in (with resets happening every frame).
Providing guidance to eBPF based telemetry extraction.
I am not sure I understand this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be two additional goals here (possibly one if you phrase it well).
I added a more generic goal that says that the project should evaluate additional use-cases.
|
||
Project non-goals: | ||
|
||
- We do not plan to offer Otel/STEF as a general-purpose replacement for OTLP or for Otel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I disagree with this. I understand the concern around limiting investment and engineering resources. However - I think you need to better clarify the target use case in this situation. Is this something only the Go SDK would be able to use with a collector? Is this something for just Collector->Collector communication?
That isn't really answered in this proposal, and I think it's critical.
My $.02 is that if we invest in a stateless protocol It should be an Open Standard (e.g. Arrow/parquet) or have an OpenTelemetry use case net well served by existing open standards and OTLP.
Here, you demonstrate that STEF can outperform arrow in efficiency for transmitting telemetry data. Where in open-telemetry do we use that?
TL;DR; this should have a targeted set of use cases for the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something only the Go SDK would be able to use with a collector? Is this something for just Collector->Collector communication?
Technically, nothing prevents any language SDK from having a STEF exporter. I am only making this claim to avoid placing additional burden on language maintainers. Should we (Otel) find it desirable to have STEF protocol support in SDKs, that's certainly doable.
I think the highest ROI is going to be in Collector->Collector or Collector->backend communication and that's why I suggest to start with that.
Can we extend the project to allow the use-cases you mentioned? Absolutely, and I would love to see that happen. I am just intentionally defining the initial scope to be small enough that we can deliver it quickly.
I agree that direct-export use case you mentioned is also very interesting (not for just performance but for crash-resilience reasons).
I am happy to extend the scope if you think you (or some other Otel contributor) could invest time in that extended scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have an OpenTelemetry use case net well served by existing open standards and OTLP.
This is my motivation. OTLP is inefficient size-wise. Otel Arrow Phase 1 is better size-wise, but even better is possible (as the benchmarks show) and is also quite expensive cpu-wise. STEF significantly advances the performance of our network protocols on multiple dimensions, with relatively small investment.
The first use case is as is described in the proposal: smaller wire size (network cost savings), less cpu consumption (compute cost savings) by the Collector.
Do you think additional explanation is needed for this use case? Or do you want the additional use cases to be described (e.g. the mmap-ed direct-export)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should showcase where stateful protocols are viable (e.g. how does it work through a load balancer?).
While I think stateful protocols have a lot of good use cases, from a broad sense - I have concerns that we may be optimising network overhead at the expense of inflexible network architecture and memory overhead in storing dictionaries from N clients.
I.e. for small, contextual cases this protocol is amazing and should be optimised for such cases. In broad, highly distributed, cases, we may still need to invest in OTLP optimisations.
To be clear - I think STEF shows a lot of potential and is worth investing in. I want to be explicit in this proposal where we think we see the biggest benefit and where the trade-offs fall off. Let's have a target architecture in mind for benchmarks, comparison and 'success'. I think you have that in your head (and were able to elucide when we talked about this), but it's just not written in the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsuereth I completely agree with you, we need answers to that. I think this line in the list of goals touches that:
Publish benchmark-justified guidelines on applicability of Otel/STEF vs OTLP vs Otel Arrow.
If this is not enough I can call it out more explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more in 88ede99
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am supportive, but would like to see more clarity on the goals, design principles, and trade-offs. For example, this could be explicitly limited to and optimized for a wire transmission protocol at the expense of other usage patterns, but I did not find these goals clearly stated.
|
||
SIG meeting to be scheduled once the project is approved. | ||
|
||
## FAQ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has there been a comparative analysis done of other existing protocols besides Arrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Benchmarks include comparison with OTLP, Parquet and Otel Arrow. If you think there are other interesting formats to compare to we can add it to the project goals.
SIG meeting to be scheduled once the project is approved. | ||
|
||
## FAQ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I the Spec I did not see a list of design principles for STEF protocol. Does such list exist? What is the protocol optimized for? For instance, this doc illustrates speed and size benchmarks, but do they come with trade-offs? What about memory layout, zero-copy capabilities, ability to append data, efficiency of query execution against the data, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added design principles here 2aa7487
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you would like more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit sceptic, Arrow has extension types, how much have they been explored? It sounds like this format is in philosophy 90% identical to Arrow. This is intended as a mostly internal protocol (since only Go is expected to have an implementation), so even the weirder things we've seen in Arrow would be doable (eg. use a binary array and store custom bytes in them).
Much like @yurishkuro, I'd like to see more on tradeoffs.
ca01f2d
to
24716f9
Compare
Added some more benchmark results in an appendix https://github.com/open-telemetry/community/blob/24716f9c027d51c793bea404354454434579f89d/projects/stef.md#appendix-a-benchmark-results |
I think this is a question for Otel Arrow SIG.
Correct, in many ways it is similar to Arrow.
Added a section with design principles that explains the tradeoffs. |
bd6c898
to
88ede99
Compare
Otel/STEF targets a narrower niche than OTLP or Otel Arrow and is more efficient | ||
for that niche. Otel/STEF is optimized for payload size and fast serialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the ecosystem and interoperability?
| batch_size: 1024 | 24764 (total: 1.1 MB) | 5622 (x 4.40) (total: 242 kB) | 10773 (x 2.30) (total: 463 kB) | | ||
| batch_size: 2048 | 39325 (total: 865 kB) | 9209 (x 4.27) (total: 203 kB) | 17808 (x 2.21) (total: 392 kB) | | ||
| batch_size: 4096 | 64824 (total: 713 kB) | 15501 (x 4.18) (total: 170 kB) | 29421 (x 2.20) (total: 324 kB) | | ||
| batch_size: 16384 | 196877 (total: 591 kB) | 38376 (x 5.13) (total: 115 kB) | 86299 (x 2.28) (total: 259 kB) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about updating these to show the result of using OTLP + various compression algorithms. gzip is the default, but I wrote benchmarks for a variety of others and had good results. I wonder how Otel/STEF And OTEL Arrow + Stream Mode stack up.
Heya, super interesting stuff. I have a couple of questions as a GC member --
Appreciate any responses to these points. |
This is also a concern of mine: why, and why now? We are constantly saying that we have too many things on our plates: what are we dropping in favor of STEF? Given the Collector's focus on v1, I'd like to get a SIG Collector collective "stamp" before committing to more work involving the Collector (code owners, maintenance, ...). I think the tech is really cool and the compression case is very compelling, but I see this as an optimization to be done after we have v1 out of the door. |
Thanks all for comments. I am converting this to a draft for now until I have more clarity on the answers. |
Otel/STEF (Sequential Tabular Encoding Format) is a new data format and network protocol for OpenTelemetry data.
For the target use-cases Otel/STEF outperforms both OTLP and Otel Arrow (phase 1): Otel/STEF is smaller and/or faster.
See stef.md for details: benchmarks and comparisons to other formats, links to prototypes and description of project goals.