-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DogStatsdSink lowers performance at higher CPU + concurrency #37831
Comments
Some additional info... Full Envoy config yaml used during testing incl DogStatsdSink(there is some envsubst done beforehand to set variables) node:
id: allnodes
cluster: $NODE_CLUSTER
admin:
access_log:
name: envoy.access_loggers.file
typed_config:
"@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
path: /dev/null
address:
socket_address: { "address": "0.0.0.0", "port_value": 9090 }
dynamic_resources:
lds_config:
resource_api_version: V3
api_config_source:
api_type: DELTA_GRPC
transport_api_version: V3
grpc_services:
envoy_grpc:
cluster_name: xds_cluster
cds_config:
resource_api_version: V3
api_config_source:
api_type: DELTA_GRPC
transport_api_version: V3
grpc_services:
envoy_grpc:
cluster_name: xds_cluster
static_resources:
clusters:
- name: xds_cluster
alt_stat_name: cluster=envoy-xds;server_group=envoy-xds-v001;service=envoy-xds/envoy-xds-v001;application=envoy;
connect_timeout: 0.25s
type: STATIC
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
circuit_breakers:
thresholds:
- priority: high
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
- priority: default
max_connections: 100000
max_pending_requests: 100000
max_requests: 60000000
max_retries: 50
upstream_connection_options:
# https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/address.proto#envoy-api-msg-core-tcpkeepalive
tcp_keepalive:
keepalive_probes: 3
keepalive_time: 30
keepalive_interval: 5
load_assignment:
cluster_name: xds_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: $CONTROL_PLANE_IP
port_value: 5678
- name: logs
alt_stat_name: cluster=envoy-logs;server_group=envoy-logs-v001;service=envoy-logs/envoy-logs-v001;application=envoy;
connect_timeout: 5s
type: STATIC
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
explicit_http_config:
http2_protocol_options: {}
load_assignment:
cluster_name: logs
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: $ALS_SERVER_IP
port_value: 8080
watchdogs:
main_thread_watchdog:
miss_timeout: 0.5s
megamiss_timeout: 2s
worker_watchdog:
miss_timeout: 0.05s
megamiss_timeout: 0.5s
stats_sinks:
- name: envoy.stat_sinks.open_telemetry
typed_config:
"@type": type.googleapis.com/envoy.extensions.stat_sinks.open_telemetry.v3.SinkConfig
grpc_service:
google_grpc:
target_uri: "127.0.0.1:4317"
stat_prefix: envoy_grpc_otel
report_counters_as_deltas: false
report_histograms_as_deltas: true
emit_tags_as_attributes: true
use_tag_extracted_name: true
prefix: envoy_otel
- name: envoy.stat_sinks.dog_statsd
typed_config:
"@type": type.googleapis.com/envoy.config.metrics.v3.DogStatsdSink
address:
socket_address:
address: $STATSD_SERVICE_IP
port_value: 8125
prefix: envoy
stats_config:
stats_tags:
- tag_name: service
regex: ^cluster\..*?(service=(.+?);).*$
- tag_name: server_group
regex: ^cluster\..*?(server_group=(.+?);).*$
- tag_name: application
regex: ^cluster\..*?(application=(.+?);).*$
- tag_name: cluster
regex: ^cluster\..*?(cluster=(.+?);).*
- tag_name: host
fixed_value: $POD_NAME
- tag_name: lb_type
fixed_value: envoy
- tag_name: traffic_source
fixed_value: $TRAFFIC_SOURCE Envoy startup script looks like: envoy -c /etc/envoy/config/envoy.yaml \
--restart-epoch $RESTART_EPOCH \
--concurrency $CONCURRENCY \
--service-zone $zone \
--service-cluster $NODE_CLUSTER \
--service-node $NODE_ID |
One possible cause of lower performance with a higher number of cores is mutex contention. For the DataDog devs that might be a good place to look. One particular place where mutex contention can arise with the symbolization of string-based stat names. See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats-recentlookups for more discussion, and a mechanism built into Envoy to help detect that case. This is just a guess, though, and there could be mutexes unrelated to stat names. https://github.com/envoyproxy/envoy/blob/fd618c36c5c821ebd5697e322a3c842bb5eb1faf/source/docs/stats.md#performance-and-thread-local-storage is also worth a quick read. |
I was trying to find the origins of this sink, and it dates back 7 years: #2158 There has been a lot of DataDog-related work since then but I think it's in the tracer. |
@dgoffredo may be able to point us to someone currently working on Envoy at DataDog that could help look for points of contetion within its sink library |
Hi @jmarantz The tracer is not responsible for sending metrics. The DogstatsD implementation is simply the StatsD sink over UDP own by Envoy. |
Yeah I figured that out. Anyway I thought someone from Datadog could drive investigation and remedy, per my suggesitons above. It might be contention related to the UDP sink but fixing that would need to be driven from a use-case like Data Dog. |
Reading what @jmarantz said about mutex contention, I tried some of the things he mentioned: recentlookupsSee https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats-recentlookups No extra lookups noted, seems about the same with or without DogStatsdSink sink. Around 40-50 lookups per second, which is more than I expected, but by my understanding, I don't think enough to have a noticeable impact (since each one is max 100ns). The lookups seem to be a combination of metric names, label/attribute keys, and label/attribute values. contentionRunning perf test with DogStatsdSink and envoy having curl http://127.0.0.1:9090/contention
{
"num_contentions": "18272760",
"current_wait_cycles": "44097",
"lifetime_wait_cycles": "1932193659414"
} Before the test started, numbers were much lower, like: curl http://127.0.0.1:9090/contention
{
"num_contentions": "1156",
"current_wait_cycles": "2100",
"lifetime_wait_cycles": "18656083"
} Without the DogStatsdSink sink, we see numbers like this, even at a peak of traffic to an Envoy instance, and it is fairly stable over time: $ curl http://127.0.0.1:9090/contention
{
"num_contentions": "65",
"current_wait_cycles": "14573",
"lifetime_wait_cycles": "4377304"
} These much lower and more stable numbers lead me to conclude that mutex contention is a primary cause of the slowness, somehow introduced by the DogStatsdSink sink, but probably not contention over the string lookups for stat names. |
This makes sense; stat name lookups are just one form of contention that we've seen in the past so we have special instrumentation for. I guess we need to examine the mutexes taken during the data-dog sink flow, which probably is likely mostly the same as the UDP sink flow. |
I have another thought -- I think it's worth looking into why histograms are flushed to stats sink on every change, whereas gauges/counters are flushed every 5s (typically). I think this might date back to before we had a mechanism to collate the histogram data collected across threads. Consider this call-stack:
Internally we found it made things faster to make our stats sink's I think the envoy/source/common/stats/symbol_table.cc Line 319 in 871e410
|
Thinking about this more, the difficulty of maintaining that map correctly without leaking or adding contention seems daunting. It would be highly preferable if we could just access the histogram's current state on flush, like we do for counters or gauges. If that is not possible, we could have the histogram updates post to the main thread so that the name lookup is not contentious, though there might be contention on the main thread's dispatcher mutex, you might be just moving the contention around. It might be a little better because that operation is only appending a closure to a list. |
We've had performance issues with StatsD sinks as well, since they flush histograms on every call which causes lots of UDP traffic which also causes high CPU usage in StatsD. Sort of recently Datadog added support for StatsD client-side agregation which means you can send a metric with multiple data points, though this adds a lot of complexity on the client side (sink). It could also help from the Sink side to buffer metrics into as large a packet as can be supported (according to MTU) and send multiple metrics together. But IMHO this sink should probably be deprecated and users of Datadog should use Prometheus scraping which is also supported and has support for Distribution metric types which are much better than standard gauge-based histograms. |
I think it would make sense to batch updates. I don't know a lot about histograms in statsd; I just had a quick look at https://github.com/statsd/statsd/blob/master/docs/metric_types.md and I guess the concept is that the histogram aggregation is done in the statsd process, whereas Envoy is also doing it. It would be preferable to send aggregated results to statsd every 5s instead of sending every data point. This is how we solved this internally. However we can definitely improve things a lot by buffering the data points and sending them as a batch either during the sink, or at buffer-full. We would only need to translate the StatNames to strings when emitting the buffered data, and could use a local-scoped map to reduce the number of translations needed. |
I'm going to try to repro with a benchmark but my compilation setup got stale. |
I have a benchmark to repro, as well as a very basic attempt to alleviate the bottleneck. I'm not sure if this is worth pursuing though, per @sschepens comment that the statsd sink should be deprecated in favor of Prom scraping. I would be curious to hear the thoughts of @DukeyToo about whether that's an appropriate solution. |
Title: DogStatsdSink lowers performance at higher CPU + concurrency
Description:
Repro steps:
Admin and Stats Output:
Config:
Logs:
Metrics:
Each of the below panels are from the same 2 perf runs, the first with the DogStatsdSink included, and the second without. There were no other changes between the 2 tests/runs. This was also reproduced multiple times on different days. The perf test:
The above panel shows mean latency as measured by the app (yellow line, opentelemetry histogram metric) and at the load balancer (green line, also an opentelemetry histogram metric). Note the green line jumps significantly in the first perf test, representing latency that Envoy is adding. (This is the mean, so even a small bump represents a significant change). In the second perf test, there is a smaller bump. This bump is more "normal" - we expect some small loss of performance when scaling out under increasing perf loads. In both cases, the yellow line (self-reported upstream cluster latency) is stable, even dropping in the beginning as it warms up.
This next panel shows the Envoy TPS (rate derived from
http.downstream_rq_http1_total
counter) during the 2 tests:The pattern is significantly different in the 2 tests. In test 1, Envoy reaches a max of around 16K ops per instance. As autoscaling happens, the overall rate drops to a steady state of around 10K ops per instance. In test 2, the first instances get all the way up to 27K ops per instance, before dropping to a steady state of about 16K ops per instance (i.e. steady state in test 2 was the same as the peak in test 1). Also, the final instance count for Envoy was 12 in test 1, and 8 in test 2, i.e. test 1 used 50% more envoy instances to support the same traffic.
In summary, we were able to reproduce performance we saw in prod with an artificial performance test, but only when configuring the DogStatsdSink sink. Thus we concluded that the source of the performance problems is something in the DogStatsdSink. We don't observe the same issues when using Prometheus scraping, or the OpenTelemetry sink.
The text was updated successfully, but these errors were encountered: