Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DogStatsdSink lowers performance at higher CPU + concurrency #37831

Open
DukeyToo opened this issue Dec 27, 2024 · 15 comments
Open

DogStatsdSink lowers performance at higher CPU + concurrency #37831

DukeyToo opened this issue Dec 27, 2024 · 15 comments

Comments

@DukeyToo
Copy link

Title: DogStatsdSink lowers performance at higher CPU + concurrency

Description:

We have scaled up some Envoy edge instances (added more CPU + increased concurrency), and noticed that these new, larger 16-32 core Envoy instances did not perform as well as expected. Specifically, latency increased well before reaching high CPU usage, and we saw more watchdog misses on the main thread. After a lot of investigation, we found that removing the DogStatsdSink fixed the problem for us.

Repro steps:

Run an envoy instance with 16 cores and 16 concurrency. Include DogStatsdSink. It is not necessary to have a real destination, since it uses UDP and will not affect anything if the destination is not real. Run enough load to the instance that it should get above 60% mean cpu. Note how response times are poor, and CPU higher, compared to the same test without configuring DogStatsdSink. (You can maybe reproduce with less cores/concurrency, but this is what we used in our tests).

Admin and Stats Output:

I don't have a sanitized stats to share

Config:

stats_sinks:
  - name: envoy.stat_sinks.dog_statsd
    typed_config:
      "@type": type.googleapis.com/envoy.config.metrics.v3.DogStatsdSink
      address:
        socket_address:
          address: $STATSD_SERVICE_IP
          port_value: 8125
      prefix: envoy

Logs:

No relevant logs

Metrics:

Each of the below panels are from the same 2 perf runs, the first with the DogStatsdSink included, and the second without. There were no other changes between the 2 tests/runs. This was also reproduced multiple times on different days. The perf test:

  • runs http1.1 requests against Envoy listener on port 443, routing based on host header
  • Envoy itself autoscales out at 40% cpu, from an initial 3 instances, using k8s HPA
  • lbtester-child is the name of one of the backend/upstream workloads. It is part of our standard suite we use for testing load balancer behavior, and its performance is well-known and predictable
  • lbtester-child was configured to respond to GET queries in 10ms. Any latency above that is generally attributable elsewhere

The first perf run (left side) shows the problem behavior, and the second one (right side) shows what we consider to be normal, well-performing behavior

cdc1431a-0969-47a8-9550-40b83c4ccb39

The above panel shows mean latency as measured by the app (yellow line, opentelemetry histogram metric) and at the load balancer (green line, also an opentelemetry histogram metric). Note the green line jumps significantly in the first perf test, representing latency that Envoy is adding. (This is the mean, so even a small bump represents a significant change). In the second perf test, there is a smaller bump. This bump is more "normal" - we expect some small loss of performance when scaling out under increasing perf loads. In both cases, the yellow line (self-reported upstream cluster latency) is stable, even dropping in the beginning as it warms up.

This next panel shows the Envoy TPS (rate derived from http.downstream_rq_http1_total counter) during the 2 tests:
4fe14e18-4356-46b0-9e21-6898bcd9eac3

The pattern is significantly different in the 2 tests. In test 1, Envoy reaches a max of around 16K ops per instance. As autoscaling happens, the overall rate drops to a steady state of around 10K ops per instance. In test 2, the first instances get all the way up to 27K ops per instance, before dropping to a steady state of about 16K ops per instance (i.e. steady state in test 2 was the same as the peak in test 1). Also, the final instance count for Envoy was 12 in test 1, and 8 in test 2, i.e. test 1 used 50% more envoy instances to support the same traffic.

In summary, we were able to reproduce performance we saw in prod with an artificial performance test, but only when configuring the DogStatsdSink sink. Thus we concluded that the source of the performance problems is something in the DogStatsdSink. We don't observe the same issues when using Prometheus scraping, or the OpenTelemetry sink.

@DukeyToo DukeyToo added bug triage Issue requires triage labels Dec 27, 2024
@DukeyToo
Copy link
Author

Some additional info...

Full Envoy config yaml used during testing incl DogStatsdSink

(there is some envsubst done beforehand to set variables)

node:
  id: allnodes
  cluster: $NODE_CLUSTER
admin:
  access_log:
    name: envoy.access_loggers.file
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
      path: /dev/null
  address:
    socket_address: { "address": "0.0.0.0", "port_value": 9090 }
dynamic_resources:
  lds_config:
    resource_api_version: V3
    api_config_source:
      api_type: DELTA_GRPC
      transport_api_version: V3
      grpc_services:
        envoy_grpc:
          cluster_name: xds_cluster
  cds_config:
    resource_api_version: V3
    api_config_source:
      api_type: DELTA_GRPC
      transport_api_version: V3
      grpc_services:
        envoy_grpc:
          cluster_name: xds_cluster
static_resources:
  clusters:
  - name: xds_cluster
    alt_stat_name: cluster=envoy-xds;server_group=envoy-xds-v001;service=envoy-xds/envoy-xds-v001;application=envoy;
    connect_timeout: 0.25s
    type: STATIC
    lb_policy: ROUND_ROBIN
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http2_protocol_options: {}
    circuit_breakers:
      thresholds:
        - priority: high
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
        - priority: default
          max_connections: 100000
          max_pending_requests: 100000
          max_requests: 60000000
          max_retries: 50
    upstream_connection_options:
      # https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/address.proto#envoy-api-msg-core-tcpkeepalive
      tcp_keepalive:
        keepalive_probes: 3
        keepalive_time: 30
        keepalive_interval: 5
    load_assignment:
      cluster_name: xds_cluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: $CONTROL_PLANE_IP
                port_value: 5678
  - name: logs
    alt_stat_name: cluster=envoy-logs;server_group=envoy-logs-v001;service=envoy-logs/envoy-logs-v001;application=envoy;
    connect_timeout: 5s
    type: STATIC
    lb_policy: ROUND_ROBIN
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http2_protocol_options: {}
    load_assignment:
      cluster_name: logs
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: $ALS_SERVER_IP
                port_value: 8080
               
watchdogs:
  main_thread_watchdog: 
    miss_timeout: 0.5s
    megamiss_timeout: 2s
  worker_watchdog:
    miss_timeout: 0.05s
    megamiss_timeout: 0.5s

stats_sinks:
  - name: envoy.stat_sinks.open_telemetry
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.stat_sinks.open_telemetry.v3.SinkConfig
      grpc_service:
        google_grpc:
          target_uri: "127.0.0.1:4317"
          stat_prefix: envoy_grpc_otel
      report_counters_as_deltas: false
      report_histograms_as_deltas: true
      emit_tags_as_attributes: true
      use_tag_extracted_name: true
      prefix: envoy_otel
  - name: envoy.stat_sinks.dog_statsd
    typed_config:
      "@type": type.googleapis.com/envoy.config.metrics.v3.DogStatsdSink
      address:
        socket_address:
          address: $STATSD_SERVICE_IP
          port_value: 8125
      prefix: envoy

stats_config:
  stats_tags:
    - tag_name: service
      regex: ^cluster\..*?(service=(.+?);).*$
    - tag_name: server_group
      regex: ^cluster\..*?(server_group=(.+?);).*$
    - tag_name: application
      regex: ^cluster\..*?(application=(.+?);).*$
    - tag_name: cluster
      regex: ^cluster\..*?(cluster=(.+?);).*
    - tag_name: host
      fixed_value: $POD_NAME
    - tag_name: lb_type
      fixed_value: envoy
    - tag_name: traffic_source
      fixed_value: $TRAFFIC_SOURCE

Envoy startup script looks like:

envoy -c /etc/envoy/config/envoy.yaml \
  --restart-epoch $RESTART_EPOCH \
  --concurrency $CONCURRENCY \
  --service-zone $zone \
  --service-cluster $NODE_CLUSTER \
  --service-node $NODE_ID

@phlax
Copy link
Member

phlax commented Dec 29, 2024

cc @taiki45 @jmarantz

@phlax phlax added area/datadog and removed triage Issue requires triage labels Dec 29, 2024
@jmarantz
Copy link
Contributor

One possible cause of lower performance with a higher number of cores is mutex contention. For the DataDog devs that might be a good place to look.

One particular place where mutex contention can arise with the symbolization of string-based stat names. See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats-recentlookups for more discussion, and a mechanism built into Envoy to help detect that case. This is just a guess, though, and there could be mutexes unrelated to stat names.

https://github.com/envoyproxy/envoy/blob/fd618c36c5c821ebd5697e322a3c842bb5eb1faf/source/docs/stats.md#performance-and-thread-local-storage is also worth a quick read.

@jmarantz
Copy link
Contributor

I was trying to find the origins of this sink, and it dates back 7 years: #2158

There has been a lot of DataDog-related work since then but I think it's in the tracer.

@jmarantz
Copy link
Contributor

jmarantz commented Dec 29, 2024

@dgoffredo may be able to point us to someone currently working on Envoy at DataDog that could help look for points of contetion within its sink library

@dmehala
Copy link
Contributor

dmehala commented Dec 30, 2024

Hi @jmarantz

The tracer is not responsible for sending metrics. The DogstatsD implementation is simply the StatsD sink over UDP own by Envoy.

@jmarantz
Copy link
Contributor

Yeah I figured that out. Anyway I thought someone from Datadog could drive investigation and remedy, per my suggesitons above. It might be contention related to the UDP sink but fixing that would need to be driven from a use-case like Data Dog.

@DukeyToo
Copy link
Author

Reading what @jmarantz said about mutex contention, I tried some of the things he mentioned:

recentlookups

See https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats-recentlookups

No extra lookups noted, seems about the same with or without DogStatsdSink sink. Around 40-50 lookups per second, which is more than I expected, but by my understanding, I don't think enough to have a noticeable impact (since each one is max 100ns). The lookups seem to be a combination of metric names, label/attribute keys, and label/attribute values.

contention

Running perf test with DogStatsdSink and envoy having --enable-mutex-tracing flag, we see very large numbers for contention once a perf test starts, e.g. on one of the envoy instances having a DogStatsdSink sink

curl http://127.0.0.1:9090/contention
{
 "num_contentions": "18272760",
 "current_wait_cycles": "44097",
 "lifetime_wait_cycles": "1932193659414"
}

Before the test started, numbers were much lower, like:

curl http://127.0.0.1:9090/contention
{
 "num_contentions": "1156",
 "current_wait_cycles": "2100",
 "lifetime_wait_cycles": "18656083"
}

Without the DogStatsdSink sink, we see numbers like this, even at a peak of traffic to an Envoy instance, and it is fairly stable over time:

$ curl http://127.0.0.1:9090/contention
{
 "num_contentions": "65",
 "current_wait_cycles": "14573",
 "lifetime_wait_cycles": "4377304"
}

These much lower and more stable numbers lead me to conclude that mutex contention is a primary cause of the slowness, somehow introduced by the DogStatsdSink sink, but probably not contention over the string lookups for stat names.

@jmarantz
Copy link
Contributor

This makes sense; stat name lookups are just one form of contention that we've seen in the past so we have special instrumentation for.

I guess we need to examine the mutexes taken during the data-dog sink flow, which probably is likely mostly the same as the UDP sink flow.

@jmarantz
Copy link
Contributor

I have another thought -- I think it's worth looking into why histograms are flushed to stats sink on every change, whereas gauges/counters are flushed every 5s (typically). I think this might date back to before we had a mechanism to collate the histogram data collected across threads. Consider this call-stack:

void ParentHistogramImpl::recordValue(uint64_t value) {

sink.onHistogramComplete(histogram, value);

void UdpStatsdSink::onHistogramComplete(const Stats::Histogram& histogram, uint64_t value) {

switch (tag_format_.tag_position) {

Internally we found it made things faster to make our stats sink's onHistogramComplete (bad name: should be onHistogramChange) method be completely empty, and only write out histogram contents on flush. I don't know whether this would work for statsd though; it might need every value recorded. In that case, I think keeping a map to the stringified stat names would help, held in UdpStatsSink. It's a little tricky to keep a map as you'd want to avoid leaking memory into if the map stats keep getting created and destroyed, so it might need to be cleared periodically or act as an LRU cache, and onHistogramComplete can be called from any thread so you need to be really careful about locks and adding more contention.

I think the /stats/recentlookups endpoint might be checking for when you have a string and need a StatName (

recent_lookups_.lookup(name);
), and UdpStatsSink::buildMessage is doing the opposite; starting with a StatName and getting a string out. That's not recording into the "recent_lookups_" table.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 1, 2025

Thinking about this more, the difficulty of maintaining that map correctly without leaking or adding contention seems daunting.

It would be highly preferable if we could just access the histogram's current state on flush, like we do for counters or gauges.

If that is not possible, we could have the histogram updates post to the main thread so that the name lookup is not contentious, though there might be contention on the main thread's dispatcher mutex, you might be just moving the contention around. It might be a little better because that operation is only appending a closure to a list.

@sschepens
Copy link
Contributor

We've had performance issues with StatsD sinks as well, since they flush histograms on every call which causes lots of UDP traffic which also causes high CPU usage in StatsD.

Sort of recently Datadog added support for StatsD client-side agregation which means you can send a metric with multiple data points, though this adds a lot of complexity on the client side (sink). It could also help from the Sink side to buffer metrics into as large a packet as can be supported (according to MTU) and send multiple metrics together.

But IMHO this sink should probably be deprecated and users of Datadog should use Prometheus scraping which is also supported and has support for Distribution metric types which are much better than standard gauge-based histograms.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 1, 2025

I think it would make sense to batch updates. I don't know a lot about histograms in statsd; I just had a quick look at https://github.com/statsd/statsd/blob/master/docs/metric_types.md and I guess the concept is that the histogram aggregation is done in the statsd process, whereas Envoy is also doing it. It would be preferable to send aggregated results to statsd every 5s instead of sending every data point. This is how we solved this internally.

However we can definitely improve things a lot by buffering the data points and sending them as a batch either during the sink, or at buffer-full. We would only need to translate the StatNames to strings when emitting the buffered data, and could use a local-scoped map to reduce the number of translations needed.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 2, 2025

I'm going to try to repro with a benchmark but my compilation setup got stale.

@jmarantz
Copy link
Contributor

jmarantz commented Jan 3, 2025

I have a benchmark to repro, as well as a very basic attempt to alleviate the bottleneck.

#37859

I'm not sure if this is worth pursuing though, per @sschepens comment that the statsd sink should be deprecated in favor of Prom scraping.

I would be curious to hear the thoughts of @DukeyToo about whether that's an appropriate solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants