Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796

laurodd · 2024-12-23T16:47:29Z

Title: Lua HTTP Filter - httpCall bottleneck during burst of traffic

Description:
We are currently facing a burst of traffic (from 3k RPS until up to 50KRPS).

We use Lua HTTP filter to call a bot protection API for each request and we are getting a high number of 503's from the upstream when calling the cluster.

We tried to tune the envoy upstream cluster via configuration to deal with this volume of connections to the upstream, without success.

However, we found out that having multiple clusters with the same configuration (only adding numbers at the end) and calling them "randomly" on the Lua file improved by a lot the performance and avoided a consequent number of 503s.

  local clusterName = "apicluster" .. math.random(1,8)

  local headers, response_body = request_handle:httpCall(
    clusterName,
    request_header,
    payload_body, API_TIMEOUT
  )

We would like to ask for your help to identify if we are missing a configuration, if we did something wrong or if it makes senses what we did in your point of view.

Below we tried to share the relevant information and if you need anything from us, let us know.

Thanks!

Context

Our Envoy setup has a Lua HTTP filter that does an httpCall to an external API (bot detection) to block or not the incoming request before reaching the origin
When the burst happens, our configuration seems overwhelmed with the number of requests and start to send 503s
We increased the timeout (cluster and the httpCall on the lua filter), it improved the situation, but it did not deal fully with the burst
We noticed that vertical scaling also helps, but it did not solve the problem either

Reproduction

We created a dedicated machine to launch the requests with 10K connections:

wrk (apt install wrk)
ulimit -n 65535
wrk -c 10000 -t 4 -d 60s http://{ENVOY_IP}:${ENVOY_PORT}

  clusters:
  - name: apicluster
    connect_timeout: 0.75s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: apicluster

In our case, we can see that our Envoy server (16 VCPU 64 GB) reaches 90+% CPU usage and the stats show a lot of 503's: 1797585 out of 2071053 (86,79%)

  wrk -t 4 -c 10000 -d 60s
  Running 1m test
  4 threads and 10000 connections
  2071053 requests in 1.00m, 1.16GB read

cluster.apicluster.upstream_rq_403: 281907
cluster.apiclister.upstream_rq_4xx: 281907
cluster.apicluster.upstream_rq_503: 1797585
cluster.apicluster.upstream_rq_504: 1209
cluster.apicluster.upstream_rq_5xx: 1798794

Investigation

As previously said, increasing the timeout and vertically scaling helped, but did not fully resolve the situation
During the reproduction, we noticed that our Envoy server was receiving 10K connections from the wrk server, but the number of connections to the API was not able to grow besides certain level, which for us was causing the issue:
We used ss to monitor that:


ss -t -a | grep ESTAB | grep ${WRK_SERVER_IP} | wc -l
10000

ss -t -a | grep ESTAB | grep ${API_SERVER_IPS} | wc -l
3092

As previously mentioned, we tried multiple configurations to increase how many connections we were able to send to the API, without much success

Workaround

We noticed that Envoy uses a worker for each vCPU and it deals with the requests based on that
So, we basically had the idea to replicate the clusters and on the Lua code, the httpCall would do a "round robin" (in this case we did a math.random)

  local clusterName = "apicluster" .. math.random(1,8)

  local headers, response_body = request_handle:httpCall(
    clusterName,
    request_header,
    payload_body, API_TIMEOUT
  )

  clusters:
  - name: apicluster1
    connect_timeout: 0.75s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: apicluster1
...

  clusters:
  - name: apicluster2
    connect_timeout: 0.75s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: apicluster2
...

After that we can see that the 503 errors are almost gone and we are able to "ingest" and treat the 10K connections : 1006 / 1587717 (0,06% of 503s)

wrk -t 4 -c 10000 -d 60s 

Running 1m test
  4 threads and 10000 connections
  1587717 requests in 1.00m, 3.04GB read

cluster.apicluster1.upstream_rq_503: 81
cluster.apicluster2.upstream_rq_503: 97
cluster.apicluster3upstream_rq_503: 160
cluster.apicluster4.upstream_rq_503: 89
cluster.qpicluster5.upstream_rq_503: 68
cluster.apicluster6.upstream_rq_503: 242
cluster.apicluster7.upstream_rq_503: 187
cluster.apicluster8.upstream_rq_503: 82

ss -t -a | grep ESTAB | grep ${WRK_SERVER_IP}  | wc -l
10000
ss -t -a | grep ESTAB | grep ${API_SERVER_IPS} | wc -l
9016

Other information:

We took a look on the source code (lua_filer.cc and it seems we have a thread_local_cluster for each cluster and maybe this is scaling better than having just one cluster to deal with everything?

const auto thread_local_cluster = filter.clusterManager().getThreadLocalCluster(cluster);

WRK server

Ubuntu 24.04
4 vCPU 16G
ulimit -n 65535

Envoy Server

EC2 m4.4xlarge 16 vCPU 64 GB
Docker version 27.1.2, build d01f264
docker run -dit --name envoy-container --network "host" -p 9901:9901
Envoy image: v1.31-latest
Envoy version: 688c4bb/1.31.5/Clean/RELEASE/BoringSSL
Debian GNU/Linux 12 (bookworm)

Envoy Configuration

static_resources:
  listeners:
  - name: main
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 8080
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: auto
          use_remote_address: true
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: web_service
          http_filters:
          - name: envoy.lua
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
              inline_code: |
                  assert(loadfile("/apicluster.lua"))({})
          - name: envoy.router
            typed_config:
              "@type": [type.googleapis.com/envoy.extensions.filters.http.router.v3.Router](http://type.googleapis.com/envoy.extensions.filters.http.router.v3.Router)

  clusters:
  - name: apicluster
    connect_timeout: 0.75s
    type: strict_dns
    lb_policy: round_robin
    load_assignment:
      cluster_name: apicluster
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: api.example.co
                port_value: 443
    circuit_breakers:
      thresholds:
        - max_connections: 10000

Lua file

-- we do some header manipulation and 


function envoy_on_request(request_handle)
  local headers = request_handle:headers()

  -- some code

  local clusterName = "apicluster" .. math.random(1,8)

  local headers, response_body = request_handle:httpCall(
    clusterName,
    request_header,
    payload_body, API_TIMEOUT
  )

-- some more code

The text was updated successfully, but these errors were encountered:

KBaichoo · 2024-12-24T16:26:35Z

Hey @laurodd ,

STM that you are running into circuit breakers tripping. Your work around of adding additional clusters adds additional circuit breakers thus "working around" the issue.

See https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds for configuring circuit breakers, you likely need to tune max_request, max_pending_requests.

You can validate that this was the issue by seeing if you see the circuit breaker stats tripping for the cluster:
https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats#circuit-breakers-statistics

laurodd · 2024-12-26T15:39:45Z

thanks a lot for your message, @KBaichoo

The first approach we tried was exactly that: we played with those values (max_connections, max_requests, max_pending_requests) to see if we had an impact on 503s we were having, with no luck.
I relaunched the tests and checking the circuit break stats we have everything to zero (however, we have a correlation between cluster.apicluster.upstream_rq_503 and cluster.apicluster.upstream_rq_pending_overflow).
Even when increasing max_connections, max_requests, max_pending_requests, the 503s are the same (and the number of connections to the API do not change, i.e., 3/4k connections)
When decreasing the values (1k), we also have 503s and now we only see 1k connections to the API.

Maybe we are not configuring the pool as it should since it is overflowing (upstream_rq_pending_overflow)?
Thanks a lot for your help.

# example of the values we changed
    circuit_breakers:
      thresholds:
        - max_connections: 10000
        - max_requests: 10000
        - max_pending_requests: 10000
        # max_connection_pools is unlimited by default, we did not change it

# example of the wrk command we are launching
wrk -t 4 -c 10000 -d 60s
Running 1m test
  4 threads and 10000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   299.17ms  192.17ms   2.00s    88.55%
    Req/Sec     8.89k     1.53k   13.70k    70.45%
  2103842 requests in 1.00m, 1.10GB read
Requests/sec:  35007.46
Transfer/sec:     18.77MB

# circuit breakers stats
cluster.apicluster.circuit_breakers.default.cx_open: 0
cluster.apicluster.circuit_breakers.default.cx_pool_open: 0
cluster.apicluster.circuit_breakers.default.rq_open: 0
cluster.apicluster.circuit_breakers.default.rq_pending_open: 0
cluster.apicluster.circuit_breakers.default.rq_retry_open: 0
cluster.apicluster.circuit_breakers.high.cx_open: 0
cluster.apicluster.circuit_breakers.high.cx_pool_open: 0
cluster.apicluster.circuit_breakers.high.rq_open: 0
cluster.apicluster.circuit_breakers.high.rq_pending_open: 0
cluster.apicluster.circuit_breakers.high.rq_retry_open: 0

#upstream stats
cluster.apicluster.upstream_rq_403: 261869
cluster.apicluster.upstream_rq_4xx: 261869
cluster.apicluster.upstream_rq_503: 1849610
cluster.apicluster.upstream_rq_504: 2117
cluster.apicluster.upstream_rq_5xx: 1851727

cluster.apicluster.upstream_rq_pending_overflow: 1849596

p.s.: the API I am connecting to uses HTTP1.1

KBaichoo · 2025-01-06T16:20:13Z

Sorry for the delay.

Maybe we are not configuring the pool as it should since it is overflowing (upstream_rq_pending_overflow)?

yes your instincts are correct here e.g. the overflow from circuit breakers tripping as causing some of the cluster.apicluster.upstream_rq_5xx.

I think you're still tripping the circuit breaker for max_requests.

Try effectively disabling circuit breakers like https://www.envoyproxy.io/docs/envoy/latest/faq/load_balancing/disable_circuit_breaking#faq-disable-circuit-breaking and the overflow should go away.

laurodd · 2025-01-09T16:03:27Z

hello @KBaichoo , thanks a lot!
Just wanted to confirm that with your suggestion, we were able to bypass the problem.

After some investigation, we were able to identify the root cause of why our circuit_breakers configuration did not work: we wrongly used a '-' before the threshold values and as consequence, those values were ignored. (however, no warning/error message on the logs)

Out of curiosity, we tried to use a weird variable name such as foo_max_connection_pools and we had the error below.
So it seems to me

there is some kind JSON validation of the config, it is checking the variables and ignoring the '-' char
since the validation is ok, Envoy starts
however, the values are not loaded (because at the end of the day, our config is wrong) and circuit_breakers uses the default values and start tripping

thanks again

Protobuf message (type envoy.config.bootstrap.v3.Bootstrap reason INVALID_ARGUMENT: invalid JSON  in envoy.config.bootstrap.v3.Bootstrap @   static_resources.clusters[0].circuit_breakers.thresholds[3]:  message envoy.config.cluster.v3.CircuitBreakers.Thresholds,  near 1:1343 (offset 1342): no such field: 'foo_max_connection_pools') has unknown fields

circuit_breakers uses the default values, not the ones below and no warning/error message

    circuit_breakers:
      thresholds:
        - max_connections: 10000
        - max_requests: 10000
        - max_pending_requests: 10000
        - max_connection_pools: 10000

circuit_breakers work as expected

    circuit_breakers:
      thresholds:
        max_connections: 10000
        max_requests: 10000
        max_pending_requests: 10000
        max_connection_pools

KBaichoo · 2025-01-09T22:57:50Z

Yea there's a few bits of conversions that happen to the config:

JSON -> .. -> Protobuf
Yaml -> ... -> Protobuf

The ... is for pieces that I don't recall off the top of my head e.g. we might convert JSON -> YAML then to protobuf or YAML -> JSON then to protobuf.

Protobuf warnings about unknown fields are common as it needs to solve the compatibility problem of receiving a newer protobuf message while the running instance only has older protobuf code.

Something might be getting lost in translation, but happy to see the core issue is resolved.

laurodd added the triage Issue requires triage label Dec 23, 2024

KBaichoo added question Questions that are neither investigations, bugs, nor enhancements area/lua area/circuit_breaker and removed triage Issue requires triage labels Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796

Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796

laurodd commented Dec 23, 2024 •

edited

Loading

KBaichoo commented Dec 24, 2024 •

edited

Loading

laurodd commented Dec 26, 2024

KBaichoo commented Jan 6, 2025

laurodd commented Jan 9, 2025

KBaichoo commented Jan 9, 2025

Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796

Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796

Comments

laurodd commented Dec 23, 2024 • edited Loading

Context

Reproduction

Investigation

Workaround

Other information:

KBaichoo commented Dec 24, 2024 • edited Loading

laurodd commented Dec 26, 2024

KBaichoo commented Jan 6, 2025

laurodd commented Jan 9, 2025

KBaichoo commented Jan 9, 2025

laurodd commented Dec 23, 2024 •

edited

Loading

KBaichoo commented Dec 24, 2024 •

edited

Loading