-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lua HTTP Filter - httpCall bottleneck during burst of traffic #37796
Comments
Hey @laurodd , STM that you are running into circuit breakers tripping. Your work around of adding additional clusters adds additional circuit breakers thus "working around" the issue. See https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/circuit_breaker.proto#config-cluster-v3-circuitbreakers-thresholds for configuring circuit breakers, you likely need to tune You can validate that this was the issue by seeing if you see the circuit breaker stats tripping for the cluster: |
thanks a lot for your message, @KBaichoo
Maybe we are not configuring the pool as it should since it is overflowing (upstream_rq_pending_overflow)?
p.s.: the API I am connecting to uses HTTP1.1 |
Sorry for the delay.
yes your instincts are correct here e.g. the overflow from circuit breakers tripping as causing some of the I think you're still tripping the circuit breaker for Try effectively disabling circuit breakers like https://www.envoyproxy.io/docs/envoy/latest/faq/load_balancing/disable_circuit_breaking#faq-disable-circuit-breaking and the overflow should go away. |
hello @KBaichoo , thanks a lot! After some investigation, we were able to identify the root cause of why our Out of curiosity, we tried to use a weird variable name such as
thanks again
circuit_breakers uses the default values, not the ones below and no warning/error message
circuit_breakers work as expected
|
Yea there's a few bits of conversions that happen to the config: JSON -> .. -> Protobuf The Protobuf warnings about unknown fields are common as it needs to solve the compatibility problem of receiving a newer protobuf message while the running instance only has older protobuf code. Something might be getting lost in translation, but happy to see the core issue is resolved. |
Title: Lua HTTP Filter - httpCall bottleneck during burst of traffic
Description:
We are currently facing a burst of traffic (from 3k RPS until up to 50KRPS).
We use Lua HTTP filter to call a bot protection API for each request and we are getting a high number of 503's from the upstream when calling the cluster.
We tried to tune the envoy upstream cluster via configuration to deal with this volume of connections to the upstream, without success.
However, we found out that having multiple clusters with the same configuration (only adding numbers at the end) and calling them "randomly" on the Lua file improved by a lot the performance and avoided a consequent number of 503s.
We would like to ask for your help to identify if we are missing a configuration, if we did something wrong or if it makes senses what we did in your point of view.
Below we tried to share the relevant information and if you need anything from us, let us know.
Thanks!
Context
Our Envoy setup has a Lua HTTP filter that does an httpCall to an external API (bot detection) to block or not the incoming request before reaching the origin
When the burst happens, our configuration seems overwhelmed with the number of requests and start to send 503s
We increased the timeout (cluster and the httpCall on the lua filter), it improved the situation, but it did not deal fully with the burst
We noticed that vertical scaling also helps, but it did not solve the problem either
Reproduction
We created a dedicated machine to launch the requests with 10K connections:
In our case, we can see that our Envoy server (16 VCPU 64 GB) reaches 90+% CPU usage and the stats show a lot of 503's: 1797585 out of 2071053 (86,79%)
Investigation
As previously said, increasing the timeout and vertically scaling helped, but did not fully resolve the situation
During the reproduction, we noticed that our Envoy server was receiving 10K connections from the wrk server, but the number of connections to the API was not able to grow besides certain level, which for us was causing the issue:
We used
ss
to monitor that:Workaround
We noticed that Envoy uses a worker for each vCPU and it deals with the requests based on that
So, we basically had the idea to replicate the clusters and on the Lua code, the httpCall would do a "round robin" (in this case we did a math.random)
Other information:
We took a look on the source code (lua_filer.cc and it seems we have a
thread_local_cluster
for each cluster and maybe this is scaling better than having just one cluster to deal with everything?WRK server
Envoy Server
Envoy Configuration
Lua file
The text was updated successfully, but these errors were encountered: