You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title: ActiveTcpClient destructor deletes the wrong connection pool
Description:
We're seeing cases where an upstream connection is closed locally for no reason. If retries are configured, then every new connection is also immediately closed until max retries is reached.
The issue happens when a connection is being closed at the same time as a new connection to the same cluster is being established. If these two events are processed in the same call to event_base_loop, then the subsequent clearing of the deferred deletion list will destroy the wrong connection pool (the new one instead of the old one). If retries are configured, then yet another new pool and connection are created, but these are again destroyed immediately. This then continues for every retry.
Here's what happens:
ActiveTcpClient for the upstream connection is closed and added to the deferred deletion list.
The connection pool is removed from the pool map.
A new connection pool is created for the same cluster (and with the same hash_key), since the previous one is no longer in the map.
A new upstream connection is created in the new connection pool.
The deferred deletion list is then cleared. This invokes the old ActiveTcpClient's destructor.
The destructor invokes the old pool's idle callbacks.
This is where things go wrong: The idle callback erroneously erases the new pool, because tcpConnPoolIsIdle() finds and erases the pool by hash_key from the pools map (this map contains the new pool, because the old one was already removed in step 2).
Repro steps:
We haven't been able to reproduce this ourselves, but our customer is seeing this in their test cluster. However, as explained earlier, this happens when a new connection arrives at the same time as a previous connection is being closed (perhaps a key thing here is also that the old upstream connection is being drained). To reproduce this, you must ensure that the following two events are processed in the same invocation of event_base_loop():
connection close event that causes the connection pool to become idle and thus erased
a new connection event that causes a new connection pool to be created (with the same hash_key)
After these two events are processed, the ThreadLocalClusterManagerImpl will contain the new connection pool in its pools_ map instead of the old one; at the same time, the deferred deletion list will contain the ActiveTcpClient of the previous connection. When this ActiveTcpClient's destructor is called, it will call isIdleImpl() on the old connection pool, and then call the idle callbacks on this same (old) connection pool. But, the idle call back itself (tcpConnPoolIsIdle) will delete the new connection pool.
Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.
Admin and Stats Output:
Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation.
Note: If there are privacy concerns, sanitize the data prior to
sharing.
Config:
(this is just the relevant part of the config)
Title: ActiveTcpClient destructor deletes the wrong connection pool
Description:
We're seeing cases where an upstream connection is closed locally for no reason. If retries are configured, then every new connection is also immediately closed until max retries is reached.
The issue happens when a connection is being closed at the same time as a new connection to the same cluster is being established. If these two events are processed in the same call to
event_base_loop
, then the subsequent clearing of the deferred deletion list will destroy the wrong connection pool (the new one instead of the old one). If retries are configured, then yet another new pool and connection are created, but these are again destroyed immediately. This then continues for every retry.Here's what happens:
hash_key
), since the previous one is no longer in the map.tcpConnPoolIsIdle()
finds and erases the pool byhash_key
from the pools map (this map contains the new pool, because the old one was already removed in step 2).Repro steps:
We haven't been able to reproduce this ourselves, but our customer is seeing this in their test cluster. However, as explained earlier, this happens when a new connection arrives at the same time as a previous connection is being closed (perhaps a key thing here is also that the old upstream connection is being drained). To reproduce this, you must ensure that the following two events are processed in the same invocation of
event_base_loop()
:After these two events are processed, the
ThreadLocalClusterManagerImpl
will contain the new connection pool in itspools_
map instead of the old one; at the same time, the deferred deletion list will contain theActiveTcpClient
of the previous connection. When thisActiveTcpClient
's destructor is called, it will callisIdleImpl()
on the old connection pool, and then call the idle callbacks on this same (old) connection pool. But, the idle call back itself (tcpConnPoolIsIdle
) will delete the new connection pool.Admin and Stats Output:
Config:
(this is just the relevant part of the config)
Logs:
Notes:
The text was updated successfully, but these errors were encountered: