-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(gossipsub): gracefully disable handler on stream errors #3625
fix(gossipsub): gracefully disable handler on stream errors #3625
Conversation
@thomaseizinger, Will you please comment on the open questions in the first message? |
bdf4a47
to
153b4cc
Compare
153b4cc
to
1264345
Compare
Meta-note: Please don't force push, it makes review unnecessarily difficult. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, we are making progress in the right direction here.
ConnectionHandlerEvent::Close
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Some more thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing this forward!
I did not expect to make harmonization changes. So in the first commits, I kept the surface as minimal as possible. |
I figured while we are at it, might as well tidy up the code a bit :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good from my end. I would like @mxinden to also have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall change looks good to me. Thanks for the work.
//CC @divagant-martian and @AgeManning to make sure this does not break any assumptions in lighthouse.
…eprecate/gossipsub-close-event
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Max!
Yes, thank you for doing this. I think this makes a lot of sense. I am liking the direction this is going. Eventually, we might be able to add first-class support for "disabling" a handler to the swarm. |
Co-authored-by: Thomas Eizinger <[email protected]>
@thomaseizinger can you give this pull request another review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Should we extract the changes around void
and ConnectionEvent
into separate PRs?
|
||
if event.is_outbound() { | ||
handler.outbound_substream_establishing = false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConnectionEvent::FullyNegotiatedInbound(FullyNegotiatedInbound { | ||
protocol, | ||
.. | ||
}) => match protocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could flatten this match
into the outer one.
@vnermolaev I am sorry that we hijacked your PR like this 🙈 I hope you don't mind. We discovered several issues around handler errors with gossipsub and this PR helped us uncover those. Thanks for kicking it off and I hope it doesn't scare you back from contributing in the future :) |
|
||
KeepAlive::Until(handler.last_io_activity + handler.idle_timeout) | ||
} | ||
Handler::Disabled(_) => KeepAlive::No, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this way it's possible we don't send the PeerKind::UnsupportedProtocol
"on time" unless there is some kind of guarantee that poll will be called first (?). This event is important since we might want to get rid of (ban, or otherwise prevent connections to) this peer if it doesn't support what we need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A handler is always completed drained of its events first before we even ask whether or not it should be shut down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See
rust-libp2p/swarm/src/connection.rs
Lines 216 to 303 in 06ea8ce
loop { | |
match requested_substreams.poll_next_unpin(cx) { | |
Poll::Ready(Some(Ok(()))) => continue, | |
Poll::Ready(Some(Err(info))) => { | |
handler.on_connection_event(ConnectionEvent::DialUpgradeError( | |
DialUpgradeError { | |
info, | |
error: ConnectionHandlerUpgrErr::Timeout, | |
}, | |
)); | |
continue; | |
} | |
Poll::Ready(None) | Poll::Pending => {} | |
} | |
// Poll the [`ConnectionHandler`]. | |
match handler.poll(cx) { | |
Poll::Pending => {} | |
Poll::Ready(ConnectionHandlerEvent::OutboundSubstreamRequest { protocol }) => { | |
let timeout = *protocol.timeout(); | |
let (upgrade, user_data) = protocol.into_upgrade(); | |
requested_substreams.push(SubstreamRequested::new(user_data, timeout, upgrade)); | |
continue; // Poll handler until exhausted. | |
} | |
Poll::Ready(ConnectionHandlerEvent::Custom(event)) => { | |
return Poll::Ready(Ok(Event::Handler(event))); | |
} | |
Poll::Ready(ConnectionHandlerEvent::Close(err)) => { | |
return Poll::Ready(Err(ConnectionError::Handler(err))); | |
} | |
} | |
// In case the [`ConnectionHandler`] can not make any more progress, poll the negotiating outbound streams. | |
match negotiating_out.poll_next_unpin(cx) { | |
Poll::Pending | Poll::Ready(None) => {} | |
Poll::Ready(Some((info, Ok(protocol)))) => { | |
handler.on_connection_event(ConnectionEvent::FullyNegotiatedOutbound( | |
FullyNegotiatedOutbound { protocol, info }, | |
)); | |
continue; | |
} | |
Poll::Ready(Some((info, Err(error)))) => { | |
handler.on_connection_event(ConnectionEvent::DialUpgradeError( | |
DialUpgradeError { info, error }, | |
)); | |
continue; | |
} | |
} | |
// In case both the [`ConnectionHandler`] and the negotiating outbound streams can not | |
// make any more progress, poll the negotiating inbound streams. | |
match negotiating_in.poll_next_unpin(cx) { | |
Poll::Pending | Poll::Ready(None) => {} | |
Poll::Ready(Some((info, Ok(protocol)))) => { | |
handler.on_connection_event(ConnectionEvent::FullyNegotiatedInbound( | |
FullyNegotiatedInbound { protocol, info }, | |
)); | |
continue; | |
} | |
Poll::Ready(Some((info, Err(error)))) => { | |
handler.on_connection_event(ConnectionEvent::ListenUpgradeError( | |
ListenUpgradeError { info, error }, | |
)); | |
continue; | |
} | |
} | |
// Ask the handler whether it wants the connection (and the handler itself) | |
// to be kept alive, which determines the planned shutdown, if any. | |
let keep_alive = handler.connection_keep_alive(); | |
match (&mut *shutdown, keep_alive) { | |
(Shutdown::Later(timer, deadline), KeepAlive::Until(t)) => { | |
if *deadline != t { | |
*deadline = t; | |
if let Some(dur) = deadline.checked_duration_since(Instant::now()) { | |
timer.reset(dur) | |
} | |
} | |
} | |
(_, KeepAlive::Until(t)) => { | |
if let Some(dur) = t.checked_duration_since(Instant::now()) { | |
*shutdown = Shutdown::Later(Delay::new(dur), t) | |
} | |
} | |
(_, KeepAlive::No) => *shutdown = Shutdown::Asap, | |
(_, KeepAlive::Yes) => *shutdown = Shutdown::None, | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to the above, we first "drain" (i.e. poll
) the ConnectionHandler
till it returns Poll::Pending
:
rust-libp2p/swarm/src/connection.rs
Lines 231 to 247 in 06ea8ce
// Poll the [`ConnectionHandler`]. | |
match handler.poll(cx) { | |
Poll::Pending => {} | |
Poll::Ready(ConnectionHandlerEvent::OutboundSubstreamRequest { protocol }) => { | |
let timeout = *protocol.timeout(); | |
let (upgrade, user_data) = protocol.into_upgrade(); | |
requested_substreams.push(SubstreamRequested::new(user_data, timeout, upgrade)); | |
continue; // Poll handler until exhausted. | |
} | |
Poll::Ready(ConnectionHandlerEvent::Custom(event)) => { | |
return Poll::Ready(Ok(Event::Handler(event))); | |
} | |
Poll::Ready(ConnectionHandlerEvent::Close(err)) => { | |
return Poll::Ready(Err(ConnectionError::Handler(err))); | |
} | |
} |
Only then do we check the connection_keep_alive
statsu:
rust-libp2p/swarm/src/connection.rs
Lines 284 to 303 in 06ea8ce
// Ask the handler whether it wants the connection (and the handler itself) | |
// to be kept alive, which determines the planned shutdown, if any. | |
let keep_alive = handler.connection_keep_alive(); | |
match (&mut *shutdown, keep_alive) { | |
(Shutdown::Later(timer, deadline), KeepAlive::Until(t)) => { | |
if *deadline != t { | |
*deadline = t; | |
if let Some(dur) = deadline.checked_duration_since(Instant::now()) { | |
timer.reset(dur) | |
} | |
} | |
} | |
(_, KeepAlive::Until(t)) => { | |
if let Some(dur) = t.checked_duration_since(Instant::now()) { | |
*shutdown = Shutdown::Later(Delay::new(dur), t) | |
} | |
} | |
(_, KeepAlive::No) => *shutdown = Shutdown::Asap, | |
(_, KeepAlive::Yes) => *shutdown = Shutdown::None, | |
}; |
@divagant-martian @AgeManning does either of you mind giving this another review? |
@p-shahi with #3765 and #3767 merged into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, management of handler state as an enum seems like a great code quality improvement on top, ty @mxinden
@Mergifyio refresh |
✅ Pull request refreshed |
Approvals have been dismissed because the PR was updated after the send-it
label was applied.
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection. Instead of closing the connection, set `KeepAlive::No`. Related: libp2p#3591. Resolves: libp2p#3690. Pull-Request: libp2p#3625.
@mxinden The changelog from this PR may have to get updated since libp2p-gossipsub 0.44.3 didnt include this change before 0.51.3 patch release. |
Description
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection.
Instead of closing the connection, set
KeepAlive::No
.Related: #3591.
Resolves: #3690.
Notes & open questions
Few items I'd like to raise.
self.keep_alive = KeepAlive::No
and the error report. Unfortunately, I cannot meaningfully return from thepoll
function because it returnsConnectionHamdlerEvent
and not anOption
thereof.self.keep_alive = KeepAlive::No
and the error report. Again, it is impossible to return meaningfully; I justbreak
; such an approach is also observed in thepoll
function.warn!
level which is inconsistent with the proposedinfo!
levelChange checklist