Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync-server: Fix infinite loop caused by accept error #271

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Tim-Zhang
Copy link
Member

Since poll is level-triggered, an uncorrected error can lead to an infinite loop, so we sleep for a while and wait for the error to be corrected.

Fixes: #270

@@ -373,6 +374,9 @@ impl Server {
}
Err(e) => {
error!("listener accept got {:?}", e);
// Since poll is level-triggered, an uncorrected error can lead to an infinite loop,
// so we sleep for a while and wait for the error to be corrected.
thread::sleep(Duration::from_secs(10));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 seconds seem like a long time?

Copy link
Member Author

@Tim-Zhang Tim-Zhang Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, It's a long time for a program, but the errors of accept such as EMFILE and ENOMEM can't be recovered in a short time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could add a method server.set_accept_retry_interval, so users can customize the sleep duration

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @jsturtevant I have added the method set_accept_retry_interval.

Copy link
Collaborator

@jsturtevant jsturtevant Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making it configurable.

If it wasn't resolved in 10s wouldn't the continue loop still get stuck? In either case, would EMFILE is eventually resolved, wouldn't the loop finish?

Is the sleep to resolve cpu cycles while those errors occur? Should this only be done for a subclass of errors? seems like if there was a transient error we would slow things down significantlly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code. Now the sleep will only affects those resource limit errors: libc::EMFILE, libc::ENFILE, libc::ENOBUFS, libc::ENOMEM.

@Tim-Zhang Tim-Zhang force-pushed the fix-infinite-accept-loop-master branch 2 times, most recently from a845820 to 47f828a Compare January 9, 2025 07:00
Also Add the method set_accept_retry_interval.

Since poll is level-triggered, an uncorrected error can lead to an infinite loop,
so we sleep for a while and wait for the error to be corrected.

Fixes: containerd#270

Signed-off-by: Tim Zhang <[email protected]>
@Tim-Zhang Tim-Zhang force-pushed the fix-infinite-accept-loop-master branch from 47f828a to c351f0b Compare January 10, 2025 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The error message: "failed to accept error EMFILE" filled up the hard disk.
4 participants