-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-16008: Reconciler pending fix [WIP] #170
OCPBUGS-16008: Reconciler pending fix [WIP] #170
Conversation
@nicklesimba: This pull request references Jira Issue OCPBUGS-16008, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nicklesimba The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
5e631f2
to
9e7e1be
Compare
@nicklesimba: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/jira refresh |
@nicklesimba: This pull request references Jira Issue OCPBUGS-16008, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@nicklesimba: This pull request references Jira Issue OCPBUGS-16008. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This fix resolves the issue where, after a forceful node reboot, force deleting a pod in a stateful set causes the pod to be recreated and remain indefinitely in the Pending state.
Note: The fix was authored originally by user xagent003. Currently, to resolve bug 16008, we have made this patch available for 4.12 - but we cannot merge this PR until upstream/4.14/4.13 are merged first. Nonetheless, 4.12 users who are affected by the bug described by OCPBUGS-16008 can apply this patch to resolve the issue.
Solution description, as written on xagent003's upstream PR:
"We shouldn't treat all pending Pods as "alive" and skip the check. The list of all Pods fetch'd earlier may be stale, and as observed in some scenarios, several seconds before the ip-reconciler does the isPodAlive check.
Instead, can we retry a Get on an individual Pod, with the hopes that it has final IP/network annoations? So we try to refetch the pod a few times if it is Pending state and initial IP check fails. After that, just do the IP matching check like before"
Note that xagent003's upstream PR is stale and has since been rebased by dougbtv. You can find the current upstream PR here.