Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ Improve health checks based on node conditions #604

Open
10 tasks
prashanth26 opened this issue May 6, 2021 · 0 comments
Open
10 tasks

☂️ Improve health checks based on node conditions #604

prashanth26 opened this issue May 6, 2021 · 0 comments
Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)

Comments

@prashanth26
Copy link
Contributor

prashanth26 commented May 6, 2021

Context

Currently there are node conditions added to the node by different actors like kubelet , Node Problem Detector(NPD), Network Problem Detector. But MCM acts only on a few by default and not that effectively.

Customer could update the conditions on shoot level from this section.
Current shoot defaults are

- ReadonlyFilesystem(NPD)
- KernelDeadlock (NPD)
- DiskPressure(kubelet)

MCM has its own defaults but ofcourse they are overridden by shoot defaults (or values provided by customer)
MCM defaults are:

- KernelDeadlock (NPD)
- ReadonlyFilesystem (NPD)
- DiskPressure (kubelet)
- NetworkUnavailable (kubelet)

Example of conditions added by Network problem detector are

  - lastHeartbeatTime: "2023-02-22T07:27:28Z"
    lastTransitionTime: "2023-02-20T14:18:15Z"
    message: no cluster network problems
    reason: NoNetworkProblems
    status: "False"
    type: ClusterNetworkProblem
  - lastHeartbeatTime: "2023-02-22T07:27:23Z"
    lastTransitionTime: "2023-02-21T13:59:48Z"
    message: no host network problems
    reason: NoNetworkProblems
    type: HostNetworkProblem
    status: "False"

Goal

MCM should use the node conditions more effectively and do the replacement if it feels the node is unhealthy according to the condition.

Quick Improvements

Research

  • collect metrics on currently acted upon node Conditions
  • relevance of KernelDeadlock condition. is it just for docker ? gardener doesn't support docker . Is it a permanent condition ?
  • look into effectiveness of taint node by condition . Refer similar issue on NPD Add option flag to taint node for permanent problems kubernetes/node-problem-detector#457 . Currently the tainting is done by KCM only, and only for kubelet added node conditions
  • feasibility of firing remediation process of NPD
  • edit NPD config to introduce new node conditions which could be beneficial for Gardener scenario
  • feasibility of different timeouts for different conditions , or would it be to much fine tuning. see live issue # 2653

Why is this needed:

To have better , reliable recovery of nodes and lesser downtimes for customer workloads.

@prashanth26 prashanth26 added kind/enhancement Enhancement, improvement, extension effort/1m Effort for issue is around 1 month priority/3 Priority (lower number equals higher priority) area/ops-productivity Operator productivity related (how to improve operations) labels May 6, 2021
@prashanth26 prashanth26 changed the title Investiage the possibility of handling more errors on machines using NPD Investigate the possibility of handling more errors on machines using NPD Jul 21, 2021
@prashanth26 prashanth26 added this to the 2022-Q2 milestone Jul 21, 2021
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 18, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 17, 2022
@himanshu-kun himanshu-kun removed this from the 2022-Q2 milestone Feb 22, 2023
@himanshu-kun himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 22, 2023
@himanshu-kun himanshu-kun changed the title Investigate the possibility of handling more errors on machines using NPD Improve health checks based on node conditions Apr 11, 2023
@himanshu-kun himanshu-kun changed the title Improve health checks based on node conditions ☂️ Improve health checks based on node conditions Apr 11, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 19, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ops-productivity Operator productivity related (how to improve operations) effort/1m Effort for issue is around 1 month kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

3 participants