☂️ Improve health checks based on node conditions #604

prashanth26 · 2021-05-06T03:51:52Z

Context

Currently there are node conditions added to the node by different actors like kubelet , Node Problem Detector(NPD), Network Problem Detector. But MCM acts only on a few by default and not that effectively.

Customer could update the conditions on shoot level from this section.
Current shoot defaults are

- ReadonlyFilesystem(NPD)
- KernelDeadlock (NPD)
- DiskPressure(kubelet)

MCM has its own defaults but ofcourse they are overridden by shoot defaults (or values provided by customer)
MCM defaults are:

- KernelDeadlock (NPD)
- ReadonlyFilesystem (NPD)
- DiskPressure (kubelet)
- NetworkUnavailable (kubelet)

Example of conditions added by Network problem detector are

  - lastHeartbeatTime: "2023-02-22T07:27:28Z"
    lastTransitionTime: "2023-02-20T14:18:15Z"
    message: no cluster network problems
    reason: NoNetworkProblems
    status: "False"
    type: ClusterNetworkProblem
  - lastHeartbeatTime: "2023-02-22T07:27:23Z"
    lastTransitionTime: "2023-02-21T13:59:48Z"
    message: no host network problems
    reason: NoNetworkProblems
    type: HostNetworkProblem
    status: "False"

Goal

MCM should use the node conditions more effectively and do the replacement if it feels the node is unhealthy according to the condition.

Quick Improvements

Identify Unrecoverable nodeConditions (ex- ReadonlyFilesystem) and don't wait healthTimeout, but immediately make machine Failed.
- could be deceiving as it is reported if other file system is read-only Gauge for FilesystemIsReadOnly not downgraded to 0 after fixing the problem kubernetes/node-problem-detector#474
- taint based evictions don’t follow PDB (confirm, if yes , then do health timeout 0) , Documentation on taint-based pod evictions implies respect of PDBs, which may not be the case kubernetes/website#7829
MCM should consider node Conditions added by Network Problem Detector also.

Research

collect metrics on currently acted upon node Conditions
relevance of KernelDeadlock condition. is it just for docker ? gardener doesn't support docker . Is it a permanent condition ?
look into effectiveness of taint node by condition . Refer similar issue on NPD Add option flag to taint node for permanent problems kubernetes/node-problem-detector#457 . Currently the tainting is done by KCM only, and only for kubelet added node conditions
feasibility of firing remediation process of NPD
edit NPD config to introduce new node conditions which could be beneficial for Gardener scenario
feasibility of different timeouts for different conditions , or would it be to much fine tuning. see live issue # 2653

Why is this needed:

To have better , reliable recovery of nodes and lesser downtimes for customer workloads.

The text was updated successfully, but these errors were encountered:

prashanth26 added kind/enhancement Enhancement, improvement, extension effort/1m Effort for issue is around 1 month priority/3 Priority (lower number equals higher priority) area/ops-productivity Operator productivity related (how to improve operations) labels May 6, 2021

prashanth26 changed the title ~~Investiage the possibility of handling more errors on machines using NPD~~ Investigate the possibility of handling more errors on machines using NPD Jul 21, 2021

prashanth26 added this to the 2022-Q2 milestone Jul 21, 2021

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jan 18, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 17, 2022

himanshu-kun removed this from the 2022-Q2 milestone Feb 22, 2023

himanshu-kun added needs/planning Needs (more) planning with other MCM maintainers and removed lifecycle/rotten Nobody worked on this for 12 months (final aging stage) labels Feb 22, 2023

himanshu-kun changed the title ~~Investigate the possibility of handling more errors on machines using NPD~~ Improve health checks based on node conditions Apr 11, 2023

himanshu-kun changed the title ~~Improve health checks based on node conditions~~ ☂️ Improve health checks based on node conditions Apr 11, 2023

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 19, 2023

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Aug 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Improve health checks based on node conditions #604

☂️ Improve health checks based on node conditions #604

prashanth26 commented May 6, 2021 •

edited by himanshu-kun

Loading

☂️ Improve health checks based on node conditions #604

☂️ Improve health checks based on node conditions #604

Comments

prashanth26 commented May 6, 2021 • edited by himanshu-kun Loading

Context

Goal

Quick Improvements

Research

prashanth26 commented May 6, 2021 •

edited by himanshu-kun

Loading