☂️ Improve health checks based on node conditions #604
Labels
area/ops-productivity
Operator productivity related (how to improve operations)
effort/1m
Effort for issue is around 1 month
kind/enhancement
Enhancement, improvement, extension
lifecycle/rotten
Nobody worked on this for 12 months (final aging stage)
needs/planning
Needs (more) planning with other MCM maintainers
priority/3
Priority (lower number equals higher priority)
Context
Currently there are node conditions added to the node by different actors like kubelet , Node Problem Detector(NPD), Network Problem Detector. But MCM acts only on a few by default and not that effectively.
Customer could update the conditions on shoot level from this section.
Current shoot defaults are
MCM has its own defaults but ofcourse they are overridden by shoot defaults (or values provided by customer)
MCM defaults are:
Example of conditions added by Network problem detector are
Goal
MCM should use the node conditions more effectively and do the replacement if it feels the node is unhealthy according to the condition.
Quick Improvements
ReadonlyFilesystem
) and don't wait healthTimeout, but immediately make machineFailed
.Research
KernelDeadlock
condition. is it just for docker ? gardener doesn't support docker . Is it a permanent condition ?Why is this needed:
To have better , reliable recovery of nodes and lesser downtimes for customer workloads.
The text was updated successfully, but these errors were encountered: