Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace kill_all_container_shims with remove_all_containers even in classic #4693

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gaelgatelement
Copy link

@gaelgatelement gaelgatelement commented Oct 2, 2024

Summary

  1. We've seen issues with containers not being stopped on microk8s stop commands. See microk8s stop command does not shutdown previous deployed containers #3969
  2. We've mostly discovered this issue with snap refresh of microk8s, the problem is also probably causing Microk8s restart during snap refresh did not remove pods #4691

Current behaviour is very dangerous : we've seen data corruption and file locks issues after microk8s upgrades because containers processes become duplicated. (Processes before upgrade are still present, and upgraded microk8s starts new containers writing on the same resources).

Changes

  • We replace kill_all_container_shims with remove_all_containers, regardless of classic or strict confinement

Testing

  • I ran the microk8s stop command patched and asserted that no containers were present anymore.

Before :

microk8s stop
sleep 60s
# list orphaned processes
for cgroup_tasks in `find /sys/fs/cgroup/pids/kubepods/ -name tasks`; do head -n 1 "$cgroup_tasks"; done | xargs ps
    PID TTY      STAT   TIME COMMAND
  42522 ?        Ss     0:00 /pause
  44183 ?        Ss     0:00 /pause
  44307 ?        Ss     0:00 /pause
  44868 ?        Ss     0:00 /pause
  44870 ?        Ss     0:00 /usr/local/bin/runsvdir -P /etc/service/enabled
  44913 ?        Ss     0:00 /opt/bitnami/postgresql/bin/postgres -D /bitnami/postgresql/data --config-file=/opt/bitnami/postgresql/conf/postgresql.conf --external_pid_file=/opt/bitnami/postgresql/tmp/postgresql.pid --hba_file=/opt
  45138 ?        Ss     0:00 /pause
  45317 ?        Ss     0:00 /pause
...

After :

microk8s stop
Stopped.
Stopped.
for cgroup_tasks in `find /sys/fs/cgroup/pids/kubepods/ -name tasks`; do head -n 1 "$cgroup_tasks"; done | xargs ps
    PID TTY          TIME CMD
   1048 pts/0    00:00:00 bash
  23334 pts/0    00:00:00 xargs
  23373 pts/0    00:00:00 ps

Possible Regressions

Checklist

  • Read the contributions page.
  • Submitted the CLA form, if you are a first time contributor.
  • The introduced changes are covered by unit and/or integration tests.

Notes

@gaelgatelement gaelgatelement changed the title replace kill_all_container_shims with remove_all_containers even in c… replace kill_all_container_shims with remove_all_containers even in classic Oct 2, 2024
@claudiubelu claudiubelu self-assigned this Oct 29, 2024
@claudiubelu
Copy link
Contributor

Hello.

Make sure to sign the CLA.

I've been looking a bit into this, and experimented a bit with what's being reported here and proposed. I got different results, so, it might be useful to know what is your setup (node OS, microk8s version, etc.), so we can compare our results.

Here are my setup and my findings.

I've ran these tests in Hyper-V VMs (Windows 11 Pro), using Ubuntu 22.04 nodes, and I've been trying out v1.31 (including building the snap using this PR):

cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

As for my results, I first tried the strict channel and tried to see if I'd get any leaking containerd-related resources leaking after removing the snap. Interestingly, I did find leaking resources (which it itself is a problem, IMO), contrary to what this PR would suggest:

# Install strict microk8s.
sudo snap install microk8s --channel=1.31-strict
microk8s (1.31-strict/stable) v1.31.1 from Canonical✓ installed

snap list | grep microk8s | cut -d' ' -f1 | xargs -i bash -c 'echo -n "{} = " && snap info --verbose {} | grep confinement  | rev | cut -d" " -f1  |rev'
microk8s = strict

# List containerd and containers.
ps aux | grep containerd
root     2295879  2.1  0.8 2079412 51428 ?       Ssl  22:25   0:16 /snap/microk8s/7234/bin/containerd --config /var/snap/microk8s/7234/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /var/snap/microk8s/common/run/containerd.sock
root     2297278  0.1  0.1 1236368 11000 ?       Sl   22:25   0:00 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id edd82a4e43fee5bda3f478faa7520931ea6b680ab64d6c25461228e3de1d787d -address /var/snap/microk8s/common/run/containerd.sock
root     2298229  0.1  0.1 1236112 10764 ?       Sl   22:26   0:00 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id d68be3e66706b2963003038a73e950bfa47432ce07b148dbdd9abc74a0563f9b -address /var/snap/microk8s/common/run/containerd.sock
root     2298672  0.0  0.1 1236112 10492 ?       Sl   22:26   0:00 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id 962a3c3fb2b86a721d37d12961dbc50621ee92b0abf1b1f2aaab66943be558d3 -address /var/snap/microk8s/common/run/containerd.sock

# Remove microk8s.
sudo snap remove --purge microk8s
Disconnect microk8s:docker-unprivileged from snapd:docker-support
microk8s removed

# List containerd and containers.
ps aux | grep containerd
root     2297278  0.1  0.1 1236368 10812 ?       Sl   22:25   0:01 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id edd82a4e43fee5bda3f478faa7520931ea6b680ab64d6c25461228e3de1d787d -address /var/snap/microk8s/common/run/containerd.sock
root     2298229  0.0  0.1 1236112 10916 ?       Sl   22:26   0:00 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id d68be3e66706b2963003038a73e950bfa47432ce07b148dbdd9abc74a0563f9b -address /var/snap/microk8s/common/run/containerd.sock
root     2298672  0.0  0.1 1236112 10844 ?       Sl   22:26   0:00 /snap/microk8s/7234/bin/containerd-shim-runc-v2 -namespace k8s.io -id 962a3c3fb2b86a721d37d12961dbc50621ee92b0abf1b1f2aaab66943be558d3 -address /var/snap/microk8s/common/run/containerd.sock

# Kill zombie containers.
ps aux | grep containerd | awk '{print $2}' | sudo xargs kill -9

I've then tried the classic variant as well, and I did not see any leaked resources there:

# Install classic microk8s.
sudo snap install microk8s --classic --channel=1.31
microk8s (1.31/stable) v1.31.1 from Canonical✓ installed

snap list | grep microk8s | cut -d' ' -f1 | xargs -i bash -c 'echo -n "{} = " && snap info --verbose {} | grep confinement  | rev | cut -d" " -f1  |rev'
microk8s = classic

# List containerd and containers.
ps aux | grep containerd                                                                       root     2320646 15.5  0.9 2079156 55720 ?       Ssl  22:45   0:13 /snap/microk8s/7229/bin/containerd --config /var/snap/microk8s/7229/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /var/snap/microk8s/common/run/containerd.sock
root     2322012  0.1  0.1 1236368 11256 ?       Sl   22:45   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id cc2e9a248b398e9dad730e95cbe3c6063d369f771807c296ad59960025d2bb7e -address /var/snap/microk8s/common/run/containerd.sock
root     2322933  0.1  0.1 1236112 11380 ?       Sl   22:46   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 32e87a796c2d8c974b6fbd9c470154e76089df630e66467924465f89599f819c -address /var/snap/microk8s/common/run/containerd.sock
root     2323485  0.0  0.1 1236048 9128 ?        Sl   22:46   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1298c8a3695019ea06ff7bf87c521755c9804d8503e7ae966fc7acb939843b54 -address /var/snap/microk8s/common/run/containerd.sock

# Remove microk8s.
sudo snap remove --purge microk8s
microk8s removed

# List containerd and containers.
ps aux | grep containerd
ubuntu   2325445  0.0  0.0   6612  2300 pts/0    S+   22:48   0:00 grep --color=auto containerd

From the looks of it, remove_all_containers doesn't do what it's supposed to. Would have to investigate further.

Another interesting note here, is that it seems that systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL &>/dev/null || true actually kills the containers as well... but, nap.microk8s.daemon-containerd.service is not stopped (in /etc/systemd/system/snap.microk8s.daemon-containerd.service it is set to Restart=always) . This means that containerd will actually recreate / restart the containers, which may relate to your issue. But, if we stop the service beforehand, the containers will not be recreated:

# Reinstall classic microk8s.
sudo snap install microk8s --classic --channel=1.31
microk8s (1.31/stable) v1.31.1 from Canonical✓ installed

snap list | grep microk8s | cut -d' ' -f1 | xargs -i bash -c 'echo -n "{} = " && snap info --verbose {} | grep confinement  | rev | cut -d" " -f1  |rev'
microk8s = classic

# List containerd and containers.
ps aux | grep containerd
root     2331542  1.2  0.7 2373828 46332 ?       Ssl  22:51   0:00 /snap/microk8s/7229/bin/containerd --config /var/snap/microk8s/7229/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /var/snap/microk8s/common/run/containerd.sock
root     2331779  0.1  0.1 1236368 11088 ?       Sl   22:51   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 85441c744ee93962c191156a3cc1aeac0f0086fc1ecf11fcbf9885e5c8a92486 -address /var/snap/microk8s/common/run/containerd.sock
root     2331950  0.1  0.1 1236368 10640 ?       Sl   22:51   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 7208cb129230066e437e16b9dd14c69ecb846d1491fd038613e178f13994acf6 -address /var/snap/microk8s/common/run/containerd.sock
root     2332004  0.0  0.1 1236368 10572 ?       Sl   22:51   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 17e86cc665cc4867b5c28fe0aa1d478009407a641dc3cfcd60c2311b1a3417a6 -address /var/snap/microk8s/common/run/containerd.sock
ubuntu   2333994  0.0  0.0   6612  2296 pts/0    S+   22:52   0:00 grep --color=auto containerd


# Stop containerd. Observe that the containers are still running, even though containerd is not.
sudo systemctl stop snap.microk8s.daemon-containerd.service --signal=SIGKILL
ps aux | grep containerd
root     2334412  0.1  0.1 1236624 10340 ?       Sl   22:52   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3818a967f17d17569a00a6e424cac1d1c9d9c0e2c5bbe92ad6e9c9bf30ad3678 -address /var/snap/microk8s/common/run/containerd.sock
root     2334642  0.0  0.1 1236112 9668 ?        Sl   22:52   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id f1da747132080c7744eba4cb63908f3adcf2e94141e631139c3c0b88c15338bb -address /var/snap/microk8s/common/run/containerd.sock
root     2334705  0.0  0.1 1236368 10512 ?       Sl   22:52   0:00 /snap/microk8s/7229/bin/containerd-shim-runc-v2 -namespace k8s.io -id 70aacf5526b88f60a4365822d3e14dad35acac9cc257d991978847e681c4ecd0 -address /var/snap/microk8s/common/run/containerd.sock
ubuntu   2337261  0.0  0.0   6612  2236 pts/0    S+   22:54   0:00 grep --color=auto containerd

# Kill containerd (even though it is stopped). The containers will also dissapear.
sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL &>/dev/null || true
ps aux | grep containerd
ubuntu   2338052  0.0  0.0   6612  2224 pts/0    S+   22:55   0:00 grep --color=auto containerd

As mentioned, I've also built the snap myself using this PR and tried it out. I suspected that, because of the changes proposed here, the classic confinement would start having the issues seen in strict as well:

git log

commit 5eb2959932d58fb7cd8bf82e99bd5851a1c8dff6 (HEAD)
Author: Gaël Goinvic <[email protected]>
Date:   Wed Oct 2 14:54:42 2024 +0200

    replace kill_all_container_shims with remove_all_containers even in classic mode

...

# Build snap.
snapcraft --use-lxd

Launching a container.
Waiting for container to be ready
Waiting for network to be ready...
snapcraft 8.4.4 from Canonical✓ installed
"snapcraft" switched to the "latest/stable" channel
...
Snapping |
Snapped microk8s_v1.31.2_amd64.snap

# Install built snap.
sudo snap install microk8s_*_amd64.snap --classic --dangerous
[sudo] password for ubuntu:
microk8s v1.31.2 installed

snap list | grep microk8s | cut -d' ' -f1 | xargs -i bash -c 'echo -n "{} = " && snap info --verbose {} | grep confinement  | rev | cut -d" " -f1  |rev'
microk8s = classic

ps aux | grep containerd
root     1075059 18.9  0.6 2153128 54956 ?       Ssl  11:49   0:14 /snap/microk8s/x1/bin/containerd --config /var/snap/microk8s/x1/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /var/snap/microk8s/common/run/containerd.sock
root     1076586  0.2  0.1 1236364 12384 ?       Sl   11:50   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3d95b0780d69c6d53eae4526a1264c070f1f7688695f5dca8d885089fb6b7d85 -address /var/snap/microk8s/common/run/containerd.sock
root     1078236  0.3  0.1 1236044 10624 ?       Sl   11:51   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 7569f6fa1f3b27d8afa9c7802a0c30caa5011ce41e29ee666069c867a11eb755 -address /var/snap/microk8s/common/run/containerd.sock
root     1078350  0.2  0.1 1235532 10368 ?       Sl   11:51   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 774fa766c47b3efe1e3f01f314b1af4bc3c2f0ca1a52025280c72b31db52fd30 -address /var/snap/microk8s/common/run/containerd.sock
ubuntu   1078683  0.0  0.0   6544  2304 pts/2    S+   11:51   0:00 grep --color=auto containerd

# Remove the snap.
sudo snap remove --purge microk8s
microk8s removed

ps aux | grep containerd
root     1076586  0.1  0.1 1236364 12100 ?       Sl   11:50   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3d95b0780d69c6d53eae4526a1264c070f1f7688695f5dca8d885089fb6b7d85 -address /var/snap/microk8s/common/run/containerd.sock
root     1078236  0.0  0.1 1236364 10844 ?       Sl   11:51   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 7569f6fa1f3b27d8afa9c7802a0c30caa5011ce41e29ee666069c867a11eb755 -address /var/snap/microk8s/common/run/containerd.sock
root     1078350  0.1  0.1 1236108 11508 ?       Sl   11:51   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 774fa766c47b3efe1e3f01f314b1af4bc3c2f0ca1a52025280c72b31db52fd30 -address /var/snap/microk8s/common/run/containerd.sock
ubuntu   1079535  0.0  0.0   6544  2304 pts/2    S+   11:52   0:00 grep --color=auto containerd

As for microk8s stop, I haven't observed this issue with either the strict or classic modes.

Best regards,

Claudiu

@jcjveraa
Copy link

jcjveraa commented Oct 30, 2024

Got here due to observing that the kubelite process was apparently leaking memory at a rate of 0.1 GB/day. When running microk8s stop, as expected, the kubelite process stopped, but I can still see all services (argocd, calico, ...) running in the background in top.

The kubelite proces and some others were stopped/killed, but that did not stop the containers. I have a DIY memory/cpu resource consumption monitor running (a python script running as a Daemonset) which kept reporting memory consumption, showing a large drop when the kubelite process went down and then showing a slight increase again when I ran microk8s start a minute ago.

I'm running the snap in --classic mode. This is on microk8s 1.30.5, running on bare metal ubuntu-server 24.04.01.

Happy to provide any requested debugging info.

$ sudo systemctl kill snap.microk8s.daemon-kubelite.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-kubelite.service: Unit snap.microk8s.daemon-kubelite.service not loaded.
$ sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-containerd.service: Unit snap.microk8s.daemon-containerd.service not loaded.

image

@gaelgatelement
Copy link
Author

gaelgatelement commented Oct 31, 2024

Got here due to observing that the kubelite process was apparently leaking memory at a rate of 0.1 GB/day. When running microk8s stop, as expected, the kubelite process stopped, but I can still see all services (argocd, calico, ...) running in the background in top.

The kubelite proces and some others were stopped/killed, but that did not stop the containers. I have a DIY memory/cpu resource consumption monitor running (a python script running as a Daemonset) which kept reporting memory consumption, showing a large drop when the kubelite process went down and then showing a slight increase again when I ran microk8s start a minute ago.

I'm running the snap in --classic mode. This is on microk8s 1.30.5, running on bare metal ubuntu-server 24.04.01.

Happy to provide any requested debugging info.

$ sudo systemctl kill snap.microk8s.daemon-kubelite.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-kubelite.service: Unit snap.microk8s.daemon-kubelite.service not loaded.
$ sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-containerd.service: Unit snap.microk8s.daemon-containerd.service not loaded.

image

If you try with microk8s 1.30/candidate, which has the #4710 backported, it should just work. As @claudiubelu demonstrated, v1.31 seems to be fixed regarding this issue, I suppose because it has #4710 merged in.

Are you able to give it a try by any chance?

@jcjveraa
Copy link

jcjveraa commented Oct 31, 2024 via email

@gaelgatelement
Copy link
Author

Do you mean 1.31/candidate? I don't see a 1.35. Op do 31 okt 2024 om 09:48 schreef Gaël Goinvic @.>

Got here due to observing that the kubelite process was apparently leaking memory at a rate of 0.1 GB/day. When running microk8s stop, as expected, the kubelite process stopped, but I can still see all services (argocd, calico, ...) running in the background in top. The kubelite proces and some others were stopped/killed, but that did not stop the containers. I have a DIY memory/cpu resource consumption monitor running (a python script running as a Daemonset) which kept reporting memory consumption, showing a large drop when the kubelite process went down and then showing a slight increase again when I ran microk8s start a minute ago. I'm running the snap in --classic mode. This is on microk8s 1.30.5, running on bare metal ubuntu-server 24.04.01. Happy to provide any requested debugging info. $ sudo systemctl kill snap.microk8s.daemon-kubelite.service --signal=SIGKILL Failed to kill unit snap.microk8s.daemon-kubelite.service: Unit snap.microk8s.daemon-kubelite.service not loaded. $ sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL Failed to kill unit snap.microk8s.daemon-containerd.service: Unit snap.microk8s.daemon-containerd.service not loaded. [image: image] https://private-user-images.githubusercontent.com/3942301/381695770-ce448ab4-c180-4cfd-9de8-655ecd38caee.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzAzNjQ2ODksIm5iZiI6MTczMDM2NDM4OSwicGF0aCI6Ii8zOTQyMzAxLzM4MTY5NTc3MC1jZTQ0OGFiNC1jMTgwLTRjZmQtOWRlOC02NTVlY2QzOGNhZWUucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MTAzMSUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDEwMzFUMDg0NjI5WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YzUyOTNlNmE0YjJkNmY5ZWMzYmVkZTAwYTg4MGMyMjE3YjA5NTE4NjQwOWZkZDgzNjUyMDM0YWNiYTE1YTlkYyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.zDJ7r0XjpKG0dULp1E8pI3xQn3h0j_p-Ws-sW8OQ2fk If you try with microk8s 1.35/candidate, which has the #4710 <#4710> backported, it should just work just as @claudiubelu https://github.com/claudiubelu explained. Are you able to give it a try by any chance? — Reply to this email directly, view it on GitHub <#4693 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA6CPHPVJDWXQ32W5NHWSO3Z6HVEHAVCNFSM6AAAAABPHYNXDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZGMZTQMRUHE . You are receiving this because you commented.Message ID: @.
>

Yes sorry, typo : v1.30/candidate is on v1.30.6 which is supposed to have the fix #4710

@jcjveraa
Copy link

jcjveraa commented Oct 31, 2024

Not seeing much difference unfortunately after upgrading one node to 1.30.6.

jelle@lenovo-01:~$ microk8s kubectl get no -o wide
NAME        STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
lenovo-01   Ready    <none>   38d   v1.30.6   192.168.178.210   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-02   Ready    <none>   32d   v1.30.5   192.168.178.211   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-03   Ready    <none>   32d   v1.30.5   192.168.178.212   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
jelle@lenovo-01:~$ microk8s kubectl get po --all-namespaces -o wide | grep mqtt
kube-metrics-mqtt     node-metrics-mqtt-fwsv9                            1/1     Running   2 (2d22h ago)    4d9h    10.1.125.104      lenovo-03   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   4 (52s ago)      4d9h    10.1.153.198      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-sqdjh                            1/1     Running   2 (2d22h ago)    4d9h    10.1.190.9        lenovo-02   <none>           <none>
jelle@lenovo-01:~$ ps aux | grep mqtt
root     4105984  1.1  0.1  26844 19964 ?        Ssl  18:55   0:00 python ./kube-metrics-mqtt.py
jelle    4106521  0.0  0.0   3956  2048 pts/1    S+   18:55   0:00 grep --color=auto mqtt
jelle@lenovo-01:~$ microk8s stop
Stopped.
jelle@lenovo-01:~$ sleep 60 && ps aux | grep mqtt
root     4105984  0.2  0.1  26844 19964 ?        Ssl  18:55   0:00 python ./kube-metrics-mqtt.py
jelle    4108591  0.0  0.0   3956  2048 pts/1    S+   18:57   0:00 grep --color=auto mqtt

I tried once before getting this 'pretty' output so that's why the pod reports running for 52 seconds. It restarts once microk8s start is called, but it does not stop after microk8s stop. The same holds for other pods like argocd, but due to runing only 2 replicas and draining the node before upgrading they got scheduled on nodes 02 and 03.

While typing this I got the idea to check from another node what the status is - and it correctly (but to me unexpectedly) reports a few pods still running on "stopped" node lenovo-01.

jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-hmrk2            1/1     Terminating   2 (9m1s ago)     35m     10.1.153.242      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           3/3     Terminating   6 (9m1s ago)     34m     10.1.153.251      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running       18 (9m1s ago)    30d     10.1.153.193      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running       4 (9m1s ago)     4d9h    10.1.153.198      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running       2 (9m1s ago)     30m     192.168.178.210   lenovo-01   <none>           <none>

Trying to drain the node hanged for ~3 minutes until I cancelled it, this completed in ~45 seconds when draining the ("not microk8s stop-ped") node earlier when doing the upgrade to 1.30.6

jelle@lenovo-02:~$ microk8s kubectl drain lenovo-01 --delete-emptydir-data --ignore-daemonsets
node/lenovo-01 already cordoned
Warning: ignoring DaemonSet-managed Pods: ingress/nginx-ingress-microk8s-controller-v8tdn, kube-metrics-mqtt/node-metrics-mqtt-jfjqh, kube-system/calico-node-vnpm5
evicting pod argocd/argocd-redis-ha-server-0
evicting pod argocd/argocd-redis-ha-haproxy-94df6548c-hmrk2

After micr0k8s start-ing the node on lenovo-01 again, the draining finished immediately.

I then thought "maybe because my mqtt app is running as a DaemonSet", as I saw these kept running normally after draining a node anyway. However the argocd ReplicaSet, which terminates with a drain, keeps running too after a microk8s stop.

jelle@lenovo-01:~$ microk8s stop
[sudo] password for jelle: 
Stopped.
jelle@lenovo-01:~$ ps aux | grep redis
jelle    4129779  0.5  0.0  34856  9160 ?        Ssl  19:16   0:00 redis-server 0.0.0.0:6379
jelle    4129854  0.6  0.0  32376  7296 ?        Ssl  19:16   0:00 redis-sentinel 0.0.0.0:26379 [sentinel]
jelle    4131552  0.0  0.0   3956  2048 pts/0    S+   19:17   0:00 grep --color=auto redis

jelle@lenovo-01:~$ microk8s status
microk8s is not running, try microk8s start

and the pods are visible from the lenovo-02 node:

jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-lbcvh            1/1     Running   0                20m     10.1.153.195      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           3/3     Running   0                10m     10.1.153.245      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running   19 (10m ago)     30d     10.1.153.194      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   5 (10m ago)      4d9h    10.1.153.239      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running   3 (10m ago)      48m     192.168.178.210   lenovo-01   <none>           <none>

jelle@lenovo-02:~$ microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.178.210:19001 192.168.178.211:19001 192.168.178.212:19001
  datastore standby nodes: none
(clipped away all the other stuff)

jelle@lenovo-02:~$ microk8s kubectl get no -o wide
NAME        STATUS     ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
lenovo-01   NotReady   <none>   38d   v1.30.6   192.168.178.210   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-02   Ready      <none>   32d   v1.30.5   192.168.178.211   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-03   Ready      <none>   32d   v1.30.5   192.168.178.212   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28

After restarting the lenovo-01, as before the pods all terminate and get restarted. Interestingly the DaemonSet pods truely 'restart' (as registered by the get po command), while the ReplicaSet pods fully terminate and then start again (not showing a "restart").

jelle@lenovo-01:~$ microk8s status
microk8s is not running, try microk8s start
jelle@lenovo-01:~$ microk8s start
jelle@lenovo-01:~$ 
jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-42dnv            1/1     Running   0                2m17s   10.1.153.201      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           1/3     Running   0                24s     10.1.153.250      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running   20 (42s ago)     30d     10.1.153.246      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   6 (41s ago)      4d9h    10.1.153.197      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running   4 (42s ago)      52m     192.168.178.210   lenovo-01   <none>           <none>

@jcjveraa
Copy link

jcjveraa commented Oct 31, 2024

To be honest, not sure how the #4710 would do anything for this in my case - it seems to fix conflicting PATHs but I've never upgraded before today so I do not have an 'old snap' to conflict with anything, the microk8s install I did was using the ubuntu server wizard. That's also why I'm on 1.30 instead of 1.31, which I see now had been out for a while already when I first installed.

@gaelgatelement
Copy link
Author

Not seeing much difference unfortunately after upgrading one node to 1.30.6.

jelle@lenovo-01:~$ microk8s kubectl get no -o wide
NAME        STATUS   ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
lenovo-01   Ready    <none>   38d   v1.30.6   192.168.178.210   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-02   Ready    <none>   32d   v1.30.5   192.168.178.211   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-03   Ready    <none>   32d   v1.30.5   192.168.178.212   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
jelle@lenovo-01:~$ microk8s kubectl get po --all-namespaces -o wide | grep mqtt
kube-metrics-mqtt     node-metrics-mqtt-fwsv9                            1/1     Running   2 (2d22h ago)    4d9h    10.1.125.104      lenovo-03   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   4 (52s ago)      4d9h    10.1.153.198      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-sqdjh                            1/1     Running   2 (2d22h ago)    4d9h    10.1.190.9        lenovo-02   <none>           <none>
jelle@lenovo-01:~$ ps aux | grep mqtt
root     4105984  1.1  0.1  26844 19964 ?        Ssl  18:55   0:00 python ./kube-metrics-mqtt.py
jelle    4106521  0.0  0.0   3956  2048 pts/1    S+   18:55   0:00 grep --color=auto mqtt
jelle@lenovo-01:~$ microk8s stop
Stopped.
jelle@lenovo-01:~$ sleep 60 && ps aux | grep mqtt
root     4105984  0.2  0.1  26844 19964 ?        Ssl  18:55   0:00 python ./kube-metrics-mqtt.py
jelle    4108591  0.0  0.0   3956  2048 pts/1    S+   18:57   0:00 grep --color=auto mqtt

I tried once before getting this 'pretty' output so that's why the pod reports running for 52 seconds. It restarts once microk8s start is called, but it does not stop after microk8s stop. The same holds for other pods like argocd, but due to runing only 2 replicas and draining the node before upgrading they got scheduled on nodes 02 and 03.

While typing this I got the idea to check from another node what the status is - and it correctly (but to me unexpectedly) reports a few pods still running on "stopped" node lenovo-01.

jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-hmrk2            1/1     Terminating   2 (9m1s ago)     35m     10.1.153.242      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           3/3     Terminating   6 (9m1s ago)     34m     10.1.153.251      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running       18 (9m1s ago)    30d     10.1.153.193      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running       4 (9m1s ago)     4d9h    10.1.153.198      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running       2 (9m1s ago)     30m     192.168.178.210   lenovo-01   <none>           <none>

Trying to drain the node hanged for ~3 minutes until I cancelled it, this completed in ~45 seconds when draining the ("not microk8s stop-ped") node earlier when doing the upgrade to 1.30.6

jelle@lenovo-02:~$ microk8s kubectl drain lenovo-01 --delete-emptydir-data --ignore-daemonsets
node/lenovo-01 already cordoned
Warning: ignoring DaemonSet-managed Pods: ingress/nginx-ingress-microk8s-controller-v8tdn, kube-metrics-mqtt/node-metrics-mqtt-jfjqh, kube-system/calico-node-vnpm5
evicting pod argocd/argocd-redis-ha-server-0
evicting pod argocd/argocd-redis-ha-haproxy-94df6548c-hmrk2

After micr0k8s start-ing the node on lenovo-01 again, the draining finished immediately.

I then thought "maybe because my mqtt app is running as a DaemonSet", as I saw these kept running normally after draining a node anyway. However the argocd ReplicaSet, which terminates with a drain, keeps running too after a microk8s stop.

jelle@lenovo-01:~$ microk8s stop
[sudo] password for jelle: 
Stopped.
jelle@lenovo-01:~$ ps aux | grep redis
jelle    4129779  0.5  0.0  34856  9160 ?        Ssl  19:16   0:00 redis-server 0.0.0.0:6379
jelle    4129854  0.6  0.0  32376  7296 ?        Ssl  19:16   0:00 redis-sentinel 0.0.0.0:26379 [sentinel]
jelle    4131552  0.0  0.0   3956  2048 pts/0    S+   19:17   0:00 grep --color=auto redis

jelle@lenovo-01:~$ microk8s status
microk8s is not running, try microk8s start

and the pods are visible from the lenovo-02 node:

jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-lbcvh            1/1     Running   0                20m     10.1.153.195      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           3/3     Running   0                10m     10.1.153.245      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running   19 (10m ago)     30d     10.1.153.194      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   5 (10m ago)      4d9h    10.1.153.239      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running   3 (10m ago)      48m     192.168.178.210   lenovo-01   <none>           <none>

jelle@lenovo-02:~$ microk8s status
microk8s is running
high-availability: yes
  datastore master nodes: 192.168.178.210:19001 192.168.178.211:19001 192.168.178.212:19001
  datastore standby nodes: none
(clipped away all the other stuff)

jelle@lenovo-02:~$ microk8s kubectl get no -o wide
NAME        STATUS     ROLES    AGE   VERSION   INTERNAL-IP       EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
lenovo-01   NotReady   <none>   38d   v1.30.6   192.168.178.210   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-02   Ready      <none>   32d   v1.30.5   192.168.178.211   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28
lenovo-03   Ready      <none>   32d   v1.30.5   192.168.178.212   <none>        Ubuntu 24.04.1 LTS   6.8.0-47-generic   containerd://1.6.28

After restarting the lenovo-01, as before the pods all terminate and get restarted. Interestingly the DaemonSet pods truely 'restart' (as registered by the get po command), while the ReplicaSet pods fully terminate and then start again (not showing a "restart").

jelle@lenovo-01:~$ microk8s status
microk8s is not running, try microk8s start
jelle@lenovo-01:~$ microk8s start
jelle@lenovo-01:~$ 
jelle@lenovo-02:~$ microk8s kubectl get po -o wide --all-namespaces | grep lenovo-01
argocd                argocd-redis-ha-haproxy-94df6548c-42dnv            1/1     Running   0                2m17s   10.1.153.201      lenovo-01   <none>           <none>
argocd                argocd-redis-ha-server-0                           1/3     Running   0                24s     10.1.153.250      lenovo-01   <none>           <none>
ingress               nginx-ingress-microk8s-controller-v8tdn            1/1     Running   20 (42s ago)     30d     10.1.153.246      lenovo-01   <none>           <none>
kube-metrics-mqtt     node-metrics-mqtt-jfjqh                            1/1     Running   6 (41s ago)      4d9h    10.1.153.197      lenovo-01   <none>           <none>
kube-system           calico-node-vnpm5                                  1/1     Running   4 (42s ago)      52m     192.168.178.210   lenovo-01   <none>           <none>

This seems very worrisome. I don't have a multi-node cluster handy. I think both #4691 and #3969 should be re-opened as there are clearly still issues with microk8s stop and node draining.

@claudiubelu I'm a bit uncertain about why you can't reproduce the issue we are seeing here. Are you able to give it more try, with the classic version? Is it somehow getting better between v1.30 and v1.31 (but I don't see why?)

Also, I'm not sure why the strict microk8s stop version would not stop containers properly, as it's directly calling against each of them :

"${SNAP}/microk8s-ctr.wrapper" container delete --force $container &>/dev/null || true

@claudiubelu
Copy link
Contributor

Hello,

Addressing things as I go through them.

Got here due to observing that the kubelite process was apparently leaking memory at a rate of 0.1 GB/day. When running microk8s stop, as expected, the kubelite process stopped, but I can still see all services (argocd, calico, ...) running in the background in top.

Just a quick note here. From what I can see, kubelite is basically just an all-in-one Kubernetes process which includes the kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, kube-proxy, as you can see here: [1]. And... Kubernetes itself is not really a simple process, so some resource consumption is to be expected to a degree. Regarding the leak, it might not be because of kubelite specifically since it's quite small, but it can due to some of the components contained by it. Knowing which one though, might be a bit tricky due to how kubelite is. But, after a quick search around, I have found an issue which might be related to yours in kube-scheduler: [2]. A fix for something like this was merged in September, so it should be available in v1.32. As for fixing this issue for previous versions, the best approach would be to backport the fix in upstream Kubernetes first, which I think will be included in microk8s afterwards.

As for minimizing your memory leak issue, there might be something you can do right now, depending on your environment. For example, if you have a multinode cluster, you can have 3 nodes joined as a control plane (for HA), and any other node above that to join only as worker nodes. As it can be seen in kubelite [1], it has if opts.StartControlPlane, in which case it will start the kube-apiserver, kube-controller-manager, kube-scheduler components. For worker nodes, it's going to be much lighter.

[1]

cmd/kubelite/app/options/options.go | 79 +++++++++++++++++++++++++++

[2] kubernetes/kubernetes#122725
[3] kubernetes/kubernetes#126962

The kubelite proces and some others were stopped/killed, but that did not stop the containers. I have a DIY memory/cpu resource consumption monitor running (a python script running as a Daemonset) which kept reporting memory consumption, showing a large drop when the kubelite process went down and then showing a slight increase again when I ran microk8s start a minute ago.

Another small note here. Stopping / killing kubelite is unrelated to containerd and its containers. It's an entirely separate, independent process from it. Basically, the kubelet component from kubelite is responsible for setting up all the necessary bits from the Kubernetes side and ultimately delegating the container creation process to the container runtime (containerd); it itself does not create the containers.

Now, going a bit further to the containerd level. Technically, stopping the containerd process will not / should not stop any of the containers, and that's intended, as also mentioned in the original issue message. Interestingly, in my environment, systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL &>/dev/null || true also kills the containers, whether that's intentional or not.

Also, another important note here: if you want to kill all the containers, you have to stop the containerd service, and kill it:

sudo systemctl stop snap.microk8s.daemon-containerd.service
sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL &>/dev/null || true

The reason is simple: if you stop the service, the containers will keep running as previously mentioned; if you only kill the service, due to the Restart=always option in /etc/systemd/system/snap.microk8s.daemon-containerd.service, it will start up once again, and it will start all the containers again. Ofcourse, you can update the service config file to not do that.

I'm running the snap in --classic mode. This is on microk8s 1.30.5, running on bare metal ubuntu-server 24.04.01.

Happy to provide any requested debugging info.

$ sudo systemctl kill snap.microk8s.daemon-kubelite.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-kubelite.service: Unit snap.microk8s.daemon-kubelite.service not loaded.
$ sudo systemctl kill snap.microk8s.daemon-containerd.service --signal=SIGKILL
Failed to kill unit snap.microk8s.daemon-containerd.service: Unit snap.microk8s.daemon-containerd.service not loaded.

Those are some weird error messages, and it's probably why the kill thing didn't work as expected. At a first glance, I would expect something to be wrong in their systemd files. Have you modified them by any chance? For example, if I have some mistakes in /etc/systemd/system/snap.microk8s.daemon-containerd.service, I get a similar error:

sudo systemctl daemon-reload
sudo systemctl stop snap.microk8s.daemon-containerd.service
Failed to stop snap.microk8s.daemon-containerd.service: Unit snap.microk8s.daemon-containerd.service not loaded.

That error goes away if I fix the mistakes and run sudo systemctl daemon-reload.

I tried once before getting this 'pretty' output so that's why the pod reports running for 52 seconds. It restarts once microk8s start is called, but it does not stop after microk8s stop. The same holds for other pods like argocd, but due to runing only 2 replicas and draining the node before upgrading they got scheduled on nodes 02 and 03.

While typing this I got the idea to check from another node what the status is - and it correctly (but to me unexpectedly) reports a few pods still running on "stopped" node lenovo-01.

Hmm, a bit of a wild guess here, but are you just checking whether a Pod is Running or not solely on the output of Kubernetes itself? If so, that may not be entirely correct. Let's test this out. I have a 2 node environment here:

microk8s kubectl get nodes
NAME      STATUS   ROLES    AGE   VERSION
ubuntu    Ready    <none>   48m   v1.30.6
ubuntu2   Ready    <none>   7s    v1.30.6

They're in --classic mode, installed with: sudo snap install microk8s --classic --channel=1.30. And I have everything running:

microk8s kubectl get pods -A -o wide
NAMESPACE     NAME                                      READY   STATUS    RESTARTS      AGE     IP              NODE      NOMINATED NODE   READINESS GATES
default       nginx-deployment-77d8468669-26j5x         1/1     Running   1 (81s ago)   4m12s   10.1.152.66     ubuntu2   <none>           <none>
default       nginx-deployment-77d8468669-rh842         1/1     Running   0             4m12s   10.1.243.197    ubuntu    <none>           <none>
default       nginx-deployment-77d8468669-wlb6s         1/1     Running   0             4m12s   10.1.243.198    ubuntu    <none>           <none>
kube-system   calico-kube-controllers-796fb75cc-227w2   1/1     Running   1 (93m ago)   129m    10.1.243.195    ubuntu    <none>           <none>
kube-system   calico-node-kdk97                         1/1     Running   2 (81s ago)   80m     172.17.195.72   ubuntu2   <none>           <none>
kube-system   calico-node-ppq74                         1/1     Running   0             81m     172.17.193.79   ubuntu    <none>           <none>
kube-system   coredns-5986966c54-h6cvk                  1/1     Running   1 (93m ago)   129m    10.1.243.196    ubuntu    <none>           <none>

The output above also includes an nginx deployment, which was created with the following command:

microk8s kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml

Now, on ubuntu2, I'm going to stop microk8s, and check that Kubernetes still reports the Pod as Running:

# on node 2
sudo microk8s stop

# on node 1
microk8s kubectl get nodes
NAME      STATUS   ROLES    AGE    VERSION
ubuntu    Ready    <none>   130m   v1.30.6
ubuntu2   Ready    <none>   81m    v1.30.6


microk8s kubectl get pods -A -o wide
NAMESPACE     NAME                                      READY   STATUS    RESTARTS        AGE    IP              NODE      NOMINATED NODE   READINESS GATES
default       nginx-deployment-77d8468669-26j5x         1/1     Running   1 (2m11s ago)   5m2s   10.1.152.66     ubuntu2   <none>           <none>
default       nginx-deployment-77d8468669-rh842         1/1     Running   0               5m2s   10.1.243.197    ubuntu    <none>           <none>
default       nginx-deployment-77d8468669-wlb6s         1/1     Running   0               5m2s   10.1.243.198    ubuntu    <none>           <none>
kube-system   calico-kube-controllers-796fb75cc-227w2   1/1     Running   1 (94m ago)     130m   10.1.243.195    ubuntu    <none>           <none>
kube-system   calico-node-kdk97                         1/1     Running   2 (2m11s ago)   81m    172.17.195.72   ubuntu2   <none>           <none>
kube-system   calico-node-ppq74                         1/1     Running   0               82m    172.17.193.79   ubuntu    <none>           <none>
kube-system   coredns-5986966c54-h6cvk                  1/1     Running   1 (94m ago)     130m   10.1.243.196    ubuntu    <none>           <none>

This happens because Kubernetes reports what it actually thinks the state of world is, and relies on services such as kubelet to report to it whether or not things are still as expected. This also includes the liveness probes Pods may have, or even the kubelets themselves reporting that they're still alive. After a while, because the ubuntu2's kubelet does not report its state anymore, Kubernetes will then consider that node as NotReady... but the Pods may still appear to be Running... even after a long time has passed since the node went down. Fortunately, workloads are still relocated on different nodes:

microk8s kubectl get nodes
NAME      STATUS     ROLES    AGE    VERSION
ubuntu    Ready      <none>   133m   v1.30.6
ubuntu2   NotReady   <none>   85m    v1.30.6

microk8s kubectl get pods -A -o wide
NAMESPACE     NAME                                      READY   STATUS        RESTARTS       AGE    IP              NODE      NOMINATED NODE   READINESS GATES
default       nginx-deployment-77d8468669-26j5x         1/1     Terminating   1 (54m ago)    57m    10.1.152.66     ubuntu2   <none>           <none>
default       nginx-deployment-77d8468669-4sw9r         1/1     Running       0              46m    10.1.243.199    ubuntu    <none>           <none>
default       nginx-deployment-77d8468669-rh842         1/1     Running       0              57m    10.1.243.197    ubuntu    <none>           <none>
default       nginx-deployment-77d8468669-wlb6s         1/1     Running       0              57m    10.1.243.198    ubuntu    <none>           <none>
kube-system   calico-kube-controllers-796fb75cc-227w2   1/1     Running       1 (146m ago)   3h2m   10.1.243.195    ubuntu    <none>           <none>
kube-system   calico-node-kdk97                         1/1     Running       2 (54m ago)    133m   172.17.195.72   ubuntu2   <none>           <none>
kube-system   calico-node-ppq74                         1/1     Running       0              134m   172.17.193.79   ubuntu    <none>           <none>
kube-system   coredns-5986966c54-h6cvk                  1/1     Running       1 (146m ago)   3h2m   10.1.243.196    ubuntu    <none>           <none>

As you can see, there's a new nginx Pod running on a different node, and the old Pod is stuck in a Terminating state (there's no kubelet on that node to resolve this state and clean it up).

On the ubuntu2 node, there's no containerd / containerd-shim processes running as expected... but indeed, you're right, the workloads running in the containers are still running, which they shouldn't.

ubuntu@ubuntu2:~$ ps aux | grep nginx
root       59467  0.0  0.0  32636  5760 ?        Ss   14:58   0:00 nginx: master process nginx -g daemon off;
message+   59479  0.0  0.0  33084  2664 ?        S    14:58   0:00 nginx: worker process
ubuntu    103151  0.0  0.0   6544  2304 pts/0    S+   15:55   0:00 grep --color=auto nginx

This issue could probably be resolved by the proposal made in this PR (I'd have to test), but this PR would instead leak containerd and containerd-shims as previously mentioned, which isn't great either. So, here's an idea: how about we call both kill_all_container_shims and remove_all_containers? That should solve both aspects of this issue. Wdyt?

Trying to drain the node hanged for ~3 minutes until I cancelled it, this completed in ~45 seconds when draining the ("not microk8s stop-ped") node earlier when doing the upgrade to 1.30.6

After micr0k8s start-ing the node on lenovo-01 again, the draining finished immediately.

I might not be understanding this properly, it isn't clear to me: were you trying to drain a node which was stopped? (with microk8s stop). It seems so, since you're mentioning that you've started the node again and it drained immediately. If so, that's perfectly normal: there's no kubelet service (part of kubelite) running on that node to resolve the draining process, similarly to how we've seen my nginx Pod above stuck in Terminating.

I then thought "maybe because my mqtt app is running as a DaemonSet", as I saw these kept running normally after draining a node anyway. However the argocd ReplicaSet, which terminates with a drain, keeps running too after a microk8s stop.

Ye, daemonsets are not drained / evicted, since by definition they're meant to run as one instance per node that respects the scheduling limits imposed on it (labels, taints, tolerations).

After restarting the lenovo-01, as before the pods all terminate and get restarted. Interestingly the DaemonSet pods truely 'restart' (as registered by the get po command), while the ReplicaSet pods fully terminate and then start again (not showing a "restart").

That's because they haven't been restarted, but they've been recreated instead. That's why you'll see that those Pods are now young (their age is 2m17s and 24s respectively). And that's pretty normal. Pods / containers (or more precicely, "stateless" containers) are typically meant to be ephemeral in the first place, losing one and getting another to replace it should be a trivial matter with no consequence. May be a different story for statefulsets though.

This seems very worrisome. I don't have a multi-node cluster handy. I think both #4691 and #3969 should be re-opened as there are clearly still issues with microk8s stop and node draining.

@claudiubelu I'm a bit uncertain about why you can't reproduce the issue we are seeing here. Are you able to give it more try, with the classic version? Is it somehow getting better between v1.30 and v1.31 (but I don't see why?)

Also, I'm not sure why the strict microk8s stop version would not stop containers properly, as it's directly calling against each of them :

"${SNAP}/microk8s-ctr.wrapper" container delete --force $container &>/dev/null || true

I only checked against containerd and containerd-shims, and did not include the binaries executed in the containers in my analysis. That's my bad.

Will try a variant in which both kill_all_container_shims and remove_all_containers are called when needed. That should solve both issues.

Best regards,

Claudiu Belu

@jcjveraa
Copy link

@claudiubelu Thanks, learned a few things from your extensive response :-) I've been using K8s (OpenShift specifically) professioally as a developer for 6 months now, and only recently started experimenting with a local cluster to understand more of what's under the hood. I was not aware that issuing e.g. kubectl get pods essentially triggers a 'cached response', rather than querying the live state of the API server on the node (and then of course discovering the node to be offline/NotReady). That explains a lot of my unexpected findings.

Regardless, I take it that we agree that issuing microk8s stop should not just stop the kubernetes processes, but also stop all pods similar to draining a node. Hope that your solution works!

@claudiubelu
Copy link
Contributor

Hello,

I've tried the following changes (on top of this PR): claudiubelu@3e2aba6, and I've built microk8s classic snap and tried it:

sudo snap install microk8s_*_amd64.snap --classic --dangerous
microk8s v1.31.2 installed

microk8s kubectl apply -f https://k8s.io/examples/controllers/nginx-deployment.yaml
deployment.apps/nginx-deployment created

microk8s kubectl get pods -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
default       nginx-deployment-d556bf558-48vtm           1/1     Running   0          93s
default       nginx-deployment-d556bf558-gtgvt           1/1     Running   0          93s
default       nginx-deployment-d556bf558-xx2hp           1/1     Running   0          93s
kube-system   calico-kube-controllers-7fbd86d5c5-vgncc   1/1     Running   0          93s
kube-system   calico-node-6tw57                          1/1     Running   0          93s
kube-system   coredns-7896dbf49-7mtd4                    1/1     Running   0          93s

ps aux | grep nginx
root     1866993  0.0  0.0  32636  5632 ?        Ss   08:20   0:00 nginx: master process nginx -g daemon off;
root     1867019  0.0  0.0  32636  5760 ?        Ss   08:20   0:00 nginx: master process nginx -g daemon off;
message+ 1867040  0.0  0.0  33084  2408 ?        S    08:20   0:00 nginx: worker process
root     1867042  0.0  0.0  32636  5760 ?        Ss   08:20   0:00 nginx: master process nginx -g daemon off;
message+ 1867053  0.0  0.0  33084  2664 ?        S    08:20   0:00 nginx: worker process
message+ 1867060  0.0  0.0  33084  2792 ?        S    08:20   0:00 nginx: worker process
ubuntu   1868092  0.0  0.0   6544  2176 pts/4    S+   08:21   0:00 grep --color=auto nginx

ps aux | grep containerd
root     1862500 16.6  0.6 2448184 56384 ?       Ssl  08:19   0:23 /snap/microk8s/x1/bin/containerd --config /var/snap/microk8s/x1/args/containerd.toml --root /var/snap/microk8s/common/var/lib/containerd --state /var/snap/microk8s/common/run/containerd --address /var/snap/microk8s/common/run/containerd.sock
root     1864174  0.2  0.1 1236108 12440 ?       Sl   08:19   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5eb9de11636de940337f37e187f307e0315e088cd794a86020a5e3b8c9874b25 -address /var/snap/microk8s/common/run/containerd.sock
root     1866075  0.0  0.1 1236364 11232 ?       Sl   08:20   0:00 /snap/microk8s/x1/bin/containerd-shim-runc-v2 -namespace k8s.io -id 22450486889deeb3a7ba38752206504dc295587c04c274aff348155d79b583dc -address /var/snap/microk8s/common/run/containerd.sock
...

# Test stop and start.
sudo microk8s stop
Stopped.
Stopped.

ps aux | grep nginx
ubuntu   1870373  0.0  0.0   6544  2304 pts/4    S+   08:21   0:00 grep --color=auto nginx

ps aux | grep containerd
ubuntu   1870375  0.0  0.0   6544  2304 pts/4    S+   08:21   0:00 grep --color=auto containerd

microk8s start
microk8s kubectl get pods -A
NAMESPACE     NAME                                       READY   STATUS    RESTARTS        AGE
default       nginx-deployment-d556bf558-48vtm           1/1     Running   1 (3m29s ago)   5m36s
default       nginx-deployment-d556bf558-gtgvt           1/1     Running   1 (3m29s ago)   5m36s
default       nginx-deployment-d556bf558-xx2hp           1/1     Running   1 (3m30s ago)   5m36s
kube-system   calico-kube-controllers-7fbd86d5c5-vgncc   1/1     Running   1 (3m28s ago)   5m36s
kube-system   calico-node-6tw57                          1/1     Running   1 (3m28s ago)   5m36s
kube-system   coredns-7896dbf49-7mtd4                    1/1     Running   1 (3m29s ago)   5m36s

# Remove it, we shouldn't have any containers left.
sudo snap remove --purge microk8s
microk8s removed

ps aux | grep nginx
ubuntu   2297198  0.0  0.0   6544  2304 pts/4    S+   14:44   0:00 grep --color=auto nginx

ubuntu@ubuntu:~/workdir/microk8s$ ps aux | grep containerd
ubuntu   2300428  0.0  0.0   6544  2304 pts/4    S+   17:09   0:00 grep --color=auto containerd

There were a few more changes than I originally thought, but it seems that there are no more leaked containers / containerd shims with the changes I've mentioned above. The change also contains an additional test which checks that microk8s stop does indeed remove the containers as well. @gaelgatelement, could you cherry-pick it onto this PR, and also sign the CLA? The CI should pass with it as well.

Even with my changes, on upgrade (snap refresh), the previous containers will still continue running. I haven't treated the upgrade scenario, and I don't think we should. In the best case scenario, an upgrade should not impact the workload Pods / containers and their uptime, thanks to the fact that containerd dying does not lead to the containers dying as well. Changing that behaviour may be undesirable for some people, without giving them an alternative to let them keep their uptime. Instead, if someone has some concerns before upgrading, they may choose to run microk8s stop beforehand, and the microk8s-related containers / container shims will be stopped.

@claudiubelu Thanks, learned a few things from your extensive response :-) I've been using K8s (OpenShift specifically) professioally as a developer for 6 months now, and only recently started experimenting with a local cluster to understand more of what's under the hood. I was not aware that issuing e.g. kubectl get pods essentially triggers a 'cached response', rather than querying the live state of the API server on the node (and then of course discovering the node to be offline/NotReady). That explains a lot of my unexpected findings.

You're welcome. :) Regarding the 'cached response' part... I agree, that can be misleading at first, but it's a reasonable implementation, if you think about it: you can easily have hundreds of Pods deployed even in a small cluster, and checking the actual state of each and every single one of them (+ if you consider a reasonable timeout for each Pod)... it would take forever to run something like kubectl get all -A, making it unusable. Plus, it could happen that a node to be "down" (in a NotReady state), and the Pods to still be running. Kubernetes is designed to support up to 5k nodes and 150k Pods (https://kubernetes.io/docs/setup/best-practices/cluster-large/)... it has to scale. :)

Best regards,

Claudiu Belu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants