Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker checkpoint/restore is slow #2519

Open
hanwen-flow opened this issue Nov 18, 2024 · 8 comments
Open

docker checkpoint/restore is slow #2519

hanwen-flow opened this issue Nov 18, 2024 · 8 comments

Comments

@hanwen-flow
Copy link

Description

Checkpoint/restore inside docker is slow.

(apologies if this the wrong place to report, but even a closed bugreport about this would have saved me quite some time.)

Steps to reproduce the issue:

I made a program to allocate memory here, and tried to CP/Restore it both directly and running inside a docker container

dumping the program with a 5Gb heap running on Linux directly took about 5 seconds (1 Gb/sec)
restoring it took about 1 second.

dumping the program when running in a docker container using docker checkpoint create took about 40 seconds; restoring the checkpoint took 20 seconds.

CRIU logs and information:

there seem to be no logs under
/var/lib/docker/containers/$containerID/checkpoints/$cpID

version info:

$ docker version
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:41:00 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.3.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.7
  Git commit:       41ca978
  Built:            Fri Sep 20 11:41:00 2024
  OS/Arch:          linux/amd64
  Experimental:     true
 containerd:
  Version:          1.7.22
  GitCommit:        7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc:
  Version:          1.1.14
  GitCommit:        v1.1.14-0-g2c9f560
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ criu --version
Version: 3.18
GitID: v3.18-320-gdfb56eed6

$ uname -a 
Linux hanwen-flow 6.8.0-47-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct  2 16:16:55 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

$ sudo criu check --all
sudo: mon_handle_sigchld: waitpid: No child processes
Looks good.
@adrianreber
Copy link
Member

Sounds like you already made the necessary measurements without docker. So it indeed sounds rather like a docker problem. Although it is not directly obvious why it would be so much slower.

Have you tried it with Podman. Just as an additional data point.

@rst0git
Copy link
Member

rst0git commented Nov 18, 2024

@hanwen-flow I was able to replicate these results locally. It looks like the reason docker checkpoint create is very slow is because it uses containerd to create an OCI image. This image includes a tar archive that contains the CRIU images. Docker then extracts this data from the tar archive into the checkpoint directory before deleting the image. This causes the checkpoint data to be copied multiple times and results in slow performance.

Have you tried it with Podman.

Checkpoint/restore with Podman is significantly faster.

@hanwen-flow
Copy link
Author

@rst0git thanks for the analysis. I will look into podman. However, we are running a SaaS company, and it is not clear if we can push this change onto our customers.

@adrianreber
Copy link
Member

The main reason to try Podman is to see if it is a Docker or a CRIU problem. If it is just a Docker problem you can provide a patch to Docker and fix it there.

@rst0git
Copy link
Member

rst0git commented Nov 19, 2024

@adrianreber I believe this problem is related to the migration to v2 shim (moby/moby#41546) and the implementation is similar to the CheckpointContainer function introduced with containerd/containerd#6965. I am not sure if there is an easy way to fix it.

@hanwen-flow
Copy link
Author

fyi, I've been toying with podman. While the CRIU part of it is plenty fast, the way the rootfs diff is handled seems clumsy and somewhat slow. I'll open a separate issue with podman.

@hanwen-flow
Copy link
Author

containers/podman#24826

Copy link

A friendly reminder that this issue had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants