Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker container of the cloned task crashes/stucks. #189

Open
MattBlue92 opened this issue Feb 21, 2024 · 13 comments
Open

Docker container of the cloned task crashes/stucks. #189

MattBlue92 opened this issue Feb 21, 2024 · 13 comments

Comments

@MattBlue92
Copy link

Hello everyone,
I installed clearml server and clearml agent locally with docker on a ubuntu linux system following the documentation guide. My problem is with the task clone after write code and logged. If I clone a task and assign it to a queue that uses virtual enviroment mode for execution, then the clone executes all the code correctly, however, if I clone a task and then assign it to a queue that uses docker for execution, the container gets started, downloads packages but does not execute the task code. Where am I going wrong?

PS: To be clear, the cloned task container will not crash or die, because it is possible to enter the container with docker exec -it id_container /bin/bash ...so it is as if clearml were merely creating the container.

@MattBlue92 MattBlue92 changed the title Docker container of the cloned task crashes. Docker container of the cloned task crashes/stucks. Feb 21, 2024
@blinor
Copy link

blinor commented Apr 17, 2024

Hey there,
i got the exact same problem. It will install everything with apt and pip and then stop working. Also trying to run the command used directly in the docker container doesn`t work.

@jkhenning
Copy link
Member

Hi,

Can you include a full log of the task execution?

@blinor
Copy link

blinor commented Apr 18, 2024

Thanks for the quick response.
I startet the deamon with:
clearml-agent daemon --queue "4gb" --docker clearml/fractional-gpu:u22-cu11.7-4gb --force-current-version
this is my clearaml.conf on the server:

agent {
    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
    git_user:"XXX"
    git_pass:"XXX"
    # all other domains will use public access (no user/pass). Default: always send user/pass for any VCS domain
    git_host:""
    package_manager: {
        type: pip,
        pip_version: [""]
        pytorch_resolve: none
        extra_pip_install_flags: ["--user"]
        extra_index_url: ["XXX"]
    }
    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
    force_git_ssh_protocol: false

    # unique name of this worker, if None, created based on hostname:process_id
    # Overridden with os environment: CLEARML_WORKER_NAME
    worker_id: ""
    docker_use_activated_venv: false
    extra_docker_arguments: ["--pid=host","-e","http_proxy=XXX", "-e","https_proxy=XXX"]
}

More or less, i switched all used parameters in the config.

Besides i tried a completly new setup on a different computer with the default config getting the same result. Also I tried to use an older version of the agent (1.6) but that didn`t work aswell.
log_txt.txt

@jkhenning
Copy link
Member

From the looks of it, it looks like the execution inside the container cannot reach the ClearML Server - can you add -e CLEARML_AGENT__AGENT__DEBUG=1 to the task's container arguments (in the execution section) and see if you get more logs from the agent? Also, if you can exec into the container, you can check the clearml.conf file mapped inside and see its contents, this might provide some clues

@blinor
Copy link

blinor commented Apr 18, 2024

Shure thing.
default_conf.txt
task_33470eb457b94a578784674546d3d397.log
It don`t seems to me, that there are different logs and also the default_conf looks correct.

My first thougth was that a proxy-setting is causing the problems, but on a different machine without any proxys my logs and problems are the same.

@jkhenning
Copy link
Member

jkhenning commented Apr 18, 2024

"api_server": "http://localhost:8008"

Is this reachable from inside the container? It seems to me this won't resolve to anything...
Try adding --ipc=host to the task container arguments (it would make more sense to put this in the agent's default docker extra args)

@blinor
Copy link

blinor commented Apr 18, 2024

You are correct, i can't reach http://localhost:8008. What setting do you mean with task container argument? is this agent.extra_docker_arguments ?

@jkhenning
Copy link
Member

Yes, that would work

@blinor
Copy link

blinor commented Apr 18, 2024

Sadly I still cannot reach the API.

@jkhenning
Copy link
Member

Where is the server running?

@blinor
Copy link

blinor commented Apr 22, 2024

The server runs at the same machine from where i try to execute my task. also i tried it both on windows and Linux

@blinor
Copy link

blinor commented Apr 22, 2024

Small Update, i didnt't change anything but tried again to start a agent with docker mode and got a different output. I now get the following output bevor nothing happens:
`
Installing collected packages: distlib, zipp, urllib3, six, rpds-py, PyYAML, pyparsing, pyjwt, psutil, platformdirs, pkgutil-resolve-name, idna, filelock, charset-normalizer, certifi, attrs, virtualenv, requests, referencing, python-dateutil, pathlib2, orderedmultidict, importlib-resources, jsonschema-specifications, furl, jsonschema, clearml-agent

1713784929586 DLB1:gpu1 DEBUG Successfully installed PyYAML-6.0.1 attrs-23.2.0 certifi-2024.2.2 charset-normalizer-3.3.2 clearml-agent-1.8.0 distlib-0.3.8 filelock-3.13.4 furl-2.1.3 idna-3.7 importlib-resources-6.4.0 jsonschema-4.21.1 jsonschema-specifications-2023.12.1 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 pkgutil-resolve-name-1.3.10 platformdirs-4.2.0 psutil-5.9.8 pyjwt-2.8.0 pyparsing-3.1.2 python-dateutil-2.8.2 referencing-0.34.0 requests-2.31.0 rpds-py-0.18.0 six-1.16.0 urllib3-1.26.18 virtualenv-20.25.3 zipp-3.18.1
�[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv�[0m�[33m

The following additional packages will be installed:
libblas3 libfreetype6 libgfortran5 libimagequant0 libjbig0 libjpeg-turbo8
libjpeg8 liblapack3 liblbfgsb0 liblcms2-2 libpng16-16 libtiff5 libwebp6
libwebpdemux2 libwebpmux3 python3-decorator python3-numpy python3-olefile
python3-pil
Suggested packages:
liblcms2-utils gfortran python-numpy-doc python3-pytest python3-numpy-dbg
python-pil-doc python3-pil-dbg python-scipy-doc
The following NEW packages will be installed:
libblas3 libfreetype6 libgfortran5 libimagequant0 libjbig0 libjpeg-turbo8
libjpeg8 liblapack3 liblbfgsb0 liblcms2-2 libpng16-16 libtiff5 libwebp6
libwebpdemux2 libwebpmux3 python3-decorator python3-numpy python3-olefile
python3-pil python3-scipy
0 upgraded, 20 newly installed, 0 to remove and 32 not upgraded.
Need to get 18.5 MB of archives.
After this operation, 77.4 MB of additional disk space will be used.
Do you want to continue? [Y/n]

`

@rdyzakya
Copy link

rdyzakya commented Jan 2, 2025

I have a similar issue, and then I tried to use a different machine between the clearml server and the clearml agent. Somehow it works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants