You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using libfrabric mlx provider together with ucx 1.8.1 fails with the following error when called with processes from different user namespaces (using Singularity). A similar issue was addressed in #4511 for OpenMPI and I wonder if the libfabric interface is calling a different function that requires the same fix to disable CMA when user namespaces differ.
Setting UCX_POSIX_USE_PROC_LINK=n or setting UCX_TLS=tcp,self provides a workaround
OS RHEL 8.1
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 512, addrname_firstlen: 512
[1597347557.520494] [r2i7n3:64774:0] mm_posix.c:195 UCX ERROR open(file_name=/proc/64773/fd/21 flags=0x0) failed: Permission denied
[1597347557.520503] [r2i7n3:64774:0] mm_ep.c:149 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000054000fd05: Shared memory error
@jappa the PROC_LINK method should be disabled for non-default PID namespace, but currently we don't check USER namespace.
This is missing feature in UCX shared memory for Containers: "support different user namespaces"
@yosefe can confirm, ran into the same issue when using bwrap, and the issue went away by either passing --unshare-pid to bwrap or using UCX_POSIX_USE_PROC_LINK=n
When using libfrabric mlx provider together with ucx 1.8.1 fails with the following error when called with processes from different user namespaces (using Singularity). A similar issue was addressed in #4511 for OpenMPI and I wonder if the libfabric interface is calling a different function that requires the same fix to disable CMA when user namespaces differ.
Setting UCX_POSIX_USE_PROC_LINK=n or setting UCX_TLS=tcp,self provides a workaround
OS RHEL 8.1
[0] MPI startup(): libfabric version: 1.10.0a1-impi
[0] MPI startup(): libfabric provider: mlx
[0] MPI startup(): detected mlx provider, set device name to "mlx"
[0] MPI startup(): max_ch4_vcis: 1, max_reg_eps 1, enable_sep 0, enable_shared_ctxs 0, do_av_insert 1
[0] MPI startup(): addrname_len: 512, addrname_firstlen: 512
[1597347557.520494] [r2i7n3:64774:0] mm_posix.c:195 UCX ERROR open(file_name=/proc/64773/fd/21 flags=0x0) failed: Permission denied
[1597347557.520503] [r2i7n3:64774:0] mm_ep.c:149 UCX ERROR mm ep failed to connect to remote FIFO id 0xc00000054000fd05: Shared memory error
ucx_info -v
# UCT version=1.8.1 revision 6b29558
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --prefix=/lustre/sw/ucx/1.8.1/gcc_820 --enable-shared --enable-static --enable-numa
The text was updated successfully, but these errors were encountered: