Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for MPI/UCX environment #196

Merged
merged 1 commit into from
Jul 19, 2023
Merged

Conversation

panda1100
Copy link
Contributor

Description of the Pull Request (PR):

Add workaround for MPI/UCX environment.
See

This fixes or addresses the following GitHub issues:

  • Workaround for #769

@panda1100 panda1100 changed the title Add workaround for running MPI/UCX environment Add workaround for MPI/UCX environment Jun 19, 2023
@DrDaveD
Copy link
Contributor

DrDaveD commented Jun 20, 2023

In today's community meeting you agreed to do some more investigation into this. Let me know when that's done; I do have some ideas for improvements to this documentation change, but I want to first make sure we know what we want to recommend.

@DrDaveD
Copy link
Contributor

DrDaveD commented Jun 30, 2023

@panda1100 Reminder, this is waiting on you.

@panda1100
Copy link
Contributor Author

Thank you, @DrDaveD -san! I've got access to MPI/UCX environment yesterday. I'll keep update here.

@DrDaveD DrDaveD added this to the 1.2.0 milestone Jul 6, 2023
@panda1100
Copy link
Contributor Author

panda1100 commented Jul 7, 2023

@DrDaveD @gmkurtzer
I did some tests and discussed with @cclerget.
I also got response from @hoopoepg -san at UCX repo.
openucx/ucx#4511 (comment)

The issue users (including me) faced is related to posix transport (shared memory-based transport).

I replicated the issue with the following two command, and I got the same error. The difference is explicitly use posix transport UCX_TLS=posix,cma,ib or not.

mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 -x UCX_TLS=posix,cma,ib apptainer run hello.sif

and

mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 apptainer run hello.sif

Rootless container environment, somehow,

access to neighborhood process via /proc filesystem is prohibited which cause fail in initialization of posix transport.

UCX_POSIX_USE_PROC_LINK=n helps here to avoid above restriction. This works good.

UCX_POSIX_USE_PROC_LINK=n: it force to create SHM object not on /proc filesystem (used /dev/shm or so). originally /proc filesystem is used because in case of abnormal termination of process shared memory segment is deleted automatically, but in case of "classic" filesystem such termination may lead to resource leak

UCX_POSIX_USE_PROC_LINK=n \
mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 -x UCX_TLS=posix,cma,ib apptainer run hello.sif

There is another transport called "sysv transport". This works as well.

sysv transport which provides same (or very similar) performance, but uses another API to establish processes connection.

mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 -x UCX_TLS=sysv,cma,ib apptainer run hello.sif

The apptainer command options --ipc and --pid also works with posix transport but this is kind of counter-intuitive solution, since for other rootless container solutions, sharing the same IPC namespace is the one of solution for the issue. (Apptainer shares those namespaces by default.)

I will perform quick performance test with the following cases:

  • posix transport with workaround: UCX_POSIX_USE_PROC_LINK=n and UCX_TLS=posix,cma,ib
  • sysv trasnport: UCX_TLS=sysv,cma,ib
  • posix transport with --ipc option: UCX_TLS=posix,cma,ib

to be continue...

@panda1100
Copy link
Contributor Author

openucx/ucx#9213

@panda1100
Copy link
Contributor Author

@panda1100
Copy link
Contributor Author

@DrDaveD -san, I have updated workaround regarding investigation results. I keep this as simple as possible for now and I wrote a details on apptainer/apptainer#769 (comment).

@DrDaveD DrDaveD merged commit 78cf3a0 into apptainer:main Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants