-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test UCX-py on Azure #369
Comments
For reference here are some docs on enabling Infiniband on Azure and some info on VM sizes. Only the |
I took the liberty to edit your post @quasiben, what you wrote about |
Thanks @pentschev |
As this will involve using UCX in containers, you may need PR ( openucx/ucx#4511 ), which was added to make sure processes using UCX in different containers can recognize each other and what are the best transports they can use to communicate. |
We've received some feedback that NVLINK testing is a higher priority and we should probably start with a |
A single node test with two workers could use Vanilla TCP
With IB
|
For a multinode or multi IB test we could do the following with the Server
Client
On a DGX1 I am seeing the following with this test TCP
IB
|
Here are my notes from running through these benchmarks on Azure. https://gist.github.com/jacobtomlinson/6242a3547d13d4e7a00cc768b8b475c8 Overview
|
Is there something more to be done here or should this be closed? |
I think we can close this. Thank you @jacobtomlinson for doing the work here |
I believe azure cloud has compute instances with multiple gpus connected by nvlink and inifiniband enabled. It would be great to test ucx-py in this environment. I would suggest creating an env with the following setup;
NVLINK TESTS
Then run and record the dask-cuda benchmarks:
With the following runs:
Note: 1,2 refers to GPUs 1 and 2
IB TESTS
*Note: second test pins CPU affinity
Monitor packets received with:
You should observe the values ticking up during the test. If others have ideas, please chime in.
These machines are rather expensive so the idea is that @jacobtomlinson will provision, configure, install, run and record tests then shut down to minimize costs
The text was updated successfully, but these errors were encountered: