Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test UCX-py on Azure #369

Closed
quasiben opened this issue Dec 12, 2019 · 10 comments
Closed

Test UCX-py on Azure #369

quasiben opened this issue Dec 12, 2019 · 10 comments

Comments

@quasiben
Copy link
Member

quasiben commented Dec 12, 2019

I believe azure cloud has compute instances with multiple gpus connected by nvlink and inifiniband enabled. It would be great to test ucx-py in this environment. I would suggest creating an env with the following setup;

conda create -n ucx-test -c rapidsai-nightly -c nvidia -c conda-forge \
ucx-proc=*=gpu ucx ucx-py python=3.7 cudf=0.12 dask-cudf dask-cuda \
pytest-asyncio cudatoolkit=<CUDA version> 

NVLINK TESTS

Then run and record the dask-cuda benchmarks:

With the following runs:

  • UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python local_cupy_transpose_sum.py -p ucx -d 1,2 --size 40000
  • UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python local_cudf_merge.py -p ucx -d 1,2 -c 100000000
  • UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=ALL python local_cudf_merge.py -p ucx -d 1,2 -c 100000000

Note: 1,2 refers to GPUs 1 and 2

IB TESTS

  • UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest -t tag_bw -m host -s 10000000 -n 10 -p 9999 & UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest HOSTNAME -t tag_bw -m host -s 100000000 -n 10 -p 9999
  • UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest -t tag_bw -m host -s 10000000 -n 10 -p 9999 -c 0 & UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest HOSTNAME -t tag_bw -m host -s 100000000 -n 10 -p 9999 -c 1

*Note: second test pins CPU affinity

Monitor packets received with:

watch -d 'cat /sys/class/infiniband/mlx5_*/ports/1/counters/port_rcv_packets'

You should observe the values ticking up during the test. If others have ideas, please chime in.

These machines are rather expensive so the idea is that @jacobtomlinson will provision, configure, install, run and record tests then shut down to minimize costs

@jacobtomlinson
Copy link
Member

For reference here are some docs on enabling Infiniband on Azure and some info on VM sizes.

Only the Standard_ND24rs is RDMA capable which comes in at $10/hour.

@pentschev
Copy link
Member

I took the liberty to edit your post @quasiben, what you wrote about ucx_perftest was wrong, you had -m HOSTNAME but -m is the memory type and the value for it should be host (or cuda, etc.). The actual HOSTNAME is the first argument for ucx_perftest.

@quasiben
Copy link
Member Author

Thanks @pentschev

@jakirkham
Copy link
Member

As this will involve using UCX in containers, you may need PR ( openucx/ucx#4511 ), which was added to make sure processes using UCX in different containers can recognize each other and what are the best transports they can use to communicate.

@quasiben
Copy link
Member Author

We've received some feedback that NVLINK testing is a higher priority and we should probably start with a Standard_NC24rs_v2. These instances have 4 P100s

@quasiben
Copy link
Member Author

quasiben commented Feb 3, 2020

A single node test with two workers could use local-send-recv.py

Vanilla TCP

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm  UCX_TLS=tcp,cuda_copy,sockcm python local-send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc
[1580500435.185974] [dgx13:43834:0]          mpool.c:43   UCX  WARN  object 0x55fef7de4140 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 422.53 MB/s
--------------------------
Iterations
--------------------------
000         |407.11 MB/s
001         |423.23 MB/s
002         |425.08 MB/s
003         |422.31 MB/s
004         |421.46 MB/s
005         |425.57 MB/s
006         |425.80 MB/s
007         |424.83 MB/s
008         |424.73 MB/s
009         |425.86 MB/s

With IB

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm  UCX_TLS=tcp,cuda_copy,sockcm,rc python local-send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 2.91 GB/s
--------------------------
Iterations
--------------------------
000         |  2.28 GB/s
001         |  3.00 GB/s
002         |  3.01 GB/s
003         |  3.02 GB/s
004         |  3.02 GB/s
005         |  3.00 GB/s
006         |  2.99 GB/s
007         |  3.00 GB/s
008         |  3.00 GB/s
009         |  3.01 GB/s

cc @jacobtomlinson

@quasiben
Copy link
Member Author

quasiben commented Feb 3, 2020

For a multinode or multi IB test we could do the following with the recv-into-client.py benchmark.

Server

UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=5,1 UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,rc python recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13337 --n-iter 100

Client

UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=5,1 UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,rc python recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13337 -s 10.33.227.163 --n-iter 100

On a DGX1 I am seeing the following with this test

TCP

CUDA RUNTIME DEVICE:  0
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
400.30 MB / s
===================

IB

CUDA RUNTIME DEVICE:  0
True
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
10.30 GB / s
===================

@jacobtomlinson
Copy link
Member

Here are my notes from running through these benchmarks on Azure.

https://gist.github.com/jacobtomlinson/6242a3547d13d4e7a00cc768b8b475c8

Overview

  • Ran on an ND40_rs_v2 running Ubuntu 18.04
  • Installed drivers
    • NVIDIA 440.33.01
    • Mellanox MOFED 4.7-3.2.9.0
    • Mellanox GPUDirect RDMA nvidia-peer-memory_1.0-8
  • nvidia-smi topo -m shows 8 V100s and 1 IB NIC
  • local-send-recv.py single node benchmark results
    • TCP 395.07 MB/s
    • IB 2.12 GB/s (~600 MB/s without GPUDirect RDMA)
    • NVLINK NV1 17.08 GB/s
    • NVLINK NV2 26.12 GB/s
  • local-send-recv.py multi node results
    • IB 2.74 GB/s

@pentschev
Copy link
Member

Is there something more to be done here or should this be closed?

@quasiben
Copy link
Member Author

quasiben commented Jun 1, 2020

I think we can close this. Thank you @jacobtomlinson for doing the work here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants