Test UCX-py on Azure #369

quasiben · 2019-12-12T16:11:02Z

I believe azure cloud has compute instances with multiple gpus connected by nvlink and inifiniband enabled. It would be great to test ucx-py in this environment. I would suggest creating an env with the following setup;

conda create -n ucx-test -c rapidsai-nightly -c nvidia -c conda-forge \
ucx-proc=*=gpu ucx ucx-py python=3.7 cudf=0.12 dask-cudf dask-cuda \
pytest-asyncio cudatoolkit=<CUDA version>

NVLINK TESTS

Then run and record the dask-cuda benchmarks:

With the following runs:

UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python local_cupy_transpose_sum.py -p ucx -d 1,2 --size 40000
UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,cuda_ipc,sockcm python local_cudf_merge.py -p ucx -d 1,2 -c 100000000
UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=ALL python local_cudf_merge.py -p ucx -d 1,2 -c 100000000

Note: 1,2 refers to GPUs 1 and 2

IB TESTS

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest -t tag_bw -m host -s 10000000 -n 10 -p 9999 & UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest HOSTNAME -t tag_bw -m host -s 100000000 -n 10 -p 9999
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest -t tag_bw -m host -s 10000000 -n 10 -p 9999 -c 0 & UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=rc ucx_perftest HOSTNAME -t tag_bw -m host -s 100000000 -n 10 -p 9999 -c 1

*Note: second test pins CPU affinity

Monitor packets received with:

watch -d 'cat /sys/class/infiniband/mlx5_*/ports/1/counters/port_rcv_packets'

You should observe the values ticking up during the test. If others have ideas, please chime in.

These machines are rather expensive so the idea is that @jacobtomlinson will provision, configure, install, run and record tests then shut down to minimize costs

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2019-12-12T16:45:34Z

For reference here are some docs on enabling Infiniband on Azure and some info on VM sizes.

Only the Standard_ND24rs is RDMA capable which comes in at $10/hour.

pentschev · 2019-12-12T17:27:37Z

I took the liberty to edit your post @quasiben, what you wrote about ucx_perftest was wrong, you had -m HOSTNAME but -m is the memory type and the value for it should be host (or cuda, etc.). The actual HOSTNAME is the first argument for ucx_perftest.

quasiben · 2019-12-12T17:43:22Z

Thanks @pentschev

jakirkham · 2019-12-12T19:22:18Z

As this will involve using UCX in containers, you may need PR ( openucx/ucx#4511 ), which was added to make sure processes using UCX in different containers can recognize each other and what are the best transports they can use to communicate.

quasiben · 2019-12-13T15:44:48Z

We've received some feedback that NVLINK testing is a higher priority and we should probably start with a Standard_NC24rs_v2. These instances have 4 P100s

quasiben · 2020-02-03T22:37:58Z

A single node test with two workers could use local-send-recv.py

Vanilla TCP

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm  UCX_TLS=tcp,cuda_copy,sockcm python local-send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc
[1580500435.185974] [dgx13:43834:0]          mpool.c:43   UCX  WARN  object 0x55fef7de4140 was not returned to mpool ucp_am_bufs
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 422.53 MB/s
--------------------------
Iterations
--------------------------
000         |407.11 MB/s
001         |423.23 MB/s
002         |425.08 MB/s
003         |422.31 MB/s
004         |421.46 MB/s
005         |425.57 MB/s
006         |425.80 MB/s
007         |424.83 MB/s
008         |424.73 MB/s
009         |425.86 MB/s

With IB

(rapidsai-latest) bzaitlen@dgx13:~/GitRepos/ucx-py/benchmarks$ UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm  UCX_TLS=tcp,cuda_copy,sockcm,rc python local-send-recv.py -o cupy  -n "100MB" --server-dev 1 --client-dev 2 --reuse-alloc
Roundtrip benchmark
--------------------------
n_iter      | 10
n_bytes     | 100.00 MB
object      | cupy
reuse alloc | True
==========================
Device(s)   | 1, 2
Average     | 2.91 GB/s
--------------------------
Iterations
--------------------------
000         |  2.28 GB/s
001         |  3.00 GB/s
002         |  3.01 GB/s
003         |  3.02 GB/s
004         |  3.02 GB/s
005         |  3.00 GB/s
006         |  2.99 GB/s
007         |  3.00 GB/s
008         |  3.00 GB/s
009         |  3.01 GB/s

cc @jacobtomlinson

quasiben · 2020-02-03T22:38:02Z

For a multinode or multi IB test we could do the following with the recv-into-client.py benchmark.

Server

UCX_NET_DEVICES=mlx5_0:1 CUDA_VISIBLE_DEVICES=5,1 UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,rc python recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13337 --n-iter 100

Client

UCX_NET_DEVICES=mlx5_2:1 CUDA_VISIBLE_DEVICES=5,1 UCX_MEMTYPE_CACHE=n UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=tcp,cuda_copy,sockcm,rc python recv-into-client.py -r recv_into -o cupy --n-bytes 1000Mb -p 13337 -s 10.33.227.163 --n-iter 100

On a DGX1 I am seeing the following with this test

TCP

CUDA RUNTIME DEVICE:  0
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
400.30 MB / s
===================

IB

CUDA RUNTIME DEVICE:  0
True
Roundtrip benchmark
-------------------
n_iter   | 10
n_bytes  | 1000.00 MB
recv     | recv_into
object   | cupy
inc      | False

===================
10.30 GB / s
===================

jacobtomlinson · 2020-02-11T15:04:51Z

Here are my notes from running through these benchmarks on Azure.

https://gist.github.com/jacobtomlinson/6242a3547d13d4e7a00cc768b8b475c8

Overview

Ran on an ND40_rs_v2 running Ubuntu 18.04
Installed drivers
- NVIDIA 440.33.01
- Mellanox MOFED 4.7-3.2.9.0
- Mellanox GPUDirect RDMA nvidia-peer-memory_1.0-8
nvidia-smi topo -m shows 8 V100s and 1 IB NIC
local-send-recv.py single node benchmark results
- TCP 395.07 MB/s
- IB 2.12 GB/s (~600 MB/s without GPUDirect RDMA)
- NVLINK NV1 17.08 GB/s
- NVLINK NV2 26.12 GB/s
local-send-recv.py multi node results
- IB 2.74 GB/s

pentschev · 2020-05-29T23:24:14Z

Is there something more to be done here or should this be closed?

quasiben · 2020-06-01T16:45:51Z

I think we can close this. Thank you @jacobtomlinson for doing the work here

quasiben closed this as completed Jun 1, 2020

mmccarty mentioned this issue Nov 14, 2022

Article on MNMG with InifiBand on Azure rapidsai/deployment#68

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test UCX-py on Azure #369

Test UCX-py on Azure #369

quasiben commented Dec 12, 2019 •

edited by pentschev

Loading

jacobtomlinson commented Dec 12, 2019

pentschev commented Dec 12, 2019

quasiben commented Dec 12, 2019

jakirkham commented Dec 12, 2019

quasiben commented Dec 13, 2019

quasiben commented Feb 3, 2020

quasiben commented Feb 3, 2020

jacobtomlinson commented Feb 11, 2020

pentschev commented May 29, 2020

quasiben commented Jun 1, 2020

Test UCX-py on Azure #369

Test UCX-py on Azure #369

Comments

quasiben commented Dec 12, 2019 • edited by pentschev Loading

NVLINK TESTS

IB TESTS

jacobtomlinson commented Dec 12, 2019

pentschev commented Dec 12, 2019

quasiben commented Dec 12, 2019

jakirkham commented Dec 12, 2019

quasiben commented Dec 13, 2019

quasiben commented Feb 3, 2020

Vanilla TCP

With IB

quasiben commented Feb 3, 2020

Server

Client

TCP

IB

jacobtomlinson commented Feb 11, 2020

pentschev commented May 29, 2020

quasiben commented Jun 1, 2020

quasiben commented Dec 12, 2019 •

edited by pentschev

Loading