-
Notifications
You must be signed in to change notification settings - Fork 432
Performance measurement
Vasily Philipov edited this page Sep 22, 2021
·
5 revisions
The performance measurement library allows running performance tests (in the current thread) on the various UCX communication APIs. The purpose is to allow a developer make optimizations to the code and immediately test their effects.
The infrastructure provides both an API (libperf.h) and a command-line utility ucx_perftest
.
The API is tested as part of the unit tests.
Location in to code tree: src/tools/perf
Features of the API:
-
uct_perf_test_run()
is the function which runs the test. (currently only UCT API is supported) - No need to do any resource allocation - just pass the testing parameters to the API
- Requires running the function on 2 threads/processes/nodes - by passing RTE callbacks which are used to bootstrap the connections.
- Two testing modes - ping-pong and unidirectional stream (TBD bi-directional stream)
- Configurable message size and data layout (short/bcopy/zcopy)
- Supports: warmup cycles, unlimited iterations.
- UCT Active-messages stream is measured with simple flow-control.
- Tests driver is written in C++ (C linkage), to take advantage of templates.
- Results are reported to callback function at the specified intervals, and also returned from the API call.
- Including: latency, message rate, bandwidth - iteration average, and overall average.
Features of ucx_perftest
:
- Have pre-defined list of tests which are valid combinations of operation and testing mode.
- Can be run either as client-server application, as MPI application, or using libRTE.
- Supports: CSV output, numeric formatting.
- Supports "batch mode" - write the lists of tests to run to a text file (see example in
contrib/perf
) and run them one after another. Every line is the list of arguments that the tool would normally read as command-line options. They are "appended" to the other command-line arguments, if such were passed.- "Cartesian" mode: if several batch files are specified, all possible combinations are executed!
- Can be compiled with MPI and use it 'mpirun' as a launcher. In order to do it, need to add
--with-mpi
to UCX./configure
command line. - Supports loopback mode, in this case the process will communicate with itself, so passing server hostname is not allowed.
$ ucx_perftest -h
Note: test can be also launched as an MPI application
Usage: lt-ucx_perftest [ server-hostname ] [ options ]
Common options:
-t <test> test to run:
am_lat - active message latency
put_lat - put latency
add_lat - atomic add latency
get - get latency / bandwidth / message rate
fadd - atomic fetch-and-add latency / message rate
swap - atomic swap latency / message rate
cswap - atomic compare-and-swap latency / message rate
am_bw - active message bandwidth / message rate
put_bw - put bandwidth / message rate
add_mr - atomic add message rate
tag_lat - UCP tag match latency
tag_bw - UCP tag match bandwidth
tag_sync_lat - UCP tag sync match latency
ucp_put_lat - UCP put latency
ucp_put_bw - UCP put bandwidth
ucp_get - UCP get latency / bandwidth / message rate
ucp_add - UCP atomic add bandwidth / message rate
ucp_fadd - UCP atomic fetch-and-add latency / bandwidth / message rate
ucp_swap - UCP atomic swap latency / bandwidth / message rate
ucp_cswap - UCP atomic compare-and-swap latency / bandwidth / message rate
stream_bw - UCP stream bandwidth
stream_lat - UCP stream latency
-s <size> list of scatter-gather sizes for single message (8)
for example: "-s 16,48,8192,8192,14"
-n <iters> number of iterations to run (1000000)
-w <iters> number of warm-up iterations (10000)
-c <cpu> set affinity to this CPU (off)
-O <count> maximal number of uncompleted outstanding sends (1)
-i <offset> distance between consecutive scatter-gather entries (0)
-l <loopback> use loopback connection, in this case,
the process will communicate with itself,
so passing server hostname is not allowed
-T <threads> number of threads in the test (1), if >1 implies "-M multi" for UCP
-B register memory with NONBLOCK flag
-b <file> read and execute tests from a batch file: every line in the
file is a test to run, first word is test name, the rest of
the line is command-line arguments for the test.
-p <port> TCP port to use for data exchange (13337)
-P <0|1> disable/enable MPI mode (0)
-m <mem type> memory type of messages
host - system memory(default)
-h show this help message
Output format:
-N use numeric formatting (thousands separator)
-f print only final numbers
-v print CSV-formatted output
UCT only:
-d <device> device to use for testing
-x <tl> transport to use for testing
-D <layout> data layout for sender side:
short - short messages API (default, cannot be used for get)
bcopy - copy-out API (cannot be used for atomics)
zcopy - zero-copy API (cannot be used for atomics)
iov - scatter-gather list (iovec)
-W <count> flow control window size, for active messages (127)
-H <size> active message header size (8)
-A <mode> asynchronous progress mode (thread)
thread - separate progress thread
signal - signal-based timer
UCP only:
-M <thread> thread support level for progress engine (single)
single - only the master thread can access
serialized - one thread can access at a time
multi - multiple threads can access
-D <layout>[,<layout>]
data layout for sender and receiver side (contig)
contig - Continuous datatype
iov - Scatter-gather list
-C use wild-card tag for tag tests
-U force unexpected flow by using tag probe
-r <mode> receive mode for stream tests (recv)
recv : Use ucp_stream_recv_nb
recv_data : Use ucp_stream_recv_data_nb
Start server
$ ucx_perftest -c 0
Waiting for connection...
+------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: UCP tag match latency |
| Data layout: (automatic) |
| Message size: 8 |
+------------------------------------------------------------------------------------------+
Connect client:
$ ucx_perftest vegas08 -t tag_lat -c 0
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
592840 0.843 0.843 0.843 9.05 9.05 1185680 1185680
1000000 0.840 0.843 0.843 9.05 9.05 1185782 1185721
$salloc -N2 --ntasks-per-node=1 mpirun --bind-to core --display-map ucx_perftest -d mlx5_1:1 \
-x rc_mlx5 -t put_lat
salloc: Granted job allocation 6991
salloc: Waiting for resource configuration
salloc: Nodes clx-orion-[001-002] are ready for job
Data for JOB [62403,1] offset 0
======================== JOB MAP ========================
Data for node: clx-orion-001 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [62403,1] App: 0 Process rank: 0
Data for node: clx-orion-002 Num slots: 1 Max slots: 0 Num procs: 1
Process OMPI jobid: [62403,1] App: 0 Process rank: 1
=============================================================
+--------------+-----------------------------+---------------------+-----------------------+
| | latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
586527 0.845 0.852 0.852 4.47 4.47 586527 586527
1000000 0.844 0.848 0.851 4.50 4.48 589339 587686