This repository has been archived by the owner on Dec 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtest_run_allreduce_mpi_basic.log
83 lines (82 loc) · 5.68 KB
/
test_run_allreduce_mpi_basic.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/# mpirun --allow-run-as-root -np 2 -H <rocm_ip>,<cuda_ip> -mca pml ucx -mca coll_ucc_enable 1 -mca coll_ucc_priority 100 -mca coll_ucc_verbose 3 -mca pml_ucx_verbose 3 /test_allreduce
[ip-cuda:00463] common_ucx.c:183 using OPAL memory hooks as external events
[ip-cuda:00463] pml_ucx.c:207 mca_pml_ucx_open: UCX version 1.18.0
[ip-rocm:01539] common_ucx.c:183 using OPAL memory hooks as external events
[ip-rocm:01539] pml_ucx.c:207 mca_pml_ucx_open: UCX version 1.18.0
[ip-rocm:01539] common_ucx.c:342 self/memory: did not match transport list
[ip-rocm:01539] common_ucx.c:342 tcp/ens5: did not match transport list
[ip-rocm:01539] common_ucx.c:342 tcp/lo: did not match transport list
[ip-rocm:01539] common_ucx.c:342 sysv/memory: did not match transport list
[ip-rocm:01539] common_ucx.c:342 posix/memory: did not match transport list
[ip-rocm:01539] common_ucx.c:342 rocm_copy/rocm_cpy: did not match transport list
[ip-rocm:01539] common_ucx.c:227 readlink(/sys/class/infiniband/rocm_ipc/device/driver) failed: No such file or directory
[ip-rocm:01539] common_ucx.c:337 rocm_ipc/rocm_ipc: matched transport list but not device list
[ip-rocm:01539] common_ucx.c:342 cma/memory: did not match transport list
[ip-rocm:01539] common_ucx.c:347 support level is transports only
[ip-rocm:01539] pml_ucx.c:293 mca_pml_ucx_init
[ip-rocm:01539] pml_ucx.c:124 Pack remote worker address, size 82
[ip-rocm:01539] pml_ucx.c:124 Pack local worker address, size 302
[ip-rocm:01539] pml_ucx.c:358 created ucp context 0x154e040, worker 0x158aed0
[ip-rocm:01539] pml_ucx_component.c:147 returning priority 19
[ip-cuda:00463] common_ucx.c:342 self/memory: did not match transport list
[ip-cuda:00463] common_ucx.c:342 tcp/ens5: did not match transport list
[ip-cuda:00463] common_ucx.c:342 tcp/lo: did not match transport list
[ip-cuda:00463] common_ucx.c:342 sysv/memory: did not match transport list
[ip-cuda:00463] common_ucx.c:342 posix/memory: did not match transport list
[ip-cuda:00463] common_ucx.c:342 cuda_copy/cuda: did not match transport list
[ip-cuda:00463] common_ucx.c:227 readlink(/sys/class/infiniband/cuda/device/driver) failed: No such file or directory
[ip-cuda:00463] common_ucx.c:337 cuda_ipc/cuda: matched transport list but not device list
[ip-cuda:00463] common_ucx.c:342 cma/memory: did not match transport list
[ip-cuda:00463] common_ucx.c:347 support level is transports only
[ip-cuda:00463] pml_ucx.c:293 mca_pml_ucx_init
[ip-cuda:00463] pml_ucx.c:124 Pack remote worker address, size 82
[ip-cuda:00463] pml_ucx.c:124 Pack local worker address, size 300
[ip-cuda:00463] pml_ucx.c:358 created ucp context 0x55ee2ea19a00, worker 0x55ee2ea91a60
[ip-cuda:00463] pml_ucx_component.c:147 returning priority 19
[ip-cuda:00463] pml_ucx.c:192 Got proc 1 address, size 300
[ip-cuda:00463] pml_ucx.c:423 connecting to proc. 1
[ip-rocm:01539] pml_ucx.c:192 Got proc 0 address, size 302
[ip-rocm:01539] pml_ucx.c:423 connecting to proc. 0
[ip-rocm:01539] pml_ucx.c:192 Got proc 1 address, size 82
[ip-rocm:01539] pml_ucx.c:423 connecting to proc. 1
[ip-cuda:00463] pml_ucx.c:192 Got proc 0 address, size 82
[ip-cuda:00463] pml_ucx.c:423 connecting to proc. 0
[ip-cuda:00463] coll_ucc_module.c:383 - mca_coll_ucc_init_ctx() initialized ucc context
[ip-rocm:01539] coll_ucc_module.c:383 - mca_coll_ucc_init_ctx() initialized ucc context
[ip-cuda:00463] coll_ucc_module.c:469 - mca_coll_ucc_module_enable() creating ucc_team for comm 0x55ee2e65a880, comm_id 0, comm_size 2
[ip-rocm:01539] coll_ucc_module.c:469 - mca_coll_ucc_module_enable() creating ucc_team for comm 0x2045a0, comm_id 0, comm_size 2
ROCm Rank 0
CUDA Rank 1
[ip-cuda:00463] coll_ucc_allreduce.c:69 - mca_coll_ucc_allreduce() running ucc allreduce
[ip-rocm:01539] coll_ucc_allreduce.c:69 - mca_coll_ucc_allreduce() running ucc allreduce
[ip-rocm:1539 :0:1539] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7037a3801020)
==== backtrace (tid: 1539) ====
0 /usr/lib/libucs.so.0(ucs_handle_error+0x2dc) [0x7038bbac3aac]
1 /usr/lib/libucs.so.0(+0x3cc8f) [0x7038bbac3c8f]
2 /usr/lib/libucs.so.0(+0x3cfc4) [0x7038bbac3fc4]
3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7038bbd59520]
4 /lib/x86_64-linux-gnu/libc.so.6(+0x1a0977) [0x7038bbeb7977]
5 /usr/lib/ucx/libuct_rocm.so.0(uct_rocm_copy_ep_put_short+0x37) [0x7038b20bf7c7]
6 /usr/lib/libucp.so.0(ucp_mem_type_unpack+0x17a) [0x7038bbb8d57a]
7 /usr/lib/libucp.so.0(ucp_eager_only_handler+0xfa6) [0x7038bbc0eee6]
8 /usr/lib/libuct.so.0(+0x29f2e) [0x7038b2601f2e]
9 /usr/lib/libuct.so.0(+0x2ac78) [0x7038b2602c78]
10 /usr/lib/libuct.so.0(+0x2eda4) [0x7038b2606da4]
11 /usr/lib/libucs.so.0(ucs_event_set_wait+0x141) [0x7038bbad5951]
12 /usr/lib/libuct.so.0(uct_tcp_iface_progress+0x90) [0x7038b2606e90]
13 /usr/lib/libucs.so.0(+0x2cf27) [0x7038bbab3f27]
14 /usr/lib/libucs.so.0(+0x2d47b) [0x7038bbab447b]
15 /usr/lib/libucp.so.0(ucp_worker_progress+0x7a) [0x7038bbb86bda]
16 /usr/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_reduce_scatter_knomial_progress+0x225) [0x7037b089a6c5]
17 /usr/lib/ucc/libucc_tl_ucp.so(ucc_tl_ucp_reduce_scatter_knomial_start+0x190) [0x7037b089c230]
18 /usr/lib/libucc.so.1(ucc_dependency_handler+0x5c) [0x7038bbcf16dc]
19 /usr/lib/libucc.so.1(ucc_event_manager_notify+0x63) [0x7038bbcf03b3]
20 /usr/lib/libucc.so.1(ucc_event_manager_notify+0x63) [0x7038bbcf03b3]
21 /usr/lib/libucc.so.1(ucc_collective_post+0xab) [0x7038bbced33b]
22 /usr/lib/libmpi.so.40(mca_coll_ucc_allreduce+0xf2) [0x7038bdc838e2]
23 /usr/lib/libmpi.so.40(PMPI_Allreduce+0x104) [0x7038bdbf6e44]
24 /test_allreduce() [0x201f8c]
25 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7038bbd40d90]
26 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7038bbd40e40]
27 /test_allreduce() [0x201d95]
=================================