This repository has been archived by the owner on Dec 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathtest_run_bidirectional_send_recv_mpi_ucc.log
115 lines (114 loc) · 11.9 KB
/
test_run_bidirectional_send_recv_mpi_ucc.log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/# mpirun --allow-run-as-root -np 2 -H <cuda_ip>,<rocm_ip> -x MASTER_ADDR=<rocm_ip> -x MASTER_PORT=1234 -mca pml ucx -mca coll_ucc_enable 1 -mca coll_ucc_priority 100 -x UCX_ROCM_COPY_D2H_THRESH=0 -x UCX_ROCM_COPY_H2D_THRESH=0 -x UCC_EC_ROCM_REDUCE_HOST_LIMIT=0 -x UCC_EC_ROCM_COPY_HOST_LIMIT=0 -x OMPI_MCA_mpi_accelerator_rocm_memcpyD2H_limit=0 -x OMPI_MCA_mpi_accelerator_rocm_memcpyH2D_limit=0 -x UCC_TL_UCP_TUNE=inf -x UCC_COLL_TRACE=DEBUG -x UCC_LOG_LEVEL=DEBUG /opt/conda/envs/py_3.12/bin/python /test_bidirectional_send_recv.py
[1730464018.476660] [ip-cuda:24601:0] ucc_proc_info.c:309 UCC DEBUG proc pid 24601, host ip-cuda, host_hash 761723772129539007, sockid 0, numaid 0
[1730464018.476683] [ip-cuda:24601:0] ucc_constructor.c:188 UCC INFO version: 1.4.0, loaded from: /usr/lib/libucc.so.1, cfg file: n/a
[1730464018.476705] [ip-cuda:24601:0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized
[1730464018.480151] [ip-rocm:1654 :0] ucc_proc_info.c:309 UCC DEBUG proc pid 1654, host ip-rocm, host_hash 6740075926527859685, sockid 0, numaid 0
[1730464018.480183] [ip-rocm:1654 :0] ucc_constructor.c:188 UCC INFO version: 1.4.0, loaded from: /usr/lib/libucc.so.1, cfg file: n/a
[1730464018.480214] [ip-rocm:1654 :0] ucc_mc.c:67 UCC DEBUG mc cpu mc initialized
[1730464018.480237] [ip-rocm:1654 :0] ucc_mc.c:67 UCC DEBUG mc rocm mc initialized
[1730464018.480248] [ip-rocm:1654 :0] ucc_ec.c:63 UCC DEBUG ec cpu ec initialized
[1730464018.480271] [ip-rocm:1654 :0] ucc_ec.c:63 UCC DEBUG ec rocm ec initialized
[1730464018.480298] [ip-rocm:1654 :0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x8283590
[1730464018.480319] [ip-rocm:1654 :0] ucc_lib.c:150 UCC DEBUG lib_prefix "OMPI_UCC_": initialized component "basic" score 10
[1730464018.480364] [ip-rocm:1654 :0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x7bb8cc0
[1730464018.482467] [ip-cuda:24601:0] mc_cuda.c:65 cuda mc DEBUG driver version 12040
[1730464018.482485] [ip-cuda:24601:0] mc_cuda.c:77 cuda mc DEBUG cuCtxGetDevice() failed: invalid device context
[1730464018.482498] [ip-cuda:24601:0] ucc_mc.c:67 UCC DEBUG mc cuda mc initialized
[1730464018.482515] [ip-cuda:24601:0] ucc_ec.c:63 UCC DEBUG ec cpu ec initialized
[1730464018.485616] [ip-cuda:24601:0] ucc_ec.c:63 UCC DEBUG ec cuda ec initialized
[1730464018.485660] [ip-cuda:24601:0] cl_basic_lib.c:20 CL_BASIC DEBUG initialized lib object: 0x45d1e90
[1730464018.485680] [ip-cuda:24601:0] ucc_lib.c:150 UCC DEBUG lib_prefix "OMPI_UCC_": initialized component "basic" score 10
[1730464018.485728] [ip-cuda:24601:0] tl_ucp_lib.c:69 TL_UCP DEBUG initialized lib object: 0x6230e20
[1730464018.491303] [ip-rocm:1654 :0] tl_ucp_context.c:281 TL_UCP DEBUG initialized tl context: 0x81234e0
[1730464018.491345] [ip-rocm:1654 :0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x829f9a0
[1730464018.494350] [ip-cuda:24601:0] tl_ucp_context.c:281 TL_UCP DEBUG initialized tl context: 0x64e52a0
[1730464018.494378] [ip-cuda:24601:0] cl_basic_context.c:50 CL_BASIC DEBUG initialized cl context: 0x66b6e40
[1730464018.506930] [ip-cuda:24601:0] tl_ucp_team.c:103 TL_UCP DEBUG posted tl team: 0x6794600
[1730464018.506942] [ip-cuda:24601:0] tl_ucp_team.c:202 TL_UCP DEBUG initialized tl team: 0x6794600
[1730464018.506954] [ip-cuda:24601:0] ucc_context.c:839 UCC DEBUG created ucc context 0x45d1f90 for lib OMPI_UCC_
[1730464018.506818] [ip-rocm:1654 :0] tl_ucp_team.c:103 TL_UCP DEBUG posted tl team: 0x834a860
[1730464018.506848] [ip-rocm:1654 :0] tl_ucp_team.c:202 TL_UCP DEBUG initialized tl team: 0x834a860
[1730464018.506854] [ip-rocm:1654 :0] ucc_context.c:839 UCC DEBUG created ucc context 0x8283780 for lib OMPI_UCC_
[1730464018.506990] [ip-cuda:24601:0] ucc_team.c:369 UCC DEBUG team 0x659eab0 rank 0, ctx_rank 0, map_type 1
[1730464018.506887] [ip-rocm:1654 :0] ucc_team.c:369 UCC DEBUG team 0x8199b40 rank 1, ctx_rank 1, map_type 1
[1730464018.506914] [ip-rocm:1654 :0] tl_ucp_team.c:100 TL_UCP DEBUG opt knomial radix: 2
[1730464018.506918] [ip-rocm:1654 :0] tl_ucp_team.c:103 TL_UCP DEBUG posted tl team: 0x834c840
[1730464018.506925] [ip-rocm:1654 :0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x834c5a0
[1730464018.506930] [ip-rocm:1654 :0] tl_ucp_team.c:202 TL_UCP DEBUG initialized tl team: 0x834c840
[1730464018.507018] [ip-cuda:24601:0] tl_ucp_team.c:100 TL_UCP DEBUG opt knomial radix: 2
[1730464018.506940] [ip-rocm:1654 :0] cl_basic_team.c:120 CL_BASIC DEBUG initialized tl ucp team
[1730464018.506955] [ip-rocm:1654 :0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host
[1730464018.506960] [ip-rocm:1654 :0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type rocm
[1730464018.507027] [ip-cuda:24601:0] tl_ucp_team.c:103 TL_UCP DEBUG posted tl team: 0x6796210
[1730464018.507036] [ip-cuda:24601:0] cl_basic_team.c:52 CL_BASIC DEBUG posted cl team: 0x795ba0070550
[1730464018.507046] [ip-cuda:24601:0] tl_ucp_team.c:202 TL_UCP DEBUG initialized tl team: 0x6796210
[1730464018.507058] [ip-cuda:24601:0] cl_basic_team.c:120 CL_BASIC DEBUG initialized tl ucp team
[1730464018.507070] [ip-cuda:24601:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type host
[1730464018.507078] [ip-cuda:24601:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda
[1730464018.507085] [ip-cuda:24601:0] tl_ucp_team.c:230 TL_UCP DEBUG enable support for memory type cuda-managed
[1730464018.507201] [ip-cuda:24601:0] ucc_team.c:471 UCC INFO ===== COLL_SCORE_MAP (team_id 32768, size 2) =====
[1730464018.507228] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Allgather:
[1730464018.507228] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507228] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507228] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507267] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Allgatherv:
[1730464018.507267] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507267] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507267] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507316] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Allreduce:
[1730464018.507316] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507316] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507316] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
[1730464018.507362] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Alltoall:
[1730464018.507362] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..257}:TL_UCP:10 {258..inf}:TL_UCP:10
[1730464018.507362] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507362] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507400] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Alltoallv:
[1730464018.507400] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507400] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507400] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507441] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Barrier:
[1730464018.507441] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507441] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507441] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507483] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Bcast:
[1730464018.507483] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507483] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507483] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507524] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Fanin:
[1730464018.507524] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507524] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507524] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507561] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Fanout:
[1730464018.507561] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507561] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507561] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507596] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Gather:
[1730464018.507596] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507596] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507596] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507864] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Gatherv:
[1730464018.507864] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507864] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507864] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507899] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Reduce:
[1730464018.507899] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507899] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507899] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507942] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Reduce_scatter:
[1730464018.507942] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507942] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507942] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.507992] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Reduce_scatterv:
[1730464018.507992] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.507992] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.507992] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.508035] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Scatterv:
[1730464018.508035] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Host: {0..inf}:TL_UCP:10
[1730464018.508035] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO Cuda: {0..inf}:TL_UCP:10
[1730464018.508035] [ip-cuda:24601:0] ucc_coll_score_map.c:203 UCC INFO CudaManaged: {0..inf}:TL_UCP:10
[1730464018.508065] [ip-cuda:24601:0] ucc_team.c:474 UCC INFO ================================================
CUDA Rank 0 sent data to Rank 1: 0.0
ROCM Rank 1 received data from Rank 0: 0.0
ROCM Rank 1 sent data to Rank 0: 1.0
CUDA Rank 0 received data from Rank 1: 1.0