Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHM/MM: use real SHM access for reachable test #4511

Merged
merged 1 commit into from
Dec 9, 2019

Conversation

hoopoepg
Copy link
Contributor

  • use access to shared memory segment for is_reachable test
    on iface initialization (instead of GUID based)

@hoopoepg hoopoepg changed the title SHM/MM: use real SHM access for reacjable test SHM/MM: use real SHM access for reachable test Nov 26, 2019
@@ -88,11 +115,36 @@ UCS_CLASS_INIT_FUNC(uct_sm_iface_t, uct_iface_ops_t *ops, uct_md_h md,

self->config.bandwidth = sm_config->bandwidth;

self->shmid = shmget(IPC_PRIVATE, sizeof(uint64_t),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if SysV is not supported, posix will not work as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, but without sysV - most of unix systems are not operable
as I remember sysV supported by all linux & freeBSD/MacOS systems

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already have support for posix without sysv dependency - let's keep it that way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use the transport itself and its mappers to test reachability, not direct calls to sysv

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 03-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@yosefe yosefe added the WIP-DNM Work in progress / Do not review label Nov 27, 2019
@hoopoepg hoopoepg force-pushed the topic/sm-reachable-on-mem-access branch 2 times, most recently from f647522 to 0f09df5 Compare November 30, 2019 05:10
@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 10 of 25 workers (click for details)

Note: the logs will be deleted after 07-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W3 ❌ FAILURE
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W2 ❌ FAILURE
hpc-arm-hwi-jenkins_W3 ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg hoopoepg force-pushed the topic/sm-reachable-on-mem-access branch from 0f09df5 to 39fb657 Compare November 30, 2019 12:33
@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 6 of 25 workers (click for details)

Note: the logs will be deleted after 07-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W1 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-test-node-new_W3 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 2, 2019

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 09-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 2, 2019

bot:pipe:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 25 of 25 workers (click for details)

Note: the logs will be deleted after 09-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W3 ❌ FAILURE
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W2 ❌ FAILURE
hpc-arm-hwi-jenkins_W3 ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W1 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-test-node-new_W0 ❌ FAILURE
hpc-test-node-new_W1 ❌ FAILURE
hpc-test-node-new_W2 ❌ FAILURE
hpc-test-node-new_W3 ❌ FAILURE
r-vmb-ppc-jenkins_W0 ❌ FAILURE
r-vmb-ppc-jenkins_W1 ❌ FAILURE
r-vmb-ppc-jenkins_W2 ❌ FAILURE
r-vmb-ppc-jenkins_W3 ❌ FAILURE

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 3, 2019

out-of-memory
bot:pipe:retest

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 3, 2019

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 2 of 25 workers (click for details)

Note: the logs will be deleted after 10-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg hoopoepg force-pushed the topic/sm-reachable-on-mem-access branch from 10a58a0 to a9ebe98 Compare December 4, 2019 14:04
@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 4, 2019

resolved conflicts, rebased to master

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 7 of 25 workers (click for details)

Note: the logs will be deleted after 11-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-new_W2 ❌ FAILURE
r-vmb-ppc-jenkins_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 11-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg hoopoepg removed the WIP-DNM Work in progress / Do not review label Dec 5, 2019
src/ucs/sys/sys.h Outdated Show resolved Hide resolved
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.h Outdated Show resolved Hide resolved
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.h Show resolved Hide resolved
src/uct/sm/cma/cma_iface.c Outdated Show resolved Hide resolved
src/uct/sm/cma/cma_iface.c Outdated Show resolved Hide resolved
src/uct/sm/cma/cma_iface.c Outdated Show resolved Hide resolved
src/uct/sm/mm/base/mm_iface.c Outdated Show resolved Hide resolved
src/uct/sm/mm/posix/mm_posix.c Outdated Show resolved Hide resolved
#define UCS_PROCESS_NS_NET_DFLT 0xF0000080U



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, removed

if (res == 0) {
ucs_sys_namespace_info[ns].ino = (ucs_sys_ns_t)st.st_ino;
} else {
ucs_sys_namespace_info[ns].ino = ucs_sys_namespace_info[ns].dflt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also print a warning with errno?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if freeBSD or MacOS support such bood ID - it could be ok if such file is missing

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 12-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

*(pid_t*)addr = getpid();
ucs_cma_iface_ext_device_addr_t *iface_addr = (void*)addr;

ucs_assert(!(getpid() & UCT_CMA_IFACE_ADDR_FLAG_PID_NS));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert_always?, then can do iface_addr->super.id = getpid()
looks like we always need to detect this error and it is not a fast path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are ucs_assert used in other places to evaluate pid, I just used same logic



typedef struct {
uint64_t id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can align with struct below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fixed both

@@ -12,6 +12,16 @@
#include <ucs/sys/string.h>


typedef struct {
pid_t id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can align with struct below

return ucs_sys_get_ns(UCS_SYS_NS_TYPE_PID) == *(const ucs_sys_ns_t*)iface_addr;
}

return ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_PID);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra whitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, will fix on squash

@mellanox-github
Copy link
Contributor

Mellanox CI: FAILED on 25 of 25 workers (click for details)

Note: the logs will be deleted after 13-Dec-2019

Agent/Stage Status
_main ❌ FAILURE
hpc-arm-cavium-jenkins_W0 ❌ FAILURE
hpc-arm-cavium-jenkins_W1 ❌ FAILURE
hpc-arm-cavium-jenkins_W2 ❌ FAILURE
hpc-arm-cavium-jenkins_W3 ❌ FAILURE
hpc-arm-hwi-jenkins_W0 ❌ FAILURE
hpc-arm-hwi-jenkins_W1 ❌ FAILURE
hpc-arm-hwi-jenkins_W2 ❌ FAILURE
hpc-arm-hwi-jenkins_W3 ❌ FAILURE
hpc-test-node-gpu_W0 ❌ FAILURE
hpc-test-node-gpu_W1 ❌ FAILURE
hpc-test-node-gpu_W2 ❌ FAILURE
hpc-test-node-gpu_W3 ❌ FAILURE
hpc-test-node-legacy_W0 ❌ FAILURE
hpc-test-node-legacy_W1 ❌ FAILURE
hpc-test-node-legacy_W2 ❌ FAILURE
hpc-test-node-legacy_W3 ❌ FAILURE
hpc-test-node-new_W0 ❌ FAILURE
hpc-test-node-new_W1 ❌ FAILURE
hpc-test-node-new_W2 ❌ FAILURE
hpc-test-node-new_W3 ❌ FAILURE
r-vmb-ppc-jenkins_W0 ❌ FAILURE
r-vmb-ppc-jenkins_W1 ❌ FAILURE
r-vmb-ppc-jenkins_W2 ❌ FAILURE
r-vmb-ppc-jenkins_W3 ❌ FAILURE

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 13-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 14-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
return ext_addr->ipc_ns == my_addr.ipc_ns;
return (ext_addr->super.id == my_addr.super.id) &&
(!(ext_addr->super.id & UCS_SM_IFACE_ADDR_FLAG_EXT) ||
(ext_addr->ipc_ns == my_addr.ipc_ns));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unhandled?

@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 15-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash

- use namespace ID to evaluate reachable iface
- used IPC namespace for sysV & knem ifaces
- IPC + PID namespaces for posix + cma ifaces
@hoopoepg hoopoepg force-pushed the topic/sm-reachable-on-mem-access branch from 072968a to d222b72 Compare December 8, 2019 16:41
@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 15-Dec-2019

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 9, 2019

bot:pipe:retest

@yosefe
Copy link
Contributor

yosefe commented Dec 9, 2019

r-vmb-rhel7-u0-beta-x86-64 failure
bot:pipe:retest

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Dec 9, 2019

@yosefe ok to merge?

@panda1100
Copy link
Contributor

panda1100 commented Jul 7, 2023

Hi @yosefe -san and @hoopoepg -san,

I'm still facing the issue #4224
OMPI 4.1.5 + UCX v1.10.1, I have working test environment.

The following workaround did work for me but could you please help me to understand why posix doesn't work?

  • UCX_TLS=sysv,cma,ib works
  • UCX_POSIX_USE_PROC_LINK=n works
  • UCX_POSIX_USE_PROC_LINK=n and UCX_TLS=posix,cma,ib works
  • UCX_TLS=posix,cma,ib didn't work
$ /home/ciq/ysenda/opt/ompi/bin/mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 -x UCX_TLS=posix,cma,ib apptainer run hello.sif
[1688710000.495329] [c7:2019617:0]       mm_posix.c:195  UCX  ERROR open(file_name=/proc/2019618/fd/20 flags=0x0) failed: Permission denied
[1688710000.495329] [c7:2019618:0]       mm_posix.c:195  UCX  ERROR open(file_name=/proc/2019617/fd/20 flags=0x0) failed: Permission denied
[1688710000.495362] [c7:2019618:0]          mm_ep.c:155  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc0000005001ed121: Shared memory error
[1688710000.495363] [c7:2019617:0]          mm_ep.c:155  UCX  ERROR mm ep failed to connect to remote FIFO id 0xc0000005001ed122: Shared memory error
[c7:2019618] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:424  Error: ucp_ep_create(proc=1) failed: Shared memory error
[c7:2019617] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:424  Error: ucp_ep_create(proc=0) failed: Shared memory error
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[c7:2019617] *** An error occurred in MPI_Init
[c7:2019617] *** reported by process [2623275009,1]
[c7:2019617] *** on a NULL communicator
[c7:2019617] *** Unknown error
[c7:2019617] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c7:2019617] ***    and potentially your MPI job)
[c7:2019530] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[c7:2019530] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[c7:2019530] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

$ /home/ciq/ysenda/opt/ucx/bin/ucx_info -v
# UCT version=1.10.1 revision 6a5856e
# configured with: --prefix=/home/ciq/ysenda/opt/ucx

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Jul 7, 2023

hi @panda1100
posix transport is shared memory-based transport (works for processes located on same host only but provides good performance) based on posix API. for some reason access to neighborhood process via /proc filesystem is prohibited on your system which cause fail in initialization of posix transport. as alternative you may use sysv transport which provides same (or very similar) performance, but uses another API to establish processes connection.

BTW, one of the reasons why posix may fail to initialize is restriction on ptrace API: could you check if file /proc/sys/kernel/yama/ptrace_scope has value 0 ? in case if it is not so can you try to run command

sudo "echo 0 > /proc/sys/kernel/yama/ptrace_scope"

and check if posix transport is recovered

@panda1100
Copy link
Contributor

panda1100 commented Jul 7, 2023

@hoopoepg Thank you for your detailed explanation!!
It looks like my system already has value 0.

$ cat /proc/sys/kernel/yama/ptrace_scope
0

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Jul 7, 2023

hmmm, ok.
do you use any kind of containers? docker or another one?

@panda1100
Copy link
Contributor

panda1100 commented Jul 7, 2023

@hoopoepg Yes, I use Apptainer (formerly Singularity).

So, I believe ptrace is not involved.
https://apptainer.org/docs/user/main/fakeroot.html

It shares IPC and PID with host by default.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Jul 7, 2023

Yes, I use Apptainer (formerly Singularity).

it could be root of issue. I'm not familiar with such system, but for docker we have to set parameter to use same IPC namespace across all processes to enable posix over /proc filesystem (and of-course all processes must be run using same credentials)

So, I believe ptrace is not involved. https://apptainer.org/docs/user/main/fakeroot.html

yes, it is not involved directly.

@panda1100
Copy link
Contributor

There is --ipc option for Apptainer run command, creating separate IPC namespace.
With that option, somehow UCX_TLS=posix,cma,ib works.
That something not intuitive though...

/home/ciq/ysenda/opt/ompi/bin/mpirun -np 2 -mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=hfi1_0:1 -x UCX_TLS=posix,cma,ib apptainer run --ipc hello.sif

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Jul 7, 2023

That something not intuitive though...

agree, such configurations are not easy to understand :)
glad to see you found solution!!!

@panda1100
Copy link
Contributor

@hoopoepg Thank you!
I have multiple solutions and wanted to understand which one is the best one.
Your explanation definitely helped. Thank you again for your support.

@panda1100
Copy link
Contributor

@hoopoepg The workaround UCX_POSIX_USE_PROC_LINK=n with UCX_TLS=posix,cma,ib works on my environment. Does it actually use posix transport? Why this option works? I'm not sure how this option avoid restriction from the system.

I did check with UCX_LOG_LEVEL=debug and it looks like it uses posix shared memory.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Jul 7, 2023

the trick here is in variable UCX_POSIX_USE_PROC_LINK=n: it force to create SHM object not on /proc filesystem (used /dev/shm or so). originally /proc filesystem is used because in case of abnormal termination of process shared memory segment is deleted automatically, but in case of "classic" filesystem such termination may lead to resource leak

@panda1100
Copy link
Contributor

@hoopoepg Thank you very much!! That makes perfect sense.

@tvegas1
Copy link
Contributor

tvegas1 commented Jul 13, 2023

Started PR #9213 to discuss making things work out of the box, likely without need to add --ipc or disable use proc link.

@panda1100
Copy link
Contributor

@tvegas1 -san, Thank you for letting us know about PR. I will try that PR with Apptainer on our test environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants