-
Notifications
You must be signed in to change notification settings - Fork 432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SHM/MM: use real SHM access for reachable test #4511
SHM/MM: use real SHM access for reachable test #4511
Conversation
src/uct/sm/base/sm_iface.c
Outdated
@@ -88,11 +115,36 @@ UCS_CLASS_INIT_FUNC(uct_sm_iface_t, uct_iface_ops_t *ops, uct_md_h md, | |||
|
|||
self->config.bandwidth = sm_config->bandwidth; | |||
|
|||
self->shmid = shmget(IPC_PRIVATE, sizeof(uint64_t), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if SysV is not supported, posix will not work as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but without sysV - most of unix systems are not operable
as I remember sysV supported by all linux & freeBSD/MacOS systems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already have support for posix without sysv dependency - let's keep it that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use the transport itself and its mappers to test reachability, not direct calls to sysv
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 03-Dec-2019
|
f647522
to
0f09df5
Compare
Mellanox CI: FAILED on 10 of 25 workers (click for details)Note: the logs will be deleted after 07-Dec-2019
|
0f09df5
to
39fb657
Compare
Mellanox CI: FAILED on 6 of 25 workers (click for details)Note: the logs will be deleted after 07-Dec-2019
|
bot:mlx:retest |
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 09-Dec-2019
|
bot:pipe:retest |
Mellanox CI: FAILED on 25 of 25 workers (click for details)Note: the logs will be deleted after 09-Dec-2019
|
Mellanox CI: FAILED on 2 of 25 workers (click for details)Note: the logs will be deleted after 10-Dec-2019
|
out-of-memory |
bot:mlx:retest |
Mellanox CI: FAILED on 2 of 25 workers (click for details)Note: the logs will be deleted after 10-Dec-2019
|
10a58a0
to
a9ebe98
Compare
resolved conflicts, rebased to master |
Mellanox CI: FAILED on 7 of 25 workers (click for details)Note: the logs will be deleted after 11-Dec-2019
|
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 11-Dec-2019
|
src/ucs/sys/sys.c
Outdated
#define UCS_PROCESS_NS_NET_DFLT 0xF0000080U | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, removed
src/ucs/sys/sys.c
Outdated
if (res == 0) { | ||
ucs_sys_namespace_info[ns].ino = (ucs_sys_ns_t)st.st_ino; | ||
} else { | ||
ucs_sys_namespace_info[ns].ino = ucs_sys_namespace_info[ns].dflt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also print a warning with errno?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if freeBSD or MacOS support such bood ID - it could be ok if such file is missing
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 12-Dec-2019
|
*(pid_t*)addr = getpid(); | ||
ucs_cma_iface_ext_device_addr_t *iface_addr = (void*)addr; | ||
|
||
ucs_assert(!(getpid() & UCT_CMA_IFACE_ADDR_FLAG_PID_NS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe assert_always
?, then can do iface_addr->super.id = getpid()
looks like we always need to detect this error and it is not a fast path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are ucs_assert used in other places to evaluate pid, I just used same logic
src/uct/sm/base/sm_iface.c
Outdated
|
||
|
||
typedef struct { | ||
uint64_t id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can align with struct below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, fixed both
src/uct/sm/cma/cma_iface.c
Outdated
@@ -12,6 +12,16 @@ | |||
#include <ucs/sys/string.h> | |||
|
|||
|
|||
typedef struct { | |||
pid_t id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: can align with struct below
src/uct/sm/mm/posix/mm_posix.c
Outdated
return ucs_sys_get_ns(UCS_SYS_NS_TYPE_PID) == *(const ucs_sys_ns_t*)iface_addr; | ||
} | ||
|
||
return ucs_sys_ns_is_default(UCS_SYS_NS_TYPE_PID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, will fix on squash
Mellanox CI: FAILED on 25 of 25 workers (click for details)Note: the logs will be deleted after 13-Dec-2019
|
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 13-Dec-2019
|
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 14-Dec-2019
|
src/uct/sm/base/sm_iface.c
Outdated
return ext_addr->ipc_ns == my_addr.ipc_ns; | ||
return (ext_addr->super.id == my_addr.super.id) && | ||
(!(ext_addr->super.id & UCS_SM_IFACE_ADDR_FLAG_EXT) || | ||
(ext_addr->ipc_ns == my_addr.ipc_ns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unhandled?
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 15-Dec-2019
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls squash
- use namespace ID to evaluate reachable iface - used IPC namespace for sysV & knem ifaces - IPC + PID namespaces for posix + cma ifaces
072968a
to
d222b72
Compare
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 15-Dec-2019
|
bot:pipe:retest |
r-vmb-rhel7-u0-beta-x86-64 failure |
@yosefe ok to merge? |
Hi @yosefe -san and @hoopoepg -san, I'm still facing the issue #4224 The following workaround did work for me but could you please help me to understand why
|
hi @panda1100 BTW, one of the reasons why posix may fail to initialize is restriction on ptrace API: could you check if file /proc/sys/kernel/yama/ptrace_scope has value
and check if posix transport is recovered |
@hoopoepg Thank you for your detailed explanation!!
|
hmmm, ok. |
@hoopoepg Yes, I use Apptainer (formerly Singularity). So, I believe It shares IPC and PID with host by default. |
it could be root of issue. I'm not familiar with such system, but for docker we have to set parameter to use same IPC namespace across all processes to enable posix over /proc filesystem (and of-course all processes must be run using same credentials)
yes, it is not involved directly. |
There is
|
agree, such configurations are not easy to understand :) |
@hoopoepg Thank you! |
@hoopoepg The workaround I did check with |
the trick here is in variable |
@hoopoepg Thank you very much!! That makes perfect sense. |
Started PR #9213 to discuss making things work out of the box, likely without need to add |
@tvegas1 -san, Thank you for letting us know about PR. I will try that PR with Apptainer on our test environment. |
on iface initialization (instead of GUID based)