You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).
Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).
Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.
What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.
Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.
I've managed to create a small-ish reproducer program, in case anyone is interested:
Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls MPI_Test, but not for each iteration as is done here.
Here's the output when running on 32 ranks on Leonardo, with the ZoneScopedN enabled:
I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.
One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).
Here's some additional things we've determined:
It only seems to happen for transfers over the actual network (no shared memory on a single node).
Reproducible for both OpenMPI and Intel MPI (MPICH).
The spikes actually happen somewhere inside calls to MPI_Test et al.; pre-loading a dummy MPI library that replaces MPI_Isend / MPI_Irecv / MPI_Test with no-ops eliminates the gaps.
It does not matter whether the trace is actually being consumed (e.g. via tracy-capture) or not
The text was updated successfully, but these errors were encountered:
You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?
You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?
Yes, Celerity uses the fibers API to render concurrent tasks in our runtime. However, the reproducer code does not, it only uses a single ZoneScopedN!
We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).
Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).
Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.
What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.
Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.
I've managed to create a small-ish reproducer program, in case anyone is interested:
Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls
MPI_Test
, but not for each iteration as is done here.Here's the output when running on 32 ranks on Leonardo, with the
ZoneScopedN
enabled:And here's the output without the
ZoneScopedN
:I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.
One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).
Here's some additional things we've determined:
MPI_Test
et al.; pre-loading a dummy MPI library that replacesMPI_Isend
/MPI_Irecv
/MPI_Test
with no-ops eliminates the gaps.tracy-capture
) or notThe text was updated successfully, but these errors were encountered: