Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling MPI applications with Tracy can cause long spikes in non-blocking send/receive operations #966

Open
psalz opened this issue Jan 7, 2025 · 2 comments

Comments

@psalz
Copy link

psalz commented Jan 7, 2025

We are using Tracy to profile our distributed GPU runtime system Celerity, and it's mostly working great. However, during some recent benchmarking runs on the Leonardo supercomputer we've noticed that traces often contain very long spikes for MPI non-blocking send / receive operations, with some transfers taking several thousand times longer than they should (e.g. 30ms instead of 10us, sometimes even > 100ms).

Here's an example trace for a run with 32 ranks. Notice how there are several small gaps throughout the run and a few very long ones towards the end, caused by long transfers (in the "p2p" fibers at the very bottom). The application is a simple stencil code executed over 10000 iterations, with each iteration performing exactly the same set of operations (point to point transfers between ranks, some copies as well as GPU kernel executions).

image

Long story short, it turns out that those spikes only happen while profiling with Tracy, and therefore seem to be due to some unfortunate interaction between the Tracy client and MPI.

What is very curious is that the gaps happen at seemingly predictable phases of the program's execution.

Here is another trace of the same application / configuration. Notice how the pattern of gaps looks very similar, although in this case the long gap towards the end is quite a bit shorter.

image

I've managed to create a small-ish reproducer program, in case anyone is interested:

#include <numeric>
#include <optional>
#include <functional>
#include <algorithm>
#include <array>
#include <vector>
#include <chrono>
#include <cstdio>
#include <cstdlib>
#include <thread>
#include <deque>

#include <mpi.h>
#include <tracy/Tracy.hpp>
#include <tracy/TracyC.h>

using clk = std::chrono::steady_clock;
using namespace std::chrono_literals;

int main(int argc, char* argv[]) {
	const size_t transfer_bytes = argc > 1 ? std::atol(argv[1]) : 16384 * 4;
	const size_t iterations = 10000;
	const size_t warmup = 100;

	int provided = -1;
	MPI_Init_thread(nullptr, nullptr, MPI_THREAD_MULTIPLE, &provided);

	int rank = -1;
	int size = -1;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	if(rank == 0) {
		printf("Transferring %zu bytes %zu times, %zu warmup iterations\n", transfer_bytes, iterations, warmup);
		fflush(stdout);
	}
	MPI_Barrier(MPI_COMM_WORLD);

	std::vector<unsigned char> sendbuf_up(transfer_bytes);
	std::vector<unsigned char> recvbuf_up(transfer_bytes);
	std::vector<unsigned char> sendbuf_down(transfer_bytes);
	std::vector<unsigned char> recvbuf_down(transfer_bytes);

	std::vector<clk::duration> times;
	times.reserve(iterations);

	const int up = rank > 0 ? rank - 1 : size - 1;
	const int down = rank < size - 1 ? rank + 1 : 0;
	const int tag_up = 0;
	const int tag_down = 1;
	std::random_device rd;
	std::mt19937 g(rd());
	for(size_t i = 0; i < iterations + warmup; ++i) {
		const auto before = clk::now();
		MPI_Request reqs[4];

		MPI_Irecv(recvbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_down + i, MPI_COMM_WORLD, &reqs[0]);
		MPI_Irecv(recvbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_up + i, MPI_COMM_WORLD, &reqs[1]);
		MPI_Isend(sendbuf_up.data(), transfer_bytes, MPI_BYTE, up, tag_up + i, MPI_COMM_WORLD, &reqs[2]);
		MPI_Isend(sendbuf_down.data(), transfer_bytes, MPI_BYTE, down, tag_down + i, MPI_COMM_WORLD, &reqs[3]);

		bool done[4] = {false, false, false, false};
		bool all_done = false;
		while(!all_done) {
			ZoneScopedN("make things slow"); // <--- comment out this zone to remove spikes

			all_done = true;
			for(size_t j = 0; j < 4; ++j) {
				if(done[j]) continue;
				int flag = -1;
				MPI_Test(&reqs[j], &flag, MPI_STATUS_IGNORE);
				done[j] = flag != 0;
				all_done = all_done && done[j];
			}
		}

		const auto after = clk::now();
		if(i >= warmup) {
			times.push_back(after - before);
		}
	}

	MPI_Finalize();

	const auto sum = std::accumulate(times.begin(), times.end(), clk::duration{});
	const auto min = *std::min_element(times.begin(), times.end());
	const auto max = *std::max_element(times.begin(), times.end());

	printf("Rank %2d mean: %4zuus, min: %4zuus, max: %4zuus\n", rank, sum / 1us / iterations, min / 1us, max / 1us);

	return 0;
}

Obviously creating zones in a busy loop is not ideal, but this was the only way I could reproduce the effect in this small example. In our real application zones are submitted by different threads, including the thread that calls MPI_Test, but not for each iteration as is done here.

Here's the output when running on 32 ranks on Leonardo, with the ZoneScopedN enabled:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  7 mean:   20us, min:   13us, max: 26901us
Rank  4 mean:   20us, min:   13us, max: 31456us
Rank 17 mean:   20us, min:   12us, max: 25444us
Rank 14 mean:   20us, min:   12us, max: 21204us
Rank 19 mean:   20us, min:   13us, max: 25211us
Rank 31 mean:   20us, min:   13us, max: 31455us
Rank 10 mean:   20us, min:   13us, max: 20106us
Rank  5 mean:   20us, min:   13us, max: 31455us
Rank  1 mean:   20us, min:   12us, max: 31458us
Rank 11 mean:   20us, min:   13us, max: 18880us
Rank 18 mean:   20us, min:   13us, max: 25231us
Rank 28 mean:   20us, min:   13us, max: 31472us
Rank  3 mean:   20us, min:   13us, max: 31455us
Rank  2 mean:   20us, min:   13us, max: 31458us
Rank 23 mean:   20us, min:   12us, max: 25189us
Rank 27 mean:   20us, min:   13us, max: 25188us
Rank  0 mean:   20us, min:   13us, max: 31460us
Rank  6 mean:   20us, min:   12us, max: 26901us
Rank 16 mean:   20us, min:   12us, max: 25554us
Rank 24 mean:   20us, min:   12us, max: 25186us
Rank  8 mean:   20us, min:   12us, max: 24177us
Rank  9 mean:   20us, min:   12us, max: 19973us
Rank 13 mean:   20us, min:   12us, max: 21199us
Rank 20 mean:   20us, min:   12us, max: 25191us
Rank 29 mean:   20us, min:   12us, max: 31467us
Rank 30 mean:   20us, min:   12us, max: 31461us
Rank 25 mean:   20us, min:   12us, max: 25186us
Rank 15 mean:   20us, min:   12us, max: 25550us
Rank 21 mean:   20us, min:   12us, max: 25186us
Rank 26 mean:   20us, min:   12us, max: 25184us
Rank 12 mean:   20us, min:   12us, max: 16532us
Rank 22 mean:   20us, min:   12us, max: 25188us

And here's the output without the ZoneScopedN:

Transferring 65536 bytes 10000 times, 100 warmup iterations
Rank  8 mean:   15us, min:   12us, max:  250us
Rank 22 mean:   15us, min:   13us, max:  216us
Rank 20 mean:   15us, min:   12us, max:  216us
Rank 24 mean:   15us, min:   12us, max:  216us
Rank 15 mean:   15us, min:   12us, max:  249us
Rank 23 mean:   15us, min:   12us, max:  216us
Rank  0 mean:   15us, min:   13us, max:  251us
Rank 12 mean:   15us, min:   13us, max:  249us
Rank 25 mean:   15us, min:   12us, max:  217us
Rank 21 mean:   15us, min:   12us, max:  216us
Rank 19 mean:   15us, min:   12us, max:  215us
Rank  5 mean:   15us, min:   13us, max:  250us
Rank 14 mean:   15us, min:   12us, max:  249us
Rank  7 mean:   15us, min:   12us, max:  251us
Rank 29 mean:   15us, min:   12us, max:  217us
Rank 26 mean:   15us, min:   12us, max:  217us
Rank 17 mean:   15us, min:   12us, max:  216us
Rank  9 mean:   15us, min:   12us, max:  250us
Rank 18 mean:   15us, min:   12us, max:  217us
Rank  2 mean:   15us, min:   12us, max:  251us
Rank 11 mean:   15us, min:   12us, max:  249us
Rank 13 mean:   15us, min:   12us, max:  249us
Rank  4 mean:   15us, min:   13us, max:  251us
Rank  6 mean:   15us, min:   13us, max:  251us
Rank 28 mean:   15us, min:   12us, max:  216us
Rank 31 mean:   15us, min:   13us, max:  216us
Rank  3 mean:   15us, min:   13us, max:  251us
Rank  1 mean:   15us, min:   12us, max:  251us
Rank 27 mean:   15us, min:   12us, max:  216us
Rank 16 mean:   15us, min:   13us, max:  250us
Rank 10 mean:   15us, min:   12us, max:  249us
Rank 30 mean:   15us, min:   12us, max:  216us

I realize that this is a rather difficult issue to reproduce; I'm mainly opening it to see if anybody has any ideas as to what might be causing these spikes, or any suggestions for how to investigate this further.

One hypothesis we had was that somewhere inside MPI a OS / hardware interaction sometimes causes a thread to be scheduled out, and Tracy would get scheduled in (which could result in delays in the order of milliseconds). However, it is unlikely that this would result in a consistent gap pattern. Furthermore, we've tried explicitly setting the thread affinity for Tracy and all other application threads to ensure no overlap, but this does not seem to change anything (or at least not consistently; we've seen a couple of instances where it seemed to eliminate the gaps, but then wasn't reproducible).

Here's some additional things we've determined:

  • It only seems to happen for transfers over the actual network (no shared memory on a single node).
  • Reproducible for both OpenMPI and Intel MPI (MPICH).
  • The spikes actually happen somewhere inside calls to MPI_Test et al.; pre-loading a dummy MPI library that replaces MPI_Isend / MPI_Irecv / MPI_Test with no-ops eliminates the gaps.
  • It does not matter whether the trace is actually being consumed (e.g. via tracy-capture) or not
@wolfpld
Copy link
Owner

wolfpld commented Jan 7, 2025

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

@psalz
Copy link
Author

psalz commented Jan 7, 2025

You are using async / fiber functionality, and the current implementation switches everything to be fully serialized in such case. Maybe this is the reason why you see this behavior?

Yes, Celerity uses the fibers API to render concurrent tasks in our runtime. However, the reproducer code does not, it only uses a single ZoneScopedN!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants