Hi folks,

[the following discussion is based on v1.8.8]

suppose you have a MPI one-sided program using general active target
synchronization (GATS). In that program, a single origin process
performs two rounds of communication, i.e. two access epochs, to
different target process groups. The target processes synchronizes
accordingly with the single origin process.

Suppose further, that - for any reason - there is a process skew
that delays the target processes of the first group but does not affect
the second group. Thus, the processes in the second group issue a post
operation earlier than the first group.

IMO, this should have no effect for the origin process. It should first
complete its access epoch to the first group of targets, then to the
other one.

Things do work as expected with the osc/rdma component but do not with
osc/sm. To get osc/sm involved, compile the attached program with
-DUSE_WIN_ALLOCATE compiler flag. In detail I used

mpicc -O0 -g -Wall std=c99 -DUSE_WIN_ALLOCATE  pscw_epochs.c -o pscw_epochs

Run the compiled program on a shared memory system (e.g. your
workstation) with more than 2 processes and either use --mca osc sm or
do not specify any mca parameter at all (sm component is used for
windows automatically on shared memory systems if win is created by
MPI_Win_allocate and friends).

This will give a deadlock (timestamps from output removed):
mpiexec -n 3 ./pscw_epochs

[2 @ pscw_epochs.c:72]: posted, waiting for wait to return...
[0 @ pscw_epochs.c:41]: putting to 1
[0 @ pscw_epochs.c:44]: done with put/complete
[1 @ pscw_epochs.c:61]: sleeping...
[0 @ pscw_epochs.c:53]: putting value 2 to rank 2
[1 @ pscw_epochs.c:63]: woke up.
[1 @ pscw_epochs.c:66]: window buffer modified before sync'ed
[1 @ pscw_epochs.c:72]: posted, waiting for wait to return...
[2 @ pscw_epochs.c:75]: target done got 2 -> success
^C

Note, that is does not only cause a deadlock but also puts data in the
window of a process that has not synchronized already (rank 1)

If I run the program with more than 3 processes the effect of wrong data
in the window disappears, but the deadlock manifests:

mpiexec -n 4 ./pscw_epochs
[1 @ pscw_epochs.c:61]: sleeping...
[2 @ pscw_epochs.c:72]: posted, waiting for wait to return...
[3 @ pscw_epochs.c:72]: posted, waiting for wait to return...
[1 @ pscw_epochs.c:63]: woke up.
[1 @ pscw_epochs.c:72]: posted, waiting for wait to return...
^C

The reason for this seems to be the employed implementation using a
counter to check if all processes given in START have issued according
POST operations. START only checks if the counter's value matches the
number of processes in the start group. That way, it is prone to
modifications by other target processes from "future" epochs.

IMO, the counter is simply not a good solution for implementing START as
it is not capable of tracking which process has performed POST. I
suppose a solution for this would be to have a list or bit vector as
proposed in [1].

Looking forward for a discussion (may be at EuroMPI or MPI Forum next week)


Kind regards, Steffen

[1] Ping Lai, Sayantan Sur, and Dhabaleswar K. Panda. “Designing truly
one- sided MPI-2 RMA intra-node communication on multi-core systems”.
In: Computer Science - R&D 25.1-2 (2010), pp. 3–14, DOI:
10.1007/s00450-010-0115-3

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <mpi.h>

#define WIN_BUFFER_INIT_VALUE 0xFFFFFFFF
#define DPRINT(fmt, ...) printf("[%d @ %.6f %s:%d]: " fmt "\n", comm_rank, MPI_Wtime(), __FILE__, __LINE__, ##__VA_ARGS__)

int main(int argc, char** argv)
{
	int comm_rank, comm_size, i, buffer;
	int* win_buffer;
	int exclude_targets[2] = { 0, 1 };
	MPI_Win win;
	MPI_Group world_group, start_group, post_group;

	MPI_Init(&argc, &argv);
	MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
	MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

#ifdef USE_WIN_ALLOCATE
	MPI_Win_allocate(sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win_buffer, &win);
#else
	MPI_Alloc_mem(sizeof(*win_buffer), MPI_INFO_NULL, &win_buffer);
	MPI_Win_create(win_buffer, sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
#endif
	*win_buffer = WIN_BUFFER_INIT_VALUE;

	MPI_Comm_group(MPI_COMM_WORLD, &world_group);

	if (comm_rank == 0) {
		/* origin */

		/* 1st round: access window of rank 1 */
		MPI_Group_incl(world_group, 1, exclude_targets + 1, &start_group);

		buffer = 1;
		MPI_Win_start(start_group, 0, win);
		DPRINT("putting to 1");
		MPI_Put(&buffer, 1, MPI_INT, 1, 0, 1, MPI_INT, win);
		MPI_Win_complete(win);
		DPRINT("done with put/complete");

		MPI_Group_free(&start_group);

		/* 2nd round: access everyone else */
		MPI_Group_excl(world_group, sizeof(exclude_targets) / sizeof(*exclude_targets), exclude_targets, &start_group);
		buffer = 2;
		MPI_Win_start(start_group, 0, win);
		for (i = 2; i < comm_size; i++) {
			DPRINT("putting value %d to rank %d", buffer, i);
			MPI_Put(&buffer, 1, MPI_INT, i, 0, 1, MPI_INT, win);
		}
		MPI_Win_complete(win);
		MPI_Group_free(&start_group);
	} else {
		/* target */
		if (comm_rank == 1) {
			DPRINT("sleeping...");
			sleep(2);
			DPRINT("woke up.");

			if (*win_buffer != WIN_BUFFER_INIT_VALUE) {
				DPRINT("window buffer modified before sync'ed");
			}
		}

		MPI_Group_incl(world_group, 1, exclude_targets, &post_group);
		MPI_Win_post(post_group, 0, win);
		DPRINT("posted, waiting for wait to return...");
		/* nop */
		MPI_Win_wait(win);
		DPRINT("target done got %d -> %s", *win_buffer, (*win_buffer != (comm_rank == 1 ? 1 : 2)) ? "FAILURE" : "success");
		MPI_Group_free(&post_group);
	}

	MPI_Win_free(&win);
	MPI_Finalize();
}

Reply via email to