Hi folks, [the following discussion is based on v1.8.8]
suppose you have a MPI one-sided program using general active target synchronization (GATS). In that program, a single origin process performs two rounds of communication, i.e. two access epochs, to different target process groups. The target processes synchronizes accordingly with the single origin process. Suppose further, that - for any reason - there is a process skew that delays the target processes of the first group but does not affect the second group. Thus, the processes in the second group issue a post operation earlier than the first group. IMO, this should have no effect for the origin process. It should first complete its access epoch to the first group of targets, then to the other one. Things do work as expected with the osc/rdma component but do not with osc/sm. To get osc/sm involved, compile the attached program with -DUSE_WIN_ALLOCATE compiler flag. In detail I used mpicc -O0 -g -Wall std=c99 -DUSE_WIN_ALLOCATE pscw_epochs.c -o pscw_epochs Run the compiled program on a shared memory system (e.g. your workstation) with more than 2 processes and either use --mca osc sm or do not specify any mca parameter at all (sm component is used for windows automatically on shared memory systems if win is created by MPI_Win_allocate and friends). This will give a deadlock (timestamps from output removed): mpiexec -n 3 ./pscw_epochs [2 @ pscw_epochs.c:72]: posted, waiting for wait to return... [0 @ pscw_epochs.c:41]: putting to 1 [0 @ pscw_epochs.c:44]: done with put/complete [1 @ pscw_epochs.c:61]: sleeping... [0 @ pscw_epochs.c:53]: putting value 2 to rank 2 [1 @ pscw_epochs.c:63]: woke up. [1 @ pscw_epochs.c:66]: window buffer modified before sync'ed [1 @ pscw_epochs.c:72]: posted, waiting for wait to return... [2 @ pscw_epochs.c:75]: target done got 2 -> success ^C Note, that is does not only cause a deadlock but also puts data in the window of a process that has not synchronized already (rank 1) If I run the program with more than 3 processes the effect of wrong data in the window disappears, but the deadlock manifests: mpiexec -n 4 ./pscw_epochs [1 @ pscw_epochs.c:61]: sleeping... [2 @ pscw_epochs.c:72]: posted, waiting for wait to return... [3 @ pscw_epochs.c:72]: posted, waiting for wait to return... [1 @ pscw_epochs.c:63]: woke up. [1 @ pscw_epochs.c:72]: posted, waiting for wait to return... ^C The reason for this seems to be the employed implementation using a counter to check if all processes given in START have issued according POST operations. START only checks if the counter's value matches the number of processes in the start group. That way, it is prone to modifications by other target processes from "future" epochs. IMO, the counter is simply not a good solution for implementing START as it is not capable of tracking which process has performed POST. I suppose a solution for this would be to have a list or bit vector as proposed in [1]. Looking forward for a discussion (may be at EuroMPI or MPI Forum next week) Kind regards, Steffen [1] Ping Lai, Sayantan Sur, and Dhabaleswar K. Panda. “Designing truly one- sided MPI-2 RMA intra-node communication on multi-core systems”. In: Computer Science - R&D 25.1-2 (2010), pp. 3–14, DOI: 10.1007/s00450-010-0115-3
#include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <mpi.h> #define WIN_BUFFER_INIT_VALUE 0xFFFFFFFF #define DPRINT(fmt, ...) printf("[%d @ %.6f %s:%d]: " fmt "\n", comm_rank, MPI_Wtime(), __FILE__, __LINE__, ##__VA_ARGS__) int main(int argc, char** argv) { int comm_rank, comm_size, i, buffer; int* win_buffer; int exclude_targets[2] = { 0, 1 }; MPI_Win win; MPI_Group world_group, start_group, post_group; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &comm_size); MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); #ifdef USE_WIN_ALLOCATE MPI_Win_allocate(sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win_buffer, &win); #else MPI_Alloc_mem(sizeof(*win_buffer), MPI_INFO_NULL, &win_buffer); MPI_Win_create(win_buffer, sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win); #endif *win_buffer = WIN_BUFFER_INIT_VALUE; MPI_Comm_group(MPI_COMM_WORLD, &world_group); if (comm_rank == 0) { /* origin */ /* 1st round: access window of rank 1 */ MPI_Group_incl(world_group, 1, exclude_targets + 1, &start_group); buffer = 1; MPI_Win_start(start_group, 0, win); DPRINT("putting to 1"); MPI_Put(&buffer, 1, MPI_INT, 1, 0, 1, MPI_INT, win); MPI_Win_complete(win); DPRINT("done with put/complete"); MPI_Group_free(&start_group); /* 2nd round: access everyone else */ MPI_Group_excl(world_group, sizeof(exclude_targets) / sizeof(*exclude_targets), exclude_targets, &start_group); buffer = 2; MPI_Win_start(start_group, 0, win); for (i = 2; i < comm_size; i++) { DPRINT("putting value %d to rank %d", buffer, i); MPI_Put(&buffer, 1, MPI_INT, i, 0, 1, MPI_INT, win); } MPI_Win_complete(win); MPI_Group_free(&start_group); } else { /* target */ if (comm_rank == 1) { DPRINT("sleeping..."); sleep(2); DPRINT("woke up."); if (*win_buffer != WIN_BUFFER_INIT_VALUE) { DPRINT("window buffer modified before sync'ed"); } } MPI_Group_incl(world_group, 1, exclude_targets, &post_group); MPI_Win_post(post_group, 0, win); DPRINT("posted, waiting for wait to return..."); /* nop */ MPI_Win_wait(win); DPRINT("target done got %d -> %s", *win_buffer, (*win_buffer != (comm_rank == 1 ? 1 : 2)) ? "FAILURE" : "success"); MPI_Group_free(&post_group); } MPI_Win_free(&win); MPI_Finalize(); }