I maintain the osc/sm component but did not write the pscw synchronization. I agree that a counter is not sufficient. I have a fix in mind and will probably create a PR for it later this week. The fix will need to be applied to 1.10, 2.x, and master.
-Nathan On Fri, Sep 18, 2015 at 10:33:18AM +0200, Steffen Christgau wrote: > Hi folks, > > [the following discussion is based on v1.8.8] > > suppose you have a MPI one-sided program using general active target > synchronization (GATS). In that program, a single origin process > performs two rounds of communication, i.e. two access epochs, to > different target process groups. The target processes synchronizes > accordingly with the single origin process. > > Suppose further, that - for any reason - there is a process skew > that delays the target processes of the first group but does not affect > the second group. Thus, the processes in the second group issue a post > operation earlier than the first group. > > IMO, this should have no effect for the origin process. It should first > complete its access epoch to the first group of targets, then to the > other one. > > Things do work as expected with the osc/rdma component but do not with > osc/sm. To get osc/sm involved, compile the attached program with > -DUSE_WIN_ALLOCATE compiler flag. In detail I used > > mpicc -O0 -g -Wall std=c99 -DUSE_WIN_ALLOCATE pscw_epochs.c -o pscw_epochs > > Run the compiled program on a shared memory system (e.g. your > workstation) with more than 2 processes and either use --mca osc sm or > do not specify any mca parameter at all (sm component is used for > windows automatically on shared memory systems if win is created by > MPI_Win_allocate and friends). > > This will give a deadlock (timestamps from output removed): > mpiexec -n 3 ./pscw_epochs > > [2 @ pscw_epochs.c:72]: posted, waiting for wait to return... > [0 @ pscw_epochs.c:41]: putting to 1 > [0 @ pscw_epochs.c:44]: done with put/complete > [1 @ pscw_epochs.c:61]: sleeping... > [0 @ pscw_epochs.c:53]: putting value 2 to rank 2 > [1 @ pscw_epochs.c:63]: woke up. > [1 @ pscw_epochs.c:66]: window buffer modified before sync'ed > [1 @ pscw_epochs.c:72]: posted, waiting for wait to return... > [2 @ pscw_epochs.c:75]: target done got 2 -> success > ^C > > Note, that is does not only cause a deadlock but also puts data in the > window of a process that has not synchronized already (rank 1) > > If I run the program with more than 3 processes the effect of wrong data > in the window disappears, but the deadlock manifests: > > mpiexec -n 4 ./pscw_epochs > [1 @ pscw_epochs.c:61]: sleeping... > [2 @ pscw_epochs.c:72]: posted, waiting for wait to return... > [3 @ pscw_epochs.c:72]: posted, waiting for wait to return... > [1 @ pscw_epochs.c:63]: woke up. > [1 @ pscw_epochs.c:72]: posted, waiting for wait to return... > ^C > > The reason for this seems to be the employed implementation using a > counter to check if all processes given in START have issued according > POST operations. START only checks if the counter's value matches the > number of processes in the start group. That way, it is prone to > modifications by other target processes from "future" epochs. > > IMO, the counter is simply not a good solution for implementing START as > it is not capable of tracking which process has performed POST. I > suppose a solution for this would be to have a list or bit vector as > proposed in [1]. > > Looking forward for a discussion (may be at EuroMPI or MPI Forum next week) > > > Kind regards, Steffen > > [1] Ping Lai, Sayantan Sur, and Dhabaleswar K. Panda. “Designing truly > one- sided MPI-2 RMA intra-node communication on multi-core systems”. > In: Computer Science - R&D 25.1-2 (2010), pp. 3–14, DOI: > 10.1007/s00450-010-0115-3 > > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > > #include <mpi.h> > > #define WIN_BUFFER_INIT_VALUE 0xFFFFFFFF > #define DPRINT(fmt, ...) printf("[%d @ %.6f %s:%d]: " fmt "\n", comm_rank, > MPI_Wtime(), __FILE__, __LINE__, ##__VA_ARGS__) > > int main(int argc, char** argv) > { > int comm_rank, comm_size, i, buffer; > int* win_buffer; > int exclude_targets[2] = { 0, 1 }; > MPI_Win win; > MPI_Group world_group, start_group, post_group; > > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &comm_size); > MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank); > > #ifdef USE_WIN_ALLOCATE > MPI_Win_allocate(sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, > &win_buffer, &win); > #else > MPI_Alloc_mem(sizeof(*win_buffer), MPI_INFO_NULL, &win_buffer); > MPI_Win_create(win_buffer, sizeof(*win_buffer), 1, MPI_INFO_NULL, > MPI_COMM_WORLD, &win); > #endif > *win_buffer = WIN_BUFFER_INIT_VALUE; > > MPI_Comm_group(MPI_COMM_WORLD, &world_group); > > if (comm_rank == 0) { > /* origin */ > > /* 1st round: access window of rank 1 */ > MPI_Group_incl(world_group, 1, exclude_targets + 1, > &start_group); > > buffer = 1; > MPI_Win_start(start_group, 0, win); > DPRINT("putting to 1"); > MPI_Put(&buffer, 1, MPI_INT, 1, 0, 1, MPI_INT, win); > MPI_Win_complete(win); > DPRINT("done with put/complete"); > > MPI_Group_free(&start_group); > > /* 2nd round: access everyone else */ > MPI_Group_excl(world_group, sizeof(exclude_targets) / > sizeof(*exclude_targets), exclude_targets, &start_group); > buffer = 2; > MPI_Win_start(start_group, 0, win); > for (i = 2; i < comm_size; i++) { > DPRINT("putting value %d to rank %d", buffer, i); > MPI_Put(&buffer, 1, MPI_INT, i, 0, 1, MPI_INT, win); > } > MPI_Win_complete(win); > MPI_Group_free(&start_group); > } else { > /* target */ > if (comm_rank == 1) { > DPRINT("sleeping..."); > sleep(2); > DPRINT("woke up."); > > if (*win_buffer != WIN_BUFFER_INIT_VALUE) { > DPRINT("window buffer modified before sync'ed"); > } > } > > MPI_Group_incl(world_group, 1, exclude_targets, &post_group); > MPI_Win_post(post_group, 0, win); > DPRINT("posted, waiting for wait to return..."); > /* nop */ > MPI_Win_wait(win); > DPRINT("target done got %d -> %s", *win_buffer, (*win_buffer != > (comm_rank == 1 ? 1 : 2)) ? "FAILURE" : "success"); > MPI_Group_free(&post_group); > } > > MPI_Win_free(&win); > MPI_Finalize(); > } > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27622.php
pgplz3XB4uIN4.pgp
Description: PGP signature