I maintain the osc/sm component but did not write the pscw
synchronization. I agree that a counter is not sufficient. I have a fix
in mind and will probably create a PR for it later this week. The fix
will need to be applied to 1.10, 2.x, and master.

-Nathan

On Fri, Sep 18, 2015 at 10:33:18AM +0200, Steffen Christgau wrote:
> Hi folks,
> 
> [the following discussion is based on v1.8.8]
> 
> suppose you have a MPI one-sided program using general active target
> synchronization (GATS). In that program, a single origin process
> performs two rounds of communication, i.e. two access epochs, to
> different target process groups. The target processes synchronizes
> accordingly with the single origin process.
> 
> Suppose further, that - for any reason - there is a process skew
> that delays the target processes of the first group but does not affect
> the second group. Thus, the processes in the second group issue a post
> operation earlier than the first group.
> 
> IMO, this should have no effect for the origin process. It should first
> complete its access epoch to the first group of targets, then to the
> other one.
> 
> Things do work as expected with the osc/rdma component but do not with
> osc/sm. To get osc/sm involved, compile the attached program with
> -DUSE_WIN_ALLOCATE compiler flag. In detail I used
> 
> mpicc -O0 -g -Wall std=c99 -DUSE_WIN_ALLOCATE  pscw_epochs.c -o pscw_epochs
> 
> Run the compiled program on a shared memory system (e.g. your
> workstation) with more than 2 processes and either use --mca osc sm or
> do not specify any mca parameter at all (sm component is used for
> windows automatically on shared memory systems if win is created by
> MPI_Win_allocate and friends).
> 
> This will give a deadlock (timestamps from output removed):
> mpiexec -n 3 ./pscw_epochs
> 
> [2 @ pscw_epochs.c:72]: posted, waiting for wait to return...
> [0 @ pscw_epochs.c:41]: putting to 1
> [0 @ pscw_epochs.c:44]: done with put/complete
> [1 @ pscw_epochs.c:61]: sleeping...
> [0 @ pscw_epochs.c:53]: putting value 2 to rank 2
> [1 @ pscw_epochs.c:63]: woke up.
> [1 @ pscw_epochs.c:66]: window buffer modified before sync'ed
> [1 @ pscw_epochs.c:72]: posted, waiting for wait to return...
> [2 @ pscw_epochs.c:75]: target done got 2 -> success
> ^C
> 
> Note, that is does not only cause a deadlock but also puts data in the
> window of a process that has not synchronized already (rank 1)
> 
> If I run the program with more than 3 processes the effect of wrong data
> in the window disappears, but the deadlock manifests:
> 
> mpiexec -n 4 ./pscw_epochs
> [1 @ pscw_epochs.c:61]: sleeping...
> [2 @ pscw_epochs.c:72]: posted, waiting for wait to return...
> [3 @ pscw_epochs.c:72]: posted, waiting for wait to return...
> [1 @ pscw_epochs.c:63]: woke up.
> [1 @ pscw_epochs.c:72]: posted, waiting for wait to return...
> ^C
> 
> The reason for this seems to be the employed implementation using a
> counter to check if all processes given in START have issued according
> POST operations. START only checks if the counter's value matches the
> number of processes in the start group. That way, it is prone to
> modifications by other target processes from "future" epochs.
> 
> IMO, the counter is simply not a good solution for implementing START as
> it is not capable of tracking which process has performed POST. I
> suppose a solution for this would be to have a list or bit vector as
> proposed in [1].
> 
> Looking forward for a discussion (may be at EuroMPI or MPI Forum next week)
> 
> 
> Kind regards, Steffen
> 
> [1] Ping Lai, Sayantan Sur, and Dhabaleswar K. Panda. “Designing truly
> one- sided MPI-2 RMA intra-node communication on multi-core systems”.
> In: Computer Science - R&D 25.1-2 (2010), pp. 3–14, DOI:
> 10.1007/s00450-010-0115-3
> 

> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #include <mpi.h>
> 
> #define WIN_BUFFER_INIT_VALUE 0xFFFFFFFF
> #define DPRINT(fmt, ...) printf("[%d @ %.6f %s:%d]: " fmt "\n", comm_rank, 
> MPI_Wtime(), __FILE__, __LINE__, ##__VA_ARGS__)
> 
> int main(int argc, char** argv)
> {
>       int comm_rank, comm_size, i, buffer;
>       int* win_buffer;
>       int exclude_targets[2] = { 0, 1 };
>       MPI_Win win;
>       MPI_Group world_group, start_group, post_group;
> 
>       MPI_Init(&argc, &argv);
>       MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
>       MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
> 
> #ifdef USE_WIN_ALLOCATE
>       MPI_Win_allocate(sizeof(*win_buffer), 1, MPI_INFO_NULL, MPI_COMM_WORLD, 
> &win_buffer, &win);
> #else
>       MPI_Alloc_mem(sizeof(*win_buffer), MPI_INFO_NULL, &win_buffer);
>       MPI_Win_create(win_buffer, sizeof(*win_buffer), 1, MPI_INFO_NULL, 
> MPI_COMM_WORLD, &win);
> #endif
>       *win_buffer = WIN_BUFFER_INIT_VALUE;
> 
>       MPI_Comm_group(MPI_COMM_WORLD, &world_group);
> 
>       if (comm_rank == 0) {
>               /* origin */
> 
>               /* 1st round: access window of rank 1 */
>               MPI_Group_incl(world_group, 1, exclude_targets + 1, 
> &start_group);
> 
>               buffer = 1;
>               MPI_Win_start(start_group, 0, win);
>               DPRINT("putting to 1");
>               MPI_Put(&buffer, 1, MPI_INT, 1, 0, 1, MPI_INT, win);
>               MPI_Win_complete(win);
>               DPRINT("done with put/complete");
> 
>               MPI_Group_free(&start_group);
> 
>               /* 2nd round: access everyone else */
>               MPI_Group_excl(world_group, sizeof(exclude_targets) / 
> sizeof(*exclude_targets), exclude_targets, &start_group);
>               buffer = 2;
>               MPI_Win_start(start_group, 0, win);
>               for (i = 2; i < comm_size; i++) {
>                       DPRINT("putting value %d to rank %d", buffer, i);
>                       MPI_Put(&buffer, 1, MPI_INT, i, 0, 1, MPI_INT, win);
>               }
>               MPI_Win_complete(win);
>               MPI_Group_free(&start_group);
>       } else {
>               /* target */
>               if (comm_rank == 1) {
>                       DPRINT("sleeping...");
>                       sleep(2);
>                       DPRINT("woke up.");
> 
>                       if (*win_buffer != WIN_BUFFER_INIT_VALUE) {
>                               DPRINT("window buffer modified before sync'ed");
>                       }
>               }
> 
>               MPI_Group_incl(world_group, 1, exclude_targets, &post_group);
>               MPI_Win_post(post_group, 0, win);
>               DPRINT("posted, waiting for wait to return...");
>               /* nop */
>               MPI_Win_wait(win);
>               DPRINT("target done got %d -> %s", *win_buffer, (*win_buffer != 
> (comm_rank == 1 ? 1 : 2)) ? "FAILURE" : "success");
>               MPI_Group_free(&post_group);
>       }
> 
>       MPI_Win_free(&win);
>       MPI_Finalize();
> }
> 

> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27622.php

Attachment: pgplz3XB4uIN4.pgp
Description: PGP signature

Reply via email to