I tried replacing the MPI_Alloc_mem/MPI_Win_create pair with MPI_Win_allocate 
but it doesn't seem to have much of an effect. It may improve the overall 
success rate in some cases but overall, it seems a wash. I haven't done any 
timing, I've mostly been focusing on correctness, so I don't know if this 
replacement improved performance. I also tried using MPI_Win_flush_local paired 
with the standard MPI_Put/MPI_Get and again, did not see significant 
improvement in the success rate for this test.

I'm not surprised that RM- based implementations work for applications, at 
least most of the time since the tests usually get though at least one loop 
before dying. Although the ordering of operations to the same remote processor 
is specified by ARMCI, it does not appear to occur in practical applications. I 
checked with one of the NWChem developers and he does not think it likely that 
this particular motif is used anywhere in NWChem. On the other hand, unless I'm 
doing something wrong, this should work according to the MPI-3 standard (at 
least in the case where an appropriate synchronization or flush operation is 
inserted between the put and get operations) and the fact that it does not is 
worrisome.

I modified the test program so that MPI_Win_allocate is used and added code for 
using MPI_Put/MPI_Get plus an MPI_Win_flush_local operation. This can be turned 
on and off by defining a preprocessor flag at the top of the program. As far as 
I can tell, this matches the usage in theARMCI-MPI code. Running on an 
infiniband cluster using 2 processors on 2 separate nodes, I got the following 
results:

OpenMPI-1.8.3
Request-based implementation without synchronization: 11 successes out of 20 
tests
Request-based implementation with synchronization: 17 successes out of 20 tests
Flush local implementation without synchronization: 18 successes out of 20 tests
Flush local implementation with synchronization: 19 successes out of 20 tests
Lock-based implementation without synchronization: 7 successes out of 20 tests
Lock-based implementation with synchronization: 10 successes out of 20 tests

OpenMPI-1.10.1
Request-based implementation without synchronization: 11 successes out of 20 
tests
Request-based implementation with synchronization: 15 successes out of 20 tests
Flush local implementation without synchronization: 8 successes out of 20 tests
Flush local implementation with synchronization: 18 successes out of 20 tests
Lock-based implementation without synchronization: 6 successes out of 20 tests
Lock-based implementation with synchronization: 8 successes out of 20 tests

Bruce

List-Post: users@lists.open-mpi.org
Date: Fri, 8 Jan 2016 14:01:15 -0800
From: Jeff Hammond <jeff.scie...@gmail.com>
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Put/Get semantics
Message-ID:
        <CAGKz=u+jxmjpieqguqix9vitupjmdv9wc18wovevwxm82yx...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Instead of MPI_Alloc_mem and MPI_Win_create, you should use MPI_Win_allocate.  
This will make it much easier for the implementation to optimize with 
interprocess shared memory and exploit scalability features such as symmetric 
globally addressable memory.  It also obviates the need to do both MPI_Win_free 
and MPI_Free_mem.

Based upon what I've seen recently (
https://travis-ci.org/jeffhammond/armci-mpi), using MPI_Win_allocate may fix 
some unresolved Open-MPI RMA bugs ( 
https://github.com/open-mpi/ompi/issues/1275).

As for your synchronization question, instead of

    MPI_Rget(b,1,dtype,rproc,displ,1,dtype,win,&request);
    MPI_Wait(&request,&status);

and

    MPI_Rput(a,1,dtype,rproc,displ,1,dtype,win,&request);
    MPI_Wait(&request,&status);

you should use

    MPI_Get(b,1,dtype,rproc,displ,1,dtype,win);
    MPI_Win_flush_local(1,win);

and

    MPI_Put(a,1,dtype,rproc,displ,1,dtype,win);
    MPI_Win_flush_local(1,win);

as there is no need to create a request for this usage model.
Request-based RMA entails some implementation overhead in some cases, and is 
more likely to be broken since it is not heavily tested. On the other hand, the 
non-request RMA has been tested extensively thanks to the thousands of NWChem 
jobs I've run using ARMCI-MPI on Cray, InfiniBand, and other systems.

As I think I've said before on some list, one of the best ways to understand 
the mapping between ARMCI and MPI RMA is to look at ARMCI-MPI.

Jeff

On Wed, Jan 6, 2016 at 8:51 AM, Palmer, Bruce J <bruce.pal...@pnnl.gov>
wrote:
>
> Hi,
> I?m trying to compare the semantics of MPI RMA with those of ARMCI. 
> I?ve
written a small test program that writes data to a remote processor and then 
reads the data back to the original processor. In ARMCI, you should be able to 
do this since operations to the same remote processor are completed in the same 
order that they are requested on the calling processor. I?ve implemented this 
two different ways using MPI RMA. The first is to call MPI_Win_lock to create a 
shared lock on the remote processor, then MPI_Put/MPI_Get to initiate the data 
transfer and finally MPI_Win_unlock to force completion of the data transfer. 
My understanding is that this should allow you to write and then read data to 
the same process, since the first triplet
> MPI_Win_lock
> MPI_Put
> MPI_Win_unlock
> must be completed both locally and remotely before the unlock call
completes. The calls in the second triplet
> MPI_Win_lock
> MPI_Get
> MPI_Win_unlock
> cannot start until the first triplet is done, so if both the put and 
> the
get refer to the same data on the same remote processor, then it should work.
> The second implementation uses request-based RMA and starts by calling
MPI_Win_lock_all collectively on the window when it is created and 
MPI_Win_unlock_all when it is destroy so that the window is always in a passive 
synchronization epoch. The put is implement by calling MPI_Rput followed by 
calling MPI_Wait on the handle returned from the MPI_Rput call.
Similarly, get is implemented by calling MPI_Rget followed by MPI_Wait. The 
wait call guarantees that the operation is completed locally and the data can 
then be used. However, from what I understand of the standard, it doesn?t say 
anything about the ordering of the operations so conceivably the put could 
execute remotely before the get. Inserting an MPI_Win_flush_all between the 
MPI_Rput and MPI_Rget should guarantee that the operations are ordered.
> I?ve written the test program so that it can use either the lock or
request-based implementations and I?ve also included an option that inserts a 
fence/flush plus barrier operation between put and get. The different 
configurations can be set up by defining some preprocessor symbols at the top 
of the program.  The program loops over the test repeatedly and the current 
number of loops is set at 2000. The results I get running on a Linux cluster 
with an Infiniband network using OpenMPI-1.10.1 on 2 processors on 2 different 
SMP nodes are as follows:
> Using OpenMPI-1.8.3:
> Request-based implementation without synchronization: 9 successes out 
> of
10 runs
> Request-based implementation with synchronization: 19 successes out of 
> 20
runs
> Lock-based implementation without synchronization: 1 success out of 10
runs
> Lock-based implementation with synchronization: 1 success out of 10 
> runs Using OpenMPI-1.10.1 Request-based implementation without 
> synchronization: 2 successes out of
10 runs
> Request-based implementation with synchronization: 8 successes out of 
> 10
runs
> Lock-based implementation without synchronization: 4 successes out of 
> 10
runs
> Lock-based implementation with synchronization: 2 successes out of 10 
> runs Except for the request-based implementation without 
> synchronization (in
this case a call to MPI_Win_flush_all), I would expect these to all succeed. Is 
there some fault to my thinking here? I?ve attached the test program
> Bruce Palmer
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28216.php




--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
-------------- next part --------------
HTML attachment scrubbed and removed

------------------------------

Subject: Digest Footer

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

------------------------------

End of users Digest, Vol 3386, Issue 1
**************************************
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>

#include <mpi.h>

/*
#define MPI_USE_REQUESTS
#define MPI_USE_FLUSH_LOCAL
*/
#define USE_SYNC
#define NDIM 500
#define LOOP 2000
#ifdef MPI_USE_FLUSH_LOCAL
#define MPI_USE_REQUESTS
#endif

int main(int argc, char *argv[])
{
  int nvals = NDIM * NDIM;
  MPI_Win win;
  MPI_Aint displ;
  MPI_Datatype dtype;
  void *buf;
  double *a, *b, *c;
  int tsize, lo[2], hi[2], rlo[2], rhi[2], ldims[2], dims[2], starts[2];
  int i, j, k, me, nproc, rproc, loop;
  MPI_Request request;
  MPI_Status status;

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  /* Allocate window */
  tsize = sizeof(double)*nvals;
#if 1
  MPI_Win_allocate(tsize,1,MPI_INFO_NULL,MPI_COMM_WORLD,&buf,&win);
#else
  MPI_Alloc_mem(tsize,MPI_INFO_NULL,&buf);
  MPI_Win_create(buf,tsize,1,MPI_INFO_NULL,MPI_COMM_WORLD,&win);
#endif
  c = (double*)buf;
#ifdef MPI_USE_REQUESTS
  MPI_Win_lock_all(0,win);
#endif

  /* Create local buffers*/
  a = (double*)malloc(nvals*sizeof(double));
  b = (double*)malloc(nvals*sizeof(double));

  /* set up blocks for local and remote requests */
  lo[0] = 0;
  lo[1] = 0;
  hi[0] = NDIM/2-1;
  hi[1] = NDIM/2-1;

  rlo[0] = NDIM/2;
  rlo[1] = NDIM/2;
  rhi[0] = NDIM-1;
  rhi[1] = NDIM-1;

  /* Evaluate displacement on remote processor */
  displ = rlo[0] + rlo[1]*NDIM;
  displ = displ*sizeof(double);

  /* loop over tests */
  for (loop=0; loop<LOOP; loop++) {

    /* Fill local buffer with unique values */
    for (j=0; j<NDIM; j++) {
      for (i=0; i<NDIM; i++) {
        k = i+j*NDIM;
        a[k] = (double)(k+(me+loop)*nvals);
        b[k] = 0.0;
        c[k] = 0.0;
      }
    }

    /* Construct data type. For this test, we can use the same data type for 
both
     * local and remote buffers */
    dims[0] = NDIM;
    dims[1] = NDIM;
    ldims[0] = NDIM/2;
    ldims[1] = NDIM/2;
    starts[0] = 0;
    starts[1] = 0;
    MPI_Type_create_subarray(2,dims,ldims,starts,MPI_ORDER_FORTRAN,
        MPI_DOUBLE,&dtype);

    /* Put data in remote buffer */
    rproc = (me+1)%nproc;
    MPI_Type_commit(&dtype);
#ifdef MPI_USE_REQUESTS
#ifdef MPI_USE_FLUSH_LOCAL
    MPI_Put(a,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_flush_local(rproc,win);
#else
    MPI_Rput(a,1,dtype,rproc,displ,1,dtype,win,&request); 
    MPI_Wait(&request,&status);
#endif
#else
    MPI_Win_lock(MPI_LOCK_SHARED,rproc,0,win);
    MPI_Put(a,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_unlock(rproc,win);
#endif

#ifdef USE_SYNC
#ifdef MPI_USE_REQUESTS
    MPI_Win_flush_all(win);
#else
    MPI_Win_fence(0,win);
#endif
    MPI_Barrier(MPI_COMM_WORLD);
#endif

    /* Get data from remote buffer */
#ifdef MPI_USE_REQUESTS
#ifdef MPI_USE_FLUSH_LOCAL
    MPI_Get(b,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_flush_local(rproc,win);
#else
    MPI_Rget(b,1,dtype,rproc,displ,1,dtype,win,&request); 
    MPI_Wait(&request,&status);
#endif
#else
    MPI_Win_lock(MPI_LOCK_SHARED,rproc,0,win);
    MPI_Get(b,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_unlock(rproc,win);
#endif

    /* Compare values in a and b */
    for (j=0; j<NDIM/2; j++) {
      for (i=0; i<NDIM/2; i++) {
        k = i+j*NDIM;
        if (a[k] != b[k]) {
          fprintf(stderr,"p[%d] loop: %d value a[%d,%d]: %f actual b[%d,%d]: 
%f\n",
              me,loop,i,j,a[k],i,j,b[k]);
          assert(0);
        }
      }
    }
    MPI_Type_free(&dtype);
    if (me==0) printf("Test passed for loop %d\n",loop);
  }
  if (me==0) printf("\nAll tests successful");

#ifdef MPI_USE_REQUESTS
  MPI_Win_unlock_all(win);
#endif
#if 0
  MPI_Free_mem(buf);
#endif
  MPI_Win_free(&win);
  MPI_Finalize();
  return (0);
}

Reply via email to