Hi,
I'm trying to compare the semantics of MPI RMA with those of ARMCI. I've 
written a small test program that writes data to a remote processor and then 
reads the data back to the original processor. In ARMCI, you should be able to 
do this since operations to the same remote processor are completed in the same 
order that they are requested on the calling processor. I've implemented this 
two different ways using MPI RMA. The first is to call MPI_Win_lock to create a 
shared lock on the remote processor, then MPI_Put/MPI_Get to initiate the data 
transfer and finally MPI_Win_unlock to force completion of the data transfer. 
My understanding is that this should allow you to write and then read data to 
the same process, since the first triplet
MPI_Win_lock
MPI_Put
MPI_Win_unlock
must be completed both locally and remotely before the unlock call completes. 
The calls in the second triplet
MPI_Win_lock
MPI_Get
MPI_Win_unlock
cannot start until the first triplet is done, so if both the put and the get 
refer to the same data on the same remote processor, then it should work.
The second implementation uses request-based RMA and starts by calling 
MPI_Win_lock_all collectively on the window when it is created and 
MPI_Win_unlock_all when it is destroy so that the window is always in a passive 
synchronization epoch. The put is implement by calling MPI_Rput followed by 
calling MPI_Wait on the handle returned from the MPI_Rput call. Similarly, get 
is implemented by calling MPI_Rget followed by MPI_Wait. The wait call 
guarantees that the operation is completed locally and the data can then be 
used. However, from what I understand of the standard, it doesn't say anything 
about the ordering of the operations so conceivably the put could execute 
remotely before the get. Inserting an MPI_Win_flush_all between the MPI_Rput 
and MPI_Rget should guarantee that the operations are ordered.
I've written the test program so that it can use either the lock or 
request-based implementations and I've also included an option that inserts a 
fence/flush plus barrier operation between put and get. The different 
configurations can be set up by defining some preprocessor symbols at the top 
of the program.  The program loops over the test repeatedly and the current 
number of loops is set at 2000. The results I get running on a Linux cluster 
with an Infiniband network using OpenMPI-1.10.1 on 2 processors on 2 different 
SMP nodes are as follows:
Using OpenMPI-1.8.3:
Request-based implementation without synchronization: 9 successes out of 10 runs
Request-based implementation with synchronization: 19 successes out of 20 runs
Lock-based implementation without synchronization: 1 success out of 10 runs
Lock-based implementation with synchronization: 1 success out of 10 runs
Using OpenMPI-1.10.1
Request-based implementation without synchronization: 2 successes out of 10 runs
Request-based implementation with synchronization: 8 successes out of 10 runs
Lock-based implementation without synchronization: 4 successes out of 10 runs
Lock-based implementation with synchronization: 2 successes out of 10 runs
Except for the request-based implementation without synchronization (in this 
case a call to MPI_Win_flush_all), I would expect these to all succeed. Is 
there some fault to my thinking here? I've attached the test program
Bruce Palmer


#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <string.h>
#include <sys/time.h>

#include <mpi.h>

/*
#define USE_SYNC
#define MPI_USE_REQUESTS
*/
#define NDIM 500
#define LOOP 2000

int main(int argc, char *argv[])
{
  int nvals = NDIM * NDIM;
  MPI_Win win;
  MPI_Aint displ;
  MPI_Datatype dtype;
  void *buf;
  double *a, *b, *c;
  int tsize, lo[2], hi[2], rlo[2], rhi[2], ldims[2], dims[2], starts[2];
  int i, j, k, me, nproc, rproc, loop;
  MPI_Request request;
  MPI_Status status;

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &me);
  MPI_Comm_size(MPI_COMM_WORLD, &nproc);
  /* Allocate window */
  tsize = sizeof(double)*nvals;
  MPI_Alloc_mem(tsize,MPI_INFO_NULL,&buf);
  MPI_Win_create(buf,tsize,1,MPI_INFO_NULL,MPI_COMM_WORLD,&win);
  c = (double*)buf;
#ifdef MPI_USE_REQUESTS
  MPI_Win_lock_all(0,win);
#endif

  /* Create local buffers*/
  a = (double*)malloc(nvals*sizeof(double));
  b = (double*)malloc(nvals*sizeof(double));

  /* set up blocks for local and remote requests */
  lo[0] = 0;
  lo[1] = 0;
  hi[0] = NDIM/2-1;
  hi[1] = NDIM/2-1;

  rlo[0] = NDIM/2;
  rlo[1] = NDIM/2;
  rhi[0] = NDIM-1;
  rhi[1] = NDIM-1;

  /* Evaluate displacement on remote processor */
  displ = rlo[0] + rlo[1]*NDIM;
  displ = displ*sizeof(double);

  /* loop over tests */
  for (loop=0; loop<LOOP; loop++) {

    /* Fill local buffer with unique values */
    for (j=0; j<NDIM; j++) {
      for (i=0; i<NDIM; i++) {
        k = i+j*NDIM;
        a[k] = (double)(k+(me+loop)*nvals);
        b[k] = 0.0;
        c[k] = 0.0;
      }
    }

    /* Construct data type. For this test, we can use the same data type for 
both
     * local and remote buffers */
    dims[0] = NDIM;
    dims[1] = NDIM;
    ldims[0] = NDIM/2;
    ldims[1] = NDIM/2;
    starts[0] = 0;
    starts[1] = 0;
    MPI_Type_create_subarray(2,dims,ldims,starts,MPI_ORDER_FORTRAN,
        MPI_DOUBLE,&dtype);

    /* Put data in remote buffer */
    rproc = (me+1)%nproc;
    MPI_Type_commit(&dtype);
#ifdef MPI_USE_REQUESTS
    MPI_Rput(a,1,dtype,rproc,displ,1,dtype,win,&request); 
    MPI_Wait(&request,&status);
#else
    MPI_Win_lock(MPI_LOCK_SHARED,rproc,0,win);
    MPI_Put(a,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_unlock(rproc,win);
#endif

#ifdef USE_SYNC
#ifdef MPI_USE_REQUESTS
    MPI_Win_flush_all(win);
#else
    MPI_Win_fence(0,win);
#endif
    MPI_Barrier(MPI_COMM_WORLD);
#endif

    /* Get data from remote buffer */
#ifdef MPI_USE_REQUESTS
    MPI_Rget(b,1,dtype,rproc,displ,1,dtype,win,&request); 
    MPI_Wait(&request,&status);
#else
    MPI_Win_lock(MPI_LOCK_SHARED,rproc,0,win);
    MPI_Get(b,1,dtype,rproc,displ,1,dtype,win); 
    MPI_Win_unlock(rproc,win);
#endif

    /* Compare values in a and b */
    for (j=0; j<NDIM/2; j++) {
      for (i=0; i<NDIM/2; i++) {
        k = i+j*NDIM;
        if (a[k] != b[k]) {
          fprintf(stderr,"p[%d] loop: %d value a[%d,%d]: %f actual b[%d,%d]: 
%f\n",
              me,loop,i,j,a[k],i,j,b[k]);
          assert(0);
        }
      }
    }
    MPI_Type_free(&dtype);
    if (me==0) printf("Test passed for loop %d\n",loop);
  }
  if (me==0) printf("\nAll tests successful");

#ifdef MPI_USE_REQUESTS
  MPI_Win_unlock_all(win);
#endif
  MPI_Finalize();
  return (0);
}

Reply via email to