Instead of MPI_Alloc_mem and MPI_Win_create, you should use MPI_Win_allocate. This will make it much easier for the implementation to optimize with interprocess shared memory and exploit scalability features such as symmetric globally addressable memory. It also obviates the need to do both MPI_Win_free and MPI_Free_mem.
Based upon what I've seen recently ( https://travis-ci.org/jeffhammond/armci-mpi), using MPI_Win_allocate may fix some unresolved Open-MPI RMA bugs ( https://github.com/open-mpi/ompi/issues/1275). As for your synchronization question, instead of MPI_Rget(b,1,dtype,rproc,displ,1,dtype,win,&request); MPI_Wait(&request,&status); and MPI_Rput(a,1,dtype,rproc,displ,1,dtype,win,&request); MPI_Wait(&request,&status); you should use MPI_Get(b,1,dtype,rproc,displ,1,dtype,win); MPI_Win_flush_local(1,win); and MPI_Put(a,1,dtype,rproc,displ,1,dtype,win); MPI_Win_flush_local(1,win); as there is no need to create a request for this usage model. Request-based RMA entails some implementation overhead in some cases, and is more likely to be broken since it is not heavily tested. On the other hand, the non-request RMA has been tested extensively thanks to the thousands of NWChem jobs I've run using ARMCI-MPI on Cray, InfiniBand, and other systems. As I think I've said before on some list, one of the best ways to understand the mapping between ARMCI and MPI RMA is to look at ARMCI-MPI. Jeff On Wed, Jan 6, 2016 at 8:51 AM, Palmer, Bruce J <bruce.pal...@pnnl.gov> wrote: > > Hi, > I’m trying to compare the semantics of MPI RMA with those of ARMCI. I’ve written a small test program that writes data to a remote processor and then reads the data back to the original processor. In ARMCI, you should be able to do this since operations to the same remote processor are completed in the same order that they are requested on the calling processor. I’ve implemented this two different ways using MPI RMA. The first is to call MPI_Win_lock to create a shared lock on the remote processor, then MPI_Put/MPI_Get to initiate the data transfer and finally MPI_Win_unlock to force completion of the data transfer. My understanding is that this should allow you to write and then read data to the same process, since the first triplet > MPI_Win_lock > MPI_Put > MPI_Win_unlock > must be completed both locally and remotely before the unlock call completes. The calls in the second triplet > MPI_Win_lock > MPI_Get > MPI_Win_unlock > cannot start until the first triplet is done, so if both the put and the get refer to the same data on the same remote processor, then it should work. > The second implementation uses request-based RMA and starts by calling MPI_Win_lock_all collectively on the window when it is created and MPI_Win_unlock_all when it is destroy so that the window is always in a passive synchronization epoch. The put is implement by calling MPI_Rput followed by calling MPI_Wait on the handle returned from the MPI_Rput call. Similarly, get is implemented by calling MPI_Rget followed by MPI_Wait. The wait call guarantees that the operation is completed locally and the data can then be used. However, from what I understand of the standard, it doesn’t say anything about the ordering of the operations so conceivably the put could execute remotely before the get. Inserting an MPI_Win_flush_all between the MPI_Rput and MPI_Rget should guarantee that the operations are ordered. > I’ve written the test program so that it can use either the lock or request-based implementations and I’ve also included an option that inserts a fence/flush plus barrier operation between put and get. The different configurations can be set up by defining some preprocessor symbols at the top of the program. The program loops over the test repeatedly and the current number of loops is set at 2000. The results I get running on a Linux cluster with an Infiniband network using OpenMPI-1.10.1 on 2 processors on 2 different SMP nodes are as follows: > Using OpenMPI-1.8.3: > Request-based implementation without synchronization: 9 successes out of 10 runs > Request-based implementation with synchronization: 19 successes out of 20 runs > Lock-based implementation without synchronization: 1 success out of 10 runs > Lock-based implementation with synchronization: 1 success out of 10 runs > Using OpenMPI-1.10.1 > Request-based implementation without synchronization: 2 successes out of 10 runs > Request-based implementation with synchronization: 8 successes out of 10 runs > Lock-based implementation without synchronization: 4 successes out of 10 runs > Lock-based implementation with synchronization: 2 successes out of 10 runs > Except for the request-based implementation without synchronization (in this case a call to MPI_Win_flush_all), I would expect these to all succeed. Is there some fault to my thinking here? I’ve attached the test program > Bruce Palmer > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2016/01/28216.php -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/