[OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
All, I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet: ``` if (rank == 1) { uint64_t re

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Nathan Hjelm via users
All of this is completely expected. Due to the requirements of the standard it is difficult to make use of network atomics even for MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want MPI_Fetch_and_op to be fast set this MCA parameter: osc_rdma_acc_sing

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
Thanks a lot for the quick reply, setting osc_rdma_acc_single_intrinsic=true does the trick for both shared and exclusive locks and brings it down to <2us per operation. I hope that the info key will make it into the next version of the standard, I certainly have use for it :) Cheers, Joseph

[OMPI users] Mapping and Ranking in 3.1.3

2018-11-06 Thread Ben Menadue
Hi, Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per node, running with OMP_NUM_THREADS=4. A common placement policy we see is to have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4 cores, rank 2 on the first 4 cores of the second socket, an