All,
I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:
```
if (rank == 1) {
uint64_t re
All of this is completely expected. Due to the requirements of the standard it
is difficult to make use of network atomics even for MPI_Compare_and_swap
(MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want
MPI_Fetch_and_op to be fast set this MCA parameter:
osc_rdma_acc_sing
Thanks a lot for the quick reply, setting
osc_rdma_acc_single_intrinsic=true does the trick for both shared and
exclusive locks and brings it down to <2us per operation. I hope that
the info key will make it into the next version of the standard, I
certainly have use for it :)
Cheers,
Joseph
Hi,
Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per
node, running with OMP_NUM_THREADS=4. A common placement policy we see is to
have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4
cores, rank 2 on the first 4 cores of the second socket, an