All,

I am currently experimenting with MPI atomic operations and wanted to share some interesting results I am observing. The numbers below are measurements from both an IB-based cluster and our Cray XC40. The benchmarks look like the following snippet:

```
  if (rank == 1) {
    uint64_t res, val;
    for (size_t i = 0; i < NUM_REPS; ++i) {
      MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
      MPI_Win_flush(target, win);
    }
  }
  MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I have tried to confirm that the operations are done in hardware by letting rank 0 sleep for a while and ensuring that communication progresses). Of particular interest for my use-case is fetch_op but I am including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate (compare_exchange seems to be somewhat of an outlier).

* Cray XC40,  Aries *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as with IB and the latencies for 32bit vs 64bit are roughly the same (except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when using a shared lock caused by an internal lock being acquired because the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to