[OMPI users] Latencies of atomic operations on high-performance networks

Joseph Schuchart Tue, 06 Nov 2018 08:38:28 -0800

All,

I am currently experimenting with MPI atomic operations and wanted toshare some interesting results I am observing. The numbers below aremeasurements from both an IB-based cluster and our Cray XC40. Thebenchmarks look like the following snippet:


```
  if (rank == 1) {
    uint64_t res, val;
    for (size_t i = 0; i < NUM_REPS; ++i) {
      MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
      MPI_Win_flush(target, win);
    }
  }
  MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (Ihave tried to confirm that the operations are done in hardware byletting rank 0 sleep for a while and ensuring that communicationprogresses). Of particular interest for my use-case is fetch_op but I amincluding other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:

a) operations on 64bit operands generally seem to have lower latenciesthan operations on 32bit

b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t andExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate(compare_exchange seems to be somewhat of an outlier).


* Cray XC40,  Aries *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us

The difference between exclusive and shared lock is about the same aswith IB and the latencies for 32bit vs 64bit are roughly the same(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency whenusing a shared lock caused by an internal lock being acquired becausethe hardware operations are not actually atomic?


I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Latencies of atomic operations on high-performance networks

Reply via email to