Re: [OMPI users] Latencies of atomic operations on high-performance networks

Joseph Schuchart Thu, 08 Nov 2018 10:10:32 -0800

While using the mca parameter in a real application I noticed a strangeeffect, which took me a while to figure out: It appears that on theAries network the accumulate operations are not atomic anymore. I amattaching a test program that shows the problem: all but one processescontinuously increment a counter while rank 0 is continuouslysubtracting a large value and adding it again, eventually checking forthe correct number of increments. Without the mca parameter the test atthe end succeeds as all increments are accounted for:


```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```


When setting the mca parameter the test fails with garbage in the result:

```

$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1./mpi_fetch_op_local_remote

result:25769849013

mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:Assertion `sum == 1000*(comm_size-1)' failed.

```

All processes perform only MPI_Fetch_and_op in combination with MPI_SUMso I assume that the test in combination with the mca flag is correct. Icannot reproduce this issue on our IB cluster.

Is that an issue in Open MPI or is there some problem in the test casethat I am missing?


Thanks in advance,
Joseph


On 11/6/18 1:15 PM, Joseph Schuchart wrote:

Thanks a lot for the quick reply, settingosc_rdma_acc_single_intrinsic=true does the trick for both shared andexclusive locks and brings it down to <2us per operation. I hope thatthe info key will make it into the next version of the standard, Icertainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:

All of this is completely expected. Due to the requirements of thestandard it is difficult to make use of network atomics even forMPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil theparty). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extralock step as part of the accumulate (it isn't needed if there is anexclusive lock). When setting the above parameter you are telling theimplementation that you will only be using a single count and we canoptimize that with the hardware. The RMA working group is working onan info key that will essentially do the same thing.

Note the above parameter won't help you with IB if you are using UCXunless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx

Though there may be a way to get osc/ucx to enable the same sort ofoptimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 100000 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
_______________________________________________
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

#include <mpi.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>

#define NUM_ITER 1000

int main(int argc, char **argv)
{
  MPI_Init(&argc, &argv);

  void *baseptr;
  MPI_Win win;
  int comm_size;
  int comm_rank;
  const int64_t one  = 1;
  const int64_t mone = -one;

  MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);

  // a single value that is atomically updated by all processes
  int win_size = sizeof(int64_t);

  MPI_Info win_info;
  MPI_Info_create(&win_info);
  MPI_Info_set(win_info, "accumulate_ordering", "none");
  MPI_Info_set(win_info, "same_size"          , "true");
  MPI_Info_set(win_info, "same_disp_unit"     , "true");
  MPI_Info_set(win_info, "accumulate_ops"     , "same_op_no_op");

  MPI_Win_allocate(
      win_size,
      1,
      win_info,
      MPI_COMM_WORLD,
      &baseptr,
      &win);
  MPI_Info_free(&win_info);

  MPI_Win_lock_all(0, win);
  memset(baseptr, 0, win_size);
  MPI_Barrier(MPI_COMM_WORLD);

  if (comm_rank > 0) {
    for (int i = 0; i < NUM_ITER; ++i) {
      int64_t result;
      // increment by one
      MPI_Fetch_and_op(&one, &result, MPI_INT64_T, 0, 0, MPI_SUM, win);
      MPI_Win_flush(0, win);
    }

    // signal completion
    MPI_Request req;
    MPI_Ibarrier(MPI_COMM_WORLD, &req);
    MPI_Wait(&req, MPI_STATUS_IGNORE);
  } else {
    int flag;
    int64_t sum = 0;
    const int64_t neg_value = -((int64_t)UINT32_MAX);
    MPI_Request req;
    MPI_Ibarrier(MPI_COMM_WORLD, &req);
    do {
      int64_t value;
      int64_t update = neg_value;

      // fetch value and set to large negative value
      MPI_Fetch_and_op(&update, &value, MPI_INT64_T, 0, 0, MPI_SUM, win);
      MPI_Win_flush(0, win);
      //printf("value: %ld\n", value);
      // the value should be positive as we have reset it in the previous iteration
      // Note: this assert triggers on Cray XC40
      //assert(value >= 0);
    
      // reset
      update = -neg_value;
      MPI_Fetch_and_op(&update, &value, MPI_INT64_T, 0, 0, MPI_SUM, win);
      MPI_Win_flush(0, win);

      // check for barrier to complete
      MPI_Test(&req, &flag, MPI_STATUS_IGNORE);

    } while (flag == 0);

    // read the final value
    MPI_Fetch_and_op(NULL, &sum, MPI_INT64_T, 0, 0, MPI_NO_OP, win);
    MPI_Win_flush(0, win);

    printf("result:%ld\n", sum);
    assert(sum == NUM_ITER*(comm_size-1));
  }

  MPI_Win_unlock_all(win);

  MPI_Win_free(&win);

  MPI_Finalize();

  return 0;
}

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

Reply via email to