[OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart

All,

I am currently experimenting with MPI atomic operations and wanted to 
share some interesting results I am observing. The numbers below are 
measurements from both an IB-based cluster and our Cray XC40. The 
benchmarks look like the following snippet:


```
  if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
  MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
  MPI_Win_flush(target, win);
}
  }
  MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I 
have tried to confirm that the operations are done in hardware by 
letting rank 0 sleep for a while and ensuring that communication 
progresses). Of particular interest for my use-case is fetch_op but I am 
including other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies 
than operations on 32bit

b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and 
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate 
(compare_exchange seems to be somewhat of an outlier).


* Cray XC40,  Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as 
with IB and the latencies for 32bit vs 64bit are roughly the same 
(except for compare_exchange, it seems).


So my question is: is this to be expected? Is the higher latency when 
using a shared lock caused by an internal lock being acquired because 
the hardware operations are not actually atomic?


I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Nathan Hjelm via users



All of this is completely expected. Due to the requirements of the standard it 
is difficult to make use of network atomics even for MPI_Compare_and_swap 
(MPI_Accumulate and MPI_Get_accumulate spoil the party). If you want 
MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true


Shared lock is slower than an exclusive lock because there is an extra lock 
step as part of the accumulate (it isn't needed if there is an exclusive lock). 
When setting the above parameter you are telling the implementation that you 
will only be using a single count and we can optimize that with the hardware. 
The RMA working group is working on an info key that will essentially do the 
same thing.


Note the above parameter won't help you with IB if you are using UCX unless you 
set this (master only right now):



btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx




Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan



On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart
Thanks a lot for the quick reply, setting 
osc_rdma_acc_single_intrinsic=true does the trick for both shared and 
exclusive locks and brings it down to <2us per operation. I hope that 
the info key will make it into the next version of the standard, I 
certainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de 
___
users mailing list
users@lists.open-mpi.org 
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Mapping and Ranking in 3.1.3

2018-11-06 Thread Ben Menadue
Hi,

Consider a hybrid MPI + OpenMP code on a system with 2 x 8-core processes per 
node, running with OMP_NUM_THREADS=4. A common placement policy we see is to 
have rank 0 on the first 4 cores of the first socket, rank 1 on the second 4 
cores, rank 2 on the first 4 cores of the second socket, and so on. In 3.1.2 
this is easily accomplished with

$ mpirun --map-by ppr:2:socket:PE=4 --report-bindings
[raijin1:07173] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 
0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]: 
[BB/BB/BB/BB/../../../..][../../../../../../../..]
[raijin1:07173] MCW rank 1 bound to socket 0[core 4[hwt 0-1]], socket 
0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: 
[../../../../BB/BB/BB/BB][../../../../../../../..]
[raijin1:07173] MCW rank 2 bound to socket 1[core 8[hwt 0-1]], socket 
1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]]: 
[../../../../../../../..][BB/BB/BB/BB/../../../..]
[raijin1:07173] MCW rank 3 bound to socket 1[core 12[hwt 0-1]], socket 
1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: 
[../../../../../../../..][../../../../BB/BB/BB/BB]


although looking at the man page now it seems like this is an invalid construct 
(even through it worked).

However, it looks like this (mis)use no longer works in OpenMPI 3.1.3:


--
An invalid value was given for the number of processes
per resource (ppr) to be mapped on each node:

  PPR:  2:socket:PE=4

The specification must be a comma-separated list containing
combinations of number, followed by a colon, followed
by the resource type. For example, a value of "1:socket" indicates that
one process is to be mapped onto each socket. Values are supported
for hwthread, core, L1-3 caches, socket, numa, and node. Note that
enough characters must be provided to clearly specify the desired
resource (e.g., "nu" for "numa").

--

We’ve come up with an equivalent but it needs both --map-by and --rank-by:

$ mpirun --map-by node:PE=4 --rank-by core

(without the --rank-by it (as expected) round-robins between nodes first 
instead of the ranks on each node). Is this the correct approach for getting 
this distribution?

As an aside, I’m not sure if this is the expected behaviour, but using --map-by 
socket:PE=4 fails because it tries putting rank 4 on the first socket of the 
first node even through there’s no free cores left there (because of the PE=4), 
instead of moving to the next node. But we’d still need to use the --rank-by 
option in this case, anyway.

Cheers,
Ben

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users