> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users 
> <users@lists.open-mpi.org> wrote:
> 
> Nathan,
> 
> Over the last couple of weeks I made some more interesting observations 
> regarding the latencies of accumulate operations on both Aries and InfiniBand 
> systems:
> 
> 1) There seems to be a significant difference between 64bit and 32bit 
> operations: on Aries, the average latency for compare-exchange on 64bit 
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of 
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate 
> show a similar difference between 32 and 64bit. There are no differences 
> between 32bit and 64bit puts and gets on these systems.


1) On Aries 32-bit and 64-bit CAS operations should have similar performance. 
This looks like a bug and I will try to track it down now.

2) On Infiniband when using verbs we only have access to 64-bit atomic memory 
operations (limitation of the now-dead btl/openib component). I think there may 
be support in UCX for 32-bit AMOs but the support is not implemented in Open 
MPI (at least not in btl/uct). I can take a look at btl/uct and see what I find.

> 2) On both systems, the latency for a single-value atomic load using 
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
> 64bit values, roughly matching the latency of 32bit compare-exchange 
> operations.

This is expected given the current implementation. When doing MPI_OP_NO_OP it 
falls back to the lock + get. I suppose I can change it to use MPI_SUM with an 
operand of 0. Will investigate.


-Nathan
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to