> On May 9, 2019, at 12:37 AM, Joseph Schuchart via users
> <users@lists.open-mpi.org> wrote:
>
> Nathan,
>
> Over the last couple of weeks I made some more interesting observations
> regarding the latencies of accumulate operations on both Aries and InfiniBand
> systems:
>
> 1) There seems to be a significant difference between 64bit and 32bit
> operations: on Aries, the average latency for compare-exchange on 64bit
> values takes about 1.8us while on 32bit values it's at 3.9us, a factor of
> >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and accumulate
> show a similar difference between 32 and 64bit. There are no differences
> between 32bit and 64bit puts and gets on these systems.
1) On Aries 32-bit and 64-bit CAS operations should have similar performance.
This looks like a bug and I will try to track it down now.
2) On Infiniband when using verbs we only have access to 64-bit atomic memory
operations (limitation of the now-dead btl/openib component). I think there may
be support in UCX for 32-bit AMOs but the support is not implemented in Open
MPI (at least not in btl/uct). I can take a look at btl/uct and see what I find.
> 2) On both systems, the latency for a single-value atomic load using
> MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on
> 64bit values, roughly matching the latency of 32bit compare-exchange
> operations.
This is expected given the current implementation. When doing MPI_OP_NO_OP it
falls back to the lock + get. I suppose I can change it to use MPI_SUM with an
operand of 0. Will investigate.
-Nathan
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users