I would recommend reading the following tech report, it should shed
some light to how these things work :
http://www.cs.unm.edu/research/search_technical_reports_by_keyword/?
string=infiniband
1 - It does not seem that mvapich does RDMA for small messages. It will
do RDMA for any message that is too big to send eagerly, but the
threshold is not that low and cannot be lowered to apply to 0-byte msgs
anyway (nothing lower than 128bytes or so will work).
mvapich does do RDMA for small messages, they preallocate a buffer for
each peer and then poll each of these buffers for completion,
Take a look at the paper: High Performance RDMA-Based MPI
Implementations over Infiniband by Jiuxing Liu,
Also try compiling mvapich without: -D RDMA_FAST_PATH, I am pretty sure
this is the flag that tells mvapich to compile with small message RDMA.
Removing this flag will force mvapich to use send/recv
2 - I do not see that there is any raw performance benefit in insisting
on doing rdma for small messages anyway, so it does not seem to be a
tradeoff between scalability and optimal latency. In fact, if I force
ompi or mvapich to go rdma for smaller messages (at least as far as it
seems it will go) the latency for these sizes will actually go up,
which
does not hurt my intuition. In mvapich I saw an incompressible 13 us
penalty for doing RDMA.
What you are seeing is a general RDMA protocol which requires that the
initiator obtain the targets memory address and r-key prior to the rdma
operation, additionally the initiator must inform the target of
completion of the RDMA operation. This requires the overhead of control
messages using either send/receive or small message RDMA.
So far, the best latency I got from ompi is 5.24 us, and the best I
got from mvapich is 3.15.
I am perfectly ready to accept that ompi scales better and that may be
more important (except to the marketing dept :-) ), but I do not
understand your explanation based on small-message RDMA. Either I
missunderstood something badly (my best guess), or the 2 us are lost to
something else than an RDMA-size tradeoff.
Again this is small message RDMA with polling versus send/receive
semantics, we will be adding small message RDMA and should have
performance equal to that of mvapich for small messages, but it is only
relevant for a small working set of peers / micro benchmarks.
Thanks,
Galen