Re: [O-MPI users] direct openib btl and latency

Galen Shipman Thu, 9 Feb 2006 13:13:38 -0500

I would recommend reading the following tech report, it should shedsome light to how these things work :http://www.cs.unm.edu/research/search_technical_reports_by_keyword/?string=infiniband

1 - It does not seem that mvapich does RDMA for small messages. It will
do RDMA for any message that is too big to send eagerly, but the
threshold is not that low and cannot be lowered to apply to 0-byte msgs
anyway (nothing lower than 128bytes or so will work).

mvapich does do RDMA for small messages, they preallocate a buffer foreach peer and then poll each of these buffers for completion,Take a look at the paper: High Performance RDMA-Based MPIImplementations over Infiniband by Jiuxing Liu,Also try compiling mvapich without: -D RDMA_FAST_PATH, I am pretty surethis is the flag that tells mvapich to compile with small message RDMA.Removing this flag will force mvapich to use send/recv

2 - I do not see that there is any raw performance benefit in insisting
on doing rdma for small messages anyway, so it does not seem to be a
tradeoff between scalability and optimal latency. In fact, if I force
ompi or mvapich to go rdma for smaller messages (at least as far as it

seems it will go) the latency for these sizes will actually go up,which

does not hurt my intuition. In mvapich I saw an incompressible 13 us
penalty for doing RDMA.

What you are seeing is a general RDMA protocol which requires that theinitiator obtain the targets memory address and r-key prior to the rdmaoperation, additionally the initiator must inform the target ofcompletion of the RDMA operation. This requires the overhead of controlmessages using either send/receive or small message RDMA.

So far, the best latency I got from ompi is 5.24 us, and the best Igot from mvapich is 3.15.

I am perfectly ready to accept that ompi scales better and that may be
more important (except to the marketing dept :-) ), but I do not
understand your explanation based on small-message RDMA. Either I
missunderstood something badly (my best guess), or the 2 us are lost to
something else than an RDMA-size tradeoff.

Again this is small message RDMA with polling versus send/receivesemantics, we will be adding small message RDMA and should haveperformance equal to that of mvapich for small messages, but it is onlyrelevant for a small working set of peers / micro benchmarks.


Thanks,

Galen

Re: [O-MPI users] direct openib btl and latency

Reply via email to