Hi OMPI_Users and OMPI_Developers,

I would like someone to verify if my understanding is correct concerning Open 
MPI ability to overlap communication and computations on Infiniband when using 
non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the computation is 
done between the non-blocking MPI_Isend() on the sender or MPI_Irecv() on the 
receiver and the corresponding MPI_Wait()).

After reading the following FAQ entries:

   https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2
   https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3

and the paper:

   https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/

about the algorithm used on OpenFabric to send large messages my understanding 
is that:

1-      When the "RDMA Direct" message protocol is used, the communication is 
done by an RDMA read on the receiver side so if the receiver calls MPI_Irecv() 
after it received a matching message envelope (tag, communicator) from the 
sender, then the receiver can start the RDMA read and let the Infiniband HCA 
operate and return from the MPI_Irecv() to let the receiving process compute. 
Then the next time the MPI library is called on the receiver side (or maybe in 
the corresponding MPI_Wait() call), the receiver sends a short ACK message to 
the sender to tell the sender the that the receive is completed and it is now 
free to do whatever it wants with the send buffer. When things happens this way 
(e.g. sender envelope received before MPI_Irecv() is called on the receiver 
side), it offers a great overlap potential on both receiver and sender side 
(because for the sender MPI_Isend() only have to send the envelope eagerly and 
its MPI_Wait() wait for the ACK).

However when the receiver call MPI_Irecv() before the sender envelope is 
received, the RDMA read transfer cannot start before the envelope is received 
and the MPI library realize it can start the RDMA read. If the receiver only 
realize this in the corresponding MPI_Wait(), there will be no overlap on the 
receiver side. The overlap potential is still good on the sender side for the 
same reason as the previous case.

2-      When the "RDMA Pipeline" protocol is used both sender and receiver side 
have to actively cooperate to transfer data using multiple Infiniband 
send/receive and RDMA writes. On the receiver side as the article says: 
"protocol effectively overlaps the cost of registration/deregistration with 
RDMA writes". This allows to overlap communication with registration overhead 
on the receiver side but not with computations. On the sender side I don't see 
how overlap with computation could be possible either. In practice when using 
this protocol is used between a pair of MPI_Isend() and MPI_Irecv() I fear that 
all the communication will happen when the sender and receiver reach their 
corresponding MPI_Wait() calls (which means no overlap).

So if someone could tell me if this is correct or not I would appreciate 
greatly.

I guess that the two above protocols correspond to the basic BTL/openib 
framework/component.

When a more modern MTL/mxm or PML/yall framework/component is used, I hope 
things are different and result in more communication/computation overlap 
potential.

Thanks in advance,

Martin Audet

Reply via email to