Hi OMPI_Users and OMPI_Developers, I would like someone to verify if my understanding is correct concerning Open MPI ability to overlap communication and computations on Infiniband when using non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the computation is done between the non-blocking MPI_Isend() on the sender or MPI_Irecv() on the receiver and the corresponding MPI_Wait()).
After reading the following FAQ entries: https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2 https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3 and the paper: https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/ about the algorithm used on OpenFabric to send large messages my understanding is that: 1- When the "RDMA Direct" message protocol is used, the communication is done by an RDMA read on the receiver side so if the receiver calls MPI_Irecv() after it received a matching message envelope (tag, communicator) from the sender, then the receiver can start the RDMA read and let the Infiniband HCA operate and return from the MPI_Irecv() to let the receiving process compute. Then the next time the MPI library is called on the receiver side (or maybe in the corresponding MPI_Wait() call), the receiver sends a short ACK message to the sender to tell the sender the that the receive is completed and it is now free to do whatever it wants with the send buffer. When things happens this way (e.g. sender envelope received before MPI_Irecv() is called on the receiver side), it offers a great overlap potential on both receiver and sender side (because for the sender MPI_Isend() only have to send the envelope eagerly and its MPI_Wait() wait for the ACK). However when the receiver call MPI_Irecv() before the sender envelope is received, the RDMA read transfer cannot start before the envelope is received and the MPI library realize it can start the RDMA read. If the receiver only realize this in the corresponding MPI_Wait(), there will be no overlap on the receiver side. The overlap potential is still good on the sender side for the same reason as the previous case. 2- When the "RDMA Pipeline" protocol is used both sender and receiver side have to actively cooperate to transfer data using multiple Infiniband send/receive and RDMA writes. On the receiver side as the article says: "protocol effectively overlaps the cost of registration/deregistration with RDMA writes". This allows to overlap communication with registration overhead on the receiver side but not with computations. On the sender side I don't see how overlap with computation could be possible either. In practice when using this protocol is used between a pair of MPI_Isend() and MPI_Irecv() I fear that all the communication will happen when the sender and receiver reach their corresponding MPI_Wait() calls (which means no overlap). So if someone could tell me if this is correct or not I would appreciate greatly. I guess that the two above protocols correspond to the basic BTL/openib framework/component. When a more modern MTL/mxm or PML/yall framework/component is used, I hope things are different and result in more communication/computation overlap potential. Thanks in advance, Martin Audet