>> When a more modern MTL/mxm or PML/yall framework/component is used, I hope things are different and result in more communication/computation overlap potential.
>Others will need to comment on that; the cm PML (i.e., all MTLs) and PML/yalla are super-thin shims to get to the underlying communication libraries. So it's up to the progression models of those underlying libraries as to how the communication overlap occurs. [Josh] MXM fully supports asynchronous progress in both MPI and OSHMEM applications. Network progress happens independent of calls made into the MPI/OSHMEM library. On Mon, Aug 1, 2016 at 11:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > On Jul 8, 2016, at 4:26 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> > wrote: > > > > Hi OMPI_Users and OMPI_Developers, > > Sorry for the delay in answering, Martin. > > > I would like someone to verify if my understanding is correct concerning > Open MPI ability to overlap communication and computations on Infiniband > when using non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the > computation is done between the non-blocking MPI_Isend() on the sender or > MPI_Irecv() on the receiver and the corresponding MPI_Wait()). > > > > After reading the following FAQ entries: > > > > > https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2 > > > https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3 > > > > and the paper: > > > > https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/ > > > > about the algorithm used on OpenFabric to send large messages my > understanding is that: > > > > • When the “RDMA Direct” message protocol is used, the > communication is done by an RDMA read on the receiver side so if the > receiver calls MPI_Irecv() after it received a matching message envelope > (tag, communicator) from the sender, then the receiver can start the RDMA > read and let the Infiniband HCA operate and return from the MPI_Irecv() to > let the receiving process compute. Then the next time the MPI library is > called > > ...and if the RDMA transaction has completed... > > > on the receiver side (or maybe in the corresponding MPI_Wait() call), > the receiver sends a short ACK message to the sender to tell the sender the > that the receive is completed and it is now free to do whatever it wants > with the send buffer. When things happens this way (e.g. sender envelope > received before MPI_Irecv() is called on the receiver side), it offers a > great overlap potential on both receiver and sender side (because for the > sender MPI_Isend() only have to send the envelope eagerly and its > MPI_Wait() wait for the ACK). > > More or less, yes. Note that Open MPI does send at least a little bit of > data with the envelope, just because... well, why waste a message > transfer? :-) > > > However when the receiver call MPI_Irecv() before the sender envelope is > received, the RDMA read transfer cannot start before the envelope is > received and the MPI library realize it can start the RDMA read. If the > receiver only realize this in the corresponding MPI_Wait(), > > there will be no overlap on the receiver side. > > This isn't entirely correct. > > Many of Open MPI's MPI API calls will dip into the internal progression > engine, which will attempt to progress any outstanding MPI requests. E.g., > even if you invoke an unrelated MPI_Isend, if the envelope arrives for your > previously-executed MPI_Irecv, Open MPI will progress that MPI_Irecv's > request, meaning that it will initiate the protocol to start receiving the > full message. > > Meaning: if you are dipping into the MPI library, it'll likely progress > all your underlying message passing. It's not "true" overlap (because > you're dipping into the Open MPI progression engine -- it's not happening > automatically); it's a compromise between having progression threads taking > CPU cycles away from your main application and continuing to provide > progress on long-running communication operations. > > > The overlap potential is still good on the sender side for the same > reason as the previous case. > > • When the “RDMA Pipeline” protocol is used both sender and > receiver side have to actively cooperate to transfer data using multiple > Infiniband send/receive and RDMA writes. On the receiver side as the > article says: “protocol effectively overlaps the cost of > registration/deregistration with RDMA writes”. This allows to overlap > communication with registration overhead on the receiver side but not with > computations. On the sender side I don’t see how overlap with computation > could be possible either. In practice when using this protocol is used > between a pair of MPI_Isend() and MPI_Irecv() I fear that all the > communication will happen when the sender and receiver reach their > corresponding MPI_Wait() calls (which means no overlap). > > Not quite. Open MPI's progression engine will kick in here, too. I.e., > any time you dip into the Open MPI progression engine, the state of all > pending requests -- including multiple pending registration and/or > communication operations for a single MPI request -- will be progressed. > > Bluntly: if you call MPI_Isend()/MPI_Irecv() for a large message and then > don't call any MPI calls for 10 minutes, no progress will occur. However, > if you call MPI_Isend()/MPI_Irecv() for a large message and periodically > make other communication MPI API calls (including the various Test/Wait > flavors), progress will continue to occur under the covers. > > There is continuing experimentation on true asynchronous progress, but > it's tricky to get just right: you don't want to steal too many cycles > and/or cause too much jitter for the main application thread(s). With > MPI_THREAD_MULTIPLE, particularly when you might have multiple cores > available in a single MPI process, the situation gets a bit more complex, > and potentially more favorable for software-based progress. It's something > we're actively discussing in the developer community. > > > So if someone could tell me if this is correct or not I would appreciate > greatly. > > > > I guess that the two above protocols correspond to the basic BTL/openib > framework/component. > > Yes, and other BTL components. I.e., what you have described is how the > ob1 PML operates. > > > When a more modern MTL/mxm or PML/yall framework/component is used, I > hope things are different and result in more communication/computation > overlap potential. > > Others will need to comment on that; the cm PML (i.e., all MTLs) and > PML/yalla are super-thin shims to get to the underlying communication > libraries. So it's up to the progression models of those underlying > libraries as to how the communication overlap occurs. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users