Re: [OMPI users] Ability to overlap communication and computation on Infiniband

Joshua Ladd Mon, 01 Aug 2016 10:13:42 -0700

>> When a more modern MTL/mxm or PML/yall framework/component is used, I
hope things are different and result in more communication/computation
overlap potential.


>Others will need to comment on that; the cm PML (i.e., all MTLs) and
PML/yalla are super-thin shims to get to the underlying communication
libraries.  So it's up to the progression models of those underlying
libraries as to how the communication overlap occurs.

[Josh] MXM fully supports asynchronous progress in both MPI and OSHMEM
applications. Network progress happens independent of calls made into the
MPI/OSHMEM library.

On Mon, Aug 1, 2016 at 11:00 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> On Jul 8, 2016, at 4:26 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca>
> wrote:
> >
> > Hi OMPI_Users and OMPI_Developers,
>
> Sorry for the delay in answering, Martin.
>
> > I would like someone to verify if my understanding is correct concerning
> Open MPI ability to overlap communication and computations on Infiniband
> when using non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the
> computation is done between the non-blocking MPI_Isend() on the sender or
> MPI_Irecv() on the receiver and the corresponding MPI_Wait()).
> >
> > After reading the following FAQ entries:
> >
> >
> https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2
> >
> https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3
> >
> > and the paper:
> >
> >   https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/
> >
> > about the algorithm used on OpenFabric to send large messages my
> understanding is that:
> >
> >       • When the “RDMA Direct” message protocol is used, the
> communication is done by an RDMA read on the receiver side so if the
> receiver calls MPI_Irecv() after it received a matching message envelope
> (tag, communicator) from the sender, then the receiver can start the RDMA
> read and let the Infiniband HCA operate and return from the MPI_Irecv() to
> let the receiving process compute. Then the next time the MPI library is
> called
>
> ...and if the RDMA transaction has completed...
>
> > on the receiver side (or maybe in the corresponding MPI_Wait() call),
> the receiver sends a short ACK message to the sender to tell the sender the
> that the receive is completed and it is now free to do whatever it wants
> with the send buffer. When things happens this way (e.g. sender envelope
> received before MPI_Irecv() is called on the receiver side), it offers a
> great overlap potential on both receiver and sender side (because for the
> sender MPI_Isend() only have to send the envelope eagerly and its
> MPI_Wait() wait for the ACK).
>
> More or less, yes.  Note that Open MPI does send at least a little bit of
> data with the envelope, just because... well, why waste a message
> transfer?  :-)
>
> > However when the receiver call MPI_Irecv() before the sender envelope is
> received, the RDMA read transfer cannot start before the envelope is
> received and the MPI library realize it can start the RDMA read. If the
> receiver only realize this in the corresponding MPI_Wait(),
> > there will be no overlap on the receiver side.
>
> This isn't entirely correct.
>
> Many of Open MPI's MPI API calls will dip into the internal progression
> engine, which will attempt to progress any outstanding MPI requests.  E.g.,
> even if you invoke an unrelated MPI_Isend, if the envelope arrives for your
> previously-executed MPI_Irecv, Open MPI will progress that MPI_Irecv's
> request, meaning that it will initiate the protocol to start receiving the
> full message.
>
> Meaning: if you are dipping into the MPI library, it'll likely progress
> all your underlying message passing.  It's not "true" overlap (because
> you're dipping into the Open MPI progression engine -- it's not happening
> automatically); it's a compromise between having progression threads taking
> CPU cycles away from your main application and continuing to provide
> progress on long-running communication operations.
>
> > The overlap potential is still good on the sender side for the same
> reason as the previous case.
> >       • When the “RDMA Pipeline” protocol is used both sender and
> receiver side have to actively cooperate to transfer data using multiple
> Infiniband send/receive and RDMA writes. On the receiver side as the
> article says: “protocol effectively overlaps the cost of
> registration/deregistration with RDMA writes”. This allows to overlap
> communication with registration overhead on the receiver side but not with
> computations. On the sender side I don’t see how overlap with computation
> could be possible either. In practice when using this protocol is used
> between a pair of MPI_Isend() and MPI_Irecv() I fear that all the
> communication will happen when the sender and receiver reach their
> corresponding MPI_Wait() calls (which means no overlap).
>
> Not quite.  Open MPI's progression engine will kick in here, too.  I.e.,
> any time you dip into the Open MPI progression engine, the state of all
> pending requests -- including multiple pending registration and/or
> communication operations for a single MPI request -- will be progressed.
>
> Bluntly: if you call MPI_Isend()/MPI_Irecv() for a large message and then
> don't call any MPI calls for 10 minutes, no progress will occur.  However,
> if you call MPI_Isend()/MPI_Irecv() for a large message and periodically
> make other communication MPI API calls (including the various Test/Wait
> flavors), progress will continue to occur under the covers.
>
> There is continuing experimentation on true asynchronous progress, but
> it's tricky to get just right: you don't want to steal too many cycles
> and/or cause too much jitter for the main application thread(s).  With
> MPI_THREAD_MULTIPLE, particularly when you might have multiple cores
> available in a single MPI process, the situation gets a bit more complex,
> and potentially more favorable for software-based progress.  It's something
> we're actively discussing in the developer community.
>
> > So if someone could tell me if this is correct or not I would appreciate
> greatly.
> >
> > I guess that the two above protocols correspond to the basic BTL/openib
> framework/component.
>
> Yes, and other BTL components.  I.e., what you have described is how the
> ob1 PML operates.
>
> > When a more modern MTL/mxm or PML/yall framework/component is used, I
> hope things are different and result in more communication/computation
> overlap potential.
>
> Others will need to comment on that; the cm PML (i.e., all MTLs) and
> PML/yalla are super-thin shims to get to the underlying communication
> libraries.  So it's up to the progression models of those underlying
> libraries as to how the communication overlap occurs.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Ability to overlap communication and computation on Infiniband

Reply via email to