Re: [OMPI users] Ability to overlap communication and computation on Infiniband

Jeff Squyres (jsquyres) Mon, 01 Aug 2016 08:03:03 -0700

On Jul 8, 2016, at 4:26 PM, Audet, Martin <martin.au...@cnrc-nrc.gc.ca> wrote:
> 
> Hi OMPI_Users and OMPI_Developers,


Sorry for the delay in answering, Martin.

> I would like someone to verify if my understanding is correct concerning Open 
> MPI ability to overlap communication and computations on Infiniband when 
> using non-blocking MPI_Isend() and MPI_Irecv() functions (i.e. the 
> computation is done between the non-blocking MPI_Isend() on the sender or 
> MPI_Irecv() on the receiver and the corresponding MPI_Wait()).
> 
> After reading the following FAQ entries:
> 
>   https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.2
>   https://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3
> 
> and the paper:
> 
>   https://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/
> 
> about the algorithm used on OpenFabric to send large messages my 
> understanding is that:
> 
>       • When the “RDMA Direct” message protocol is used, the communication is 
> done by an RDMA read on the receiver side so if the receiver calls 
> MPI_Irecv() after it received a matching message envelope (tag, communicator) 
> from the sender, then the receiver can start the RDMA read and let the 
> Infiniband HCA operate and return from the MPI_Irecv() to let the receiving 
> process compute. Then the next time the MPI library is called

...and if the RDMA transaction has completed...

> on the receiver side (or maybe in the corresponding MPI_Wait() call), the 
> receiver sends a short ACK message to the sender to tell the sender the that 
> the receive is completed and it is now free to do whatever it wants with the 
> send buffer. When things happens this way (e.g. sender envelope received 
> before MPI_Irecv() is called on the receiver side), it offers a great overlap 
> potential on both receiver and sender side (because for the sender 
> MPI_Isend() only have to send the envelope eagerly and its MPI_Wait() wait 
> for the ACK).

More or less, yes.  Note that Open MPI does send at least a little bit of data 
with the envelope, just because... well, why waste a message transfer?  :-)

> However when the receiver call MPI_Irecv() before the sender envelope is 
> received, the RDMA read transfer cannot start before the envelope is received 
> and the MPI library realize it can start the RDMA read. If the receiver only 
> realize this in the corresponding MPI_Wait(),
> there will be no overlap on the receiver side.

This isn't entirely correct.

Many of Open MPI's MPI API calls will dip into the internal progression engine, 
which will attempt to progress any outstanding MPI requests.  E.g., even if you 
invoke an unrelated MPI_Isend, if the envelope arrives for your 
previously-executed MPI_Irecv, Open MPI will progress that MPI_Irecv's request, 
meaning that it will initiate the protocol to start receiving the full message.

Meaning: if you are dipping into the MPI library, it'll likely progress all 
your underlying message passing.  It's not "true" overlap (because you're 
dipping into the Open MPI progression engine -- it's not happening 
automatically); it's a compromise between having progression threads taking CPU 
cycles away from your main application and continuing to provide progress on 
long-running communication operations.

> The overlap potential is still good on the sender side for the same reason as 
> the previous case.
>       • When the “RDMA Pipeline” protocol is used both sender and receiver 
> side have to actively cooperate to transfer data using multiple Infiniband 
> send/receive and RDMA writes. On the receiver side as the article says: 
> “protocol effectively overlaps the cost of registration/deregistration with 
> RDMA writes”. This allows to overlap communication with registration overhead 
> on the receiver side but not with computations. On the sender side I don’t 
> see how overlap with computation could be possible either. In practice when 
> using this protocol is used between a pair of MPI_Isend() and MPI_Irecv() I 
> fear that all the communication will happen when the sender and receiver 
> reach their corresponding MPI_Wait() calls (which means no overlap).

Not quite.  Open MPI's progression engine will kick in here, too.  I.e., any 
time you dip into the Open MPI progression engine, the state of all pending 
requests -- including multiple pending registration and/or communication 
operations for a single MPI request -- will be progressed.

Bluntly: if you call MPI_Isend()/MPI_Irecv() for a large message and then don't 
call any MPI calls for 10 minutes, no progress will occur.  However, if you 
call MPI_Isend()/MPI_Irecv() for a large message and periodically make other 
communication MPI API calls (including the various Test/Wait flavors), progress 
will continue to occur under the covers.

There is continuing experimentation on true asynchronous progress, but it's 
tricky to get just right: you don't want to steal too many cycles and/or cause 
too much jitter for the main application thread(s).  With MPI_THREAD_MULTIPLE, 
particularly when you might have multiple cores available in a single MPI 
process, the situation gets a bit more complex, and potentially more favorable 
for software-based progress.  It's something we're actively discussing in the 
developer community.

> So if someone could tell me if this is correct or not I would appreciate 
> greatly.
> 
> I guess that the two above protocols correspond to the basic BTL/openib 
> framework/component.

Yes, and other BTL components.  I.e., what you have described is how the ob1 
PML operates.

> When a more modern MTL/mxm or PML/yall framework/component is used, I hope 
> things are different and result in more communication/computation overlap 
> potential.

Others will need to comment on that; the cm PML (i.e., all MTLs) and PML/yalla 
are super-thin shims to get to the underlying communication libraries.  So it's 
up to the progression models of those underlying libraries as to how the 
communication overlap occurs.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Ability to overlap communication and computation on Infiniband

Reply via email to