Yes, you're seeing more-or-less the expected behavior.  It's a complicated 
issue.

Short version: you might want to sprinkle MPI_Test's throughout your compute 
stage to get true overlap.

More detail: MPI's typically use a "rendezvous" protocol for large messages, 
meaning that it sends a small fragment to the peer announcing the 
communicator,tag,peer of the source of the message.  When the receiver actually 
posts a matching receive, it sends back an ACK to the sender saying, "Ok, I 
have the buffer available now -- send the rest of the message".

So when you initiate a large send, the receiver still has to match that short 
initial frag, send back the ACK, and then the sender has to send the rest of 
the message.  I.e., the MPI layer has to be involved on both sides a few more 
times.  With a single-threaded MPI implementation like Open MPI, this means you 
need to dip into the MPI layer to keep the progress going.

This is currently even true with RDMA/hardware offload technologies.  So even 
though the bulk of the message transfer is offloaded to the NIC hardware, OMPI 
won't even initiate that bulk transfer until the ACK has been received.

In a perfect MPI implementation, you can do exactly what you said -- MPI_Isend 
a large message and eventually an MPI_Wait, and the MPI_Wait basically does 
very little except notice that the transfer is already done.  

However, this is engineering/reality -- there's always a tradeoff.

You can, for example, increase OMPI's threshhold between "small" and "large" 
and consider everything to be a "small" message -- meaning that they would be 
sent eagerly, and not via a rendezvous protocol (and therefore you have a much 
better changes of MPI_Isend/MPI_Wait doing more of what you expect).  But this 
tends to consume more buffering at the receiver.

Make sense?



On Mar 7, 2014, at 9:49 AM, Velickovic Nikola <nikola.velicko...@epfl.ch> wrote:

> Dear all,
> 
> I have a simple MPI program with two processes using non-blocking 
> communication illustrated bellow:
> 
> process 0:         process 1:
> 
> MPI_Isend          MPI_Irecv
> 
> compute stage  compute stage
> 
> MPI_Wait           MPI_Wait
> 
> Actual communication is performed by offloading it to another thread, or 
> using DMA (KNEM module is used for this).
> Ideally what should happen is that process 0 issues a non-blocking send, 
> process 1 receives the data
> and in the meantime (in parallel) the CPU cores where the processes run are 
> doing the compute stage.
> When compute stage is completed calling MPI_Wait wraps up the communication.
> 
> When I profile my application it turns out that actual communication is 
> initiated with MPI_Wait (significant amount of time is spent there) and hence 
> disables overlapping
> communication and computation since MPI_Wait is called after the compute 
> stage.
> Computation in my test case takes more time than communication so MPI_Wait 
> should not be consuming significant amount of time since the communication 
> should be over by then.
> 
> This I confirmed also by using MPI_Test instead of MPI_Wait.
> MPI_Test has the same effect as MPI_Wait (to the best of my knowledge) but is 
> non-blocking.
> When placing MPI_Test strategically in the compute stage it initiates the 
> communication and a certain communication-computation overlap is achieved.
> 
> Could you please shed some light for me if I am doing something wrong with 
> the library?
> Is it the way it should behave (MPI_Wait initiates the actual transfer)?
> How to achieve communication-computation overlap?
> 
> 
> Best,
> Nikola
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to