Hi all, I've been trying to get overlapping computation and data transfer to work, without much success so far. What I'm trying to achieve is:
NODE 1: * Post nonblocking send (30MB data) NODE 2: 1) Post nonblocking receive 2) do local work, while data is being received 3) complete transfer posted in 1) (MPI_Wait) 4) use received data So, in my first test using a message size of 30MB, if I did nothing at point 2) above, to complete the transfer in 3) takes about 0.8s. In my second test, I simply put a sleep(3) at point 2), and expected the MPI_Wait() call at 3) to finish almost instantly, since I assumed that the message would have been transferred during the sleep. To my disappointment tough, it took more or less the same time to finish the MPI_Wait as without any sleep. After browsing the forums, I realized that to make any communication progress for these king of large messages, I usually need to block in MPI_Wait, or repeatedly call MPI_Test. I guess that makes sense. So, my questions is, how would you get around this and achieve optimal computation/transfer overlap? Would you try to intersperse the local work code in 2) with calls to MPI_Test() ? If yes, how frequent would these calls have to be made? Another possible solution that comes to mind is to spawn a separate thread that does an MPI_Wait(). With Open MPI over Ethernet, would that mean that the MPI_Wait thread would busy-loop, and thus steal up to 50% of the CPU from the main thread doing the local computation work? Lots of questions, but I think this is a pretty common scenario. Still, after a lot of browsing, I haven't been able to find any concrete advice. Thanks, Lars