Dear all, I have a simple MPI program with two processes using non-blocking communication illustrated bellow:
process 0: process 1: MPI_Isend MPI_Irecv compute stage compute stage MPI_Wait MPI_Wait Actual communication is performed by offloading it to another thread, or using DMA (KNEM module is used for this). Ideally what should happen is that process 0 issues a non-blocking send, process 1 receives the data and in the meantime (in parallel) the CPU cores where the processes run are doing the compute stage. When compute stage is completed calling MPI_Wait wraps up the communication. When I profile my application it turns out that actual communication is initiated with MPI_Wait (significant amount of time is spent there) and hence disables overlapping communication and computation since MPI_Wait is called after the compute stage. Computation in my test case takes more time than communication so MPI_Wait should not be consuming significant amount of time since the communication should be over by then. This I confirmed also by using MPI_Test instead of MPI_Wait. MPI_Test has the same effect as MPI_Wait (to the best of my knowledge) but is non-blocking. When placing MPI_Test strategically in the compute stage it initiates the communication and a certain communication-computation overlap is achieved. Could you please shed some light for me if I am doing something wrong with the library? Is it the way it should behave (MPI_Wait initiates the actual transfer)? How to achieve communication-computation overlap? Best, Nikola