Hi Yiannis, On Fri, Dec 9, 2011 at 10:21 AM, Yiannis Papadopoulos <giannis.papadopou...@gmail.com> wrote: > Patrik Jonsson wrote: >> >> Hi all, >> >> I'm seeing performance issues I don't understand in my multithreaded >> MPI code, and I was hoping someone could shed some light on this. >> >> The code structure is as follows: A computational domain is decomposed >> into MPI tasks. Each MPI task has a "master thread" that receives >> messages from the other tasks and puts those into a local, concurrent >> queue. The tasks then have a few "worker threads" that processes the >> incoming messages and when necessary sends them to other tasks. So for >> each task, there is one thread doing receives and N (typically number >> of cores-1) threads doing sends. All messages are nonblocking, so the >> workers just post the sends and continue with computation, and the >> master repeatedly does a number of test calls to check for incoming >> messages (there are different flavors of these messages so it does >> several tests). > > When do you do the MPI_Test on the Isends? I have had performance issues in > a number of systems if I would use a single queue of MPI_Requests that would > keep Isends to different ranks and testing them one by one. It appears that > some messages are sent out more efficiently if you test them.
There are 3 classes of messages that may arrive. The requests for each are in a vector, and I use boost::mpi::test_some (which I assume just calls MPI_Testsome) to test them in a round-robin fashion. > > I found that either using MPI_Testsome or having a map(key=rank, value=queue > of MPI_Requests) and testing for each key the first MPI_Request, resolved > this issue. In my case, I know that the overwhelming traffic volume is one kind of message. What I ended up doing was to simply repeat the test for that message immediately if the preceding test succeeded, up to 1000 times, before again checking the other requests. This appears to enable the task to keep up with the incoming traffic. I guess another possibility would be to have several slots for the incoming messages. Right now I only post one irecv per source task. By posting a couple, more messages would be able to come in without not having a matching recv, and one test could match more of them. Since that makes the logic more complicated, I didn't try that.