Hi all, I'm seeing performance issues I don't understand in my multithreaded MPI code, and I was hoping someone could shed some light on this.
The code structure is as follows: A computational domain is decomposed into MPI tasks. Each MPI task has a "master thread" that receives messages from the other tasks and puts those into a local, concurrent queue. The tasks then have a few "worker threads" that processes the incoming messages and when necessary sends them to other tasks. So for each task, there is one thread doing receives and N (typically number of cores-1) threads doing sends. All messages are nonblocking, so the workers just post the sends and continue with computation, and the master repeatedly does a number of test calls to check for incoming messages (there are different flavors of these messages so it does several tests). Currently I'm just testing, so I'm running 2 tasks using the sm btl on one node, and 5 worker threads. (Node has 12 cores.) What happens is that task 0 receives everything that is sent by task 1 (number of sends and receives roughly match). However, task 1 only receives about 25% of the messages sent by task 0. Task 0 apparently has no problem keeping up with receiving the messages from task 1, even though the throughput in that direction is actually a bit higher. In less than a minute, there are hundreds of thousands of pending messages (but only in one direction).At this point, throughput drops by orders of magnitude to <1000 msg/s. Using PAPI, I can see that the receiving threads are at that point basically stalled on MPI tests and receives, and stopping them in the debugger seems to indicate that they are trying to acquire a lock. However, the test/receive that it is stalling on is NOT the test for the huge number of pending messages, but on another class of much rarer ones. I realize it's hard to know without looking at the code (it's difficult to whittle it down to a workable example) but does anyone have any ideas what is happening and how it can be fixed? I don't know if there are any problems with the basic structure of the code. For example, are the simultaneous send/receives in different threads bound to cause lock contention on the MPI side? How does the MPI library decide which thread is used for actual message processing? Does every nonblocking MPI call just "steal" a time slice to work on communications or does MPI have its own thread dedicated to message processing? What I would like is that the master thread devote all its time to communication, while the sends by the worker threads should just return as fast as possible. Would it be better that the thread doing receives do one large wait instead of repeatedly testing different sets of requests, or would that acquire some lock and then block the threads trying to post a send? I've looked around for info on how to best structure multithreaded MPI code, but haven't had much luck in finding anything. This is with OpenMPI 1.5.3 using MPI_THREAD_MULTIPLE on a Dell PowerEdge C6100 running linux kernel 2.6.18-194.32.1.el5, using Intel 12.3.174. I've attached the ompi_info output. Thanks, /Patrik J.
ompi_info.gz
Description: GNU Zip compressed data