I have a bunch of simulators communicating results to a single assembler. The results seem to take a long time to be received, and the delay increases as the system runs. Here are some results:
sent received delay 70.679 94.776 24.097 94.677 144.906 50.229 122.082 238.713 116.631 144.785 313.101 168.316 167.918 350.037 182.119 190.709 384.342 193.633 Times are wall clock times in seconds since process launch, and so there may be some slew between sender and receiver, but it will be consistent (this tracks only sends from one simulator and also ignores later sends that never arrived--my completion logic needs work). The results are typically 500kB. Sending is via Isend (non-blocking) and receiving via Recv (blocking). The simulators spend most of their time computing; in particular there may be significant delays, e.g., from 10 seconds to a minute, between calls to mpi (typically a mix of Isend, Recv, and Testsome). All processes are on the same machine (for now). The interval between assembler receives (from all sources) is sometimes quite brief, under 2 seconds, and the time between receives is quite variable. Neither is consistent with the theory that the receiver is saturated receiving messages, each of which takes a long time to transmit (I mean the active part of the transmission, when bits are flowing). I infer from this that actually transmitting the message does not take long, and that the longer gaps between receives have some other cause. This is all from R, and the problem might lie with higher level code. Can anyone explain what is going on, and what I might do to alleviate it? My speculation is that the necessary handshaking can only take place while both processes have called MPI, or perhaps some particular calls are required. The assembler spends most of its time executing a receive, but the simulators are mostly busy with other stuff. And so I suspect the delay is with the simulators, though I'm not sure what to do about it. I could wait on completion from the sender, but that kind of defeats the purpose of doing an isend. In a related thread about a similar issue, Jeff Squyres wrote (http://www.open-mpi.org/community/lists/users/2011/07/16928.php) ---------------------------------------------------- If so, it's because Open MPI does not do background progress on non-blocking sends in all cases. Specifically, if you're sending over TCP and the message is "long", the OMPI layer in the master doesn't actually send the whole message immediately because it doesn't want to unexpectedly consume a lot of resources in the slave. So the master only sends a small fragment of the message and the communicator,tag tuple suitable for matching at the receiver. When the receiver posts a corresponding MPI_Recv (time=C), it sends back an ACK to the master, enabling the master to send the rest of the message. However, since OMPI doesn't support background progress in all situations, the master doesn't see this ACK until it goes into the MPI progression engine -- i.e., when you call MPI_Recv() at Time=E. Then the OMPI layer in the master sees the ACK and sends the rest of the message. ---------------------------------------------------------------- I'm not sending over tcp (yet) but maybe I'm running into something similar. I had thought the MPI stuff was handled in separate layer or thread that would magically do all the work of moving messages around; the fact that top shows all the CPU going to the R processes suggests that's not the case. Running OMPI 1.7.4. Thanks for any help. Ross Boylan