I have a bunch of simulators communicating results to a single
assembler.  The results seem to take a long time to be received, and the
delay increases as the system runs.  Here are some results:

  sent   received delay
 70.679  94.776 24.097
 94.677 144.906 50.229
122.082 238.713 116.631
144.785 313.101 168.316
167.918 350.037 182.119
190.709 384.342 193.633
Times are wall clock times in seconds since process launch, and so there
may be some slew between sender and receiver, but it will be consistent
(this tracks only sends from one simulator and also ignores later sends
that never arrived--my completion logic needs work).

The results are typically 500kB.  Sending is via Isend (non-blocking)
and receiving via Recv (blocking).  The simulators spend most of their
time computing; in particular there may be significant delays, e.g.,
from 10 seconds to a minute, between calls to mpi (typically a mix of
Isend, Recv, and Testsome).  All processes are on the same machine (for
now).

The interval between assembler receives (from all sources) is sometimes
quite brief, under 2 seconds, and the time between receives is quite
variable.  Neither is consistent with the theory that the receiver is
saturated receiving messages, each of which takes a long time to
transmit (I mean the active part of the transmission, when bits are
flowing).  I infer from this that actually transmitting the message does
not take long, and that the longer gaps between receives have some other
cause.

This is all from R, and the problem might lie with higher level code. 

Can anyone explain what is going on, and what I might do to alleviate
it?

My speculation is that the necessary handshaking can only take place
while both processes have called MPI, or perhaps some particular calls
are required.  The assembler spends most of its time executing a
receive, but the simulators are mostly busy with other stuff.  And so I
suspect the delay is with the simulators, though I'm not sure what to do
about it.  I could wait on completion from the sender, but that kind of
defeats the purpose of doing an isend.

In a related thread about a similar issue, Jeff Squyres wrote
(http://www.open-mpi.org/community/lists/users/2011/07/16928.php)
----------------------------------------------------
If so, it's because Open MPI does not do background progress on
non-blocking sends in all cases.  Specifically, if you're sending over
TCP and the message is "long", the OMPI layer in the master doesn't
actually send the whole message immediately because it doesn't want to
unexpectedly consume a lot of resources in the slave.  So the master
only sends a small fragment of the message and the communicator,tag
tuple suitable for matching at the receiver. When the receiver posts a
corresponding MPI_Recv (time=C), it sends back an ACK to the master,
enabling the master to send the rest of the message.

However, since OMPI doesn't support background progress in all
situations, the master doesn't see this ACK until it goes into the MPI
progression engine -- i.e., when you call MPI_Recv() at Time=E.  Then
the OMPI layer in the master sees the ACK and sends the rest of the
message.
----------------------------------------------------------------

I'm not sending over tcp (yet) but maybe I'm running into something
similar.

I had thought the MPI stuff was handled in separate layer or thread that
would magically do all the work of moving messages around; the fact that
top shows all the CPU going to the R processes suggests that's not the
case.

Running OMPI 1.7.4.

Thanks for any help.
Ross Boylan

Reply via email to