Shaun Jackman wrote:
Eugene Loh wrote:
I'm no expert, but I think it's something like this:
1) If the messages are short, they're sent over to the receiver. If
the receiver does not expect them (no MPI_Irecv posted), it buffers
them up.
2) If the messages are long, only a little bit is sent over to the
receiver. The receiver will take in that little bit, but until an
MPI_Irecv is posted it will not signal the sender that any more can
be sent.
Are these messages being sent over TCP between nodes? How long are
they?
Each message is 2500 bytes. In this particular case, there are 8
processes on one host and 8 more processes on another host. So, on the
same host the communication will be shared memory, and between hosts
it will be TCP.
From your description, I'm guessing that either...
one process is falling behind the rest for whatever reason, and that
it's buffering up received messages that haven't been handled by an
MPI_Irecv.
or...
one process is falling behind and the other processes that have
messages to send to it are being queued up in a transmit buffer.
Can statistics about the number of buffered messages (either tx or rx)
be collected and reported by Open MPI? I suppose it would have to be a
snapshot in time triggered either programatically or by a special kill
signal, like SIGHUP or SIGUSR1.
At 2500 bytes, all messages will presumably be sent "eagerly" -- without
waiting for the receiver to indicate that it's ready to receive that
particular message. This would suggest congestion, if any, is on the
receiver side. Some kind of congestion could, I suppose, still occur
and back up on the sender side.
On the other hand, I assume the memory imbalance we're talking about is
rather severe. Much more than 2500 bytes to be noticeable, I would
think. Is that really the situation you're imagining?
There are tracing tools to look at this sort of thing. The only one I
have much familiarity with is Sun Studio / Sun HPC ClusterTools. Free
download, available on Solaris or Linux, SPARC or x64, plays with OMPI.
You can see a timeline with message lines on it to give you an idea if
messages are being received/completed long after they were sent.
Another interesting view is constructing a plot vs time of how many
messages are in-flight at any moment (including as a function of
receiver). Lots of similar tools out there... VampirTrace (tracing side
only, need to analyze the data), Jumpshot, etc. Again, though, there's
a question in my mind if you're really backing up 1000s or more of
messages. (I'm assuming the memory imbalances are at least Mbytes.)
Shaun Jackman wrote:
Each message is 2500 bytes, as I mentioned previously. In fact, each
message is composed of one hundred 25-byte operations that have been
queued up at the application level and sent with a single MPI_Send. It
would depend on the nature of the application of course, but is there
any reason to believe that 100 individual MPI_Send would be any
faster? Or is there a better way to queue up messages for a batch
transmission?
MPI common wisdom would say that if your messages are 25 bytes each and
you already went to the pain of batching these messages up, you are
lucky. The overhead on 25-byte messages would be high. That said, I
could think of counterarguments (e.g., having to wait a long time for
the last few messages before you can send a 100-message lot off or
something), so your mileage will vary.