Eugene Loh wrote:
I'm no expert, but I think it's something like this:
1) If the messages are short, they're sent over to the receiver. If the
receiver does not expect them (no MPI_Irecv posted), it buffers them up.
2) If the messages are long, only a little bit is sent over to the
receiver. The receiver will take in that little bit, but until an
MPI_Irecv is posted it will not signal the sender that any more can be sent.
Are these messages being sent over TCP between nodes? How long are they?
Each message is 2500 bytes. In this particular case, there are 8
processes on one host and 8 more processes on another host. So, on the
same host the communication will be shared memory, and between hosts
it will be TCP.
From your description, I'm guessing that either...
one process is falling behind the rest for whatever reason, and that
it's buffering up received messages that haven't been handled by an
MPI_Irecv.
or...
one process is falling behind and the other processes that have
messages to send to it are being queued up in a transmit buffer.
Can statistics about the number of buffered messages (either tx or rx)
be collected and reported by Open MPI? I suppose it would have to be a
snapshot in time triggered either programatically or by a special kill
signal, like SIGHUP or SIGUSR1.
Cheers,
Shaun