Shaun Jackman wrote:

Eugene Loh wrote:

I'm no expert, but I think it's something like this:

1) If the messages are short, they're sent over to the receiver. If the receiver does not expect them (no MPI_Irecv posted), it buffers them up.

2) If the messages are long, only a little bit is sent over to the receiver. The receiver will take in that little bit, but until an MPI_Irecv is posted it will not signal the sender that any more can be sent.

Are these messages being sent over TCP between nodes? How long are they?

Each message is 2500 bytes. In this particular case, there are 8 processes on one host and 8 more processes on another host. So, on the same host the communication will be shared memory, and between hosts it will be TCP.

From your description, I'm guessing that either...
one process is falling behind the rest for whatever reason, and that it's buffering up received messages that haven't been handled by an MPI_Irecv.

or...
one process is falling behind and the other processes that have messages to send to it are being queued up in a transmit buffer.

Can statistics about the number of buffered messages (either tx or rx) be collected and reported by Open MPI? I suppose it would have to be a snapshot in time triggered either programatically or by a special kill signal, like SIGHUP or SIGUSR1.

At 2500 bytes, all messages will presumably be sent "eagerly" -- without waiting for the receiver to indicate that it's ready to receive that particular message. This would suggest congestion, if any, is on the receiver side. Some kind of congestion could, I suppose, still occur and back up on the sender side.

On the other hand, I assume the memory imbalance we're talking about is rather severe. Much more than 2500 bytes to be noticeable, I would think. Is that really the situation you're imagining?

There are tracing tools to look at this sort of thing. The only one I have much familiarity with is Sun Studio / Sun HPC ClusterTools. Free download, available on Solaris or Linux, SPARC or x64, plays with OMPI. You can see a timeline with message lines on it to give you an idea if messages are being received/completed long after they were sent. Another interesting view is constructing a plot vs time of how many messages are in-flight at any moment (including as a function of receiver). Lots of similar tools out there... VampirTrace (tracing side only, need to analyze the data), Jumpshot, etc. Again, though, there's a question in my mind if you're really backing up 1000s or more of messages. (I'm assuming the memory imbalances are at least Mbytes.)

Shaun Jackman wrote:

Each message is 2500 bytes, as I mentioned previously. In fact, each message is composed of one hundred 25-byte operations that have been queued up at the application level and sent with a single MPI_Send. It would depend on the nature of the application of course, but is there any reason to believe that 100 individual MPI_Send would be any faster? Or is there a better way to queue up messages for a batch transmission?

MPI common wisdom would say that if your messages are 25 bytes each and you already went to the pain of batching these messages up, you are lucky. The overhead on 25-byte messages would be high. That said, I could think of counterarguments (e.g., having to wait a long time for the last few messages before you can send a 100-message lot off or something), so your mileage will vary.

Reply via email to