Re: [OMPI users] Debugging memory use of Open MPI

Eugene Loh Thu, 9 Apr 2009 19:56:43 -0400

Shaun Jackman wrote:

Eugene Loh wrote:
I'm no expert, but I think it's something like this:
1) If the messages are short, they're sent over to the receiver. Ifthe receiver does not expect them (no MPI_Irecv posted), it buffersthem up.
2) If the messages are long, only a little bit is sent over to thereceiver. The receiver will take in that little bit, but until anMPI_Irecv is posted it will not signal the sender that any more canbe sent.
Are these messages being sent over TCP between nodes? How long arethey?
Each message is 2500 bytes. In this particular case, there are 8processes on one host and 8 more processes on another host. So, on thesame host the communication will be shared memory, and between hostsit will be TCP.
From your description, I'm guessing that either...
one process is falling behind the rest for whatever reason, and thatit's buffering up received messages that haven't been handled by anMPI_Irecv.
or...
one process is falling behind and the other processes that havemessages to send to it are being queued up in a transmit buffer.
Can statistics about the number of buffered messages (either tx or rx)be collected and reported by Open MPI? I suppose it would have to be asnapshot in time triggered either programatically or by a special killsignal, like SIGHUP or SIGUSR1.

At 2500 bytes, all messages will presumably be sent "eagerly" -- withoutwaiting for the receiver to indicate that it's ready to receive thatparticular message. This would suggest congestion, if any, is on thereceiver side. Some kind of congestion could, I suppose, still occurand back up on the sender side.

On the other hand, I assume the memory imbalance we're talking about israther severe. Much more than 2500 bytes to be noticeable, I wouldthink. Is that really the situation you're imagining?

There are tracing tools to look at this sort of thing. The only one Ihave much familiarity with is Sun Studio / Sun HPC ClusterTools. Freedownload, available on Solaris or Linux, SPARC or x64, plays with OMPI.You can see a timeline with message lines on it to give you an idea ifmessages are being received/completed long after they were sent.Another interesting view is constructing a plot vs time of how manymessages are in-flight at any moment (including as a function ofreceiver). Lots of similar tools out there... VampirTrace (tracing sideonly, need to analyze the data), Jumpshot, etc. Again, though, there'sa question in my mind if you're really backing up 1000s or more ofmessages. (I'm assuming the memory imbalances are at least Mbytes.)


Shaun Jackman wrote:

Each message is 2500 bytes, as I mentioned previously. In fact, eachmessage is composed of one hundred 25-byte operations that have beenqueued up at the application level and sent with a single MPI_Send. Itwould depend on the nature of the application of course, but is thereany reason to believe that 100 individual MPI_Send would be anyfaster? Or is there a better way to queue up messages for a batchtransmission?

MPI common wisdom would say that if your messages are 25 bytes each andyou already went to the pain of batching these messages up, you arelucky. The overhead on 25-byte messages would be high. That said, Icould think of counterarguments (e.g., having to wait a long time forthe last few messages before you can send a 100-message lot off orsomething), so your mileage will vary.

Re: [OMPI users] Debugging memory use of Open MPI

Reply via email to