Re: [OMPI users] Debugging memory use of Open MPI

Ralph Castain Tue, 14 Apr 2009 14:21:31 -0400


On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:

Hi Eugene,

Eugene Loh wrote:
At 2500 bytes, all messages will presumably be sent "eagerly" --without waiting for the receiver to indicate that it's ready toreceive that particular message. This would suggest congestion, ifany, is on the receiver side. Some kind of congestion could, Isuppose, still occur and back up on the sender side.
Can anyone chime in as to what the message size limit is for an`eager' transmission?
On the other hand, I assume the memory imbalance we're talkingabout is rather severe. Much more than 2500 bytes to benoticeable, I would think. Is that really the situation you'reimagining?
The memory imbalance is drastic. I'm expecting 2 GB of memory useper process. The heaving processes (13/16) use the expected amountof memory; the remainder (3/16) misbehaving processes use more thantwice as much memory. The specifics vary from run to run of course.So, yes, there is gigs of unexpected memory use to track down.
There are tracing tools to look at this sort of thing. The onlyone I have much familiarity with is Sun Studio / Sun HPCClusterTools. Free download, available on Solaris or Linux, SPARCor x64, plays with OMPI. You can see a timeline with message lineson it to give you an idea if messages are being received/completedlong after they were sent. Another interesting view isconstructing a plot vs time of how many messages are in-flight atany moment (including as a function of receiver). Lots of similartools out there... VampirTrace (tracing side only, need to analyzethe data), Jumpshot, etc. Again, though, there's a question in mymind if you're really backing up 1000s or more of messages. (I'massuming the memory imbalances are at least Mbytes.)
I'll check out Sun HPC ClusterTools. Thanks for the tip.
Assuming the problem is congestion and that messages are backing up,is there an accepted method of dealing with this situation? It seemsto me the general approach would be
if (number of outstanding messages > high water mark)
   wait until (number of outstanding messages < low water mark)
where I suppose the `number of outstanding messages' is defined asthe number of messages that have been sent and not yet received bythe other side. Is there a way to get this number from MPI withouthaving to code it at the application level?

It isn't quite that simple. The problem is that these are typically"unexpected" messages - i.e., some processes are running faster thanthis one, so this one keeps falling behind, which means it has to"stockpile" messages for later processing.

It is impossible to predict who is going to send the next unexpectedmessage, so attempting to say "wait" means sending a broadcast to allprocs - a very expensive operation, especially since it can be anynumber of procs that feel overloaded.

We had the same problem when working with collectives, where memorywas being overwhelmed by stockpiled messages. The solution (availablein the 1.3 series) in that case was to use the "sync" collectivesystem. This monitors the number of times a collective is beingexecuted that can cause this type of problem, and then inserts anMPI_Barrier to allow time for the processes to "drain" all pendingmessages. You can control how frequently this happens, and whether tobarrier occurs before or after the specified number of operations.

If you are using collectives, or can reframe the algorithm so you do,you might give that a try - it has solved similar problems here. Ifit helps, then you should "tune" it by increasing the provided number(thus decreasing the frequency of the inserted barrier) until you finda value that works for you - this will minimize performance impact onyour job caused by the inserted barriers.

If you are not using collectives and/or cannot do so, then perhaps weneed to consider a similar approach for simple send/recv operations.It would probably have to be done inside the MPI library, but may behard to implement. The collective works because we know everyone hasto be in it. That isn't true for send/recv, so the barrier approachwon't work there - we would need some other method of stopping procsto allow things to catch up.

Not sure what that would be offhand....but perhaps some other wiserhead will think of something!


HTH
Ralph

Thanks,
Shaun
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Debugging memory use of Open MPI

Reply via email to