On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:

Hi Eugene,

Eugene Loh wrote:
At 2500 bytes, all messages will presumably be sent "eagerly" -- without waiting for the receiver to indicate that it's ready to receive that particular message. This would suggest congestion, if any, is on the receiver side. Some kind of congestion could, I suppose, still occur and back up on the sender side.

Can anyone chime in as to what the message size limit is for an `eager' transmission?

On the other hand, I assume the memory imbalance we're talking about is rather severe. Much more than 2500 bytes to be noticeable, I would think. Is that really the situation you're imagining?

The memory imbalance is drastic. I'm expecting 2 GB of memory use per process. The heaving processes (13/16) use the expected amount of memory; the remainder (3/16) misbehaving processes use more than twice as much memory. The specifics vary from run to run of course. So, yes, there is gigs of unexpected memory use to track down.

There are tracing tools to look at this sort of thing. The only one I have much familiarity with is Sun Studio / Sun HPC ClusterTools. Free download, available on Solaris or Linux, SPARC or x64, plays with OMPI. You can see a timeline with message lines on it to give you an idea if messages are being received/completed long after they were sent. Another interesting view is constructing a plot vs time of how many messages are in-flight at any moment (including as a function of receiver). Lots of similar tools out there... VampirTrace (tracing side only, need to analyze the data), Jumpshot, etc. Again, though, there's a question in my mind if you're really backing up 1000s or more of messages. (I'm assuming the memory imbalances are at least Mbytes.)

I'll check out Sun HPC ClusterTools. Thanks for the tip.

Assuming the problem is congestion and that messages are backing up, is there an accepted method of dealing with this situation? It seems to me the general approach would be

if (number of outstanding messages > high water mark)
   wait until (number of outstanding messages < low water mark)

where I suppose the `number of outstanding messages' is defined as the number of messages that have been sent and not yet received by the other side. Is there a way to get this number from MPI without having to code it at the application level?


It isn't quite that simple. The problem is that these are typically "unexpected" messages - i.e., some processes are running faster than this one, so this one keeps falling behind, which means it has to "stockpile" messages for later processing.

It is impossible to predict who is going to send the next unexpected message, so attempting to say "wait" means sending a broadcast to all procs - a very expensive operation, especially since it can be any number of procs that feel overloaded.

We had the same problem when working with collectives, where memory was being overwhelmed by stockpiled messages. The solution (available in the 1.3 series) in that case was to use the "sync" collective system. This monitors the number of times a collective is being executed that can cause this type of problem, and then inserts an MPI_Barrier to allow time for the processes to "drain" all pending messages. You can control how frequently this happens, and whether to barrier occurs before or after the specified number of operations.

If you are using collectives, or can reframe the algorithm so you do, you might give that a try - it has solved similar problems here. If it helps, then you should "tune" it by increasing the provided number (thus decreasing the frequency of the inserted barrier) until you find a value that works for you - this will minimize performance impact on your job caused by the inserted barriers.

If you are not using collectives and/or cannot do so, then perhaps we need to consider a similar approach for simple send/recv operations. It would probably have to be done inside the MPI library, but may be hard to implement. The collective works because we know everyone has to be in it. That isn't true for send/recv, so the barrier approach won't work there - we would need some other method of stopping procs to allow things to catch up.

Not sure what that would be offhand....but perhaps some other wiser head will think of something!

HTH
Ralph


Thanks,
Shaun
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to