On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:
Hi Eugene,
Eugene Loh wrote:
At 2500 bytes, all messages will presumably be sent "eagerly" --
without waiting for the receiver to indicate that it's ready to
receive that particular message. This would suggest congestion, if
any, is on the receiver side. Some kind of congestion could, I
suppose, still occur and back up on the sender side.
Can anyone chime in as to what the message size limit is for an
`eager' transmission?
On the other hand, I assume the memory imbalance we're talking
about is rather severe. Much more than 2500 bytes to be
noticeable, I would think. Is that really the situation you're
imagining?
The memory imbalance is drastic. I'm expecting 2 GB of memory use
per process. The heaving processes (13/16) use the expected amount
of memory; the remainder (3/16) misbehaving processes use more than
twice as much memory. The specifics vary from run to run of course.
So, yes, there is gigs of unexpected memory use to track down.
There are tracing tools to look at this sort of thing. The only
one I have much familiarity with is Sun Studio / Sun HPC
ClusterTools. Free download, available on Solaris or Linux, SPARC
or x64, plays with OMPI. You can see a timeline with message lines
on it to give you an idea if messages are being received/completed
long after they were sent. Another interesting view is
constructing a plot vs time of how many messages are in-flight at
any moment (including as a function of receiver). Lots of similar
tools out there... VampirTrace (tracing side only, need to analyze
the data), Jumpshot, etc. Again, though, there's a question in my
mind if you're really backing up 1000s or more of messages. (I'm
assuming the memory imbalances are at least Mbytes.)
I'll check out Sun HPC ClusterTools. Thanks for the tip.
Assuming the problem is congestion and that messages are backing up,
is there an accepted method of dealing with this situation? It seems
to me the general approach would be
if (number of outstanding messages > high water mark)
wait until (number of outstanding messages < low water mark)
where I suppose the `number of outstanding messages' is defined as
the number of messages that have been sent and not yet received by
the other side. Is there a way to get this number from MPI without
having to code it at the application level?
It isn't quite that simple. The problem is that these are typically
"unexpected" messages - i.e., some processes are running faster than
this one, so this one keeps falling behind, which means it has to
"stockpile" messages for later processing.
It is impossible to predict who is going to send the next unexpected
message, so attempting to say "wait" means sending a broadcast to all
procs - a very expensive operation, especially since it can be any
number of procs that feel overloaded.
We had the same problem when working with collectives, where memory
was being overwhelmed by stockpiled messages. The solution (available
in the 1.3 series) in that case was to use the "sync" collective
system. This monitors the number of times a collective is being
executed that can cause this type of problem, and then inserts an
MPI_Barrier to allow time for the processes to "drain" all pending
messages. You can control how frequently this happens, and whether to
barrier occurs before or after the specified number of operations.
If you are using collectives, or can reframe the algorithm so you do,
you might give that a try - it has solved similar problems here. If
it helps, then you should "tune" it by increasing the provided number
(thus decreasing the frequency of the inserted barrier) until you find
a value that works for you - this will minimize performance impact on
your job caused by the inserted barriers.
If you are not using collectives and/or cannot do so, then perhaps we
need to consider a similar approach for simple send/recv operations.
It would probably have to be done inside the MPI library, but may be
hard to implement. The collective works because we know everyone has
to be in it. That isn't true for send/recv, so the barrier approach
won't work there - we would need some other method of stopping procs
to allow things to catch up.
Not sure what that would be offhand....but perhaps some other wiser
head will think of something!
HTH
Ralph
Thanks,
Shaun
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users