On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:
Assuming the problem is congestion and that messages are backing up, ...
I'd check this assumption first before going too far down that path.
You might be able to instrument your code to spit out sends and
receives. VampirTrace (and PERUSE) instrumentation are already in OMPI,
but any of these instrumentation approaches require that you then
analyze the data you generate... to see how many messages get caught "in
flight" at any time. Again, there are the various tools I mentioned
earlier. If I understand correctly, the problem you're looking for is
*millions* of messages backing up (in order to induce memory imbalances
of Gbytes). Should be easy to spot.
Maybe the real tool to use is some memory-tracing tool. I don't know
much about these. Sun Studio? Valgrind? Sorry, but I'm really
clueless about what tools to use there.