On Apr 14, 2009, at 12:02 PM, Shaun Jackman wrote:

Assuming the problem is congestion and that messages are backing up, ...

I'd check this assumption first before going too far down that path. You might be able to instrument your code to spit out sends and receives. VampirTrace (and PERUSE) instrumentation are already in OMPI, but any of these instrumentation approaches require that you then analyze the data you generate... to see how many messages get caught "in flight" at any time. Again, there are the various tools I mentioned earlier. If I understand correctly, the problem you're looking for is *millions* of messages backing up (in order to induce memory imbalances of Gbytes). Should be easy to spot.

Maybe the real tool to use is some memory-tracing tool. I don't know much about these. Sun Studio? Valgrind? Sorry, but I'm really clueless about what tools to use there.

Reply via email to