On Wed, 2009-11-25 at 12:36 +0100, Atle Rudshaug wrote: > I got a similar error when using non-blocking communication on large > datasets. I could not figure out why this was happening, since it seemed > sort of random. I eventually bypassed the problem by switching to > blocking communication, which felt kind of sad...If anyone knows if this > is a bug in OpenMPI or connected to hardware somehow, please share.
You could easily be running out of memory on one node by saturating it with messages, all of which may need to be buffered. Have you checked the offending nodes for messages from the OOM killer? Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk