On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:
> On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:
> > On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:
> >> The attached code, is an example where openmpi/1.3.2 will lock up, if
> >> ran on 48 cores, of IB (4 cores per node),
> >> The code loops over recv from all processors on rank 0 and sends from
> >> all other ranks, as far as I know this should work, and I can't see
> >> why not.
> >> Note yes I know we can do the same thing with a gather, this is a
> >> simple case to demonstrate the issue.
> >> Note that if I increase the openib eager limit, the program runs,
> >> which normally means improper MPI, but I can't on my own figure out
> >> the problem with this code.
> >
> > What are you increasing the eager limit from and too?
> 
> The same value as ethernet on our system,
> mpirun --mca btl_openib_eager_limit 655360 --mca  
> btl_openib_max_send_size 655360 ./a.out
> 
> Huge values compared to the defaults, but works,

My understanding of the code is that each message will be 256k long and
the code pretty much guarantees that at some point there will be 46
messages in the queue in front of the one you are looking to receive
which makes a total of 11.5Mb, slightly less if you take shared memory
into account.

If the MPI_SEND isn't blocking then each rank will send 50 messages to
rank zero and you'll have 2000 messages and 500Mb of data being received
with the message you want being somewhere towards the end of the queue.

These numbers are far from huge but then compared to an eager limit of
64k they aren't small either.

I suspect the eager limit is being reached on COMM_WORLD rank 0 and it's
not pulling any more messages off the network pending some of the
existing ones being out of the queue but they never will be because the
message being waited for is one that's stuck on the network.  As I say
the message queue for rank 0 when it's deadlocked would be interesting
to look at.

In summary this code makes heavy use of unexpected messages and network
buffering, it's not surprising to me that it only works with eager
limits set fairly high.

Ashley,
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

Reply via email to