Re: [OMPI users] Program deadlocks, on simple send/recv loop

Eugene Loh Thu, 3 Dec 2009 11:18:26 -0500

Ashley Pittman wrote:

On Wed, 2009-12-02 at 13:11 -0500, Brock Palen wrote:

On Dec 1, 2009, at 11:15 AM, Ashley Pittman wrote:

On Tue, 2009-12-01 at 10:46 -0500, Brock Palen wrote:

The attached code, is an example where openmpi/1.3.2 will lock up, if
ran on 48 cores, of IB (4 cores per node),
The code loops over recv from all processors on rank 0 and sends from
all other ranks, as far as I know this should work, and I can't see
why not.
Note yes I know we can do the same thing with a gather, this is a
simple case to demonstrate the issue.
Note that if I increase the openib eager limit, the program runs,
which normally means improper MPI, but I can't on my own figure out
the problem with this code.

What are you increasing the eager limit from and too?

The same value as ethernet on our system,

mpirun --mca btl_openib_eager_limit 655360 --mcabtl_openib_max_send_size 655360 ./a.out


Huge values compared to the defaults, but works,

My understanding of the code is that each message will be 256k long

Yes. Brock's Fortran code has each nonzero rank send 50 messages, each256K, via standard send to rank 0. Rank 0 uses standard receives onthem all, pulling in all 50 messages in order from rank 1, then fromrank 2, etc.

http://www.open-mpi.org/community/lists/users/2009/12/11311.php

John Cary sent out a C++ code on this same e-mail thread. It sends256*8=2048-byte messages. Each nonzero rank sends 1 message and rank 0pulls these in in rank order. Then there is a barrier. The programiterates on this pattern.

http://www.open-mpi.org/community/lists/users/2009/12/11348.php

I can imagine the two programs are illustrating two different problems.

Re: [OMPI users] Program deadlocks, on simple send/recv loop

Reply via email to