Re: [OMPI users] deadlock on barrier

Jeff Squyres Thu, 22 Mar 2007 07:33:17 -0400

Is this a TCP-based cluster?

If so, do you have multiple IP addresses on your frontend machine?Check out these two FAQ entries to see if they help:


http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection



On Mar 21, 2007, at 4:51 PM, tim gunter wrote:

i am experiencing some issues w/ openmpi 1.2 running on a rocks4.2.1 cluster(the issues also appear to occur w/ openmpi 1.1.5 and1.1.4).
when i run my program with the frontend in the list of nodes, theydeadlock.
when i run my program without the frontend in the list of nodes,they run to completion.
the simplest test program that does this(test1.c) does an"MPI_Init", followed by an "MPI_Barrier", and a "MPI_Finalize".
so the following deadlocks:
/users/gunter $ mpirun -np 3 -Hfrontend,compute-0-0,compute-0-1 ./test1
    host:compute-0-1.local made it past the barrier, ret:0
    mpirun: killing job...
mpirun noticed that job rank 0 with PID 15384 on node frontendexited on signal 15 (Terminated).
    2 additional processes aborted (not shown)

this runs to completion:
/users/gunter $ mpirun -np 3 -Hcompute-0-0,compute-0-1,compute-0-2 ./test1
    host:compute-0-1.local made it past the barrier, ret:0
    host:compute-0-0.local made it past the barrier, ret:0
    host:compute-0-2.local made it past the barrier, ret:0
if i have the compute nodes send a message to the frontend prior tothe barrier, it runs to completion:
/users/gunter $ mpirun -np 3 -Hfrontend,compute-0-0,compute-0-1 ./test2 0
    host:     frontend.domain node:  0 is the master
    host:   compute-0-0.local node:  1 sent:  1 to:    0
    host:   compute-0-1.local node:  2 sent:  2 to:    0
    host:     frontend.domain node:  0 recv:  1 from:  1
    host:     frontend.domain node:  0 recv:  2 from:  2
    host:     frontend.domain made it past the barrier, ret:0
    host:   compute-0-1.local made it past the barrier, ret:0
    host:   compute-0-0.local made it past the barrier, ret:0

if i have a different node function as the master, it deadlocks:
/users/gunter $ mpirun -np 3 -Hfrontend,compute-0-0,compute-0-1 ./test2 1
    host:   compute-0-0.local node:  1 is the master
    host:   compute-0-1.local node:  2 sent:  2 to:    1
    mpirun: killing job...
mpirun noticed that job rank 0 with PID 15411 on node frontendexited on signal 15 (Terminated).
    2 additional processes aborted (not shown)
how is it that in the first example, one node makes it past thebarrier, and the rest deadlock?
these programs both run to completion on two other MPIimplementations.
is there something mis-configured on my cluster? or is thispotentially an openmpi bug?
what is the best way to debug this?

any help would be appreciated!

--tim
<test1.c>
<test2.c>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] deadlock on barrier

Reply via email to