Is this a TCP-based cluster?
If so, do you have multiple IP addresses on your frontend machine?
Check out these two FAQ entries to see if they help:
http://www.open-mpi.org/faq/?category=tcp#tcp-routability
http://www.open-mpi.org/faq/?category=tcp#tcp-selection
On Mar 21, 2007, at 4:51 PM, tim gunter wrote:
i am experiencing some issues w/ openmpi 1.2 running on a rocks
4.2.1 cluster(the issues also appear to occur w/ openmpi 1.1.5 and
1.1.4).
when i run my program with the frontend in the list of nodes, they
deadlock.
when i run my program without the frontend in the list of nodes,
they run to completion.
the simplest test program that does this(test1.c) does an
"MPI_Init", followed by an "MPI_Barrier", and a "MPI_Finalize".
so the following deadlocks:
/users/gunter $ mpirun -np 3 -H
frontend,compute-0-0,compute-0-1 ./test1
host:compute-0-1.local made it past the barrier, ret:0
mpirun: killing job...
mpirun noticed that job rank 0 with PID 15384 on node frontend
exited on signal 15 (Terminated).
2 additional processes aborted (not shown)
this runs to completion:
/users/gunter $ mpirun -np 3 -H
compute-0-0,compute-0-1,compute-0-2 ./test1
host:compute-0-1.local made it past the barrier, ret:0
host:compute-0-0.local made it past the barrier, ret:0
host:compute-0-2.local made it past the barrier, ret:0
if i have the compute nodes send a message to the frontend prior to
the barrier, it runs to completion:
/users/gunter $ mpirun -np 3 -H
frontend,compute-0-0,compute-0-1 ./test2 0
host: frontend.domain node: 0 is the master
host: compute-0-0.local node: 1 sent: 1 to: 0
host: compute-0-1.local node: 2 sent: 2 to: 0
host: frontend.domain node: 0 recv: 1 from: 1
host: frontend.domain node: 0 recv: 2 from: 2
host: frontend.domain made it past the barrier, ret:0
host: compute-0-1.local made it past the barrier, ret:0
host: compute-0-0.local made it past the barrier, ret:0
if i have a different node function as the master, it deadlocks:
/users/gunter $ mpirun -np 3 -H
frontend,compute-0-0,compute-0-1 ./test2 1
host: compute-0-0.local node: 1 is the master
host: compute-0-1.local node: 2 sent: 2 to: 1
mpirun: killing job...
mpirun noticed that job rank 0 with PID 15411 on node frontend
exited on signal 15 (Terminated).
2 additional processes aborted (not shown)
how is it that in the first example, one node makes it past the
barrier, and the rest deadlock?
these programs both run to completion on two other MPI
implementations.
is there something mis-configured on my cluster? or is this
potentially an openmpi bug?
what is the best way to debug this?
any help would be appreciated!
--tim
<test1.c>
<test2.c>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jeff Squyres
Cisco Systems