Hi,
Probably our cluster is misconfigured, but I'm not smart enough to detect which
is the problem. I am starting to use OpenMPI, I have used LAM/MPI in the past. I
realize this is a problem of my cluster, not of OpenMPI. I would be grateful for
any suggestion you could offer.
LAM/MPI works fine in this same cluster. And my problem only shows up when I try
to use the headed cluster node _and_ another one with OpenMPI. No problem if I
use just 1 node (headed or not), or if I use two headless nodes.
My sys-admin would be willing to help me but I should tell him what to check.
Since LAM/MPI is working fine and I could easily avoid using the headed node, I
cannot really complaint, but I'd like to have OpenMPI working as nicely as LAM/MPI.
I have been able to reduce my problem to this simple example, using the "hello"
source code from the LAM distribution:
--------------------------------------
$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
#The following was added by scance. Do not remove:
150.214.<blah>.<blah> clusteri
192.168.2.1 n0
192.168.2.2 n1
192.168.2.3 n2
192.168.2.4 n3
192.168.2.5 n4
192.168.2.6 n5
192.168.2.7 n6
192.168.2.8 n7
#End scance-section
$ env
HOSTNAME=clusteri
...
LD_LIBRARY_PATH=/home/javier/openmpi-1.0.2/lib:
...
PATH=/home/javier/octave-2.1.73/bin:/home/javier/openmpi-1.0.2/bin:/opt/netbeans-4.1/bin:/usr/local/bin:/usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/scali/bin:/opt/scali/sbin:/opt/scali/contrib/pbs/bin:/opt/scali/contrib/torque/bin:/home/javier/bin
...
$ ls
cxxhello.cc hello hello.c README
$ which mpicc
~/openmpi-1.0.2/bin/mpicc
$ mpicc -o hello hello.c
$ ldd hello
libmpi.so.0 => /home/javier/openmpi-1.0.2/lib/libmpi.so.0 (0x00849000)
liborte.so.0 => /home/javier/openmpi-1.0.2/lib/liborte.so.0 (0x00b07000)
libopal.so.0 => /home/javier/openmpi-1.0.2/lib/libopal.so.0
(0x00d27000)
libutil.so.1 => /lib/libutil.so.1 (0x009f4000)
libnsl.so.1 => /lib/libnsl.so.1 (0x00111000)
libdl.so.2 => /lib/libdl.so.2 (0x002b4000)
libm.so.6 => /lib/tls/libm.so.6 (0x0028f000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x002ba000)
libc.so.6 => /lib/tls/libc.so.6 (0x00164000)
/lib/ld-linux.so.2 (0x00147000)
$ which mpirun
~/openmpi-1.0.2/bin/mpirun
$ mpirun -c 2 -H n0 hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
$ mpirun -c 2 -H clusteri hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
$ mpirun -c 2 -H n1 hello
Hello, world! I am 1 of 2
Hello, world! I am 0 of 2
$ mpirun -c 2 -H n2 hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
$ mpirun -c 2 -H n1,n2 hello
Hello, world! I am 0 of 2
Hello, world! I am 1 of 2
$ mpirun -c 2 -H n0,n2 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H clusteri,n2 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H clusteri,n1 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H n0,n1 hello
^C
mpirun: killing job...
--------------------------------------
I must press ^C since the mpirun command gets blocked.
I have compared the -debug outputs and seem rather similar until this point:
--------------------------------------
$ mpirun -d -c 2 -H n0,n1 hello
...
[clusteri:11701] connect_uni: contact info read
[clusteri:11701] connect_uni: connection not allowed
[clusteri:11701] [0,0,0] setting up session dir with
...
[clusteri:11701] pls:rsh: final template argv:
[clusteri:11701] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy
1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template>
--universe javier@clusteri:default-universe-11701 --nsreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0
[clusteri:11701] pls:rsh: launching on node n0
[clusteri:11701] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to
0[clusteri:11701] pls:rsh: n0 is a LOCAL node
[clusteri:11701] pls:rsh: changing to directory /home/javier
[clusteri:11701] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1
--num_procs 3 --vpid_start 0 --nodename n0 --universe
javier@clusteri:default-universe-11701 --nsreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0
[clusteri:11703] [0,0,1] setting up session dir with
...
[clusteri:11701] pls:rsh: launching on node n1
[clusteri:11701] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to
0[clusteri:11701] pls:rsh: n1 is a REMOTE node
[clusteri:11701] pls:rsh: executing: /usr/bin/ssh n1 orted --debug --bootproxy 1
--name 0.0.2 --num_procs 3 --vpid_start 0 --nodename n1 --universe
javier@clusteri:default-universe-11701 --nsreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica
"0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0
[n1:08763] [0,0,2] setting up session dir with
...
[clusteri:11710] [0,1,0] setting up session dir with
...
[n1:08789] [0,1,1] setting up session dir with
...
[clusteri:11701] spawn: in job_state_callback(jobid = 1, state = 0x3)
[clusteri:11701] Info: Setting up debugger process table for applications
[clusteri:11703] orted: job_state_callback(jobid = 1, state = 158080832)
MPIR_being_debugged = 0
[n1:08763] orted: job_state_callback(jobid = 1, state = 144698272)
MPIR_debug_gate = 0
MPIR_debug_state = 1
MPIR_acquired_pre_main = 0
MPIR_i_am_starter = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, n1, hello, 8789)
(i, host, exe, pid) = (1, n0, hello, 11710)
<blocked here>
--------------------------------------
I can login to n1 and it's all the same: blocked if I use the headed and one
headless. Will I send the complete -debug output? For both the working and
blocking cases? Is there a checklist I could follow to look for
misconfigurations in our cluster?
Thanks a lot for your help.
-javier