Hi,

Probably our cluster is misconfigured, but I'm not smart enough to detect which is the problem. I am starting to use OpenMPI, I have used LAM/MPI in the past. I realize this is a problem of my cluster, not of OpenMPI. I would be grateful for any suggestion you could offer.

LAM/MPI works fine in this same cluster. And my problem only shows up when I try to use the headed cluster node _and_ another one with OpenMPI. No problem if I use just 1 node (headed or not), or if I use two headless nodes.

My sys-admin would be willing to help me but I should tell him what to check. Since LAM/MPI is working fine and I could easily avoid using the headed node, I cannot really complaint, but I'd like to have OpenMPI working as nicely as LAM/MPI.

I have been able to reduce my problem to this simple example, using the "hello" source code from the LAM distribution:
--------------------------------------
$ cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1       localhost.localdomain   localhost
#The following was added by scance. Do not remove:
150.214.<blah>.<blah> clusteri
192.168.2.1 n0
192.168.2.2 n1
192.168.2.3 n2
192.168.2.4 n3
192.168.2.5 n4
192.168.2.6 n5
192.168.2.7 n6
192.168.2.8 n7
#End scance-section

$ env
HOSTNAME=clusteri
...
LD_LIBRARY_PATH=/home/javier/openmpi-1.0.2/lib:
...
PATH=/home/javier/octave-2.1.73/bin:/home/javier/openmpi-1.0.2/bin:/opt/netbeans-4.1/bin:/usr/local/bin:/usr/bin:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/opt/scali/bin:/opt/scali/sbin:/opt/scali/contrib/pbs/bin:/opt/scali/contrib/torque/bin:/home/javier/bin
...

$ ls
cxxhello.cc  hello  hello.c  README

$ which mpicc
~/openmpi-1.0.2/bin/mpicc

$ mpicc -o hello hello.c
$ ldd hello
        libmpi.so.0 => /home/javier/openmpi-1.0.2/lib/libmpi.so.0 (0x00849000)
        liborte.so.0 => /home/javier/openmpi-1.0.2/lib/liborte.so.0 (0x00b07000)
libopal.so.0 => /home/javier/openmpi-1.0.2/lib/libopal.so.0 (0x00d27000)
        libutil.so.1 => /lib/libutil.so.1 (0x009f4000)
        libnsl.so.1 => /lib/libnsl.so.1 (0x00111000)
        libdl.so.2 => /lib/libdl.so.2 (0x002b4000)
        libm.so.6 => /lib/tls/libm.so.6 (0x0028f000)
        libpthread.so.0 => /lib/tls/libpthread.so.0 (0x002ba000)
        libc.so.6 => /lib/tls/libc.so.6 (0x00164000)
        /lib/ld-linux.so.2 (0x00147000)

$ which mpirun
~/openmpi-1.0.2/bin/mpirun
$ mpirun -c 2 -H n0 hello
Hello, world!  I am 0 of 2
Hello, world!  I am 1 of 2
$ mpirun -c 2 -H clusteri hello
Hello, world!  I am 0 of 2
Hello, world!  I am 1 of 2
$ mpirun -c 2 -H n1 hello
Hello, world!  I am 1 of 2
Hello, world!  I am 0 of 2
$ mpirun -c 2 -H n2 hello
Hello, world!  I am 0 of 2
Hello, world!  I am 1 of 2
$ mpirun -c 2 -H n1,n2 hello
Hello, world!  I am 0 of 2
Hello, world!  I am 1 of 2

$ mpirun -c 2 -H n0,n2 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H clusteri,n2 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H clusteri,n1 hello
^C
mpirun: killing job...
$ mpirun -c 2 -H n0,n1 hello
^C
mpirun: killing job...
--------------------------------------

I must press ^C since the mpirun command gets blocked.
I have compared the -debug outputs and seem rather similar until this point:
--------------------------------------
$ mpirun -d -c 2 -H n0,n1 hello
...
[clusteri:11701] connect_uni: contact info read
[clusteri:11701] connect_uni: connection not allowed
[clusteri:11701] [0,0,0] setting up session dir with
...

[clusteri:11701] pls:rsh: final template argv:
[clusteri:11701] pls:rsh:     /usr/bin/ssh <template> orted --debug --bootproxy
1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe javier@clusteri:default-universe-11701 --nsreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0
[clusteri:11701] pls:rsh: launching on node n0
[clusteri:11701] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0[clusteri:11701] pls:rsh: n0 is a LOCAL node
[clusteri:11701] pls:rsh: changing to directory /home/javier
[clusteri:11701] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename n0 --universe javier@clusteri:default-universe-11701 --nsreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0

[clusteri:11703] [0,0,1] setting up session dir with
...

[clusteri:11701] pls:rsh: launching on node n1
[clusteri:11701] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0[clusteri:11701] pls:rsh: n1 is a REMOTE node [clusteri:11701] pls:rsh: executing: /usr/bin/ssh n1 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename n1 --universe javier@clusteri:default-universe-11701 --nsreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --gprreplica "0.0.0;tcp://150.214.<>.<>:52220;tcp://192.168.2.1:52220" --mpi-call-yield 0

[n1:08763] [0,0,2] setting up session dir with

...
[clusteri:11710] [0,1,0] setting up session dir with
...
[n1:08789] [0,1,1] setting up session dir with
...
[clusteri:11701] spawn: in job_state_callback(jobid = 1, state = 0x3)
[clusteri:11701] Info: Setting up debugger process table for applications
[clusteri:11703] orted: job_state_callback(jobid = 1, state = 158080832)
  MPIR_being_debugged = 0
[n1:08763] orted: job_state_callback(jobid = 1, state = 144698272)
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, n1, hello, 8789)
    (i, host, exe, pid) = (1, n0, hello, 11710)
<blocked here>
--------------------------------------

I can login to n1 and it's all the same: blocked if I use the headed and one headless. Will I send the complete -debug output? For both the working and blocking cases? Is there a checklist I could follow to look for misconfigurations in our cluster?

Thanks a lot for your help.

-javier

Reply via email to