Hi,
I have tried in 2 different clusters, and both times I have problems using the headed node together with a headless one. There is no problem if I run 2 processes on n0, or 2 processes on n1, or on n2. No problem either using n1 and n2. The problem is when I try to use n0 and n1, or n0 and n2.
Previously I thought it was a cluster configuration problem. http://www.open-mpi.org/community/lists/users/2006/04/1094.php but now I am the sysadmin and don't know why this happens.I'm attaching config.log, the ompi_info output, and the mpirun -d output for both a working command (n1,n2) and one who blocks (n0,n2). The hello program is the typical "I am n/m" example.
Thanks for your help -javier
config.log.bz2
Description: BZip2 compressed data
Open MPI: 1.0.2 Open MPI SVN revision: r9571 Open RTE: 1.0.2 Open RTE SVN revision: r9571 OPAL: 1.0.2 OPAL SVN revision: r9571 Prefix: /home/javier/openmpi-1.0.2 Configured architecture: i686-pc-linux-gnu Configured by: javier Configured on: Mon Apr 24 20:47:06 CEST 2006 Configure host: oxigeno.ugr.es Built by: javier Built on: lun abr 24 21:24:27 CEST 2006 Built host: oxigeno.ugr.es C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: no C compiler: gcc C compiler absolute: /usr/bin/gcc C++ compiler: g++ C++ compiler absolute: /usr/bin/g++ Fortran77 compiler: g77 Fortran77 compiler abs: /usr/bin/g77 Fortran90 compiler: none Fortran90 compiler abs: none C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: no C++ exceptions: no Thread support: posix (mpi: yes, progress: yes) Internal debug support: no MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: 1 MCA memory: malloc_interpose (MCA v1.0, API v1.0, Component v1.0.2) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0.2) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.0.2) MCA timer: linux (MCA v1.0, API v1.0, Component v1.0.2) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.0.2) MCA coll: self (MCA v1.0, API v1.0, Component v1.0.2) MCA coll: sm (MCA v1.0, API v1.0, Component v1.0.2) MCA io: romio (MCA v1.0, API v1.0, Component v1.0.2) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0.2) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0.2) MCA pml: teg (MCA v1.0, API v1.0, Component v1.0.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.0.2) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.0.2) MCA ptl: self (MCA v1.0, API v1.0, Component v1.0.2) MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0.2) MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0.2) MCA btl: self (MCA v1.0, API v1.0, Component v1.0.2) MCA btl: sm (MCA v1.0, API v1.0, Component v1.0.2) MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) MCA topo: unity (MCA v1.0, API v1.0, Component v1.0.2) MCA gpr: null (MCA v1.0, API v1.0, Component v1.0.2) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0.2) MCA gpr: replica (MCA v1.0, API v1.0, Component v1.0.2) MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0.2) MCA iof: svc (MCA v1.0, API v1.0, Component v1.0.2) MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0.2) MCA ns: replica (MCA v1.0, API v1.0, Component v1.0.2) MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0) MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: localhost (MCA v1.0, API v1.0, Component v1.0.2) MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0.2) MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.0.2) MCA rds: resfile (MCA v1.0, API v1.0, Component v1.0.2) MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.0.2) MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0.2) MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0.2) MCA rml: oob (MCA v1.0, API v1.0, Component v1.0.2) MCA pls: daemon (MCA v1.0, API v1.0, Component v1.0.2) MCA pls: fork (MCA v1.0, API v1.0, Component v1.0.2) MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0.2) MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0.2) MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0.2) MCA sds: env (MCA v1.0, API v1.0, Component v1.0.2) MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0.2) MCA sds: seed (MCA v1.0, API v1.0, Component v1.0.2) MCA sds: singleton (MCA v1.0, API v1.0, Component v1.0.2) MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0.2)
Script iniciado (Tue Apr 25 11:23:37 2006 ) [javier@oxigeno hello]$ mpirun -d -c 2 -H ox1,ox2 hello [oxigeno.ugr.es:29960] [0,0,0] setting up session dir with [oxigeno.ugr.es:29960] universe default-universe [oxigeno.ugr.es:29960] user javier [oxigeno.ugr.es:29960] host oxigeno.ugr.es [oxigeno.ugr.es:29960] jobid 0 [oxigeno.ugr.es:29960] procid 0 [oxigeno.ugr.es:29960] procdir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/0/0 [oxigeno.ugr.es:29960] jobdir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/0 [oxigeno.ugr.es:29960] unidir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe [oxigeno.ugr.es:29960] top: openmpi-sessions-jav...@oxigeno.ugr.es_0 [oxigeno.ugr.es:29960] tmp: /tmp [oxigeno.ugr.es:29960] [0,0,0] contact_file /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/universe-setup.txt [oxigeno.ugr.es:29960] [0,0,0] wrote setup file [oxigeno.ugr.es:29960] spawn: in job_state_callback(jobid = 1, state = 0x1) [oxigeno.ugr.es:29960] pls:rsh: local csh: 0, local bash: 1 [oxigeno.ugr.es:29960] pls:rsh: assuming same remote shell as local shell [oxigeno.ugr.es:29960] pls:rsh: remote csh: 0, remote bash: 1 [oxigeno.ugr.es:29960] pls:rsh: final template argv: [oxigeno.ugr.es:29960] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --gprreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --mpi-call-yield 0 [oxigeno.ugr.es:29960] pls:rsh: launching on node ox1 [oxigeno.ugr.es:29960] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [oxigeno.ugr.es:29960] pls:rsh: ox1 is a REMOTE node [oxigeno.ugr.es:29960] pls:rsh: executing: /usr/bin/ssh ox1 orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename ox1 --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --gprreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --mpi-call-yield 0 [oxigeno.ugr.es:29960] pls:rsh: launching on node ox2 [oxigeno.ugr.es:29960] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [oxigeno.ugr.es:29960] pls:rsh: ox2 is a REMOTE node [oxigeno.ugr.es:29960] pls:rsh: executing: /usr/bin/ssh ox2 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename ox2 --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --gprreplica "0.0.0;tcp://150.214.191.217:1220;tcp://192.168.1.9:1220" --mpi-call-yield 0 [ox1:21646] [0,0,1] setting up session dir with [ox1:21646] universe default-universe [ox1:21646] user javier [ox1:21646] host ox1 [ox1:21646] jobid 0 [ox1:21646] procid 1 [ox1:21646] procdir: /tmp/openmpi-sessions-javier@ox1_0/default-universe/0/1 [ox1:21646] jobdir: /tmp/openmpi-sessions-javier@ox1_0/default-universe/0 [ox1:21646] unidir: /tmp/openmpi-sessions-javier@ox1_0/default-universe [ox1:21646] top: openmpi-sessions-javier@ox1_0 [ox1:21646] tmp: /tmp [ox1:21735] [0,1,0] setting up session dir with [ox1:21735] universe default-universe [ox1:21735] user javier [ox1:21735] host ox1 [ox1:21735] jobid 1 [ox1:21735] procid 0 [ox1:21735] procdir: /tmp/openmpi-sessions-javier@ox1_0/default-universe/1/0 [ox1:21735] jobdir: /tmp/openmpi-sessions-javier@ox1_0/default-universe/1 [ox1:21735] unidir: /tmp/openmpi-sessions-javier@ox1_0/default-universe [ox1:21735] top: openmpi-sessions-javier@ox1_0 [ox1:21735] tmp: /tmp [ox2:22211] [0,0,2] setting up session dir with [ox2:22211] universe default-universe [ox2:22211] user javier [ox2:22211] host ox2 [ox2:22211] jobid 0 [ox2:22211] procid 2 [ox2:22211] procdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/0/2 [ox2:22211] jobdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/0 [ox2:22211] unidir: /tmp/openmpi-sessions-javier@ox2_0/default-universe [ox2:22211] top: openmpi-sessions-javier@ox2_0 [ox2:22211] tmp: /tmp [ox2:22300] [0,1,1] setting up session dir with [ox2:22300] universe default-universe [ox2:22300] user javier [ox2:22300] host ox2 [ox2:22300] jobid 1 [ox2:22300] procid 1 [ox2:22300] procdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/1/1 [ox2:22300] jobdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/1 [ox2:22300] unidir: /tmp/openmpi-sessions-javier@ox2_0/default-universe [ox2:22300] top: openmpi-sessions-javier@ox2_0 [ox2:22300] tmp: /tmp [oxigeno.ugr.es:29960] spawn: in job_state_callback(jobid = 1, state = 0x3) [oxigeno.ugr.es:29960] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, ox2, hello, 22300) (i, host, exe, pid) = (1, ox1, hello, 21735) [ox1:21646] orted: job_state_callback(jobid = 1, state = 134643904) [ox2:22211] orted: job_state_callback(jobid = 1, state = 134640784) [oxigeno.ugr.es:29960] spawn: in job_state_callback(jobid = 1, state = 0x4) [ox2:22211] orted: job_state_callback(jobid = 1, state = 134640784) [ox1:21646] orted: job_state_callback(jobid = 1, state = 134643312) Hello, world! I am 1 of 2 Hello, world! I am 0 of 2 [ox1:21735] [0,1,0] ompi_mpi_init completed [ox2:22300] [0,1,1] ompi_mpi_init completed [oxigeno.ugr.es:29960] spawn: in job_state_callback(jobid = 1, state = 0x7) [oxigeno.ugr.es:29960] spawn: in job_state_callback(jobid = 1, state = 0x8) [ox2:22211] orted: job_state_callback(jobid = 1, state = 134640408) [ox2:22211] orted: job_state_callback(jobid = 1, state = 134611224) [ox1:21646] orted: job_state_callback(jobid = 1, state = 134643320) [ox1:21646] orted: job_state_callback(jobid = 1, state = 134611424) [ox2:22300] sess_dir_finalize: found proc session dir empty - deleting [ox1:21735] sess_dir_finalize: found proc session dir empty - deleting [ox2:22300] sess_dir_finalize: job session dir not empty - leaving [ox1:21735] sess_dir_finalize: job session dir not empty - leaving [ox1:21646] sess_dir_finalize: proc session dir not empty - leaving [ox2:22211] sess_dir_finalize: proc session dir not empty - leaving [ox2:22211] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [ox1:21646] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [ox1:21646] sess_dir_finalize: found job session dir empty - deleting [ox2:22211] sess_dir_finalize: found job session dir empty - deleting [ox1:21646] sess_dir_finalize: univ session dir not empty - leaving [ox2:22211] sess_dir_finalize: univ session dir not empty - leaving [ox1:21646] sess_dir_finalize: found proc session dir empty - deleting [ox2:22211] sess_dir_finalize: found proc session dir empty - deleting [ox1:21646] sess_dir_finalize: found job session dir empty - deleting [ox2:22211] sess_dir_finalize: found job session dir empty - deleting [ox2:22211] sess_dir_finalize: found univ session dir empty - deleting [ox1:21646] sess_dir_finalize: found univ session dir empty - deleting [ox2:22211] sess_dir_finalize: found top session dir empty - deleting [ox1:21646] sess_dir_finalize: found top session dir empty - deleting [javier@oxigeno hello]$ exit Script terminado (Tue Apr 25 11:24:10 2006 )
Script iniciado (Tue Apr 25 11:24:26 2006 ) [javier@oxigeno hello]$ mpirun -d -c 2 -H ox0,ox2 hello [oxigeno.ugr.es:30067] [0,0,0] setting up session dir with [oxigeno.ugr.es:30067] universe default-universe [oxigeno.ugr.es:30067] user javier [oxigeno.ugr.es:30067] host oxigeno.ugr.es [oxigeno.ugr.es:30067] jobid 0 [oxigeno.ugr.es:30067] procid 0 [oxigeno.ugr.es:30067] procdir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/0/0 [oxigeno.ugr.es:30067] jobdir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/0 [oxigeno.ugr.es:30067] unidir: /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe [oxigeno.ugr.es:30067] top: openmpi-sessions-jav...@oxigeno.ugr.es_0 [oxigeno.ugr.es:30067] tmp: /tmp [oxigeno.ugr.es:30067] [0,0,0] contact_file /tmp/openmpi-sessions-jav...@oxigeno.ugr.es_0/default-universe/universe-setup.txt [oxigeno.ugr.es:30067] [0,0,0] wrote setup file [oxigeno.ugr.es:30067] spawn: in job_state_callback(jobid = 1, state = 0x1) [oxigeno.ugr.es:30067] pls:rsh: local csh: 0, local bash: 1 [oxigeno.ugr.es:30067] pls:rsh: assuming same remote shell as local shell [oxigeno.ugr.es:30067] pls:rsh: remote csh: 0, remote bash: 1 [oxigeno.ugr.es:30067] pls:rsh: final template argv: [oxigeno.ugr.es:30067] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --gprreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --mpi-call-yield 0 [oxigeno.ugr.es:30067] pls:rsh: launching on node ox0 [oxigeno.ugr.es:30067] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [oxigeno.ugr.es:30067] pls:rsh: ox0 is a LOCAL node [oxigeno.ugr.es:30067] pls:rsh: changing to directory /home/javier [oxigeno.ugr.es:30067] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename ox0 --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --gprreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --mpi-call-yield 0 [oxigeno.ugr.es:30070] [0,0,1] setting up session dir with [oxigeno.ugr.es:30070] universe default-universe [oxigeno.ugr.es:30070] user javier [oxigeno.ugr.es:30070] host ox0 [oxigeno.ugr.es:30070] jobid 0 [oxigeno.ugr.es:30070] procid 1 [oxigeno.ugr.es:30070] procdir: /tmp/openmpi-sessions-javier@ox0_0/default-universe/0/1 [oxigeno.ugr.es:30070] jobdir: /tmp/openmpi-sessions-javier@ox0_0/default-universe/0 [oxigeno.ugr.es:30070] unidir: /tmp/openmpi-sessions-javier@ox0_0/default-universe [oxigeno.ugr.es:30070] top: openmpi-sessions-javier@ox0_0 [oxigeno.ugr.es:30070] tmp: /tmp [oxigeno.ugr.es:30067] pls:rsh: launching on node ox2 [oxigeno.ugr.es:30067] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [oxigeno.ugr.es:30067] pls:rsh: ox2 is a REMOTE node [oxigeno.ugr.es:30067] pls:rsh: executing: /usr/bin/ssh ox2 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename ox2 --universe jav...@oxigeno.ugr.es:default-universe --nsreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --gprreplica "0.0.0;tcp://150.214.191.217:1223;tcp://192.168.1.9:1223" --mpi-call-yield 0 [oxigeno.ugr.es:30076] [0,1,0] setting up session dir with [oxigeno.ugr.es:30076] universe default-universe [oxigeno.ugr.es:30076] user javier [oxigeno.ugr.es:30076] host ox0 [oxigeno.ugr.es:30076] jobid 1 [oxigeno.ugr.es:30076] procid 0 [oxigeno.ugr.es:30076] procdir: /tmp/openmpi-sessions-javier@ox0_0/default-universe/1/0 [oxigeno.ugr.es:30076] jobdir: /tmp/openmpi-sessions-javier@ox0_0/default-universe/1 [oxigeno.ugr.es:30076] unidir: /tmp/openmpi-sessions-javier@ox0_0/default-universe [oxigeno.ugr.es:30076] top: openmpi-sessions-javier@ox0_0 [oxigeno.ugr.es:30076] tmp: /tmp [ox2:22311] [0,0,2] setting up session dir with [ox2:22311] universe default-universe [ox2:22311] user javier [ox2:22311] host ox2 [ox2:22311] jobid 0 [ox2:22311] procid 2 [ox2:22311] procdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/0/2 [ox2:22311] jobdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/0 [ox2:22311] unidir: /tmp/openmpi-sessions-javier@ox2_0/default-universe [ox2:22311] top: openmpi-sessions-javier@ox2_0 [ox2:22311] tmp: /tmp [ox2:22400] [0,1,1] setting up session dir with [ox2:22400] universe default-universe [ox2:22400] user javier [ox2:22400] host ox2 [ox2:22400] jobid 1 [ox2:22400] procid 1 [ox2:22400] procdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/1/1 [ox2:22400] jobdir: /tmp/openmpi-sessions-javier@ox2_0/default-universe/1 [ox2:22400] unidir: /tmp/openmpi-sessions-javier@ox2_0/default-universe [ox2:22400] top: openmpi-sessions-javier@ox2_0 [ox2:22400] tmp: /tmp [oxigeno.ugr.es:30067] spawn: in job_state_callback(jobid = 1, state = 0x3) [oxigeno.ugr.es:30070] orted: job_state_callback(jobid = 1, state = 134601784) [oxigeno.ugr.es:30067] Info: Setting up debugger process table for applications [ox2:22311] orted: job_state_callback(jobid = 1, state = 134643904) MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, ox2, hello, 22400) (i, host, exe, pid) = (1, ox0, hello, 30076) Killed by signal 2. [oxigeno.ugr.es:30070] sess_dir_finalize: found job session dir empty - deleting [oxigeno.ugr.es:30070] sess_dir_finalize: univ session dir not empty - leaving mpirun: killing job... -------------------------------------------------------------------------- WARNING: A process refused to die! Host: oxigeno.ugr.es PID: 30076 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- [oxigeno.ugr.es:30070] sess_dir_finalize: proc session dir not empty - leaving [oxigeno.ugr.es:30070] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) -------------------------------------------------------------------------- WARNING: A process refused to die! Host: oxigeno.ugr.es PID: 30076 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: A process refused to die! Host: oxigeno.ugr.es PID: 30076 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- [javier@oxigeno hello]$ exit Script terminado (Tue Apr 25 11:25:41 2006 )
#include <mpi.h> // for MPI_Init() #include <stdio.h> // for printf() #include <stdlib.h> // for exit() int main(int argc, char **argv){ int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello, world! I am %d of %d\n", rank, size); MPI_Finalize(); exit(0); }