Hello All,
I am trying to run an OpenMPI application across two physical machines. I get an error "Returned "Unreachable" (-12) instead of "Success" (0)", and looking through the logs (attached), I cannot seem to find out the cause, and how to fix it. I see lot of (communication) components being loaded and then unloaded, and I do not see which nodes pick up what kind of comm-interface. -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[10782,1],6]) is on host: tik34x Process 2 ([[10782,1],0]) is on host: tik33x BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. The "mpirun" line is: mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -report-pid -display-map -report-bindings -hostfile hostfile -np 7 -v --rankfile rankfile.txt -v --timestamp-output --tag-output ./xstartwrapper.sh ./run_gdb.sh where the .sh files are fixes for forwarding X-windows from multiple machines to the machines where I am logged in. Can anyone help? Thanks a lot. Best, Devendra
--- Begin Message ---Hello All, I am trying to run an OpenMPI application across two physical machines. I get an error "Returned "Unreachable" (-12) instead of "Success" (0)", and looking through the logs (attached), I cannot seem to find out the cause, and how to fix it. I see lot of (communication) components being loaded and then unloaded, and I do not see which nodes pick up what kind of comm-interface. -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[10782,1],6]) is on host: tik34x Process 2 ([[10782,1],0]) is on host: tik33x BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. The "mpirun" line is: mpirun --mca btl self,sm,tcp --mca btl_base_verbose 30 -report-pid -display-map -report-bindings -hostfile hostfile -np 7 -v --rankfile rankfile.txt -v --timestamp-output --tag-output ./xstartwrapper.sh ./run_gdb.sh where the .sh files are fixes for forwarding X-windows from multiple machines to the machines where I am logged in. Can anyone help? Thanks a lot. Best, Devendrareset: standard error: Invalid argument Destination is: r...@tik33x.ethz.ch Host is: tik33x Destination is: r...@tik34x.ethz.ch Host is: tik34x ----------------------------------------------------------------------------- It seems that there is no lamd running on the host tik33x. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for the "lamhalt" command. Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----------------------------------------------------------------------------- LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<11170> ssi:boot:open: opening n-1<11170> ssi:boot:open: opening boot module globus n-1<11170> ssi:boot:open: opened boot module globus n-1<11170> ssi:boot:open: opening boot module rsh n-1<11170> ssi:boot:open: opened boot module rsh n-1<11170> ssi:boot:open: opening boot module slurm n-1<11170> ssi:boot:open: opened boot module slurm n-1<11170> ssi:boot:select: initializing boot module slurm n-1<11170> ssi:boot:slurm: not running under SLURM n-1<11170> ssi:boot:select: boot module not available: slurm n-1<11170> ssi:boot:select: initializing boot module rsh n-1<11170> ssi:boot:rsh: module initializing n-1<11170> ssi:boot:rsh:agent: /usr/bin/rsh n-1<11170> ssi:boot:rsh:username: <same> n-1<11170> ssi:boot:rsh:verbose: 1000 n-1<11170> ssi:boot:rsh:algorithm: linear n-1<11170> ssi:boot:rsh:no_n: 0 n-1<11170> ssi:boot:rsh:no_profile: 0 n-1<11170> ssi:boot:rsh:fast: 0 n-1<11170> ssi:boot:rsh:ignore_stderr: 0 n-1<11170> ssi:boot:rsh:priority: 10 n-1<11170> ssi:boot:select: boot module available: rsh, priority: 10 n-1<11170> ssi:boot:select: initializing boot module globus n-1<11170> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<11170> ssi:boot:select: boot module not available: globus n-1<11170> ssi:boot:select: finalizing boot module slurm n-1<11170> ssi:boot:slurm: finalizing n-1<11170> ssi:boot:select: closing boot module slurm n-1<11170> ssi:boot:select: finalizing boot module globus n-1<11170> ssi:boot:globus: finalizing n-1<11170> ssi:boot:select: closing boot module globus n-1<11170> ssi:boot:select: selected boot module rsh n-1<11170> ssi:boot:base: looking for boot schema in following directories: n-1<11170> ssi:boot:base: <current directory> n-1<11170> ssi:boot:base: $TROLLIUSHOME/etc n-1<11170> ssi:boot:base: $LAMHOME/etc n-1<11170> ssi:boot:base: /usr/lib/lam/etc n-1<11170> ssi:boot:base: looking for boot schema file: n-1<11170> ssi:boot:base: TIK_lamboot_hostfile.txt n-1<11170> ssi:boot:base: found boot schema: TIK_lamboot_hostfile.txt n-1<11170> ssi:boot:rsh: found the following hosts: n-1<11170> ssi:boot:rsh: n0 tik33x.ethz.ch (cpu=1) n-1<11170> ssi:boot:rsh: n1 tik34x.ethz.ch (cpu=1) n-1<11170> ssi:boot:rsh: resolved hosts: n-1<11170> ssi:boot:rsh: n0 tik33x.ethz.ch --> 129.132.67.174 (origin) n-1<11170> ssi:boot:rsh: n1 tik34x.ethz.ch --> 129.132.67.175 n-1<11170> ssi:boot:rsh: starting RTE procs n-1<11170> ssi:boot:base:linear: starting n-1<11170> ssi:boot:base:server: opening server TCP socket n-1<11170> ssi:boot:base:server: opened port 41544 n-1<11170> ssi:boot:base:linear: booting n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:rsh: starting lamd on (tik33x.ethz.ch) n-1<11170> ssi:boot:rsh: starting on n0 (tik33x.ethz.ch): hboot -t -c lam-conf.lamd -d -v -I -H tik33x.ethz.ch -P 41544 -n 0 -o 0 n-1<11170> ssi:boot:rsh: launching locally tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-raid@tik33x/lam-killfile tkill: f_kill = "/tmp/lam-raid@tik33x/lam-killfile" tkill: nothing to kill: "/tmp/lam-raid@tik33x/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/bin/X11/lamd [1] 11200 lamd -H tik33x.ethz.ch -P 41544 -n 0 -o 0 -d n-1<11170> ssi:boot:rsh: successfully launched on n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: expecting connection from finite list n-1<11200> ssi:boot:open: opening n-1<11200> ssi:boot:open: opening boot module globus n-1<11200> ssi:boot:open: opened boot module globus n-1<11200> ssi:boot:open: opening boot module rsh n-1<11200> ssi:boot:open: opened boot module rsh n-1<11200> ssi:boot:open: opening boot module slurm n-1<11200> ssi:boot:open: opened boot module slurm n-1<11200> ssi:boot:select: initializing boot module slurm n-1<11200> ssi:boot:slurm: not running under SLURM n-1<11200> ssi:boot:select: boot module not available: slurm n-1<11200> ssi:boot:select: initializing boot module rsh n-1<11200> ssi:boot:rsh: module initializing n-1<11200> ssi:boot:rsh:agent: /usr/bin/rsh n-1<11200> ssi:boot:rsh:username: <same> n-1<11200> ssi:boot:rsh:verbose: 1000 n-1<11200> ssi:boot:rsh:algorithm: linear n-1<11200> ssi:boot:rsh:no_n: 0 n-1<11200> ssi:boot:rsh:no_profile: 0 n-1<11200> ssi:boot:rsh:fast: 0 n-1<11200> ssi:boot:rsh:ignore_stderr: 0 n-1<11200> ssi:boot:rsh:priority: 10 n-1<11200> ssi:boot:select: boot module available: rsh, priority: 10 n-1<11200> ssi:boot:select: initializing boot module globus n-1<11200> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<11200> ssi:boot:select: boot module not available: globus n-1<11200> ssi:boot:select: finalizing boot module slurm n-1<11200> ssi:boot:slurm: finalizing n-1<11200> ssi:boot:select: closing boot module slurm n-1<11200> ssi:boot:select: finalizing boot module globus n-1<11200> ssi:boot:globus: finalizing n-1<11200> ssi:boot:select: closing boot module globus n-1<11200> ssi:boot:select: selected boot module rsh n-1<11200> ssi:boot:send_lamd: getting node ID from command line n-1<11200> ssi:boot:send_lamd: getting agent haddr from command line n-1<11200> ssi:boot:send_lamd: getting agent port from command line n-1<11200> ssi:boot:send_lamd: getting node ID from command line n-1<11200> ssi:boot:send_lamd: connecting to 129.132.67.174:41544, node id 0 n-1<11170> ssi:boot:base:server: got connection from 129.132.67.174 n-1<11170> ssi:boot:base:server: this connection is expected (n0) n-1<11200> ssi:boot:send_lamd: sending dli_port 41973 n-1<11170> ssi:boot:base:server: remote lamd is at 129.132.67.174:41973 n-1<11170> ssi:boot:base:linear: booting n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:rsh: starting lamd on (tik34x.ethz.ch) n-1<11170> ssi:boot:rsh: starting on n1 (tik34x.ethz.ch): hboot -t -c lam-conf.lamd -d -v -s -I "-H tik33x.ethz.ch -P 41544 -n 1 -o 0" n-1<11170> ssi:boot:rsh: launching remotely n-1<11170> ssi:boot:rsh: attempting to execute: /usr/bin/rsh tik34x.ethz.ch -n -l raid 'echo $SHELL' n-1<11170> ssi:boot:rsh: remote shell /bin/tcsh n-1<11170> ssi:boot:rsh: attempting to execute: /usr/bin/rsh tik34x.ethz.ch -n -l raid hboot -t -c lam-conf.lamd -d -v -s -I '"-H tik33x.ethz.ch -P 41544 -n 1 -o 0"' tkill: setting prefix to (null) tkill: setting suffix to (null) LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University tkill: got killname back: /tmp/lam-raid@tik34x/lam-killfile tkill: f_kill = "/tmp/lam-raid@tik34x/lam-killfile" tkill: nothing to kill: "/tmp/lam-raid@tik34x/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/bin/lamd [1] 3246 lamd -H tik33x.ethz.ch -P 41544 -n 1 -o 0 -d n-1<11170> ssi:boot:rsh: successfully launched on n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: expecting connection from finite list n-1<11170> ssi:boot:base:server: got connection from 129.132.67.175 n-1<11170> ssi:boot:base:server: this connection is expected (n1) n-1<11170> ssi:boot:base:server: remote lamd is at 129.132.67.175:37523 n-1<11170> ssi:boot:base:server: closing server socket n-1<11170> ssi:boot:base:server: connecting to lamd at 129.132.67.174:51093 n-1<11170> ssi:boot:base:server: connected n-1<11170> ssi:boot:base:server: sending number of links (2) n-1<11170> ssi:boot:base:server: sending info: n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: sending info: n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: finished sending n-1<11170> ssi:boot:base:server: disconnected from 129.132.67.174:51093 n-1<11170> ssi:boot:base:server: connecting to lamd at 129.132.67.175:53828 n-1<11170> ssi:boot:base:server: connected n-1<11170> ssi:boot:base:server: sending number of links (2) n-1<11170> ssi:boot:base:server: sending info: n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: sending info: n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: finished sending n-1<11170> ssi:boot:base:server: disconnected from 129.132.67.175:53828 n-1<11170> ssi:boot:base:linear: finished n-1<11170> ssi:boot:rsh: all RTE procs started n-1<11170> ssi:boot:rsh: finalizing n-1<11170> ssi:boot: Closing n-1<11200> ssi:boot:rsh: finalizing n-1<11200> ssi:boot: Closing rm: cannot remove `setuplog.*': No such file or directory ======================== JOB MAP ======================== Data for node: Name: tik33x Num procs: 4 Process OMPI jobid: [10782,1] Process rank: 0 Process OMPI jobid: [10782,1] Process rank: 1 Process OMPI jobid: [10782,1] Process rank: 2 Process OMPI jobid: [10782,1] Process rank: 3 Data for node: Name: tik34x.ethz.ch Num procs: 3 Process OMPI jobid: [10782,1] Process rank: 4 Process OMPI jobid: [10782,1] Process rank: 5 Process OMPI jobid: [10782,1] Process rank: 6 ============================================================= mpirun pid: 11303 Wed May 16 12:09:45 2012[1,0]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],0] to slot_list 0:0 Wed May 16 12:09:45 2012[1,1]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],1] to slot_list 0:1 Wed May 16 12:09:45 2012[1,2]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],2] to slot_list 1:0 Wed May 16 12:09:45 2012[1,3]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],3] to slot_list 1:1 Wed May 16 12:09:45 2012[1,0]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,1]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,2]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,4]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,5]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,6]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,3]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component self Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component self returned success Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component sm Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component sm returned success Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component self Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component self returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component sm Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component sm returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component self Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component self returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component sm Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component sm returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component self Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component self returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component sm Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component sm returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component self Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component self returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component sm Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component sm returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component self Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component self returned success Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component sm Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component sm returned success Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component self Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component self returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component sm Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component sm returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component tcp returned success -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[10782,1],6]) is on host: tik34x Process 2 ([[10782,1],0]) is on host: tik33x BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- Wed May 16 12:09:45 2012[1,6]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,6]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,6]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,6]<stderr>:[tik34x:3331] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,1]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,1]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,1]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,1]<stderr>:[tik33x:11430] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,2]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,2]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,2]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,2]<stderr>:[tik33x:11449] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,3]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,3]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,3]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,3]<stderr>:[tik33x:11452] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,5]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,5]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,5]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,5]<stderr>:[tik34x:3333] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun has exited due to process rank 6 with PID 3270 on node tik34x.ethz.ch exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [tik33x:11303] 4 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [tik33x:11303] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [tik33x:11303] 4 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure -----------------FINISHED------------------ LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University
--- End Message ---
reset: standard error: Invalid argument Destination is: r...@tik33x.ethz.ch Host is: tik33x Destination is: r...@tik34x.ethz.ch Host is: tik34x ----------------------------------------------------------------------------- It seems that there is no lamd running on the host tik33x. This indicates that the LAM/MPI runtime environment is not operating. The LAM/MPI runtime environment is necessary for the "lamhalt" command. Please run the "lamboot" command the start the LAM/MPI runtime environment. See the LAM/MPI documentation for how to invoke "lamboot" across multiple machines. ----------------------------------------------------------------------------- LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University n-1<11170> ssi:boot:open: opening n-1<11170> ssi:boot:open: opening boot module globus n-1<11170> ssi:boot:open: opened boot module globus n-1<11170> ssi:boot:open: opening boot module rsh n-1<11170> ssi:boot:open: opened boot module rsh n-1<11170> ssi:boot:open: opening boot module slurm n-1<11170> ssi:boot:open: opened boot module slurm n-1<11170> ssi:boot:select: initializing boot module slurm n-1<11170> ssi:boot:slurm: not running under SLURM n-1<11170> ssi:boot:select: boot module not available: slurm n-1<11170> ssi:boot:select: initializing boot module rsh n-1<11170> ssi:boot:rsh: module initializing n-1<11170> ssi:boot:rsh:agent: /usr/bin/rsh n-1<11170> ssi:boot:rsh:username: <same> n-1<11170> ssi:boot:rsh:verbose: 1000 n-1<11170> ssi:boot:rsh:algorithm: linear n-1<11170> ssi:boot:rsh:no_n: 0 n-1<11170> ssi:boot:rsh:no_profile: 0 n-1<11170> ssi:boot:rsh:fast: 0 n-1<11170> ssi:boot:rsh:ignore_stderr: 0 n-1<11170> ssi:boot:rsh:priority: 10 n-1<11170> ssi:boot:select: boot module available: rsh, priority: 10 n-1<11170> ssi:boot:select: initializing boot module globus n-1<11170> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<11170> ssi:boot:select: boot module not available: globus n-1<11170> ssi:boot:select: finalizing boot module slurm n-1<11170> ssi:boot:slurm: finalizing n-1<11170> ssi:boot:select: closing boot module slurm n-1<11170> ssi:boot:select: finalizing boot module globus n-1<11170> ssi:boot:globus: finalizing n-1<11170> ssi:boot:select: closing boot module globus n-1<11170> ssi:boot:select: selected boot module rsh n-1<11170> ssi:boot:base: looking for boot schema in following directories: n-1<11170> ssi:boot:base: <current directory> n-1<11170> ssi:boot:base: $TROLLIUSHOME/etc n-1<11170> ssi:boot:base: $LAMHOME/etc n-1<11170> ssi:boot:base: /usr/lib/lam/etc n-1<11170> ssi:boot:base: looking for boot schema file: n-1<11170> ssi:boot:base: TIK_lamboot_hostfile.txt n-1<11170> ssi:boot:base: found boot schema: TIK_lamboot_hostfile.txt n-1<11170> ssi:boot:rsh: found the following hosts: n-1<11170> ssi:boot:rsh: n0 tik33x.ethz.ch (cpu=1) n-1<11170> ssi:boot:rsh: n1 tik34x.ethz.ch (cpu=1) n-1<11170> ssi:boot:rsh: resolved hosts: n-1<11170> ssi:boot:rsh: n0 tik33x.ethz.ch --> 129.132.67.174 (origin) n-1<11170> ssi:boot:rsh: n1 tik34x.ethz.ch --> 129.132.67.175 n-1<11170> ssi:boot:rsh: starting RTE procs n-1<11170> ssi:boot:base:linear: starting n-1<11170> ssi:boot:base:server: opening server TCP socket n-1<11170> ssi:boot:base:server: opened port 41544 n-1<11170> ssi:boot:base:linear: booting n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:rsh: starting lamd on (tik33x.ethz.ch) n-1<11170> ssi:boot:rsh: starting on n0 (tik33x.ethz.ch): hboot -t -c lam-conf.lamd -d -v -I -H tik33x.ethz.ch -P 41544 -n 0 -o 0 n-1<11170> ssi:boot:rsh: launching locally tkill: setting prefix to (null) tkill: setting suffix to (null) tkill: got killname back: /tmp/lam-raid@tik33x/lam-killfile tkill: f_kill = "/tmp/lam-raid@tik33x/lam-killfile" tkill: nothing to kill: "/tmp/lam-raid@tik33x/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/bin/X11/lamd [1] 11200 lamd -H tik33x.ethz.ch -P 41544 -n 0 -o 0 -d n-1<11170> ssi:boot:rsh: successfully launched on n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: expecting connection from finite list n-1<11200> ssi:boot:open: opening n-1<11200> ssi:boot:open: opening boot module globus n-1<11200> ssi:boot:open: opened boot module globus n-1<11200> ssi:boot:open: opening boot module rsh n-1<11200> ssi:boot:open: opened boot module rsh n-1<11200> ssi:boot:open: opening boot module slurm n-1<11200> ssi:boot:open: opened boot module slurm n-1<11200> ssi:boot:select: initializing boot module slurm n-1<11200> ssi:boot:slurm: not running under SLURM n-1<11200> ssi:boot:select: boot module not available: slurm n-1<11200> ssi:boot:select: initializing boot module rsh n-1<11200> ssi:boot:rsh: module initializing n-1<11200> ssi:boot:rsh:agent: /usr/bin/rsh n-1<11200> ssi:boot:rsh:username: <same> n-1<11200> ssi:boot:rsh:verbose: 1000 n-1<11200> ssi:boot:rsh:algorithm: linear n-1<11200> ssi:boot:rsh:no_n: 0 n-1<11200> ssi:boot:rsh:no_profile: 0 n-1<11200> ssi:boot:rsh:fast: 0 n-1<11200> ssi:boot:rsh:ignore_stderr: 0 n-1<11200> ssi:boot:rsh:priority: 10 n-1<11200> ssi:boot:select: boot module available: rsh, priority: 10 n-1<11200> ssi:boot:select: initializing boot module globus n-1<11200> ssi:boot:globus: globus-job-run not found, globus boot will not run n-1<11200> ssi:boot:select: boot module not available: globus n-1<11200> ssi:boot:select: finalizing boot module slurm n-1<11200> ssi:boot:slurm: finalizing n-1<11200> ssi:boot:select: closing boot module slurm n-1<11200> ssi:boot:select: finalizing boot module globus n-1<11200> ssi:boot:globus: finalizing n-1<11200> ssi:boot:select: closing boot module globus n-1<11200> ssi:boot:select: selected boot module rsh n-1<11200> ssi:boot:send_lamd: getting node ID from command line n-1<11200> ssi:boot:send_lamd: getting agent haddr from command line n-1<11200> ssi:boot:send_lamd: getting agent port from command line n-1<11200> ssi:boot:send_lamd: getting node ID from command line n-1<11200> ssi:boot:send_lamd: connecting to 129.132.67.174:41544, node id 0 n-1<11170> ssi:boot:base:server: got connection from 129.132.67.174 n-1<11170> ssi:boot:base:server: this connection is expected (n0) n-1<11200> ssi:boot:send_lamd: sending dli_port 41973 n-1<11170> ssi:boot:base:server: remote lamd is at 129.132.67.174:41973 n-1<11170> ssi:boot:base:linear: booting n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:rsh: starting lamd on (tik34x.ethz.ch) n-1<11170> ssi:boot:rsh: starting on n1 (tik34x.ethz.ch): hboot -t -c lam-conf.lamd -d -v -s -I "-H tik33x.ethz.ch -P 41544 -n 1 -o 0" n-1<11170> ssi:boot:rsh: launching remotely n-1<11170> ssi:boot:rsh: attempting to execute: /usr/bin/rsh tik34x.ethz.ch -n -l raid 'echo $SHELL' n-1<11170> ssi:boot:rsh: remote shell /bin/tcsh n-1<11170> ssi:boot:rsh: attempting to execute: /usr/bin/rsh tik34x.ethz.ch -n -l raid hboot -t -c lam-conf.lamd -d -v -s -I '"-H tik33x.ethz.ch -P 41544 -n 1 -o 0"' tkill: setting prefix to (null) tkill: setting suffix to (null) LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University tkill: got killname back: /tmp/lam-raid@tik34x/lam-killfile tkill: f_kill = "/tmp/lam-raid@tik34x/lam-killfile" tkill: nothing to kill: "/tmp/lam-raid@tik34x/lam-killfile" hboot: performing tkill hboot: tkill -d hboot: booting... hboot: fork /usr/bin/lamd [1] 3246 lamd -H tik33x.ethz.ch -P 41544 -n 1 -o 0 -d n-1<11170> ssi:boot:rsh: successfully launched on n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: expecting connection from finite list n-1<11170> ssi:boot:base:server: got connection from 129.132.67.175 n-1<11170> ssi:boot:base:server: this connection is expected (n1) n-1<11170> ssi:boot:base:server: remote lamd is at 129.132.67.175:37523 n-1<11170> ssi:boot:base:server: closing server socket n-1<11170> ssi:boot:base:server: connecting to lamd at 129.132.67.174:51093 n-1<11170> ssi:boot:base:server: connected n-1<11170> ssi:boot:base:server: sending number of links (2) n-1<11170> ssi:boot:base:server: sending info: n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: sending info: n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: finished sending n-1<11170> ssi:boot:base:server: disconnected from 129.132.67.174:51093 n-1<11170> ssi:boot:base:server: connecting to lamd at 129.132.67.175:53828 n-1<11170> ssi:boot:base:server: connected n-1<11170> ssi:boot:base:server: sending number of links (2) n-1<11170> ssi:boot:base:server: sending info: n0 (tik33x.ethz.ch) n-1<11170> ssi:boot:base:server: sending info: n1 (tik34x.ethz.ch) n-1<11170> ssi:boot:base:server: finished sending n-1<11170> ssi:boot:base:server: disconnected from 129.132.67.175:53828 n-1<11170> ssi:boot:base:linear: finished n-1<11170> ssi:boot:rsh: all RTE procs started n-1<11170> ssi:boot:rsh: finalizing n-1<11170> ssi:boot: Closing n-1<11200> ssi:boot:rsh: finalizing n-1<11200> ssi:boot: Closing rm: cannot remove `setuplog.*': No such file or directory ======================== JOB MAP ======================== Data for node: Name: tik33x Num procs: 4 Process OMPI jobid: [10782,1] Process rank: 0 Process OMPI jobid: [10782,1] Process rank: 1 Process OMPI jobid: [10782,1] Process rank: 2 Process OMPI jobid: [10782,1] Process rank: 3 Data for node: Name: tik34x.ethz.ch Num procs: 3 Process OMPI jobid: [10782,1] Process rank: 4 Process OMPI jobid: [10782,1] Process rank: 5 Process OMPI jobid: [10782,1] Process rank: 6 ============================================================= mpirun pid: 11303 Wed May 16 12:09:45 2012[1,0]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],0] to slot_list 0:0 Wed May 16 12:09:45 2012[1,1]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],1] to slot_list 0:1 Wed May 16 12:09:45 2012[1,2]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],2] to slot_list 1:0 Wed May 16 12:09:45 2012[1,3]<stderr>:[tik33x:11303] [[10782,0],0] odls:default:fork binding child [[10782,1],3] to slot_list 1:1 Wed May 16 12:09:45 2012[1,0]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,1]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,2]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,4]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,5]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,6]<stdout>:Running DAL on tik34x Wed May 16 12:09:45 2012[1,3]<stdout>:Running DAL on tik33x Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: Looking for btl components Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component self Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component self returned success Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component sm Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component sm returned success Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component self Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component self returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component sm Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component sm returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component self Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component self returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component sm Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component sm returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component self Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component self returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component sm Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component sm returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: opening btl components Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component self Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component self has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component self open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component sm Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component sm has no register function Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component sm open function successful Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: found loaded component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component tcp has no register function Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] mca: base: components_open: component tcp open function successful Wed May 16 12:09:45 2012[1,1]<stddiag>:[tik33x:11430] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,2]<stddiag>:[tik33x:11449] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,0]<stddiag>:[tik33x:11429] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,3]<stddiag>:[tik33x:11452] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component self Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component self returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component sm Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component sm returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component self Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component self returned success Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component sm Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component sm returned success Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component self Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component self returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component sm Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component sm returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: initializing btl component tcp Wed May 16 12:09:45 2012[1,4]<stddiag>:[tik34x:03332] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,6]<stddiag>:[tik34x:03331] select: init of component tcp returned success Wed May 16 12:09:45 2012[1,5]<stddiag>:[tik34x:03333] select: init of component tcp returned success -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[10782,1],6]) is on host: tik34x Process 2 ([[10782,1],0]) is on host: tik33x BTLs attempted: self sm tcp Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- Wed May 16 12:09:45 2012[1,6]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,6]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,6]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,6]<stderr>:[tik34x:3331] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,1]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,1]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,1]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,1]<stderr>:[tik33x:11430] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,2]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,2]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,2]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,2]<stderr>:[tik33x:11449] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,3]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,3]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,3]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,3]<stderr>:[tik33x:11452] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! Wed May 16 12:09:45 2012[1,5]<stderr>:*** The MPI_Init_thread() function was called before MPI_INIT was invoked. Wed May 16 12:09:45 2012[1,5]<stderr>:*** This is disallowed by the MPI standard. Wed May 16 12:09:45 2012[1,5]<stderr>:*** Your MPI job will now abort. Wed May 16 12:09:45 2012[1,5]<stderr>:[tik34x:3333] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun has exited due to process rank 6 with PID 3270 on node tik34x.ethz.ch exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [tik33x:11303] 4 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [tik33x:11303] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [tik33x:11303] 4 more processes have sent help message help-mpi-runtime / mpi_init:startup:internal-failure -----------------FINISHED------------------ LAM 7.1.2/MPI 2 C++/ROMIO - Indiana University