Hello all, I recently got OpenMPI 1.0.2 (rev 9571) compiled and running on a small EM64T-based cluster. Everything works fine when running on a single host, or when running simple commands or testscripts on multiple hosts. But when I try and run a major program (cosmomc), I get the following error:
[alis@darwin cosmomc_mpi]$ mpirun -np 2 cosmomc params.ini Number of MPI processes: 2 [0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 I do not have more than one network interface (just eth0 and lo) and I tried the various options suggested in the FAQ for disabling interfaces. My machines have only one IP address each. It does not seem to matter whether I use single hostnames, fully-qualfied hostnames, or IP addresses in the host list. Curiously, even though it reports this error, the processes still seem to start up on the remote machines, though they do not produce output properly. The relevant ps line on the non-host machine: alis 4393 0.0 0.0 37124 2896 ? S 05:10 0:00 sshd: alis@notty alis 4394 0.1 0.0 36396 1964 ? Ss 05:10 0:00 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 alis 4411 99.9 0.1 628872 5520 ? R 05:10 0:14 cosmomc params.ini Any suggestions? A copy of the mpi_run output with --debug is included below. ----- [alis@darwin cosmomc_mpi]$ mpirun --debug -np 2 cosmomc params.ini [darwin.phsx.ku.edu:25140] procdir: (null) [darwin.phsx.ku.edu:25140] jobdir: (null) [darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe [darwin.phsx.ku.edu:25140] top: openmpi-sessions-a...@darwin.phsx.ku.edu_0 [darwin.phsx.ku.edu:25140] tmp: /tmp [darwin.phsx.ku.edu:25140] connect_uni: contact info read [darwin.phsx.ku.edu:25140] connect_uni: connection not allowed [darwin.phsx.ku.edu:25140] [0,0,0] setting up session dir with [darwin.phsx.ku.edu:25140] tmpdir /tmp [darwin.phsx.ku.edu:25140] universe default-universe-25140 [darwin.phsx.ku.edu:25140] user alis [darwin.phsx.ku.edu:25140] host darwin.phsx.ku.edu [darwin.phsx.ku.edu:25140] jobid 0 [darwin.phsx.ku.edu:25140] procid 0 [darwin.phsx.ku.edu:25140] procdir: /tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/0/0 [darwin.phsx.ku.edu:25140] jobdir: /tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/0 [darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140 [darwin.phsx.ku.edu:25140] top: openmpi-sessions-a...@darwin.phsx.ku.edu_0 [darwin.phsx.ku.edu:25140] tmp: /tmp [darwin.phsx.ku.edu:25140] [0,0,0] contact_file /tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/universe-setup.txt [darwin.phsx.ku.edu:25140] [0,0,0] wrote setup file [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x1) [darwin.phsx.ku.edu:25140] pls:rsh: local csh: 0, local bash: 1 [darwin.phsx.ku.edu:25140] pls:rsh: assuming same remote shell as local shell [darwin.phsx.ku.edu:25140] pls:rsh: remote csh: 0, remote bash: 1 [darwin.phsx.ku.edu:25140] pls:rsh: final template argv: [darwin.phsx.ku.edu:25140] pls:rsh: /usr/bin/ssh <template> orted --debug --bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template> --universe a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0 [darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.242 [darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.242 is a LOCAL node [darwin.phsx.ku.edu:25140] pls:rsh: changing to directory /home/alis [darwin.phsx.ku.edu:25140] pls:rsh: executing: orted --debug --bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename 129.237.98.242 --universe a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0 [darwin.phsx.ku.edu:25141] [0,0,1] setting up session dir with [darwin.phsx.ku.edu:25141] universe default-universe-25140 [darwin.phsx.ku.edu:25141] user alis [darwin.phsx.ku.edu:25141] host 129.237.98.242 [darwin.phsx.ku.edu:25141] jobid 0 [darwin.phsx.ku.edu:25141] procid 1 [darwin.phsx.ku.edu:25141] procdir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/0/1 [darwin.phsx.ku.edu:25141] jobdir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/0 [darwin.phsx.ku.edu:25141] unidir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140 [darwin.phsx.ku.edu:25141] top: openmpi-sessions-alis@129.237.98.242_0 [darwin.phsx.ku.edu:25141] tmp: /tmp [darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.243 [darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting mpi_yield_when_idle to 0 [darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.243 is a REMOTE node [darwin.phsx.ku.edu:25140] pls:rsh: executing: /usr/bin/ssh 129.237.98.243 orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename 129.237.98.243 --universe a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0 [fisher.phsx.ku.edu:04445] [0,0,2] setting up session dir with [fisher.phsx.ku.edu:04445] universe default-universe-25140 [fisher.phsx.ku.edu:04445] user alis [fisher.phsx.ku.edu:04445] host 129.237.98.243 [fisher.phsx.ku.edu:04445] jobid 0 [fisher.phsx.ku.edu:04445] procid 2 [fisher.phsx.ku.edu:04445] procdir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/0/2 [fisher.phsx.ku.edu:04445] jobdir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/0 [fisher.phsx.ku.edu:04445] unidir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140 [fisher.phsx.ku.edu:04445] top: openmpi-sessions-alis@129.237.98.243_0 [fisher.phsx.ku.edu:04445] tmp: /tmp [darwin.phsx.ku.edu:25143] [0,1,0] setting up session dir with [darwin.phsx.ku.edu:25143] universe default-universe-25140 [darwin.phsx.ku.edu:25143] user alis [darwin.phsx.ku.edu:25143] host 129.237.98.242 [darwin.phsx.ku.edu:25143] jobid 1 [darwin.phsx.ku.edu:25143] procid 0 [darwin.phsx.ku.edu:25143] procdir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/1/0 [darwin.phsx.ku.edu:25143] jobdir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/1 [darwin.phsx.ku.edu:25143] unidir: /tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140 [darwin.phsx.ku.edu:25143] top: openmpi-sessions-alis@129.237.98.242_0 [darwin.phsx.ku.edu:25143] tmp: /tmp [fisher.phsx.ku.edu:04462] [0,1,1] setting up session dir with [fisher.phsx.ku.edu:04462] universe default-universe-25140 [fisher.phsx.ku.edu:04462] user alis [fisher.phsx.ku.edu:04462] host 129.237.98.243 [fisher.phsx.ku.edu:04462] jobid 1 [fisher.phsx.ku.edu:04462] procid 1 [fisher.phsx.ku.edu:04462] procdir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/1/1 [fisher.phsx.ku.edu:04462] jobdir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/1 [fisher.phsx.ku.edu:04462] unidir: /tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140 [fisher.phsx.ku.edu:04462] top: openmpi-sessions-alis@129.237.98.243_0 [fisher.phsx.ku.edu:04462] tmp: /tmp [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x3) [darwin.phsx.ku.edu:25140] Info: Setting up debugger process table for applications MPIR_being_debugged = 0 MPIR_debug_gate = 0 MPIR_debug_state = 1 MPIR_acquired_pre_main = 0 MPIR_i_am_starter = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, 129.237.98.243, cosmomc, 4462) (i, host, exe, pid) = (1, 129.237.98.242, cosmomc, 25143) [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5453392) [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x4) [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5389856) [darwin.phsx.ku.edu:25143] [0,1,0] ompi_mpi_init completed [fisher.phsx.ku.edu:04462] [0,1,1] ompi_mpi_init completed [fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5449344) [fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5379136) Number of MPI processes: 2 [0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=113 --- At this point I have to kill the proc with Ctrl-C. --- [darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - deleting [darwin.phsx.ku.edu:25141] sess_dir_finalize: univ session dir not empty - leaving Killed by signal 2. [darwin.phsx.ku.edu:25140] sess_dir_finalize: proc session dir not empty - leaving [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_ABORTED) [darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0xa) [darwin.phsx.ku.edu:25140] ERROR: A daemon on node 129.237.98.243 failed to start as expected. [darwin.phsx.ku.edu:25140] ERROR: There may be more information available from [darwin.phsx.ku.edu:25140] ERROR: the remote shell (see above). [darwin.phsx.ku.edu:25140] ERROR: The daemon exited unexpectedly with status 255. mpirun: killing job... [darwin.phsx.ku.edu:25140] [0,0,0]-[0,0,2] mca_oob_tcp_msg_send_handler: writev failed with errno=104 [darwin.phsx.ku.edu:25140] [0,0,0] ORTE_ERROR_LOG: Connection failed in file pls_base_proxy.c at line 140 forrtl: error (69): process interrupted (SIGINT) -------------------------------------------------------------------------- WARNING: A process refused to die! Host: darwin.phsx.ku.edu PID: 25143 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: A process refused to die! Host: darwin.phsx.ku.edu PID: 25143 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: A process refused to die! Host: darwin.phsx.ku.edu PID: 25143 This process may still be running and/or consuming resources. -------------------------------------------------------------------------- [darwin.phsx.ku.edu:25141] sess_dir_finalize: proc session dir not empty - leaving [darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = ORTE_PROC_STATE_TERMINATED) [darwin.phsx.ku.edu:25141] sess_dir_finalize: found proc session dir empty - deleting [darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - deleting [darwin.phsx.ku.edu:25141] sess_dir_finalize: found univ session dir empty - deleting [darwin.phsx.ku.edu:25141] sess_dir_finalize: top session dir not empty - leaving