Hello all,
I recently got OpenMPI 1.0.2 (rev 9571) compiled and running on a
small EM64T-based cluster. Everything works fine when running on a
single
host, or when running simple commands or testscripts on multiple
hosts. But
when I try and run a major program (cosmomc), I get the following
error:
[alis@darwin cosmomc_mpi]$ mpirun -np 2 cosmomc params.ini
Number of MPI processes: 2
[0,1,0][btl_tcp_endpoint.c:
559:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
I do not have more than one network interface (just eth0 and lo)
and I
tried the various options suggested in the FAQ for disabling
interfaces. My
machines have only one IP address each. It does not seem to matter
whether I
use single hostnames, fully-qualfied hostnames, or IP addresses in
the host
list.
Curiously, even though it reports this error, the processes still
seem
to start up on the remote machines, though they do not produce output
properly. The relevant ps line on the non-host machine:
alis 4393 0.0 0.0 37124 2896 ? S 05:10 0:00
sshd: alis@notty
alis 4394 0.1 0.0 36396 1964 ? Ss 05:10 0:00
orted --debug
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
alis 4411 99.9 0.1 628872 5520 ? R 05:10 0:14
cosmomc params.ini
Any suggestions? A copy of the mpi_run output with --debug is
included below.
-----
[alis@darwin cosmomc_mpi]$ mpirun --debug -np 2 cosmomc params.ini
[darwin.phsx.ku.edu:25140] procdir: (null)
[darwin.phsx.ku.edu:25140] jobdir: (null)
[darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-
a...@darwin.phsx.ku.edu_0/default-universe
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-
a...@darwin.phsx.ku.edu_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] connect_uni: contact info read
[darwin.phsx.ku.edu:25140] connect_uni: connection not allowed
[darwin.phsx.ku.edu:25140] [0,0,0] setting up session dir with
[darwin.phsx.ku.edu:25140] tmpdir /tmp
[darwin.phsx.ku.edu:25140] universe default-universe-25140
[darwin.phsx.ku.edu:25140] user alis
[darwin.phsx.ku.edu:25140] host darwin.phsx.ku.edu
[darwin.phsx.ku.edu:25140] jobid 0
[darwin.phsx.ku.edu:25140] procid 0
[darwin.phsx.ku.edu:25140] procdir: /tmp/openmpi-sessions-
a...@darwin.phsx.ku.edu_0/default-universe-25140/0/0
[darwin.phsx.ku.edu:25140] jobdir: /tmp/openmpi-sessions-
a...@darwin.phsx.ku.edu_0/default-universe-25140/0
[darwin.phsx.ku.edu:25140] unidir: /tmp/openmpi-sessions-
a...@darwin.phsx.ku.edu_0/default-universe-25140
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-
a...@darwin.phsx.ku.edu_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] [0,0,0] contact_file /tmp/openmpi-
sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/universe-
setup.txt
[darwin.phsx.ku.edu:25140] [0,0,0] wrote setup file
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
state = 0x1)
[darwin.phsx.ku.edu:25140] pls:rsh: local csh: 0, local bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: assuming same remote shell as
local shell
[darwin.phsx.ku.edu:25140] pls:rsh: remote csh: 0, remote bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: final template argv:
[darwin.phsx.ku.edu:25140] pls:rsh: /usr/bin/ssh <template>
orted --debug --bootproxy 1 --name <template> --num_procs 3 --
vpid_start 0 --nodename <template> --universe
a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica
"0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://
129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.242
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.242 is a LOCAL node
[darwin.phsx.ku.edu:25140] pls:rsh: changing to directory /home/alis
[darwin.phsx.ku.edu:25140] pls:rsh: executing: orted --debug --
bootproxy 1 --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
129.237.98.242 --universe a...@darwin.phsx.ku.edu:default-
universe-25140 --nsreplica "0.0.0;tcp://129.237.98.242:37853" --
gprreplica "0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25141] [0,0,1] setting up session dir with
[darwin.phsx.ku.edu:25141] universe default-universe-25140
[darwin.phsx.ku.edu:25141] user alis
[darwin.phsx.ku.edu:25141] host 129.237.98.242
[darwin.phsx.ku.edu:25141] jobid 0
[darwin.phsx.ku.edu:25141] procid 1
[darwin.phsx.ku.edu:25141] procdir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140/0/1
[darwin.phsx.ku.edu:25141] jobdir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140/0
[darwin.phsx.ku.edu:25141] unidir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25141] top: openmpi-sessions-alis@129.237.98.242_0
[darwin.phsx.ku.edu:25141] tmp: /tmp
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.243
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting
mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.243 is a REMOTE node
[darwin.phsx.ku.edu:25140] pls:rsh: executing: /usr/bin/ssh
129.237.98.243 orted --debug --bootproxy 1 --name 0.0.2 --num_procs
3 --vpid_start 0 --nodename 129.237.98.243 --universe
a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica
"0.0.0;tcp://129.237.98.242:37853" --gprreplica "0.0.0;tcp://
129.237.98.242:37853" --mpi-call-yield 0
[fisher.phsx.ku.edu:04445] [0,0,2] setting up session dir with
[fisher.phsx.ku.edu:04445] universe default-universe-25140
[fisher.phsx.ku.edu:04445] user alis
[fisher.phsx.ku.edu:04445] host 129.237.98.243
[fisher.phsx.ku.edu:04445] jobid 0
[fisher.phsx.ku.edu:04445] procid 2
[fisher.phsx.ku.edu:04445] procdir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140/0/2
[fisher.phsx.ku.edu:04445] jobdir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140/0
[fisher.phsx.ku.edu:04445] unidir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04445] top: openmpi-sessions-alis@129.237.98.243_0
[fisher.phsx.ku.edu:04445] tmp: /tmp
[darwin.phsx.ku.edu:25143] [0,1,0] setting up session dir with
[darwin.phsx.ku.edu:25143] universe default-universe-25140
[darwin.phsx.ku.edu:25143] user alis
[darwin.phsx.ku.edu:25143] host 129.237.98.242
[darwin.phsx.ku.edu:25143] jobid 1
[darwin.phsx.ku.edu:25143] procid 0
[darwin.phsx.ku.edu:25143] procdir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140/1/0
[darwin.phsx.ku.edu:25143] jobdir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140/1
[darwin.phsx.ku.edu:25143] unidir: /tmp/openmpi-sessions-
alis@129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25143] top: openmpi-sessions-alis@129.237.98.242_0
[darwin.phsx.ku.edu:25143] tmp: /tmp
[fisher.phsx.ku.edu:04462] [0,1,1] setting up session dir with
[fisher.phsx.ku.edu:04462] universe default-universe-25140
[fisher.phsx.ku.edu:04462] user alis
[fisher.phsx.ku.edu:04462] host 129.237.98.243
[fisher.phsx.ku.edu:04462] jobid 1
[fisher.phsx.ku.edu:04462] procid 1
[fisher.phsx.ku.edu:04462] procdir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140/1/1
[fisher.phsx.ku.edu:04462] jobdir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140/1
[fisher.phsx.ku.edu:04462] unidir: /tmp/openmpi-sessions-
alis@129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04462] top: openmpi-sessions-alis@129.237.98.243_0
[fisher.phsx.ku.edu:04462] tmp: /tmp
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
state = 0x3)
[darwin.phsx.ku.edu:25140] Info: Setting up debugger process table
for applications
MPIR_being_debugged = 0
MPIR_debug_gate = 0
MPIR_debug_state = 1
MPIR_acquired_pre_main = 0
MPIR_i_am_starter = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, 129.237.98.243, cosmomc, 4462)
(i, host, exe, pid) = (1, 129.237.98.242, cosmomc, 25143)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
state = 5453392)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
state = 0x4)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
state = 5389856)
[darwin.phsx.ku.edu:25143] [0,1,0] ompi_mpi_init completed
[fisher.phsx.ku.edu:04462] [0,1,1] ompi_mpi_init completed
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1,
state = 5449344)
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1,
state = 5379136)
Number of MPI processes: 2
[0,1,0][btl_tcp_endpoint.c:
559:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
---
At this point I have to kill the proc with Ctrl-C.
---
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir
empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: univ session dir not
empty - leaving
Killed by signal 2.
[darwin.phsx.ku.edu:25140] sess_dir_finalize: proc session dir not
empty - leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
state = ORTE_PROC_STATE_ABORTED)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1,
state = 0xa)
[darwin.phsx.ku.edu:25140] ERROR: A daemon on node 129.237.98.243
failed to start as expected.
[darwin.phsx.ku.edu:25140] ERROR: There may be more information
available from
[darwin.phsx.ku.edu:25140] ERROR: the remote shell (see above).
[darwin.phsx.ku.edu:25140] ERROR: The daemon exited unexpectedly
with status 255.
mpirun: killing job...
[darwin.phsx.ku.edu:25140] [0,0,0]-[0,0,2]
mca_oob_tcp_msg_send_handler: writev failed with errno=104
[darwin.phsx.ku.edu:25140] [0,0,0] ORTE_ERROR_LOG: Connection
failed in file pls_base_proxy.c at line 140
forrtl: error (69): process interrupted (SIGINT)
----------------------------------------------------------------------
----
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID: 25143
This process may still be running and/or consuming resources.
----------------------------------------------------------------------
----
----------------------------------------------------------------------
----
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID: 25143
This process may still be running and/or consuming resources.
----------------------------------------------------------------------
----
----------------------------------------------------------------------
----
WARNING: A process refused to die!
Host: darwin.phsx.ku.edu
PID: 25143
This process may still be running and/or consuming resources.
----------------------------------------------------------------------
----
[darwin.phsx.ku.edu:25141] sess_dir_finalize: proc session dir not
empty - leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1,
state = ORTE_PROC_STATE_TERMINATED)
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found proc session
dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir
empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found univ session
dir empty - deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: top session dir not
empty - leaving
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users