Hello all,

        I recently got OpenMPI 1.0.2 (rev 9571) compiled and running on a
small EM64T-based cluster.  Everything works fine when running on a single
host, or when running simple commands or testscripts on multiple hosts.  But
when I try and run a major program (cosmomc), I get the following error:


[alis@darwin cosmomc_mpi]$ mpirun  -np 2 cosmomc params.ini
Number of MPI processes:           2
[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() 
failed with errno=113


        I do not have more than one network interface (just eth0 and lo) and I
tried the various options suggested in the FAQ for disabling interfaces.  My
machines have only one IP address each.  It does not seem to matter whether I
use single hostnames, fully-qualfied hostnames, or IP addresses in the host
list.
        Curiously, even though it reports this error, the processes still seem
to start up on the remote machines, though they do not produce output
properly.  The relevant ps line on the non-host machine:

alis      4393  0.0  0.0 37124 2896 ?        S    05:10   0:00 sshd: alis@notty
alis      4394  0.1  0.0 36396 1964 ?        Ss   05:10   0:00 orted --debug
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0
alis      4411 99.9  0.1 628872 5520 ?       R    05:10   0:14 cosmomc 
params.ini

        Any suggestions?  A copy of the mpi_run output with --debug is
included below.


-----


[alis@darwin cosmomc_mpi]$ mpirun --debug -np 2 cosmomc params.ini
[darwin.phsx.ku.edu:25140] procdir: (null)
[darwin.phsx.ku.edu:25140] jobdir: (null)
[darwin.phsx.ku.edu:25140] unidir: 
/tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-a...@darwin.phsx.ku.edu_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] connect_uni: contact info read
[darwin.phsx.ku.edu:25140] connect_uni: connection not allowed
[darwin.phsx.ku.edu:25140] [0,0,0] setting up session dir with
[darwin.phsx.ku.edu:25140]      tmpdir /tmp
[darwin.phsx.ku.edu:25140]      universe default-universe-25140
[darwin.phsx.ku.edu:25140]      user alis
[darwin.phsx.ku.edu:25140]      host darwin.phsx.ku.edu
[darwin.phsx.ku.edu:25140]      jobid 0
[darwin.phsx.ku.edu:25140]      procid 0
[darwin.phsx.ku.edu:25140] procdir: 
/tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/0/0
[darwin.phsx.ku.edu:25140] jobdir: 
/tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/0
[darwin.phsx.ku.edu:25140] unidir: 
/tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140
[darwin.phsx.ku.edu:25140] top: openmpi-sessions-a...@darwin.phsx.ku.edu_0
[darwin.phsx.ku.edu:25140] tmp: /tmp
[darwin.phsx.ku.edu:25140] [0,0,0] contact_file 
/tmp/openmpi-sessions-a...@darwin.phsx.ku.edu_0/default-universe-25140/universe-setup.txt
[darwin.phsx.ku.edu:25140] [0,0,0] wrote setup file
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x1)
[darwin.phsx.ku.edu:25140] pls:rsh: local csh: 0, local bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: assuming same remote shell as local shell
[darwin.phsx.ku.edu:25140] pls:rsh: remote csh: 0, remote bash: 1
[darwin.phsx.ku.edu:25140] pls:rsh: final template argv:
[darwin.phsx.ku.edu:25140] pls:rsh:     /usr/bin/ssh <template> orted --debug 
--bootproxy 1 --name <template> --num_procs 3 --vpid_start 0 --nodename 
<template> --universe a...@darwin.phsx.ku.edu:default-universe-25140 
--nsreplica "0.0.0;tcp://129.237.98.242:37853" --gprreplica 
"0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.242
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.242 is a LOCAL node
[darwin.phsx.ku.edu:25140] pls:rsh: changing to directory /home/alis
[darwin.phsx.ku.edu:25140] pls:rsh: executing: orted --debug --bootproxy 1 
--name 0.0.1 --num_procs 3 --vpid_start 0 --nodename 129.237.98.242 --universe 
a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica 
"0.0.0;tcp://129.237.98.242:37853" --gprreplica 
"0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[darwin.phsx.ku.edu:25141] [0,0,1] setting up session dir with
[darwin.phsx.ku.edu:25141]      universe default-universe-25140
[darwin.phsx.ku.edu:25141]      user alis
[darwin.phsx.ku.edu:25141]      host 129.237.98.242
[darwin.phsx.ku.edu:25141]      jobid 0
[darwin.phsx.ku.edu:25141]      procid 1
[darwin.phsx.ku.edu:25141] procdir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/0/1
[darwin.phsx.ku.edu:25141] jobdir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/0
[darwin.phsx.ku.edu:25141] unidir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25141] top: openmpi-sessions-alis@129.237.98.242_0
[darwin.phsx.ku.edu:25141] tmp: /tmp
[darwin.phsx.ku.edu:25140] pls:rsh: launching on node 129.237.98.243
[darwin.phsx.ku.edu:25140] pls:rsh: not oversubscribed -- setting 
mpi_yield_when_idle to 0
[darwin.phsx.ku.edu:25140] pls:rsh: 129.237.98.243 is a REMOTE node
[darwin.phsx.ku.edu:25140] pls:rsh: executing: /usr/bin/ssh 129.237.98.243 
orted --debug --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 
--nodename 129.237.98.243 --universe 
a...@darwin.phsx.ku.edu:default-universe-25140 --nsreplica 
"0.0.0;tcp://129.237.98.242:37853" --gprreplica 
"0.0.0;tcp://129.237.98.242:37853" --mpi-call-yield 0
[fisher.phsx.ku.edu:04445] [0,0,2] setting up session dir with
[fisher.phsx.ku.edu:04445]      universe default-universe-25140
[fisher.phsx.ku.edu:04445]      user alis
[fisher.phsx.ku.edu:04445]      host 129.237.98.243
[fisher.phsx.ku.edu:04445]      jobid 0
[fisher.phsx.ku.edu:04445]      procid 2
[fisher.phsx.ku.edu:04445] procdir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/0/2
[fisher.phsx.ku.edu:04445] jobdir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/0
[fisher.phsx.ku.edu:04445] unidir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04445] top: openmpi-sessions-alis@129.237.98.243_0
[fisher.phsx.ku.edu:04445] tmp: /tmp
[darwin.phsx.ku.edu:25143] [0,1,0] setting up session dir with
[darwin.phsx.ku.edu:25143]      universe default-universe-25140
[darwin.phsx.ku.edu:25143]      user alis
[darwin.phsx.ku.edu:25143]      host 129.237.98.242
[darwin.phsx.ku.edu:25143]      jobid 1
[darwin.phsx.ku.edu:25143]      procid 0
[darwin.phsx.ku.edu:25143] procdir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/1/0
[darwin.phsx.ku.edu:25143] jobdir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140/1
[darwin.phsx.ku.edu:25143] unidir: 
/tmp/openmpi-sessions-alis@129.237.98.242_0/default-universe-25140
[darwin.phsx.ku.edu:25143] top: openmpi-sessions-alis@129.237.98.242_0
[darwin.phsx.ku.edu:25143] tmp: /tmp
[fisher.phsx.ku.edu:04462] [0,1,1] setting up session dir with
[fisher.phsx.ku.edu:04462]      universe default-universe-25140
[fisher.phsx.ku.edu:04462]      user alis
[fisher.phsx.ku.edu:04462]      host 129.237.98.243
[fisher.phsx.ku.edu:04462]      jobid 1
[fisher.phsx.ku.edu:04462]      procid 1
[fisher.phsx.ku.edu:04462] procdir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/1/1
[fisher.phsx.ku.edu:04462] jobdir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140/1
[fisher.phsx.ku.edu:04462] unidir: 
/tmp/openmpi-sessions-alis@129.237.98.243_0/default-universe-25140
[fisher.phsx.ku.edu:04462] top: openmpi-sessions-alis@129.237.98.243_0
[fisher.phsx.ku.edu:04462] tmp: /tmp
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x3)
[darwin.phsx.ku.edu:25140] Info: Setting up debugger process table for 
applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
    (i, host, exe, pid) = (0, 129.237.98.243, cosmomc, 4462)
    (i, host, exe, pid) = (1, 129.237.98.242, cosmomc, 25143)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5453392)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0x4)
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 5389856)
[darwin.phsx.ku.edu:25143] [0,1,0] ompi_mpi_init completed
[fisher.phsx.ku.edu:04462] [0,1,1] ompi_mpi_init completed
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5449344)
[fisher.phsx.ku.edu:04445] orted: job_state_callback(jobid = 1, state = 5379136)
 Number of MPI processes:           2
[0,1,0][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] connect() 
failed with errno=113

---
At this point I have to kill the proc with Ctrl-C.
---

[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - 
deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: univ session dir not empty - 
leaving
Killed by signal 2.
[darwin.phsx.ku.edu:25140] sess_dir_finalize: proc session dir not empty - 
leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_ABORTED)
[darwin.phsx.ku.edu:25140] spawn: in job_state_callback(jobid = 1, state = 0xa)
[darwin.phsx.ku.edu:25140] ERROR: A daemon on node 129.237.98.243 failed to 
start as expected.
[darwin.phsx.ku.edu:25140] ERROR: There may be more information available from
[darwin.phsx.ku.edu:25140] ERROR: the remote shell (see above).
[darwin.phsx.ku.edu:25140] ERROR: The daemon exited unexpectedly with status 
255.
mpirun: killing job...
[darwin.phsx.ku.edu:25140] [0,0,0]-[0,0,2] mca_oob_tcp_msg_send_handler: writev 
failed with errno=104
[darwin.phsx.ku.edu:25140] [0,0,0] ORTE_ERROR_LOG: Connection failed in file 
pls_base_proxy.c at line 140
forrtl: error (69): process interrupted (SIGINT)
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: darwin.phsx.ku.edu
PID:  25143

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: darwin.phsx.ku.edu
PID:  25143

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: A process refused to die!

Host: darwin.phsx.ku.edu
PID:  25143

This process may still be running and/or consuming resources.
--------------------------------------------------------------------------
[darwin.phsx.ku.edu:25141] sess_dir_finalize: proc session dir not empty - 
leaving
[darwin.phsx.ku.edu:25141] orted: job_state_callback(jobid = 1, state = 
ORTE_PROC_STATE_TERMINATED)
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found proc session dir empty - 
deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found job session dir empty - 
deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: found univ session dir empty - 
deleting
[darwin.phsx.ku.edu:25141] sess_dir_finalize: top session dir not empty - 
leaving

Reply via email to