Interestingly i can start mpirun from any of the remote machines,
running processes on other remote machines and on the local machine,.
But from the local machine i can start no process on a remote machine -
it just shows the behavior detailed in the previous mail.

remote1 -> remote1 ok
remote1 -> remote2 ok
remote1 -> local      ok

remote2 -> remote1 ok
remote2 -> remote2 ok
remote2 -> local      ok

local      -> local      ok
local      -> remote1 fails
local      -> remote2 fails

My remote machines are freshly updated gentoo machines (AMD),
my local machne is a freshly installed fedora 8 (Intel Quadro).
All use a freshly installed open-mpi 1.2.5.
Before my fedora machine crashed it had fedora 6,
and everything worked great (with 1.2.2 on all machines).

Does anybody have a suggestion where i should look?

Thanks
  Jody


On Tue, Jun 10, 2008 at 12:59 PM, jody <jody....@gmail.com> wrote:
> Hi
> after a crash i reinstalled open-mpi 1.2.5 on my machines,
> used
>  ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default
> and set PATH and LD_LIBRARY_PATH in .bashrc:
>  PATH=/opt/openmpi/bin:$PATH
>  export PATH
>  LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH
>  export LD_LIBRARY_PATH
>
> First problem:
>  ssh nano_00 printenv
> does not contain the correct paths (and no LD_LIBRARY_PATH at all),
> but with a normal ssh-login the two are set correctly.
>
> When i run a test application on one computer, it works.
>
> As soon as an additional computer is involved, there is no more output,
> and everything just hangs.
>
> Adding the prefix doesn't change anything, even though openmpi is
> installed in the same
> directory (/opt/openmpi) on every computer.
>
> The debug-daemon doesn't help very much:
>
> $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest
> Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch
>
> (and nothing happens anymore)
>
> On the remote host, i see the following three processes coming up
> after i do the mpirun on the local machine:
> 30603 ?        S      0:00 sshd: jody@notty
> 30604 ?        Ss     0:00 bash -c  PATH=/opt/openmpi/bin:$PATH ;
> export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ;
> export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --
> 30605 ?        S      0:00 /opt/openmpi/bin/orted --debug-daemons
> --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
> nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934
> --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl
>
> So it looks as if the correct paths are set (probably the doing of
> --enable-mpirun-prefix-by-default)
>
> If i interrupt on the local machine (Ctrl-C)::
>
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0]
> [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> errmgr_hnp.c at line 90
> [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start
> as expected.
> [aim-plankton:14982] ERROR: There may be more information available from
> [aim-plankton:14982] ERROR: the remote shell (see above).
> [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255.
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> base/pls_base_orted_cmds.c at line 275
> [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file
> pls_rsh_module.c at line 1166
> --------------------------------------------------------------------------
> WARNING: mpirun has exited before it received notification that all
> started processes had terminated.  You should double check and ensure
> that there are no runaway processes still executing.
> --------------------------------------------------------------------------
> [aim-plankton:14983] OOB: Connection to HNP lost
>
> On the remote machine, the "sshd: jody@notty" process is gone, but the
> other two stay.
> I would be grateful for any suggestions!
>
> Jody
>

Reply via email to