Interestingly i can start mpirun from any of the remote machines, running processes on other remote machines and on the local machine,. But from the local machine i can start no process on a remote machine - it just shows the behavior detailed in the previous mail.
remote1 -> remote1 ok remote1 -> remote2 ok remote1 -> local ok remote2 -> remote1 ok remote2 -> remote2 ok remote2 -> local ok local -> local ok local -> remote1 fails local -> remote2 fails My remote machines are freshly updated gentoo machines (AMD), my local machne is a freshly installed fedora 8 (Intel Quadro). All use a freshly installed open-mpi 1.2.5. Before my fedora machine crashed it had fedora 6, and everything worked great (with 1.2.2 on all machines). Does anybody have a suggestion where i should look? Thanks Jody On Tue, Jun 10, 2008 at 12:59 PM, jody <jody....@gmail.com> wrote: > Hi > after a crash i reinstalled open-mpi 1.2.5 on my machines, > used > ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default > and set PATH and LD_LIBRARY_PATH in .bashrc: > PATH=/opt/openmpi/bin:$PATH > export PATH > LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH > export LD_LIBRARY_PATH > > First problem: > ssh nano_00 printenv > does not contain the correct paths (and no LD_LIBRARY_PATH at all), > but with a normal ssh-login the two are set correctly. > > When i run a test application on one computer, it works. > > As soon as an additional computer is involved, there is no more output, > and everything just hangs. > > Adding the prefix doesn't change anything, even though openmpi is > installed in the same > directory (/opt/openmpi) on every computer. > > The debug-daemon doesn't help very much: > > $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest > Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch > > (and nothing happens anymore) > > On the remote host, i see the following three processes coming up > after i do the mpirun on the local machine: > 30603 ? S 0:00 sshd: jody@notty > 30604 ? Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ; > export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ; > export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons > --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 -- > 30605 ? S 0:00 /opt/openmpi/bin/orted --debug-daemons > --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename > nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934 > --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl > > So it looks as if the correct paths are set (probably the doing of > --enable-mpirun-prefix-by-default) > > If i interrupt on the local machine (Ctrl-C):: > > [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] > [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs > [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] > [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c at line 1166 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > errmgr_hnp.c at line 90 > [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start > as expected. > [aim-plankton:14982] ERROR: There may be more information available from > [aim-plankton:14982] ERROR: the remote shell (see above). > [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255. > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c at line 1166 > -------------------------------------------------------------------------- > WARNING: mpirun has exited before it received notification that all > started processes had terminated. You should double check and ensure > that there are no runaway processes still executing. > -------------------------------------------------------------------------- > [aim-plankton:14983] OOB: Connection to HNP lost > > On the remote machine, the "sshd: jody@notty" process is gone, but the > other two stay. > I would be grateful for any suggestions! > > Jody >