[OMPI users] no reaction of remote hosts after ompi reinstall
Hi after a crash i reinstalled open-mpi 1.2.5 on my machines, used ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default and set PATH and LD_LIBRARY_PATH in .bashrc: PATH=/opt/openmpi/bin:$PATH export PATH LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH First problem: ssh nano_00 printenv does not contain the correct paths (and no LD_LIBRARY_PATH at all), but with a normal ssh-login the two are set correctly. When i run a test application on one computer, it works. As soon as an additional computer is involved, there is no more output, and everything just hangs. Adding the prefix doesn't change anything, even though openmpi is installed in the same directory (/opt/openmpi) on every computer. The debug-daemon doesn't help very much: $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch (and nothing happens anymore) On the remote host, i see the following three processes coming up after i do the mpirun on the local machine: 30603 ?S 0:00 sshd: jody@notty 30604 ?Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 -- 30605 ?S 0:00 /opt/openmpi/bin/orted --debug-daemons --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934 --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl So it looks as if the correct paths are set (probably the doing of --enable-mpirun-prefix-by-default) If i interrupt on the local machine (Ctrl-C):: [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start as expected. [aim-plankton:14982] ERROR: There may be more information available from [aim-plankton:14982] ERROR: the remote shell (see above). [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255. [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 -- WARNING: mpirun has exited before it received notification that all started processes had terminated. You should double check and ensure that there are no runaway processes still executing. -- [aim-plankton:14983] OOB: Connection to HNP lost On the remote machine, the "sshd: jody@notty" process is gone, but the other two stay. I would be grateful for any suggestions! Jody
Re: [OMPI users] Different CC for orte and opmi?
Sorry, I'll try and fill in the background. I'm attempting to package openmpi for a number of customers we have, whenever possible on our clusters we use modules to provide users with a choice of MPI environment. I'm using the 1.2.6 stable release and have built the code twice, once to /opt/openmpi-1.2.6/gnu and once to /opt/openmpi-1.2.6/intel, I have create two modules environments called openmpi-gnu and openmpi-intel and am also using a existing one called intel-compiler. The build was successful in both cases. If I load the openmpi-gnu module I can compile and run code using mpicc/mpirun as expected, if I load openmpi-intel and intel-compiler I find I can compile code but I get an error about missing libimf.so when I try to run it (reproduced below). The application *will* run if I add the line "module load intel-compiler" to my bashrc as this allows orted to link. What I think I want to do is to compile the actual library with icc but to compile orted with gcc so that I don't need to load the intel environment by default. I'm assuming that the link problems only exist with orted and not with the actual application as the LD_LIBRARY_PATH is set correctly in the shell which is launching the program. Ashley Pittman. sccomp@demo4-sles-10-1-fe:~/benchmarks/IMB_3.0/src> mpirun -H comp00,comp01 ./IMB-MPI1 /opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory /opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory [demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp01 failed to start as expected. [demo4-sles-10-1-fe:29303] ERROR: There may be more information available from [demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above). [demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 127. [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp00 failed to start as expected. [demo4-sles-10-1-fe:29303] ERROR: There may be more information available from [demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above). [demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 127. [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- $ ldd /opt/openmpi-1.2.6/intel/bin/orted linux-vdso.so.1 => (0x7fff877fe000) libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 (0x7fe97f3ac000) libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 (0x7fe97f239000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe97f135000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe97f01f000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe97ef1c000) libm.so.6 => /lib64/libm.so.6 (0x7fe97edc7000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe97ecba000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe97eba3000) libc.so.6 => /lib64/libc.so.6 (0x7fe97e972000) libimf.so => /opt/intel/compiler_10.1/x86_64/lib/libimf.so (0x7fe97e61) libsvml.so => /opt/intel/compiler_10.1/x86_64/lib/libsvml.so (0x7fe97e489000) libintlc.so.5 => /opt/intel/compiler_10.1/x86_64/lib/libintlc.so.5 (0x7fe97e35) /lib64/ld-linux-x86-64.so.2 (0x7fe97f525000) $ ssh comp00 ldd /opt/openmpi-1.2.6/intel/bin/orted libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-rte.so.0 (0x2b1f0c0c5000) libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen-pal.so.0 (0x2b1f0c23e000) libdl.so.2 => /lib64/libdl.so.2 (0x2b1f0c3bc000) libnsl.so.1 => /lib64/libnsl.so.1 (0x2b1f0c4c) libutil.so.1 => /lib64/libutil.so.1 (0x2b1f0c5d7000) libm.so.6 => /lib64/libm.so.6 (0x2b1f0c6da000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2b1f0c82f000) libpthread.so.0 => /lib64/libpthread.so.0 (0x2b1f0c93d000) libc.so.6 => /lib64/libc.so.6 (0x2b1f0ca54000) /lib64/ld-linux-x86-64.so.2 (0x2b1f0bfa9000) libimf.so => not found libsvml.so => not found libintlc.so.5 => not found libimf.so => not found libsvml.so => not found libi
Re: [OMPI users] no reaction of remote hosts after ompi reinstall [follow up]
Interestingly i can start mpirun from any of the remote machines, running processes on other remote machines and on the local machine,. But from the local machine i can start no process on a remote machine - it just shows the behavior detailed in the previous mail. remote1 -> remote1 ok remote1 -> remote2 ok remote1 -> local ok remote2 -> remote1 ok remote2 -> remote2 ok remote2 -> local ok local -> local ok local -> remote1 fails local -> remote2 fails My remote machines are freshly updated gentoo machines (AMD), my local machne is a freshly installed fedora 8 (Intel Quadro). All use a freshly installed open-mpi 1.2.5. Before my fedora machine crashed it had fedora 6, and everything worked great (with 1.2.2 on all machines). Does anybody have a suggestion where i should look? Thanks Jody On Tue, Jun 10, 2008 at 12:59 PM, jody wrote: > Hi > after a crash i reinstalled open-mpi 1.2.5 on my machines, > used > ./configure --prefix /opt/openmpi --enable-mpirun-prefix-by-default > and set PATH and LD_LIBRARY_PATH in .bashrc: > PATH=/opt/openmpi/bin:$PATH > export PATH > LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH > export LD_LIBRARY_PATH > > First problem: > ssh nano_00 printenv > does not contain the correct paths (and no LD_LIBRARY_PATH at all), > but with a normal ssh-login the two are set correctly. > > When i run a test application on one computer, it works. > > As soon as an additional computer is involved, there is no more output, > and everything just hangs. > > Adding the prefix doesn't change anything, even though openmpi is > installed in the same > directory (/opt/openmpi) on every computer. > > The debug-daemon doesn't help very much: > > $ mpirun -np 4 --hostfile testhosts --debug-daemons MPITest > Daemon [0,0,1] checking in as pid 14927 on host aim-plankton.uzh.ch > > (and nothing happens anymore) > > On the remote host, i see the following three processes coming up > after i do the mpirun on the local machine: > 30603 ?S 0:00 sshd: jody@notty > 30604 ?Ss 0:00 bash -c PATH=/opt/openmpi/bin:$PATH ; > export PATH ; LD_LIBRARY_PATH=/opt/openmpi/lib:$LD_LIBRARY_PATH ; > export LD_LIBRARY_PATH ; /opt/openmpi/bin/orted --debug-daemons > --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 -- > 30605 ?S 0:00 /opt/openmpi/bin/orted --debug-daemons > --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename > nano_00 --universe j...@aim-plankton.uzh.ch:default-universe-14934 > --nsreplica 0.0.0;tcp://130.60.126.111:52562 --gprrepl > > So it looks as if the correct paths are set (probably the doing of > --enable-mpirun-prefix-by-default) > > If i interrupt on the local machine (Ctrl-C):: > > [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] > [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs > [aim-plankton:14983] [0,0,1] orted_recv_pls: received message from [0,0,0] > [aim-plankton:14983] [0,0,1] orted_recv_pls: received kill_local_procs > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c at line 1166 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > errmgr_hnp.c at line 90 > [aim-plankton:14982] ERROR: A daemon on node nano_00 failed to start > as expected. > [aim-plankton:14982] ERROR: There may be more information available from > [aim-plankton:14982] ERROR: the remote shell (see above). > [aim-plankton:14982] ERROR: The daemon exited unexpectedly with status 255. > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > base/pls_base_orted_cmds.c at line 275 > [aim-plankton:14982] [0,0,0] ORTE_ERROR_LOG: Timeout in file > pls_rsh_module.c at line 1166 > -- > WARNING: mpirun has exited before it received notification that all > started processes had terminated. You should double check and ensure > that there are no runaway processes still executing. > -- > [aim-plankton:14983] OOB: Connection to HNP lost > > On the remote machine, the "sshd: jody@notty" process is gone, but the > other two stay. > I would be grateful for any suggestions! > > Jody >
Re: [OMPI users] Different CC for orte and opmi?
Ashley, I had a similar situation linking to the intel libraries and used the following in the link step -L/opt/intel/compiler_10.1/x86_64/lib -Wl,-non_shared -limf -lsvml - lintlc -Wl,-call_shared This created binaries statically linked to the intel compiler libraries so I didn't have to push the intel libraries out to the nodes or worry about the LD_LIBRARY_PATH. Doug Reeder On Jun 10, 2008, at 4:28 AM, Ashley Pittman wrote: Sorry, I'll try and fill in the background. I'm attempting to package openmpi for a number of customers we have, whenever possible on our clusters we use modules to provide users with a choice of MPI environment. I'm using the 1.2.6 stable release and have built the code twice, once to /opt/openmpi-1.2.6/gnu and once to /opt/openmpi-1.2.6/intel, I have create two modules environments called openmpi-gnu and openmpi- intel and am also using a existing one called intel-compiler. The build was successful in both cases. If I load the openmpi-gnu module I can compile and run code using mpicc/mpirun as expected, if I load openmpi-intel and intel-compiler I find I can compile code but I get an error about missing libimf.so when I try to run it (reproduced below). The application *will* run if I add the line "module load intel-compiler" to my bashrc as this allows orted to link. What I think I want to do is to compile the actual library with icc but to compile orted with gcc so that I don't need to load the intel environment by default. I'm assuming that the link problems only exist with orted and not with the actual application as the LD_LIBRARY_PATH is set correctly in the shell which is launching the program. Ashley Pittman. sccomp@demo4-sles-10-1-fe:~/benchmarks/IMB_3.0/src> mpirun -H comp00,comp01 ./IMB-MPI1 /opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory /opt/openmpi-1.2.6/intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory [demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp01 failed to start as expected. [demo4-sles-10-1-fe:29303] ERROR: There may be more information available from [demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above). [demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 127. [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at line 90 [demo4-sles-10-1-fe:29303] ERROR: A daemon on node comp00 failed to start as expected. [demo4-sles-10-1-fe:29303] ERROR: There may be more information available from [demo4-sles-10-1-fe:29303] ERROR: the remote shell (see above). [demo4-sles-10-1-fe:29303] ERROR: The daemon exited unexpectedly with status 127. [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 188 [demo4-sles-10-1-fe:29303] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 -- mpirun was unable to cleanly terminate the daemons for this job. Returned value Timeout instead of ORTE_SUCCESS. -- $ ldd /opt/openmpi-1.2.6/intel/bin/orted linux-vdso.so.1 => (0x7fff877fe000) libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- rte.so.0 (0x7fe97f3ac000) libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- pal.so.0 (0x7fe97f239000) libdl.so.2 => /lib64/libdl.so.2 (0x7fe97f135000) libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe97f01f000) libutil.so.1 => /lib64/libutil.so.1 (0x7fe97ef1c000) libm.so.6 => /lib64/libm.so.6 (0x7fe97edc7000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe97ecba000) libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe97eba3000) libc.so.6 => /lib64/libc.so.6 (0x7fe97e972000) libimf.so => /opt/intel/compiler_10.1/x86_64/lib/libimf.so (0x7fe97e61) libsvml.so => /opt/intel/compiler_10.1/x86_64/lib/ libsvml.so (0x7fe97e489000) libintlc.so.5 => /opt/intel/compiler_10.1/x86_64/lib/ libintlc.so.5 (0x7fe97e35) /lib64/ld-linux-x86-64.so.2 (0x7fe97f525000) $ ssh comp00 ldd /opt/openmpi-1.2.6/intel/bin/orted libopen-rte.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- rte.so.0 (0x2b1f0c0c5000) libopen-pal.so.0 => /opt/openmpi-1.2.6/intel/lib/libopen- pal.so.0 (0x2b1f0c23e000) libdl.so.2 => /lib64/libdl.so.2 (0x2b1f0c3bc000) libnsl.so.1 => /lib64/libnsl.so.1 (0x2b1f0c4c) libutil.so.1 => /lib64/libutil.so.1 (0x