Ralph, Unfortunately I didn't see the ssh output. The output I got was pretty much as before.
You know, the fact that the error message is not prefixed with a host name makes me think it could be happening on the host where the job is placed by PBS. If there is something wrong in the user environment prior to mpirun, that is not an OpenMPI problem. And yet, in one of the jobs that failed, I have also printed out the results of 'ldd' on the mpirun executable just prior to executing the command, and all the shared libraries were resolved: ldd /release/cfd/openmpi-intel/bin/mpirun linux-vdso.so.1 => (0x00007fffbbb39000) libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 (0x00002abdf75d2000) libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 (0x00002abdf7887000) libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000) libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000) libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000) libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000) libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000) libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so (0x00002abdf8b42000) libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so (0x00002abdf8ed7000) libintlc.so.5 => /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x00002abdf90ed000) /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000) Hence my initial assumption that the shared-library problem was happening with one of the child processes on a remote node. So at this point I have more questions than answers. I still don't know if this message comes from the main mpirun process or one of the child processes, although it seems that it should not be the main process because of the output of ldd above. Any more suggestions are welcomed of course. Thanks /release/cfd/openmpi-intel/bin/mpirun --machinefile /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 -ro /tmp/fv420804.maruhpc4-mgt/restart.5000 [c6n38:16219] mca:base:select:( plm) Querying component [rsh] [c6n38:16219] mca:base:select:( plm) Query of component [rsh] set priority to 10 [c6n38:16219] mca:base:select:( plm) Selected component [rsh] Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory -------------------------------------------------------------------------- A daemon (pid 16227) died unexpectedly with status 127 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- c6n39 - daemon did not report back when launched c6n40 - daemon did not report back when launched c6n41 - daemon did not report back when launched c6n42 - daemon did not report back when launched From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Friday, December 14, 2012 2:25 PM To: Open MPI Users Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries while launching jobs Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that will show the ssh command being used to start each orted. On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" <edwin.l.blo...@lmco.com<mailto:edwin.l.blo...@lmco.com>> wrote: I am having a weird problem launching cases with OpenMPI 1.4.3. It is most likely a problem with a particular node of our cluster, as the jobs will run fine on some submissions, but not other submissions. It seems to depend on the node list. I just am having trouble diagnosing which node, and what is the nature of the problem it has. One or perhaps more of the orted are indicating they cannot find an Intel Math library. The error is: /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: libimf.so: cannot open shared object file: No such file or directory I've checked the environment just before launching mpirun, and LD_LIBRARY_PATH includes the necessary component to point to where the Intel shared libraries are located. Furthermore, my mpirun command line says to export the LD_LIBRARY_PATH variable: Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile /var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', '-cycles', '10000', '-ri', 'restart.1', '-ro', '/tmp/fv420761.maruhpc4-mgt/restart.1'] My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH. OpenMPI is built explicitly --without-torque and should be using ssh to launch the orted. What options can I add to get more debugging of problems launching orted? Thanks, Ed _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users