Ralph,

Unfortunately I didn't see the ssh output.  The output I got was pretty much as 
before.

You know, the fact that the error message is not prefixed with a host name 
makes me think it could be happening on the host where the job is placed by 
PBS. If there is something wrong in the user environment prior to mpirun, that 
is not an OpenMPI problem. And yet, in one of the jobs that failed, I have also 
printed out the results of 'ldd' on the mpirun executable just prior to 
executing the command, and all the shared libraries were resolved:

ldd /release/cfd/openmpi-intel/bin/mpirun
        linux-vdso.so.1 =>  (0x00007fffbbb39000)
        libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 
(0x00002abdf75d2000)
        libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 
(0x00002abdf7887000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002abdf7b39000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002abdf7d3d000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002abdf7f56000)
        libm.so.6 => /lib64/libm.so.6 (0x00002abdf8159000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002abdf83af000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002abdf85c7000)
        libc.so.6 => /lib64/libc.so.6 (0x00002abdf87e4000)
        libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so 
(0x00002abdf8b42000)
        libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
(0x00002abdf8ed7000)
        libintlc.so.5 => 
/appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x00002abdf90ed000)
        /lib64/ld-linux-x86-64.so.2 (0x00002abdf73b1000)

Hence my initial assumption that the shared-library problem was happening with 
one of the child processes on a remote node.

So at this point I have more questions than answers.  I still don't know if 
this message comes from the main mpirun process or one of the child processes, 
although it seems that it should not be the main process because of the output 
of ldd above.

Any more suggestions are welcomed of course.

Thanks


/release/cfd/openmpi-intel/bin/mpirun --machinefile 
/var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x 
MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached 
/tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 10000 -ri restart.5000 -ro 
/tmp/fv420804.maruhpc4-mgt/restart.5000

[c6n38:16219] mca:base:select:(  plm) Querying component [rsh]
[c6n38:16219] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[c6n38:16219] mca:base:select:(  plm) Selected component [rsh]
Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
/release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
--------------------------------------------------------------------------
A daemon (pid 16227) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        c6n39 - daemon did not report back when launched
        c6n40 - daemon did not report back when launched
        c6n41 - daemon did not report back when launched
        c6n42 - daemon did not report back when launched

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Friday, December 14, 2012 2:25 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries while 
launching jobs

Add -mca plm_base_verbose 5 --leave-session-attached to the cmd line - that 
will show the ssh command being used to start each orted.

On Dec 14, 2012, at 12:17 PM, "Blosch, Edwin L" 
<edwin.l.blo...@lmco.com<mailto:edwin.l.blo...@lmco.com>> wrote:


I am having a weird problem launching cases with OpenMPI 1.4.3.  It is most 
likely a problem with a particular node of our cluster, as the jobs will run 
fine on some submissions, but not other submissions.  It seems to depend on the 
node list.  I just am having trouble diagnosing which node, and what is the 
nature of the problem it has.

One or perhaps more of the orted are indicating they cannot find an Intel Math 
library.  The error is:
/release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory

I've checked the environment just before launching mpirun, and LD_LIBRARY_PATH 
includes the necessary component to point to where the Intel shared libraries 
are located.  Furthermore, my mpirun command line says to export the 
LD_LIBRARY_PATH variable:
Executing ['/release/cfd/openmpi-intel/bin/mpirun', '--machinefile 
/var/spool/PBS/aux/20761.maruhpc4-mgt', '-np 160', '-x LD_LIBRARY_PATH', '-x 
MPI_ENVIRONMENT=1', '/tmp/fv420761.maruhpc4-mgt/falconv4_openmpi_jsgl', '-v', 
'-cycles', '10000', '-ri', 'restart.1', '-ro', 
'/tmp/fv420761.maruhpc4-mgt/restart.1']

My shell-initialization script (.bashrc) does not overwrite LD_LIBRARY_PATH.  
OpenMPI is built explicitly --without-torque and should be using ssh to launch 
the orted.

What options can I add to get more debugging of problems launching orted?

Thanks,

Ed
_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to