Hi,

I have looked through the list archives and google but could not find anything 
related to what I am seeing. I am simply trying to run the basic cpi.c code 
using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out 
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no register 
function
[hpcp7781:08527] mca: base: components_open: component rsh open function 
successful
[hpcp7781:08527] mca: base: components_open: found loaded component slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no register 
function
[hpcp7781:08527] mca: base: components_open: component slurm open function 
successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using 
/hpc/SGE/bin/lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:(  plm) Skipping component [slurm]. Query 
failed to return a module
[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed: 
--------------------------------------------------------------------------
A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh




Seems to me orted is not starting on the remote node.  I have LD_LIBRARY_PATH 
set on my shell startup files.  If I do an ldd on orted i see this:

hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
        libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-rte.so.0 
(0x00002ac5b14e2000)
        libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-pal.so.0 
(0x00002ac5b1628000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)


Looks like gridengine is using qrsh to start orted on the remote nodes. qrsh 
might not be reading my shell startup file and setting LD_LIBRARY_PATH.  

Thanks for any help with this.

Rene


Reply via email to