Hi, I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration.
If run outside SGE i can run my jobs just fine: hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out Process 0 on hpcp7781 Process 1 on hpcp7782 pi is approximately 3.1416009869231241, Error is 0.0000083333333309 wall clock time = 0.032325 If I submit to SGE I get this: [hpcp7781:08527] mca: base: components_open: Looking for plm components [hpcp7781:08527] mca: base: components_open: opening plm components [hpcp7781:08527] mca: base: components_open: found loaded component rsh [hpcp7781:08527] mca: base: components_open: component rsh has no register function [hpcp7781:08527] mca: base: components_open: component rsh open function successful [hpcp7781:08527] mca: base: components_open: found loaded component slurm [hpcp7781:08527] mca: base: components_open: component slurm has no register function [hpcp7781:08527] mca: base: components_open: component slurm open function successful [hpcp7781:08527] mca:base:select: Auto-selecting plm components [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/lx24-amd64/qrsh for launching [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10 [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] [hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] [hpcp7781:08527] mca: base: close: component slurm closed [hpcp7781:08527] mca: base: close: unloading component slurm Starting server daemon at host "hpcp7782" error: executing task of job 1702026 failed: -------------------------------------------------------------------------- A daemon (pid 8528) died unexpectedly with status 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- mpirun: clean termination accomplished [hpcp7781:08527] mca: base: close: component rsh closed [hpcp7781:08527] mca: base: close: unloading component rsh Seems to me orted is not starting on the remote node. I have LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on orted i see this: hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-rte.so.0 (0x00002ac5b14e2000) libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-pal.so.0 (0x00002ac5b1628000) libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000) libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000) libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000) libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000) libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000) /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000) Looks like gridengine is using qrsh to start orted on the remote nodes. qrsh might not be reading my shell startup file and setting LD_LIBRARY_PATH. Thanks for any help with this. Rene