Hi,

it shouldn't be necessary to supply a machinefile, as the one generated by SGE is taken automatically (i.e. the granted nodes are honored). You submitted the job requesting a PE?

-- Reuti


Am 18.03.2009 um 04:51 schrieb Salmon, Rene:


Hi,

I have looked through the list archives and google but could not find anything related to what I am seeing. I am simply trying to run the basic cpi.c code using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component rsh [hpcp7781:08527] mca: base: components_open: component rsh has no register function [hpcp7781:08527] mca: base: components_open: component rsh open function successful [hpcp7781:08527] mca: base: components_open: found loaded component slurm [hpcp7781:08527] mca: base: components_open: component slurm has no register function [hpcp7781:08527] mca: base: components_open: component slurm open function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ lx24-amd64/qrsh for launching [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] set priority to 10
[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:( plm) Skipping component [slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:
---------------------------------------------------------------------- ----
A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
---------------------------------------------------------------------- ---- ---------------------------------------------------------------------- ----
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
---------------------------------------------------------------------- ----
mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh




Seems to me orted is not starting on the remote node. I have LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on orted i see this:

hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- rte.so.0 (0x00002ac5b14e2000) libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- pal.so.0 (0x00002ac5b1628000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000)
        libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)


Looks like gridengine is using qrsh to start orted on the remote nodes. qrsh might not be reading my shell startup file and setting LD_LIBRARY_PATH.

Thanks for any help with this.

Rene


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to