Hi,
it shouldn't be necessary to supply a machinefile, as the one
generated by SGE is taken automatically (i.e. the granted nodes are
honored). You submitted the job requesting a PE?
-- Reuti
Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
Hi,
I have looked through the list archives and google but could not
find anything related to what I am seeing. I am simply trying to
run the basic cpi.c code using SGE and tight integration.
If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.032325
If I submit to SGE I get this:
[hpcp7781:08527] mca: base: components_open: Looking for plm
components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component
rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no
register function
[hpcp7781:08527] mca: base: components_open: component rsh open
function successful
[hpcp7781:08527] mca: base: components_open: found loaded component
slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no
register function
[hpcp7781:08527] mca: base: components_open: component slurm open
function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:( plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:( plm) Query of component [rsh]
set priority to 10
[hpcp7781:08527] mca:base:select:( plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:( plm) Skipping component
[slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:( plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:
----------------------------------------------------------------------
----
A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
----------------------------------------------------------------------
----
----------------------------------------------------------------------
----
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
----------------------------------------------------------------------
----
mpirun: clean termination accomplished
[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh
Seems to me orted is not starting on the remote node. I have
LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on
orted i see this:
hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
rte.so.0 (0x00002ac5b14e2000)
libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
pal.so.0 (0x00002ac5b1628000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ac5b1c1c000)
libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
/lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
Looks like gridengine is using qrsh to start orted on the remote
nodes. qrsh might not be reading my shell startup file and setting
LD_LIBRARY_PATH.
Thanks for any help with this.
Rene
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users