Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

Rolf Vandevaart Wed, 18 Mar 2009 10:05:52 -0400

On 03/18/09 09:52, Reuti wrote:

Hi,


Am 18.03.2009 um 14:25 schrieb Rene Salmon:

Thanks for the help.  I only use the machine file to run outside of SGE
just to test/prove that things work outside of SGE.


aha. Did you compile Open MPI 1.3 with the SGE option?

When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh

-S will only work if the queue configuration is set to posix_compliant.If it's set to unix_behavior, the first line of the script is alreadysufficient.

setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib

Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's knownautomatically on the nodes.

mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out


Do you use --mca... only for debugging or why is it added here?

-- Reuti


We are using PEs.  Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name           pavtest
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   8
control_slaves    FALSE
job_is_first_task FALSE
urgency_slots     min


At this FAQ, we show an example of a parallel environment setup.
http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

I am wondering if the control_slaves needs to be TRUE.

And double check the that the PE (pavtest) is on the list for the queue(also mentioned at the FAQ). And perhaps start trying to run hostnamefirst.


Rolf



here is he qsub line to submit the job:

qsub -pe pavtest 16 simple-job.sh



The job seems to run fine with no problems with in SGE if I contain the
job with in one node.  As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH and
orted seems not to start on the remote nodes.

Thanks
Rene




On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:

Hi,

it shouldn't be necessary to supply a machinefile, as the one
generated by SGE is taken automatically (i.e. the granted nodes are
honored). You submitted the job requesting a PE?

-- Reuti


Am 18.03.2009 um 04:51 schrieb Salmon, Rene:


Hi,

I have looked through the list archives and google but could not
find anything related to what I am seeing. I am simply trying to
run the basic cpi.c code using SGE and tight integration.

If run outside SGE i can run my jobs just fine:
hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
Process 0 on hpcp7781
Process 1 on hpcp7782
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.032325


If I submit to SGE I get this:

[hpcp7781:08527] mca: base: components_open: Looking for plm
components
[hpcp7781:08527] mca: base: components_open: opening plm components
[hpcp7781:08527] mca: base: components_open: found loaded component
rsh
[hpcp7781:08527] mca: base: components_open: component rsh has no
register function
[hpcp7781:08527] mca: base: components_open: component rsh open
function successful
[hpcp7781:08527] mca: base: components_open: found loaded component
slurm
[hpcp7781:08527] mca: base: components_open: component slurm has no
register function
[hpcp7781:08527] mca: base: components_open: component slurm open
function successful
[hpcp7781:08527] mca:base:select: Auto-selecting plm components
[hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
[hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
lx24-amd64/qrsh for launching
[hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh]
set priority to 10
[hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
[hpcp7781:08527] mca:base:select:(  plm) Skipping component
[slurm]. Query failed to return a module
[hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
[hpcp7781:08527] mca: base: close: component slurm closed
[hpcp7781:08527] mca: base: close: unloading component slurm
Starting server daemon at host "hpcp7782"
error: executing task of job 1702026 failed:

----------------------------------------------------------------------

----
A daemon (pid 8528) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see

above).


This may be because the daemon was unable to find all the needed
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to
have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.

----------------------------------------------------------------------

----

----------------------------------------------------------------------

----
mpirun noticed that the job aborted, but has no info as to the

process

that caused that situation.

----------------------------------------------------------------------

----
mpirun: clean termination accomplished

[hpcp7781:08527] mca: base: close: component rsh closed
[hpcp7781:08527] mca: base: close: unloading component rsh




Seems to me orted is not starting on the remote node.  I have
LD_LIBRARY_PATH set on my shell startup files.  If I do an ldd on
orted i see this:

hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
        libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
rte.so.0 (0x00002ac5b14e2000)
        libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
pal.so.0 (0x00002ac5b1628000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
        libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
        libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
        libpthread.so.0 => /lib64/libpthread.so.0

(0x00002ac5b1c1c000)

        libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
        /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)


Looks like gridengine is using qrsh to start orted on the remote
nodes. qrsh might not be reading my shell startup file and setting
LD_LIBRARY_PATH.

Thanks for any help with this.

Rene


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=========================
rolf.vandeva...@sun.com
781-442-3043
=========================

Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem

Reply via email to