Hi,

Thanks for the help.  I only use the machine file to run outside of SGE
just to test/prove that things work outside of SGE.

When I run with in SGE here is what the job script looks like:

hpcp7781(salmr0)128:cat simple-job.sh
#!/bin/csh
#
#$ -S /bin/csh
setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib
mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np
16 /bphpc7/vol0/salmr0/SGE/a.out


We are using PEs.  Here is what the PE looks like:

hpcp7781(salmr0)129:qconf -sp pavtest
pe_name           pavtest
slots             16
user_lists        NONE
xuser_lists       NONE
start_proc_args   /bin/true
stop_proc_args    /bin/true
allocation_rule   8
control_slaves    FALSE
job_is_first_task FALSE
urgency_slots     min


here is he qsub line to submit the job:

>>qsub -pe pavtest 16 simple-job.sh


The job seems to run fine with no problems with in SGE if I contain the
job with in one node.  As soon as the job has to use more than one one
things stop working with the message I posted about LD_LIBRARY_PATH and
orted seems not to start on the remote nodes.  

Thanks
Rene




On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote:
> Hi,
> 
> it shouldn't be necessary to supply a machinefile, as the one 
> generated by SGE is taken automatically (i.e. the granted nodes are 
> honored). You submitted the job requesting a PE?
> 
> -- Reuti
> 
> 
> Am 18.03.2009 um 04:51 schrieb Salmon, Rene:
> 
> >
> > Hi,
> >
> > I have looked through the list archives and google but could not 
> > find anything related to what I am seeing. I am simply trying to 
> > run the basic cpi.c code using SGE and tight integration.
> >
> > If run outside SGE i can run my jobs just fine:
> > hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out
> > Process 0 on hpcp7781
> > Process 1 on hpcp7782
> > pi is approximately 3.1416009869231241, Error is 0.0000083333333309
> > wall clock time = 0.032325
> >
> >
> > If I submit to SGE I get this:
> >
> > [hpcp7781:08527] mca: base: components_open: Looking for plm 
> > components
> > [hpcp7781:08527] mca: base: components_open: opening plm components
> > [hpcp7781:08527] mca: base: components_open: found loaded component 
> > rsh
> > [hpcp7781:08527] mca: base: components_open: component rsh has no 
> > register function
> > [hpcp7781:08527] mca: base: components_open: component rsh open 
> > function successful
> > [hpcp7781:08527] mca: base: components_open: found loaded component 
> > slurm
> > [hpcp7781:08527] mca: base: components_open: component slurm has no 
> > register function
> > [hpcp7781:08527] mca: base: components_open: component slurm open 
> > function successful
> > [hpcp7781:08527] mca:base:select: Auto-selecting plm components
> > [hpcp7781:08527] mca:base:select:(  plm) Querying component [rsh]
> > [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/
> > lx24-amd64/qrsh for launching
> > [hpcp7781:08527] mca:base:select:(  plm) Query of component [rsh] 
> > set priority to 10
> > [hpcp7781:08527] mca:base:select:(  plm) Querying component [slurm]
> > [hpcp7781:08527] mca:base:select:(  plm) Skipping component 
> > [slurm]. Query failed to return a module
> > [hpcp7781:08527] mca:base:select:(  plm) Selected component [rsh]
> > [hpcp7781:08527] mca: base: close: component slurm closed
> > [hpcp7781:08527] mca: base: close: unloading component slurm
> > Starting server daemon at host "hpcp7782"
> > error: executing task of job 1702026 failed:
> >
> ----------------------------------------------------------------------
> > ----
> > A daemon (pid 8528) died unexpectedly with status 1 while attempting
> > to launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the needed 
> > shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to 
> > have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> ----------------------------------------------------------------------
> > ----
> >
> ----------------------------------------------------------------------
> > ----
> > mpirun noticed that the job aborted, but has no info as to the
> process
> > that caused that situation.
> >
> ----------------------------------------------------------------------
> > ----
> > mpirun: clean termination accomplished
> >
> > [hpcp7781:08527] mca: base: close: component rsh closed
> > [hpcp7781:08527] mca: base: close: unloading component rsh
> >
> >
> >
> >
> > Seems to me orted is not starting on the remote node.  I have 
> > LD_LIBRARY_PATH set on my shell startup files.  If I do an ldd on 
> > orted i see this:
> >
> > hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted
> >         libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > rte.so.0 (0x00002ac5b14e2000)
> >         libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen-
> > pal.so.0 (0x00002ac5b1628000)
> >         libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000)
> >         libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000)
> >         libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000)
> >         libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000)
> >         libpthread.so.0 => /lib64/libpthread.so.0
> (0x00002ac5b1c1c000)
> >         libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000)
> >         /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000)
> >
> >
> > Looks like gridengine is using qrsh to start orted on the remote 
> > nodes. qrsh might not be reading my shell startup file and setting 
> > LD_LIBRARY_PATH.
> >
> > Thanks for any help with this.
> >
> > Rene
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

Reply via email to