Hi, Thanks for the help. I only use the machine file to run outside of SGE just to test/prove that things work outside of SGE.
When I run with in SGE here is what the job script looks like: hpcp7781(salmr0)128:cat simple-job.sh #!/bin/csh # #$ -S /bin/csh setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi -np 16 /bphpc7/vol0/salmr0/SGE/a.out We are using PEs. Here is what the PE looks like: hpcp7781(salmr0)129:qconf -sp pavtest pe_name pavtest slots 16 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule 8 control_slaves FALSE job_is_first_task FALSE urgency_slots min here is he qsub line to submit the job: >>qsub -pe pavtest 16 simple-job.sh The job seems to run fine with no problems with in SGE if I contain the job with in one node. As soon as the job has to use more than one one things stop working with the message I posted about LD_LIBRARY_PATH and orted seems not to start on the remote nodes. Thanks Rene On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote: > Hi, > > it shouldn't be necessary to supply a machinefile, as the one > generated by SGE is taken automatically (i.e. the granted nodes are > honored). You submitted the job requesting a PE? > > -- Reuti > > > Am 18.03.2009 um 04:51 schrieb Salmon, Rene: > > > > > Hi, > > > > I have looked through the list archives and google but could not > > find anything related to what I am seeing. I am simply trying to > > run the basic cpi.c code using SGE and tight integration. > > > > If run outside SGE i can run my jobs just fine: > > hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out > > Process 0 on hpcp7781 > > Process 1 on hpcp7782 > > pi is approximately 3.1416009869231241, Error is 0.0000083333333309 > > wall clock time = 0.032325 > > > > > > If I submit to SGE I get this: > > > > [hpcp7781:08527] mca: base: components_open: Looking for plm > > components > > [hpcp7781:08527] mca: base: components_open: opening plm components > > [hpcp7781:08527] mca: base: components_open: found loaded component > > rsh > > [hpcp7781:08527] mca: base: components_open: component rsh has no > > register function > > [hpcp7781:08527] mca: base: components_open: component rsh open > > function successful > > [hpcp7781:08527] mca: base: components_open: found loaded component > > slurm > > [hpcp7781:08527] mca: base: components_open: component slurm has no > > register function > > [hpcp7781:08527] mca: base: components_open: component slurm open > > function successful > > [hpcp7781:08527] mca:base:select: Auto-selecting plm components > > [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] > > [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ > > lx24-amd64/qrsh for launching > > [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] > > set priority to 10 > > [hpcp7781:08527] mca:base:select:( plm) Querying component [slurm] > > [hpcp7781:08527] mca:base:select:( plm) Skipping component > > [slurm]. Query failed to return a module > > [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] > > [hpcp7781:08527] mca: base: close: component slurm closed > > [hpcp7781:08527] mca: base: close: unloading component slurm > > Starting server daemon at host "hpcp7782" > > error: executing task of job 1702026 failed: > > > ---------------------------------------------------------------------- > > ---- > > A daemon (pid 8528) died unexpectedly with status 1 while attempting > > to launch so we are aborting. > > > > There may be more information reported by the environment (see > above). > > > > This may be because the daemon was unable to find all the needed > > shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to > > have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > > ---------------------------------------------------------------------- > > ---- > > > ---------------------------------------------------------------------- > > ---- > > mpirun noticed that the job aborted, but has no info as to the > process > > that caused that situation. > > > ---------------------------------------------------------------------- > > ---- > > mpirun: clean termination accomplished > > > > [hpcp7781:08527] mca: base: close: component rsh closed > > [hpcp7781:08527] mca: base: close: unloading component rsh > > > > > > > > > > Seems to me orted is not starting on the remote node. I have > > LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on > > orted i see this: > > > > hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted > > libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > > rte.so.0 (0x00002ac5b14e2000) > > libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > > pal.so.0 (0x00002ac5b1628000) > > libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000) > > libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000) > > libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000) > > libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000) > > libpthread.so.0 => /lib64/libpthread.so.0 > (0x00002ac5b1c1c000) > > libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000) > > /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000) > > > > > > Looks like gridengine is using qrsh to start orted on the remote > > nodes. qrsh might not be reading my shell startup file and setting > > LD_LIBRARY_PATH. > > > > Thanks for any help with this. > > > > Rene > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >