> > aha. Did you compile Open MPI 1.3 with the SGE option? > Yes I did.
hpcp7781(salmr0)142:ompi_info |grep grid MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.3) > > > setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib > > Maybe you have to set this LD_LIBRARY_PATH in your .cshrc, so it's > known automatically on the nodes. > Yes. I also have "setenv LD_LIBRARY_PATH /bphpc7/vol0/salmr0/ompi/lib" on my .cshrc as well. I just wanted to make double sure that it was there. I also even tried putting "/bphpc7/vol0/salmr0/ompi/lib" in /etc/ld.so.conf system wide just to test and see if that would help but still same results. > > mpirun --mca plm_base_verbose 20 --prefix /bphpc7/vol0/salmr0/ompi > -np > > 16 /bphpc7/vol0/salmr0/SGE/a.out > > Do you use --mca... only for debugging or why is it added here? > I only put that there for debugging. Is there a different flag I should use to get more debug info? Thanks Rene > -- Reuti > > > > > > We are using PEs. Here is what the PE looks like: > > > > hpcp7781(salmr0)129:qconf -sp pavtest > > pe_name pavtest > > slots 16 > > user_lists NONE > > xuser_lists NONE > > start_proc_args /bin/true > > stop_proc_args /bin/true > > allocation_rule 8 > > control_slaves FALSE > > job_is_first_task FALSE > > urgency_slots min > > > > > > here is he qsub line to submit the job: > > > >>> qsub -pe pavtest 16 simple-job.sh > > > > > > The job seems to run fine with no problems with in SGE if I contain > > the > > job with in one node. As soon as the job has to use more than one > one > > things stop working with the message I posted about LD_LIBRARY_PATH > > and > > orted seems not to start on the remote nodes. > > > > Thanks > > Rene > > > > > > > > > > On Wed, 2009-03-18 at 07:45 +0000, Reuti wrote: > >> Hi, > >> > >> it shouldn't be necessary to supply a machinefile, as the one > >> generated by SGE is taken automatically (i.e. the granted nodes are > >> honored). You submitted the job requesting a PE? > >> > >> -- Reuti > >> > >> > >> Am 18.03.2009 um 04:51 schrieb Salmon, Rene: > >> > >>> > >>> Hi, > >>> > >>> I have looked through the list archives and google but could not > >>> find anything related to what I am seeing. I am simply trying to > >>> run the basic cpi.c code using SGE and tight integration. > >>> > >>> If run outside SGE i can run my jobs just fine: > >>> hpcp7781(salmr0)132:mpiexec -np 2 --machinefile x a.out > >>> Process 0 on hpcp7781 > >>> Process 1 on hpcp7782 > >>> pi is approximately 3.1416009869231241, Error is > 0.0000083333333309 > >>> wall clock time = 0.032325 > >>> > >>> > >>> If I submit to SGE I get this: > >>> > >>> [hpcp7781:08527] mca: base: components_open: Looking for plm > >>> components > >>> [hpcp7781:08527] mca: base: components_open: opening plm > components > >>> [hpcp7781:08527] mca: base: components_open: found loaded > component > >>> rsh > >>> [hpcp7781:08527] mca: base: components_open: component rsh has no > >>> register function > >>> [hpcp7781:08527] mca: base: components_open: component rsh open > >>> function successful > >>> [hpcp7781:08527] mca: base: components_open: found loaded > component > >>> slurm > >>> [hpcp7781:08527] mca: base: components_open: component slurm has > no > >>> register function > >>> [hpcp7781:08527] mca: base: components_open: component slurm open > >>> function successful > >>> [hpcp7781:08527] mca:base:select: Auto-selecting plm components > >>> [hpcp7781:08527] mca:base:select:( plm) Querying component [rsh] > >>> [hpcp7781:08527] [[INVALID],INVALID] plm:rsh: using /hpc/SGE/bin/ > >>> lx24-amd64/qrsh for launching > >>> [hpcp7781:08527] mca:base:select:( plm) Query of component [rsh] > >>> set priority to 10 > >>> [hpcp7781:08527] mca:base:select:( plm) Querying component > [slurm] > >>> [hpcp7781:08527] mca:base:select:( plm) Skipping component > >>> [slurm]. Query failed to return a module > >>> [hpcp7781:08527] mca:base:select:( plm) Selected component [rsh] > >>> [hpcp7781:08527] mca: base: close: component slurm closed > >>> [hpcp7781:08527] mca: base: close: unloading component slurm > >>> Starting server daemon at host "hpcp7782" > >>> error: executing task of job 1702026 failed: > >>> > >> > --------------------------------------------------------------------- > >> - > >>> ---- > >>> A daemon (pid 8528) died unexpectedly with status 1 while > attempting > >>> to launch so we are aborting. > >>> > >>> There may be more information reported by the environment (see > >> above). > >>> > >>> This may be because the daemon was unable to find all the needed > >>> shared > >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to > >>> have the > >>> location of the shared libraries on the remote nodes and this will > >>> automatically be forwarded to the remote nodes. > >>> > >> > --------------------------------------------------------------------- > >> - > >>> ---- > >>> > >> > --------------------------------------------------------------------- > >> - > >>> ---- > >>> mpirun noticed that the job aborted, but has no info as to the > >> process > >>> that caused that situation. > >>> > >> > --------------------------------------------------------------------- > >> - > >>> ---- > >>> mpirun: clean termination accomplished > >>> > >>> [hpcp7781:08527] mca: base: close: component rsh closed > >>> [hpcp7781:08527] mca: base: close: unloading component rsh > >>> > >>> > >>> > >>> > >>> Seems to me orted is not starting on the remote node. I have > >>> LD_LIBRARY_PATH set on my shell startup files. If I do an ldd on > >>> orted i see this: > >>> > >>> hpcp7781(salmr0)135:ldd /bphpc7/vol0/salmr0/ompi/bin/orted > >>> libopen-rte.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > >>> rte.so.0 (0x00002ac5b14e2000) > >>> libopen-pal.so.0 => /bphpc7/vol0/salmr0/ompi/lib/libopen- > >>> pal.so.0 (0x00002ac5b1628000) > >>> libdl.so.2 => /lib64/libdl.so.2 (0x00002ac5b17a9000) > >>> libnsl.so.1 => /lib64/libnsl.so.1 (0x00002ac5b18ad000) > >>> libutil.so.1 => /lib64/libutil.so.1 (0x00002ac5b19c4000) > >>> libm.so.6 => /lib64/libm.so.6 (0x00002ac5b1ac7000) > >>> libpthread.so.0 => /lib64/libpthread.so.0 > >> (0x00002ac5b1c1c000) > >>> libc.so.6 => /lib64/libc.so.6 (0x00002ac5b1d34000) > >>> /lib64/ld-linux-x86-64.so.2 (0x00002ac5b13c6000) > >>> > >>> > >>> Looks like gridengine is using qrsh to start orted on the remote > >>> nodes. qrsh might not be reading my shell startup file and setting > >>> LD_LIBRARY_PATH. > >>> > >>> Thanks for any help with this. > >>> > >>> Rene > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >