I have built OpenMPI 1.3.3 without support for SGE.
I just want to launch jobs with loose integration right
now.
Here is how I configured it:
./configure CC=pgcc CXX=pgCC F77=pgf90 F90=pgf90 FC=pgf90
--prefix=/opt/openmpi/1.3.3-pgi --without-sge
--enable-io-romio --with-openib=/opt/hjet/ofed/1.4.1
--with-io-romio-flags=--with-file-system=lustre
--enable-orterun-prefix-by-default
I can start jobs from the commandline just fine. When
I try to do the same thing inside an SGE job, I get
errors like the following:
error: executing task of job 5041155 failed:
--------------------------------------------------------------------------
A daemon (pid 13324) died unexpectedly with status 1 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished
I am starting mpirun with the following options:
$OMPI/bin/mpirun -mca btl openib,sm,self --mca pls ^sge \
-machinefile $MACHINE_FILE -x LD_LIBRARY_PATH -np 16 ./xhpl
The options are to ensure I am using IB, that SGE is not used, and that
the LD_LIBRARY_PATH is sent along to ensure dynamic linking is done
correctly.
This worked with 1.2.7 (except setting the pls option as gridengine
instead of sge), but I can't get it to work with 1.3.3.
Am I missing something obvious for getting jobs with loose integration
started?
Thanks,
Craig