Hi,

Thanks all who answered. The problem was indeed in the max. locked memory limitation.
Though, changing it in <SGE_ROOT>/default/common/settings.sh was not enough.
I also had to add ". <SGE_ROOT>/default/common/settings.sh" to <SGE_ROOT>/default/common/sgeexecd (and to /etc/init.d/sgeexecd on the compute nodes) as when the sgeexecd was executed boot it ignored the limits.conf.

Best regards,
Noam Meltzer
Software Support Engineer & RHCE
E&M Computing

http://www.emet.co.il



Jeff Squyres wrote:
I suspect that your SGE daemons are not starting with the proper locked memory limits (and therefore jobs started under SGE get severely limited locked memory limits).

See these FAQ entries -- the issues described for SLURM are applicable to all resource managers (to include SGE):

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages-more


On Aug 22, 2007, at 8:31 AM, Noam Meltzer wrote:

Hi,

I am running openmpi-1.2.3 compiled for 64bit on RHEL4u4.
I also have a Voltaire InfiniBand interconnect.
When I manually run jobs using the following command:

/opt/local/openmpi-1.2.3-gcc4/bin/orterun -np 8 -hostfile ~/myHostList
-mca btl self,openib /tcc/eandm/performance/igor/main.exe.openmpi123

The job is executed just fine..

Though, when run through SGE I have the weirdest problem, and get the
following error (on all hosts in my list):
---------------------------------------------------------------------- ----
The OpenIB BTL failed to initialize while trying to create an internal
queue.  This typically indicates a failed OpenFabrics installation or
faulty hardware.  The failure occured here:

    Host:        node4.grid.technion.ac.il
    OMPI source: btl_openib.c:828
    Function:    ibv_create_cq()
    Error:       Invalid argument (errno=22)
    Device:      mthca0

You may need to consult with your system administrator to get this
problem fixed.
---------------------------------------------------------------------- ----

To send a job to the grid I use the following command:
qrsh -cwd -q noam.q -pe orte 8 ./myScript

while "myScript" looks like:

#!/bin/bash
/opt/local/openmpi-1.2.3-gcc4/bin/orterun -np $NSLOTS -mca btl
self,openib /tcc/eandm/performance/igor/main.exe.openmpi123

If I change "openib" to "tcp" (in myScript) everything works just fine.

Any ideas?

--
Best regards,
Noam Meltzer
Software Support Engineer & RHCE
E&M Computing

http://www.emet.co.il

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to