Re: [OMPI users] locked memory problem

Jeff Squyres Mon, 16 Jun 2008 18:14:42 -0400

Can you check to see what the locked memory limits are *inside of ajob*? This can be different than what they are if you login to thenode independently / outside of an LSF job.

For example, write a quickie script that runs "ulimit -a" and submitthat through LSF and see what results you get. Better yet, usesomething like this (typed off the top of my head -- not tested forcorrectness/typos at all):


runme.csh:

#!/bin/csh -f
set l=`limit -l`
echo `hostname`: limit $l
exit 0

submitme.csh:

#!/bin/csh -f
mpirun runme.csh

That is, submit the submitme.csh script to LSF and have it mpirun therunme.csh script so that you can see the limits on all the nodes thatyou requested.



On Jun 11, 2008, at 5:59 PM, twu...@goodyear.com wrote:

I get the locked memory error as follows:
--------------------------------------------------------------------------
   *** An error occurred in MPI_Init
   *** before MPI was initialized
   *** MPI_ERRORS_ARE_FATAL (goodbye)
[node10:10395] [0,0,0]-[0,1,6] mca_oob_tcp_msg_recv: readvfailed: Connection reset by peer (104)--------------------------------------------------------------------------
   The OpenIB BTL failed to initialize while trying to allocate some
   locked memory.  This typically can indicate that the memlock limits
   are set too low.  For most HPC installations, the memlock limits
   should be set to "unlimited".  The failure occured here:

       Host:          node10
       OMPI source:   btl_openib.c:830
       Function:      ibv_create_cq()
       Device:        mlx4_0
       Memlock limit: 32768

   You may need to consult with your system administrator to get this
   problem fixed.  This FAQ entry on the Open MPI web site may also be
   helpful:

       http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
I've read the above FAQ and still have problems. Here is thescenario. All cluster nodes are (supposed) to be the same.I can run just fine on all except a few nodes. For testing, I haveclosed all the nodes, and when I submit the job, LSF puts the job inPENDING state.
Now if I use

brun -m "node1 node10" jobid

to release the job, it runs fine.

But if I use

brun -m "node10 node1" jobid

it fails with the above OPENMPI error.
I've checked the ulimit -a on all nodes, it is set to unlimited.I've added a .bashrc file and set the ulimit in there, as well as inmy .cshrc file
(I start on a csh shell and the jobs run in sh).
I've compared environment settings and everything else I can thinkof. 3 nodes have the (bad) behaviour if they happen to be the leadnode and run
fine if
they are not, the rest of the nodes run fine in either position.

Anyone have any ideas about this?

thanks!
tom

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] locked memory problem

Reply via email to