I get the locked memory error as follows:

    --------------------------------------------------------------------------
    *** An error occurred in MPI_Init
    *** before MPI was initialized
    *** MPI_ERRORS_ARE_FATAL (goodbye)
    [node10:10395] [0,0,0]-[0,1,6] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
    --------------------------------------------------------------------------
    The OpenIB BTL failed to initialize while trying to allocate some
    locked memory.  This typically can indicate that the memlock limits
    are set too low.  For most HPC installations, the memlock limits
    should be set to "unlimited".  The failure occured here:

        Host:          node10
        OMPI source:   btl_openib.c:830
        Function:      ibv_create_cq()
        Device:        mlx4_0
        Memlock limit: 32768

    You may need to consult with your system administrator to get this
    problem fixed.  This FAQ entry on the Open MPI web site may also be
    helpful:

        http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
    --------------------------------------------------------------------------

I've read the above FAQ and still have problems.  Here is the scenario.  All 
cluster nodes are (supposed) to be the same.
I can run just fine on all except a few nodes. For testing, I have closed all 
the nodes, and when I submit the job, LSF puts the job in PENDING state.

Now if I use

brun -m "node1 node10" jobid

to release the job, it runs fine.

But if I use

brun -m "node10 node1" jobid

it fails with the above OPENMPI error.

I've checked the ulimit -a on all nodes, it is set to unlimited.  I've added a 
.bashrc file and set the ulimit in there, as well as in my .cshrc file
(I start on a csh shell and the jobs run in sh).

I've compared environment settings and everything else I can think of.  3 nodes 
have the (bad) behaviour if they happen to be the lead node and run
fine if
they are not, the rest of the nodes run fine in either position.

Anyone have any ideas about this?

thanks!
tom

Reply via email to