Hi Bill

Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?

http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem

I hope this helps,
Gus Correa

On 10/21/2014 06:36 PM, Bill Broadley wrote:

I've setup several clusters over the years with OpenMPI.  I often get the below
error:

    WARNING: It appears that your OpenFabrics subsystem is configured to only
    allow registering part of your physical memory.  This can cause MPI jobs to
    run with erratic performance, hang, and/or crash.
    ...
    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

      Local host:              c2-31
      Registerable memory:     32768 MiB
      Total memory:            64398 MiB

I'm well aware of the normal fixes, and have implemented them in puppet to
ensure compute nodes get the changes.  To be paranoid I've implemented all the
changes, and they all worked under ubuntu 13.10.

However with ubuntu 14.04 it seems like it's not working, thus the above 
message.

As recommended by the faq's I've implemented:
1) ulimit -l unlimited in /etc/profile.d/slurm.sh
2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf
3) UsePAM=1 in slurm.conf
4) in /etc/security/limits.conf
    * hard memlock unlimited
    * soft memlock unlimited
    * hard stack unlimited
    * soft stack unlimited

My changes seem to be working, of I submit this to slurm:
#!/bin/bash -l
ulimit -l
hostname
mpirun bash -c ulimit -l
mpirun ./relay 1 131072

I get:
    unlimited
    c2-31
    unlimited
    unlimited
    unlimited
    unlimited
    <above error message only 32GB of Registerable memory>
    <output of mpirun relay>

Is there some new kernel parameter, ofed parameter, or similar that controls
locked pages now?  The kernel is 3.13.0-36 and the libopenmpi-dev package is 
1.6.5.

Since the ulimit -l is getting to both the slurm launched script and also to the
mpirun launched binaries I'm pretty puzzled.

Any suggestions?
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/10/25544.php


Reply via email to