Hi Bill Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem I hope this helps, Gus Correa On 10/21/2014 06:36 PM, Bill Broadley wrote:
I've setup several clusters over the years with OpenMPI. I often get the below error: WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. ... http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: c2-31 Registerable memory: 32768 MiB Total memory: 64398 MiB I'm well aware of the normal fixes, and have implemented them in puppet to ensure compute nodes get the changes. To be paranoid I've implemented all the changes, and they all worked under ubuntu 13.10. However with ubuntu 14.04 it seems like it's not working, thus the above message. As recommended by the faq's I've implemented: 1) ulimit -l unlimited in /etc/profile.d/slurm.sh 2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf 3) UsePAM=1 in slurm.conf 4) in /etc/security/limits.conf * hard memlock unlimited * soft memlock unlimited * hard stack unlimited * soft stack unlimited My changes seem to be working, of I submit this to slurm: #!/bin/bash -l ulimit -l hostname mpirun bash -c ulimit -l mpirun ./relay 1 131072 I get: unlimited c2-31 unlimited unlimited unlimited unlimited <above error message only 32GB of Registerable memory> <output of mpirun relay> Is there some new kernel parameter, ofed parameter, or similar that controls locked pages now? The kernel is 3.13.0-36 and the libopenmpi-dev package is 1.6.5. Since the ulimit -l is getting to both the slurm launched script and also to the mpirun launched binaries I'm pretty puzzled. Any suggestions? _______________________________________________ users mailing list us...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/25544.php