Hi Bill
Maybe you're missing these settings in /etc/modprobe.d/mlx4_core.conf ?
http://www.open-mpi.org/faq/?category=openfabrics#ib-low-reg-mem
I hope this helps,
Gus Correa
On 10/21/2014 06:36 PM, Bill Broadley wrote:
I've setup several clusters over the years with OpenMPI. I often get the below
error:
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory. This can cause MPI jobs to
run with erratic performance, hang, and/or crash.
...
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Local host: c2-31
Registerable memory: 32768 MiB
Total memory: 64398 MiB
I'm well aware of the normal fixes, and have implemented them in puppet to
ensure compute nodes get the changes. To be paranoid I've implemented all the
changes, and they all worked under ubuntu 13.10.
However with ubuntu 14.04 it seems like it's not working, thus the above
message.
As recommended by the faq's I've implemented:
1) ulimit -l unlimited in /etc/profile.d/slurm.sh
2) PropagateResourceLimitsExcept=MEMLOCK in slurm.conf
3) UsePAM=1 in slurm.conf
4) in /etc/security/limits.conf
* hard memlock unlimited
* soft memlock unlimited
* hard stack unlimited
* soft stack unlimited
My changes seem to be working, of I submit this to slurm:
#!/bin/bash -l
ulimit -l
hostname
mpirun bash -c ulimit -l
mpirun ./relay 1 131072
I get:
unlimited
c2-31
unlimited
unlimited
unlimited
unlimited
<above error message only 32GB of Registerable memory>
<output of mpirun relay>
Is there some new kernel parameter, ofed parameter, or similar that controls
locked pages now? The kernel is 3.13.0-36 and the libopenmpi-dev package is
1.6.5.
Since the ulimit -l is getting to both the slurm launched script and also to the
mpirun launched binaries I'm pretty puzzled.
Any suggestions?
_______________________________________________
users mailing list
[email protected]
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/10/25544.php