Dear Open-MPI experts, I have updated my little cluster from Scientific Linux 6.5 to 6.6, this included extensive changes in the Infiniband drivers and a newer openmpi version (1.8.1). Now I'm getting this message on all nodes with more than 32 GB of RAM:
WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: pax98 Registerable memory: 32768 MiB Total memory: 49106 MiB Your MPI job will continue, but may be behave poorly and/or hang. The issue is similar to the one described in a previous thread about Ubuntu nodes: http://www.open-mpi.org/community/lists/users/2014/08/25090.php But the Infiniband driver is different, the values log_num_mtt and log_mtts_per_seg both still exist, but they cannot be changed and have on all configurations the same values: [pax52] /root # cat /sys/module/mlx4_core/parameters/log_num_mtt 0 [pax52] /root # cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 3 The kernel changelog says that Red Hat has included this commit: mlx4: Scale size of MTT table with system RAM (Doug Ledford) so it should be all fine, the buffers scale automatically, however, as far as I can see, the wrong value calculated by calculate_max_reg() is used in the code, so I think I cannot simply ignore the warning. Also, a user has reported a problem with a job, I cannot confirm that this is the cause. My workaround was to simply load the mlx5_core kernel module, as this is used by calculate_max_reg() to detect OFED 2.0. Regards, Götz Waschk