Hi there,

We're intermittently seeing messages (below) about failing to register memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 / 24 core 126G RAM Broadwell nodes and the vanilla IB stack as shipped by centos.

(We previously seen similar messages for the "ud" oob component but, as recommended in this thread, we stopped oob from using openib via an MCA parameter.)

I've checked to see what the registered memory limit is (by setting mlx4_core's debug_level, rebooting and examining kernel messages) and it's double the system RAM - which I understand is the recommended setting.

Any ideas about what might be going on, please?

Thanks,

Mark


--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to initialize while trying to
allocate some locked memory.  This typically can indicate that the
memlock limits are set too low.  For most HPC installations, the
memlock limits should be set to "unlimited".  The failure occured
here:

  Local host:    dc1s0b1a
  OMPI source:   btl_openib.c:752
  Function:      opal_free_list_init()
  Device:        mlx4_0
  Memlock limit: unlimited

You may need to consult with your system administrator to get this
problem fixed.  This FAQ entry on the Open MPI web site may also be
helpful:

http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
--------------------------------------------------------------------------
[dc1s0b1a][[59067,1],0][btl_openib.c:1035:mca_btl_openib_add_procs] could not 
prepare openib device for use
[dc1s0b1a][[59067,1],0][btl_openib.c:1186:mca_btl_openib_get_ep] could not 
prepare openib device for use
[dc1s0b1a][[59067,1],0][connect/btl_openib_connect_udcm.c:1522:udcm_find_endpoint]
 could not find endpoint with port: 1, lid: 69, msg_type: 100


On Thu, 19 Oct 2017, Mark Dixon wrote:

Thanks Ralph, will do.

Cheers,

Mark

On Wed, 18 Oct 2017, [email protected] wrote:

 Put “oob=tcp” in your default MCA param file

 On Oct 18, 2017, at 9:00 AM, Mark Dixon <[email protected]> wrote:

 Hi,

 We're intermittently seeing messages (below) about failing to register
 memory with openmpi 2.0.2 on centos7 / Mellanox FDR Connect-X 3 and the
 vanilla IB stack as shipped by centos.

 We're not using any mlx4_core module tweaks at the moment. On earlier
 machines we used to set registered memory as per the FAQ, but neither
 log_num_mtt nor num_mtt seem to exist these days (according to
 /sys/module/mlx4_*/parameters/*), which makes it somewhat difficult to
 follow the FAQ.

 The output of 'ulimit -l' shows as unlimited for every rank.

 Does anyone have any advice, please?

 Thanks,

 Mark

 -------------------------------------------------------------------------
 Failed to register memory region (MR):

 Hostname: dc1s0b1c
 Address:  ec5000
 Length:   20480
 Error:    Cannot allocate memory
 --------------------------------------------------------------------------
 --------------------------------------------------------------------------
 Open MPI has detected that there are UD-capable Verbs devices on your
 system, but none of them were able to be setup properly.  This may
 indicate a problem on this system.

 You job will continue, but Open MPI will ignore the "ud" oob component
 in this run.
 _______________________________________________
 users mailing list
 [email protected]
 https://lists.open-mpi.org/mailman/listinfo/users

 _______________________________________________
 users mailing list
 [email protected]
 https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to