Hi - we’re having a weird problem with OpenMPI on our newish infiniband EDR
(mlx5) nodes. We're running CentOS 7.6, with all the infiniband and ucx
libraries as provided by CentOS, i.e.
ucx-1.4.0-1.el7.x86_64
libibverbs-utils-17.2-3.el7.x86_64
libibverbs-17.2-3.el7.x86_64
libibumad-17.2-3.el7.x8
Noam, it may be a stupid question. Could you try runningslabtop ss
the program executes
Also 'watch cat /proc/meminfo'is also a good diagnostic
On Wed, 19 Jun 2019 at 18:32, Noam Bernstein via users <
users@lists.open-mpi.org> wrote:
> Hi - we’re having a weird problem with OpenMPI on ou
> On Jun 19, 2019, at 2:00 PM, John Hearns via users
> wrote:
>
> Noam, it may be a stupid question. Could you try runningslabtop ss the
> program executes
The top SIZE usage is this line
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
5937540 5937540 100%
I tried to disable ucx (successfully, I think - I replaced the “—mca btl ucx
—mca btl ^vader,tcp,openib” with “—mca btl_openib_allow_ib 1”, and attaching
gdb to a running process shows no ucx-related routines active). It still has
the same fast growing (1 GB/s) memory usage problem.
To completely disable UCX you need to disable the UCX MTL and not only the
BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1".
As you have a gdb session on the processes you can try to break on some of
the memory allocations function (malloc, realloc, calloc).
George.
> On Jun 19, 2019, at 2:44 PM, George Bosilca wrote:
>
> To completely disable UCX you need to disable the UCX MTL and not only the
> BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.
Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory
issue.
Hi, Noam
Can you try your original command line with the following addition:
mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx *
I think we're seeing some conflict between UCX PML and UCT OSC.
Josh
On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users <
users@lists.open-mpi.org>
> On Jun 19, 2019, at 5:05 PM, Joshua Ladd wrote:
>
> Hi, Noam
>
> Can you try your original command line with the following addition:
>
> mpirun —mca pml ucx —mca btl ^vader,tcp,openib -mca osc ucx
>
> I think we're seeing some conflict between UCX PML and UCT OSC.
I did this, although me