> On Jun 19, 2019, at 2:44 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> To completely disable UCX you need to disable the UCX MTL and not only the 
> BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.

Thanks for the pointer.  Disabling ucx this way _does_ seem to fix the memory 
issue.  That’s a very helpful workaround, if nothing else.

Using ucx 1.5.1 downloaded from the ucx web site at runtime (just by inserting 
it into LD_LIBRARY_PATH, without recompiling openmpi) doesn’t seem to fix the 
problem.

> 
> As you have a gdb session on the processes you can try to break on some of 
> the memory allocations function (malloc, realloc, calloc).

Good idea.  I set breakpoints on all 3 of those, then did “c” 3 times.  Does 
this mean anything to anyone?  I’m investigating the upstream calls (not 
included below) that generate these calls to mpi_bcast, but given that it works 
on other types of nodes, I doubt those are problematic.

#0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1  0x00002b9e651f358a in ucs_rcache_create_region (region_p=0x7fff82806da0, 
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, 
rcache=0xb341a50) at sys/rcache.c:500
#2  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, 
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, 
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#3  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, 
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at 
ib/base/ib_md.c:990
#4  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, 
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, 
uct_flags=uct_flags@entry=96, 
    alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, 
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, 
md_map_p=md_map_p@entry=0xbc409a8)
    at core/ucp_mm.c:100
#5  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, 
buffer=0x2b9e76102070, length=131072, datatype=128, 
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, 
    req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, 
uct_flags@entry=0) at core/ucp_request.c:218
#6  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, 
req=0xbc40940) at 
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#7  ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#8  0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, 
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, 
    rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, 
req=<optimized out>) at tag/tag_send.c:78
#9  ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, 
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#10 0x00002b9e64465fa6 in mca_pml_ucx_isend () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#11 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#12 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#13 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#14 0x00002b9e521dbb79 in PMPI_Bcast () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e51f623df in pmpi_bcast__ () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40


#0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1  0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at 
datastruct/pgtable.c:69
#2  ucs_pgtable_insert_page (region=0xc6919d0, order=12, 
address=47959585718272, pgtable=0xb341ab8) at datastruct/pgtable.c:299
#3  ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, 
region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
#4  0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0, 
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, 
rcache=0xb341a50) at sys/rcache.c:511
#5  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, 
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, 
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#6  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, 
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at 
ib/base/ib_md.c:990
#7  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, 
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, 
uct_flags=uct_flags@entry=96, 
    alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, 
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, 
md_map_p=md_map_p@entry=0xbc409a8)
    at core/ucp_mm.c:100
#8  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, 
buffer=0x2b9e76102070, length=131072, datatype=128, 
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, 
    req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, 
uct_flags@entry=0) at core/ucp_request.c:218
#9  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, 
req=0xbc40940) at 
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, 
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, 
    rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, 
req=<optimized out>) at tag/tag_send.c:78
#12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, 
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#17 0x00002b9e521dbb79 in PMPI_Bcast () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#18 0x00002b9e51f623df in pmpi_bcast__ () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40

#0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
#1  0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at 
datastruct/pgtable.c:69
#2  ucs_pgtable_insert_page (region=0xc6919d0, order=4, address=47959585726464, 
pgtable=0xb341ab8) at datastruct/pgtable.c:299
#3  ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, 
region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
#4  0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0, 
arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, 
rcache=0xb341a50) at sys/rcache.c:511
#5  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, 
prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, 
region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
#6  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, 
address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at 
ib/base/ib_md.c:990
#7  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, 
reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, 
uct_flags=uct_flags@entry=96, 
    alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, 
alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, 
md_map_p=md_map_p@entry=0xbc409a8)
    at core/ucp_mm.c:100
#8  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, 
buffer=0x2b9e76102070, length=131072, datatype=128, 
state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, 
    req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, 
uct_flags@entry=0) at core/ucp_request.c:218
#9  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, 
req=0xbc40940) at 
/home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
#10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
#11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, 
proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, 
    rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, 
req=<optimized out>) at tag/tag_send.c:78
#12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, 
datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 
<mca_pml_ucx_send_completion>) at tag/tag_send.c:203
#13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
#14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
#17 0x00002b9e521dbb79 in PMPI_Bcast () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
#18 0x00002b9e51f623df in pmpi_bcast__ () from 
/share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
#19 0x000000000040d442 in m_bcast_z_from (comm=..., vec=..., n=55826, inode=2) 
at mpi.F:1781



                                                                                
        Noam

____________
||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to