> On Jun 19, 2019, at 2:44 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > To completely disable UCX you need to disable the UCX MTL and not only the > BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.
Thanks for the pointer. Disabling ucx this way _does_ seem to fix the memory issue. That’s a very helpful workaround, if nothing else. Using ucx 1.5.1 downloaded from the ucx web site at runtime (just by inserting it into LD_LIBRARY_PATH, without recompiling openmpi) doesn’t seem to fix the problem. > > As you have a gdb session on the processes you can try to break on some of > the memory allocations function (malloc, realloc, calloc). Good idea. I set breakpoints on all 3 of those, then did “c” 3 times. Does this mean anything to anyone? I’m investigating the upstream calls (not included below) that generate these calls to mpi_bcast, but given that it works on other types of nodes, I doubt those are problematic. #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 #1 0x00002b9e651f358a in ucs_rcache_create_region (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:500 #2 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 #3 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at ib/base/ib_md.c:990 #4 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, uct_flags=uct_flags@entry=96, alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) at core/ucp_mm.c:100 #5 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, uct_flags@entry=0) at core/ucp_request.c:218 #6 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, req=0xbc40940) at /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 #7 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 #8 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, req=<optimized out>) at tag/tag_send.c:78 #9 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 #10 0x00002b9e64465fa6 in mca_pml_ucx_isend () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so #11 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #12 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #13 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so #14 0x00002b9e521dbb79 in PMPI_Bcast () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #15 0x00002b9e51f623df in pmpi_bcast__ () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 #1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at datastruct/pgtable.c:69 #2 ucs_pgtable_insert_page (region=0xc6919d0, order=12, address=47959585718272, pgtable=0xb341ab8) at datastruct/pgtable.c:299 #3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, region=region@entry=0xc6919d0) at datastruct/pgtable.c:403 #4 0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511 #5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 #6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at ib/base/ib_md.c:990 #7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, uct_flags=uct_flags@entry=96, alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) at core/ucp_mm.c:100 #8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, uct_flags@entry=0) at core/ucp_request.c:218 #9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, req=0xbc40940) at /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, req=<optimized out>) at tag/tag_send.c:78 #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 #13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so #17 0x00002b9e521dbb79 in PMPI_Bcast () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #18 0x00002b9e51f623df in pmpi_bcast__ () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 #1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at datastruct/pgtable.c:69 #2 ucs_pgtable_insert_page (region=0xc6919d0, order=4, address=47959585726464, pgtable=0xb341ab8) at datastruct/pgtable.c:299 #3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, region=region@entry=0xc6919d0) at datastruct/pgtable.c:403 #4 0x00002b9e651f35bc in ucs_rcache_create_region (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511 #5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 #6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, address=<optimized out>, length=<optimized out>, flags=96, memh_p=0xbc409b0) at ib/base/ib_md.c:990 #7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized out>, uct_flags=uct_flags@entry=96, alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry=UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) at core/ucp_mm.c:100 #8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, uct_flags@entry=0) at core/ucp_request.c:218 #9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized out>, req=0xbc40940) at /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, req=<optimized out>) at tag/tag_send.c:78 #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, count=8192, datatype=<optimized out>, tag=<optimized out>, cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 #13 0x00002b9e64465fa6 in mca_pml_ucx_isend () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so #17 0x00002b9e521dbb79 in PMPI_Bcast () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 #18 0x00002b9e51f623df in pmpi_bcast__ () from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 #19 0x000000000040d442 in m_bcast_z_from (comm=..., vec=..., n=55826, inode=2) at mpi.F:1781 Noam ____________ || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users