Hello Dave, There's an issue opened about this -
https://github.com/open-mpi/ompi/issues/8252 However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 1.9.0 using OMPI 4.0.x at 6ea9d98. This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0. Howard On 12/7/20, 7:21 AM, "users on behalf of Dave Love via users" <users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> wrote: After seeing several failures with RMA with the change needed to get 4.0.5 through IMB, I looked for simple tests. So, I built the mpich 3.4b1 tests -- or the ones that would build, and I haven't checked why some fail -- and ran the rma set. Three out of 180 passed. Many (most?) aborted in ucx, like I saw with production code, with a backtrace like below; others at least reported an MPI error. This was on two nodes of a ppc64le RHEL7 IB system with 4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the same result without those parameters). I haven't tried to reproduce it on x86_64, but it seems unlikely to be CPU-specific. Is there anything we can do to run RMA without just moving to mpich? Do releases actually get tested on run-of-the-mill IB+Lustre systems? + mpirun -n 2 winname [gpu005:50906:0:50906] ucp_worker.c:183 Fatal: failed to set active message handler id 1: Invalid parameter ==== backtrace (tid: 50906) ==== 0 0x000000000005453c ucs_debug_print_backtrace() .../src/ucs/debug/debug.c:656 1 0x0000000000028218 ucp_worker_set_am_handlers() .../src/ucp/core/ucp_worker.c:182 2 0x0000000000029ae0 ucp_worker_iface_deactivate() .../src/ucp/core/ucp_worker.c:816 3 0x0000000000029ae0 ucp_worker_iface_check_events() .../src/ucp/core/ucp_worker.c:766 4 0x0000000000029ae0 ucp_worker_iface_deactivate() .../src/ucp/core/ucp_worker.c:819 5 0x0000000000029ae0 ucp_worker_iface_unprogress_ep() .../src/ucp/core/ucp_worker.c:841 6 0x00000000000582a8 ucp_wireup_ep_t_cleanup() .../src/ucp/wireup/wireup_ep.c:381 7 0x0000000000068124 ucs_class_call_cleanup_chain() .../src/ucs/type/class.c:56 8 0x0000000000057420 ucp_wireup_ep_t_delete() .../src/ucp/wireup/wireup_ep.c:28 9 0x0000000000013de8 uct_ep_destroy() .../src/uct/base/uct_iface.c:546 10 0x00000000000252f4 ucp_proxy_ep_replace() .../src/ucp/core/ucp_proxy_ep.c:236 11 0x0000000000057b88 ucp_wireup_ep_progress() .../src/ucp/wireup/wireup_ep.c:89 12 0x0000000000049820 ucs_callbackq_slow_proxy() .../src/ucs/datastruct/callbackq.c:400 13 0x000000000002ca04 ucs_callbackq_dispatch() .../src/ucs/datastruct/callbackq.h:211 14 0x000000000002ca04 uct_worker_progress() .../src/uct/api/uct.h:2346 15 0x000000000002ca04 ucp_worker_progress() .../src/ucp/core/ucp_worker.c:2040 16 0x000000000000c144 progress_callback() osc_ucx_component.c:0 17 0x00000000000374ac opal_progress() ???:0 18 0x000000000006cc74 ompi_request_default_wait() ???:0 19 0x00000000000e6fcc ompi_coll_base_sendrecv_actual() ???:0 20 0x00000000000e5530 ompi_coll_base_allgather_intra_two_procs() ???:0 21 0x0000000000006c44 ompi_coll_tuned_allgather_intra_dec_fixed() ???:0 22 0x000000000000dc20 component_select() osc_ucx_component.c:0 23 0x0000000000115b90 ompi_osc_base_select() ???:0 24 0x0000000000075264 ompi_win_create() ???:0 25 0x00000000000cb4e8 PMPI_Win_create() ???:0 26 0x0000000010006ecc MTestGetWin() .../mpich-3.4b1/test/mpi/util/mtest.c:1173 27 0x0000000010002e40 main() .../mpich-3.4b1/test/mpi/rma/winname.c:25 28 0x0000000000025200 generic_start_main.isra.0() libc-start.c:0 29 0x00000000000253f4 __libc_start_main() ???:0 followed by the abort backtrace