Hello Dave,

There's an issue opened about this -

https://github.com/open-mpi/ompi/issues/8252

However, I'm not observing failures with IMB RMA on a IB/aarch64 system and UCX 
1.9.0 using OMPI 4.0.x at 6ea9d98.
This cluster is running RHEL 7.6 and MLNX_OFED_LINUX-4.5-1.0.1.0.

Howard

On 12/7/20, 7:21 AM, "users on behalf of Dave Love via users" 
<users-boun...@lists.open-mpi.org on behalf of users@lists.open-mpi.org> wrote:

    After seeing several failures with RMA with the change needed to get
    4.0.5 through IMB, I looked for simple tests.  So, I built the mpich
    3.4b1 tests -- or the ones that would build, and I haven't checked why
    some fail -- and ran the rma set.
    
    Three out of 180 passed.  Many (most?) aborted in ucx, like I saw with
    production code, with a backtrace like below; others at least reported
    an MPI error.  This was on two nodes of a ppc64le RHEL7 IB system with
    4.0.5, ucx 1.9, and MCA parameters from the ucx FAQ (though I got the
    same result without those parameters).  I haven't tried to reproduce it
    on x86_64, but it seems unlikely to be CPU-specific.
    
    Is there anything we can do to run RMA without just moving to mpich?  Do
    releases actually get tested on run-of-the-mill IB+Lustre systems?
    
    + mpirun -n 2 winname
    [gpu005:50906:0:50906]  ucp_worker.c:183  Fatal: failed to set active 
message handler id 1: Invalid parameter
    ==== backtrace (tid:  50906) ====
     0 0x000000000005453c ucs_debug_print_backtrace()  
.../src/ucs/debug/debug.c:656
     1 0x0000000000028218 ucp_worker_set_am_handlers()  
.../src/ucp/core/ucp_worker.c:182
     2 0x0000000000029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:816
     3 0x0000000000029ae0 ucp_worker_iface_check_events()  
.../src/ucp/core/ucp_worker.c:766
     4 0x0000000000029ae0 ucp_worker_iface_deactivate()  
.../src/ucp/core/ucp_worker.c:819
     5 0x0000000000029ae0 ucp_worker_iface_unprogress_ep()  
.../src/ucp/core/ucp_worker.c:841
     6 0x00000000000582a8 ucp_wireup_ep_t_cleanup()  
.../src/ucp/wireup/wireup_ep.c:381
     7 0x0000000000068124 ucs_class_call_cleanup_chain()  
.../src/ucs/type/class.c:56
     8 0x0000000000057420 ucp_wireup_ep_t_delete()  
.../src/ucp/wireup/wireup_ep.c:28
     9 0x0000000000013de8 uct_ep_destroy()  .../src/uct/base/uct_iface.c:546
    10 0x00000000000252f4 ucp_proxy_ep_replace()  
.../src/ucp/core/ucp_proxy_ep.c:236
    11 0x0000000000057b88 ucp_wireup_ep_progress()  
.../src/ucp/wireup/wireup_ep.c:89
    12 0x0000000000049820 ucs_callbackq_slow_proxy()  
.../src/ucs/datastruct/callbackq.c:400
    13 0x000000000002ca04 ucs_callbackq_dispatch()  
.../src/ucs/datastruct/callbackq.h:211
    14 0x000000000002ca04 uct_worker_progress()  .../src/uct/api/uct.h:2346
    15 0x000000000002ca04 ucp_worker_progress()  
.../src/ucp/core/ucp_worker.c:2040
    16 0x000000000000c144 progress_callback()  osc_ucx_component.c:0
    17 0x00000000000374ac opal_progress()  ???:0
    18 0x000000000006cc74 ompi_request_default_wait()  ???:0
    19 0x00000000000e6fcc ompi_coll_base_sendrecv_actual()  ???:0
    20 0x00000000000e5530 ompi_coll_base_allgather_intra_two_procs()  ???:0
    21 0x0000000000006c44 ompi_coll_tuned_allgather_intra_dec_fixed()  ???:0
    22 0x000000000000dc20 component_select()  osc_ucx_component.c:0
    23 0x0000000000115b90 ompi_osc_base_select()  ???:0
    24 0x0000000000075264 ompi_win_create()  ???:0
    25 0x00000000000cb4e8 PMPI_Win_create()  ???:0
    26 0x0000000010006ecc MTestGetWin()  
.../mpich-3.4b1/test/mpi/util/mtest.c:1173
    27 0x0000000010002e40 main()  .../mpich-3.4b1/test/mpi/rma/winname.c:25
    28 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
    29 0x00000000000253f4 __libc_start_main()  ???:0
    
    followed by the abort backtrace
    

Reply via email to