I would suggest to post the error in UCX issues -
https://github.com/openucx/ucx/issues
It is typical IB error complaining about an access to unregistered memory.
Usually it caused by some pointer corruption in OMPI/UCX or application
code.

Best,
Pasha



On Thu, Sep 20, 2018 at 11:22 PM Ben Menadue <ben.mena...@nci.org.au> wrote:

> Hi,
>
> A couple of our users have reported issues using UCX in OpenMPI 3.1.2.
> It’s failing with this message:
>
> [r1071:27563:0:27563] rc_verbs_iface.c:63   FATAL: send completion with
> error: local protection error
>
> The actual MPI calls provoking this are different between the two
> applications — one is an MPI_Bcast and the other is an MPI_Waitany — but in
> both cases it ends up in ompi_request_default_wait_all and then into the
> progress engines:
>
>  0 0x00000000000373dc ucs_log_dispatch()  
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
>  1 0x00000000000368ff uct_rc_verbs_iface_poll_tx()  
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
>  2 0x00000000000368ff uct_rc_verbs_iface_progress()  
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
>  3 0x00000000000179d2 ucs_callbackq_dispatch()  
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
>  4 0x0000000000018e0a uct_worker_progress()  
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
>  5 0x00000000000050a9 mca_pml_ucx_progress()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
>  6 0x000000000002b554 opal_progress()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
>  7 0x000000000004a7fa sync_wait_st()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83
>  8 0x000000000004b073 ompi_request_default_wait_all()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237
>
>  9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98
> 10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280
> 11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303
> 12 0x0000000000067b60 PMPI_Bcast()  
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111
>
>
> and
>
>  0 0x00000000000373dc ucs_log_dispatch()
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169
>  1 0x00000000000368ff uct_rc_verbs_iface_poll_tx()
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88
>  2 0x00000000000368ff uct_rc_verbs_iface_progress()
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116
>  3 0x00000000000179d2 ucs_callbackq_dispatch()
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208
>  4 0x0000000000018e0a uct_worker_progress()
> /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631
>  5 0x0000000000005099 mca_pml_ucx_progress()
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466
>  6 0x000000000002b554 opal_progress()
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228
>  7 0x00000000000331cc ompi_sync_wait_mt()
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85
>  8 0x000000000004ad0b ompi_request_default_wait_any()
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131
>  9 0x00000000000b91ab PMPI_Waitany()
> /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83
>
> I’m not sure if it’s an issue with the ucx PML or with UCX itself, though.
> In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone
> else seen this?
>
> Thanks,
> Ben
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to