I would suggest to post the error in UCX issues - https://github.com/openucx/ucx/issues It is typical IB error complaining about an access to unregistered memory. Usually it caused by some pointer corruption in OMPI/UCX or application code.
Best, Pasha On Thu, Sep 20, 2018 at 11:22 PM Ben Menadue <ben.mena...@nci.org.au> wrote: > Hi, > > A couple of our users have reported issues using UCX in OpenMPI 3.1.2. > It’s failing with this message: > > [r1071:27563:0:27563] rc_verbs_iface.c:63 FATAL: send completion with > error: local protection error > > The actual MPI calls provoking this are different between the two > applications — one is an MPI_Bcast and the other is an MPI_Waitany — but in > both cases it ends up in ompi_request_default_wait_all and then into the > progress engines: > > 0 0x00000000000373dc ucs_log_dispatch() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169 > 1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88 > 2 0x00000000000368ff uct_rc_verbs_iface_progress() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116 > 3 0x00000000000179d2 ucs_callbackq_dispatch() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208 > 4 0x0000000000018e0a uct_worker_progress() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631 > 5 0x00000000000050a9 mca_pml_ucx_progress() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466 > 6 0x000000000002b554 opal_progress() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228 > 7 0x000000000004a7fa sync_wait_st() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../opal/threads/wait_sync.h:83 > 8 0x000000000004b073 ompi_request_default_wait_all() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:237 > > 9 0x00000000000ce548 ompi_coll_base_bcast_intra_generic() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:98 > 10 0x00000000000ced08 ompi_coll_base_bcast_intra_pipeline() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_bcast.c:280 > 11 0x0000000000004f28 ompi_coll_tuned_bcast_intra_dec_fixed() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:303 > 12 0x0000000000067b60 PMPI_Bcast() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pbcast.c:111 > > > and > > 0 0x00000000000373dc ucs_log_dispatch() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/ucs/../../../src/ucs/debug/log.c:169 > 1 0x00000000000368ff uct_rc_verbs_iface_poll_tx() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:88 > 2 0x00000000000368ff uct_rc_verbs_iface_progress() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/src/uct/../../../src/uct/ib/rc/verbs/rc_verbs_iface.c:116 > 3 0x00000000000179d2 ucs_callbackq_dispatch() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/ucs/datastruct/callbackq.h:208 > 4 0x0000000000018e0a uct_worker_progress() > /short/z00/bjm900/build/ucx/ucx-1.3.1/build/../src/uct/api/uct.h:1631 > 5 0x0000000000005099 mca_pml_ucx_progress() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mca/pml/ucx/../../../../../../../ompi/mca/pml/ucx/pml_ucx.c:466 > 6 0x000000000002b554 opal_progress() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/runtime/opal_progress.c:228 > 7 0x00000000000331cc ompi_sync_wait_mt() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/opal/../../../../opal/threads/wait_sync.c:85 > 8 0x000000000004ad0b ompi_request_default_wait_any() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/../../../../ompi/request/req_wait.c:131 > 9 0x00000000000b91ab PMPI_Waitany() > /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-0/ompi/mpi/c/profile/pwaitany.c:83 > > I’m not sure if it’s an issue with the ucx PML or with UCX itself, though. > In both cases, disabling ucx and using yalla or ob1 works fine. Has anyone > else seen this? > > Thanks, > Ben > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users