Some extra information turns out if we disable rc_verbs it works on the machines that don't work
Works -x UCX_TLS=sm,ud -x UCX_TLS=sm,rc_v Fails did also found the machines that are not working are some mixed IB card erras Mixed ConnectX-3 & 4 fail All ConnectX-3 works Brock Palen IG: brockpalen1984 www.umich.edu/~brockp Director Advanced Research Computing - TS bro...@umich.edu Office: (734)936-1985 (not in use during Covid) Cell: (989)277-6075 On Tue, Jan 12, 2021 at 10:24 PM Brock Palen <bro...@umich.edu> wrote: > We have an odd behavior after an update, the most severe one is if UCX is > allowed to use IB it will fail for anything except small messages, > > Using PingPong from IMB > OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x > UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong > > [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:130 mca_pml_ucx_open > [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:130 mca_pml_ucx_open > [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:194 mca_pml_ucx_init > [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:194 mca_pml_ucx_init > [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:247 created ucp context > 0x1d00de0, worker 0x2ad833eba010 > [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:247 created ucp context > 0xc1fe40, worker 0x2ada73f80010 > [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 0 > [1610507831.995613] [lh1111:9155 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[1]: tag(self/memory knem/memory); > [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:286 connecting to proc. 1 > [1610507832.007926] [lh1112:9551 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[1]: tag(self/memory knem/memory); > [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 1 > [1610507832.008479] [lh1111:9155 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[2]: tag(rc_verbs/mlx4_0:1); > [1610507832.045284] [lh1112:9551 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[2]: tag(rc_verbs/mlx5_0:1); > <snip> > 256 1000 3.38 75.74 > 512 1000 3.41 150.05 > 1024 1000 3.79 270.43 > [lh1111:9155 :0:9155] rc_verbs_iface.c:65 send completion with error: > remote invalid request error qpn 0x2aa wrid 0x28 vendor_err 0x8a > ==== backtrace (tid: 9155) ==== > <snip> > > So you can see it starts up runs up till it goes over 1K message and then > fails. > Other nodes don't do this, any idea what is causing this? The UCX logging > isn't helping much. > > We have ucx 1.8.0 provided by MOFED > > Thanks here is a working pair of nodes, all were rebuild from scratch > using automation > > > [brockp@lh0057 src]$ OMPI_MCA_pml_ucx_verbose=100 mpirun -x > UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:130 mca_pml_ucx_open > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:130 mca_pml_ucx_open > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:194 mca_pml_ucx_init > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:194 mca_pml_ucx_init > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:247 created ucp context > 0x18a1e00, worker 0x2b3ef3e92010 > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:247 created ucp context > 0x1606e30, worker 0x2b6703edd010 > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 0 > [1610508044.471312] [lh0057:4804 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[1]: tag(self/memory knem/memory); > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:286 connecting to proc. 1 > [1610508044.471757] [lh0058:5126 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[1]: tag(self/memory knem/memory); > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 1 > [1610508044.472420] [lh0057:4804 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[2]: tag(rc_mlx5/mlx5_0:1); > [1610508044.520315] [lh0058:5126 :0] ucp_worker.c:1543 UCX INFO > ep_cfg[2]: tag(rc_mlx5/mlx5_0:1); > > <snip> > > 1048576 40 92.21 11371.43 > 2097152 20 179.31 11695.83 > 4194304 10 352.87 11886.16 > > > # All processes entering MPI_Finalize > > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 0 > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 0 > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 1 disconnect > requests > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 1 > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 1 > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 1 disconnect > requests > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 0 disconnect > requests > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 0 disconnect > requests > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:253 mca_pml_ucx_cleanup > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:253 mca_pml_ucx_cleanup > [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:178 mca_pml_ucx_close > [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:178 mca_pml_ucx_close > > Brock Palen > IG: brockpalen1984 > www.umich.edu/~brockp > Director Advanced Research Computing - TS > bro...@umich.edu > Office: (734)936-1985 (not in use during Covid) > Cell: (989)277-6075 >