We have an odd behavior after an update, the most severe one is if UCX is allowed to use IB it will fail for anything except small messages,
Using PingPong from IMB OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:130 mca_pml_ucx_open [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:130 mca_pml_ucx_open [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:194 mca_pml_ucx_init [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:194 mca_pml_ucx_init [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:247 created ucp context 0x1d00de0, worker 0x2ad833eba010 [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:247 created ucp context 0xc1fe40, worker 0x2ada73f80010 [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 0 [1610507831.995613] [lh1111:9155 :0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:286 connecting to proc. 1 [1610507832.007926] [lh1112:9551 :0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 1 [1610507832.008479] [lh1111:9155 :0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(rc_verbs/mlx4_0:1); [1610507832.045284] [lh1112:9551 :0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(rc_verbs/mlx5_0:1); <snip> 256 1000 3.38 75.74 512 1000 3.41 150.05 1024 1000 3.79 270.43 [lh1111:9155 :0:9155] rc_verbs_iface.c:65 send completion with error: remote invalid request error qpn 0x2aa wrid 0x28 vendor_err 0x8a ==== backtrace (tid: 9155) ==== <snip> So you can see it starts up runs up till it goes over 1K message and then fails. Other nodes don't do this, any idea what is causing this? The UCX logging isn't helping much. We have ucx 1.8.0 provided by MOFED Thanks here is a working pair of nodes, all were rebuild from scratch using automation [brockp@lh0057 src]$ OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:130 mca_pml_ucx_open [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:130 mca_pml_ucx_open [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:194 mca_pml_ucx_init [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:194 mca_pml_ucx_init [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:247 created ucp context 0x18a1e00, worker 0x2b3ef3e92010 [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:247 created ucp context 0x1606e30, worker 0x2b6703edd010 [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 0 [1610508044.471312] [lh0057:4804 :0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:286 connecting to proc. 1 [1610508044.471757] [lh0058:5126 :0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory knem/memory); [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 1 [1610508044.472420] [lh0057:4804 :0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_0:1); [1610508044.520315] [lh0058:5126 :0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_0:1); <snip> 1048576 40 92.21 11371.43 2097152 20 179.31 11695.83 4194304 10 352.87 11886.16 # All processes entering MPI_Finalize [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 0 [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 0 [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 1 disconnect requests [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 1 [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 1 [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 1 disconnect requests [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 0 disconnect requests [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 0 disconnect requests [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:253 mca_pml_ucx_cleanup [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:253 mca_pml_ucx_cleanup [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:178 mca_pml_ucx_close [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:178 mca_pml_ucx_close Brock Palen IG: brockpalen1984 www.umich.edu/~brockp Director Advanced Research Computing - TS bro...@umich.edu Office: (734)936-1985 (not in use during Covid) Cell: (989)277-6075