We have an odd behavior after an update, the most severe one is if UCX is
allowed to use IB it will fail for anything except small messages,

Using PingPong from IMB
OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x
UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong

[lh1111.arc-ts.umich.edu:09155] pml_ucx.c:130 mca_pml_ucx_open
[lh1112.arc-ts.umich.edu:09551] pml_ucx.c:130 mca_pml_ucx_open
[lh1111.arc-ts.umich.edu:09155] pml_ucx.c:194 mca_pml_ucx_init
[lh1112.arc-ts.umich.edu:09551] pml_ucx.c:194 mca_pml_ucx_init
[lh1111.arc-ts.umich.edu:09155] pml_ucx.c:247 created ucp context
0x1d00de0, worker 0x2ad833eba010
[lh1112.arc-ts.umich.edu:09551] pml_ucx.c:247 created ucp context 0xc1fe40,
worker 0x2ada73f80010
[lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 0
[1610507831.995613] [lh1111:9155 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[1]: tag(self/memory knem/memory);
[lh1112.arc-ts.umich.edu:09551] pml_ucx.c:286 connecting to proc. 1
[1610507832.007926] [lh1112:9551 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[1]: tag(self/memory knem/memory);
[lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 1
[1610507832.008479] [lh1111:9155 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[2]: tag(rc_verbs/mlx4_0:1);
[1610507832.045284] [lh1112:9551 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[2]: tag(rc_verbs/mlx5_0:1);
<snip>
          256         1000         3.38        75.74
          512         1000         3.41       150.05
         1024         1000         3.79       270.43
[lh1111:9155 :0:9155] rc_verbs_iface.c:65   send completion with error:
remote invalid request error qpn 0x2aa wrid 0x28 vendor_err 0x8a
==== backtrace (tid:   9155) ====
<snip>

So you can see it starts up runs up till it goes over 1K message and then
fails.
Other nodes don't do this, any idea what is causing this?  The UCX logging
isn't helping much.

We have ucx 1.8.0 provided by MOFED

Thanks  here is a working pair of nodes,  all were rebuild from scratch
using automation


[brockp@lh0057 src]$ OMPI_MCA_pml_ucx_verbose=100 mpirun -x
UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:130 mca_pml_ucx_open
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:130 mca_pml_ucx_open
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:194 mca_pml_ucx_init
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:194 mca_pml_ucx_init
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:247 created ucp context
0x18a1e00, worker 0x2b3ef3e92010
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:247 created ucp context
0x1606e30, worker 0x2b6703edd010
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 0
[1610508044.471312] [lh0057:4804 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[1]: tag(self/memory knem/memory);
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:286 connecting to proc. 1
[1610508044.471757] [lh0058:5126 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[1]: tag(self/memory knem/memory);
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 1
[1610508044.472420] [lh0057:4804 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[2]: tag(rc_mlx5/mlx5_0:1);
[1610508044.520315] [lh0058:5126 :0]     ucp_worker.c:1543 UCX  INFO
 ep_cfg[2]: tag(rc_mlx5/mlx5_0:1);

<snip>

      1048576           40        92.21     11371.43
      2097152           20       179.31     11695.83
      4194304           10       352.87     11886.16


# All processes entering MPI_Finalize

[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 0
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 0
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 1 disconnect
requests
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 1
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 1
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 1 disconnect
requests
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 0 disconnect
requests
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 0 disconnect
requests
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:253 mca_pml_ucx_cleanup
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:253 mca_pml_ucx_cleanup
[lh0058.arc-ts.umich.edu:05126] pml_ucx.c:178 mca_pml_ucx_close
[lh0057.arc-ts.umich.edu:04804] pml_ucx.c:178 mca_pml_ucx_close

Brock Palen
IG: brockpalen1984
www.umich.edu/~brockp
Director Advanced Research Computing - TS
bro...@umich.edu
Office: (734)936-1985   (not in use during Covid)
Cell:  (989)277-6075

Reply via email to