Some extra information turns out if we disable rc_verbs it works on the
machines that don't work

Works
-x UCX_TLS=sm,ud
-x UCX_TLS=sm,rc_v

Fails did also found the machines that are not working are some mixed IB
card erras
Mixed ConnectX-3 & 4  fail
All  ConnectX-3  works


Brock Palen
IG: brockpalen1984
www.umich.edu/~brockp
Director Advanced Research Computing - TS
bro...@umich.edu
Office: (734)936-1985   (not in use during Covid)
Cell:  (989)277-6075


On Tue, Jan 12, 2021 at 10:24 PM Brock Palen <bro...@umich.edu> wrote:

> We have an odd behavior after an update, the most severe one is if UCX is
> allowed to use IB it will fail for anything except small messages,
>
> Using PingPong from IMB
> OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x
> UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong
>
> [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:130 mca_pml_ucx_open
> [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:130 mca_pml_ucx_open
> [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:194 mca_pml_ucx_init
> [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:194 mca_pml_ucx_init
> [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:247 created ucp context
> 0x1d00de0, worker 0x2ad833eba010
> [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:247 created ucp context
> 0xc1fe40, worker 0x2ada73f80010
> [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 0
> [1610507831.995613] [lh1111:9155 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[1]: tag(self/memory knem/memory);
> [lh1112.arc-ts.umich.edu:09551] pml_ucx.c:286 connecting to proc. 1
> [1610507832.007926] [lh1112:9551 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[1]: tag(self/memory knem/memory);
> [lh1111.arc-ts.umich.edu:09155] pml_ucx.c:286 connecting to proc. 1
> [1610507832.008479] [lh1111:9155 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[2]: tag(rc_verbs/mlx4_0:1);
> [1610507832.045284] [lh1112:9551 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[2]: tag(rc_verbs/mlx5_0:1);
> <snip>
>           256         1000         3.38        75.74
>           512         1000         3.41       150.05
>          1024         1000         3.79       270.43
> [lh1111:9155 :0:9155] rc_verbs_iface.c:65   send completion with error:
> remote invalid request error qpn 0x2aa wrid 0x28 vendor_err 0x8a
> ==== backtrace (tid:   9155) ====
> <snip>
>
> So you can see it starts up runs up till it goes over 1K message and then
> fails.
> Other nodes don't do this, any idea what is causing this?  The UCX logging
> isn't helping much.
>
> We have ucx 1.8.0 provided by MOFED
>
> Thanks  here is a working pair of nodes,  all were rebuild from scratch
> using automation
>
>
> [brockp@lh0057 src]$ OMPI_MCA_pml_ucx_verbose=100 mpirun -x
> UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:130 mca_pml_ucx_open
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:130 mca_pml_ucx_open
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:194 mca_pml_ucx_init
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:194 mca_pml_ucx_init
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:247 created ucp context
> 0x18a1e00, worker 0x2b3ef3e92010
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:247 created ucp context
> 0x1606e30, worker 0x2b6703edd010
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 0
> [1610508044.471312] [lh0057:4804 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[1]: tag(self/memory knem/memory);
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:286 connecting to proc. 1
> [1610508044.471757] [lh0058:5126 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[1]: tag(self/memory knem/memory);
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:286 connecting to proc. 1
> [1610508044.472420] [lh0057:4804 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[2]: tag(rc_mlx5/mlx5_0:1);
> [1610508044.520315] [lh0058:5126 :0]     ucp_worker.c:1543 UCX  INFO
>  ep_cfg[2]: tag(rc_mlx5/mlx5_0:1);
>
> <snip>
>
>       1048576           40        92.21     11371.43
>       2097152           20       179.31     11695.83
>       4194304           10       352.87     11886.16
>
>
> # All processes entering MPI_Finalize
>
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 0
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 0
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 1 disconnect
> requests
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:423 disconnecting from rank 1
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:423 disconnecting from rank 1
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 1 disconnect
> requests
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:373 waiting for 0 disconnect
> requests
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:373 waiting for 0 disconnect
> requests
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:253 mca_pml_ucx_cleanup
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:253 mca_pml_ucx_cleanup
> [lh0058.arc-ts.umich.edu:05126] pml_ucx.c:178 mca_pml_ucx_close
> [lh0057.arc-ts.umich.edu:04804] pml_ucx.c:178 mca_pml_ucx_close
>
> Brock Palen
> IG: brockpalen1984
> www.umich.edu/~brockp
> Director Advanced Research Computing - TS
> bro...@umich.edu
> Office: (734)936-1985   (not in use during Covid)
> Cell:  (989)277-6075
>

Reply via email to