Re: [OMPI users] Some nodes have ucx over IB failures

2021-01-12 Thread Brock Palen via users
Some extra information turns out if we disable rc_verbs it works on the machines that don't work Works -x UCX_TLS=sm,ud -x UCX_TLS=sm,rc_v Fails did also found the machines that are not working are some mixed IB card erras Mixed ConnectX-3 & 4 fail All ConnectX-3 works Brock Palen IG: brockp

[OMPI users] Some nodes have ucx over IB failures

2021-01-12 Thread Brock Palen via users
We have an odd behavior after an update, the most severe one is if UCX is allowed to use IB it will fail for anything except small messages, Using PingPong from IMB OMPI_MCA_pml_ucx_verbose=100 mpirun -x UCX_LOG_LEVEL=DEBUG -x UCX_MODULE_LOG_LEVEL=DEBUG IMB-MPI1 PingPong [lh.arc-ts.umich.edu:

Re: [OMPI users] Timeout in MPI_Bcast/MPI_Barrier?

2021-01-12 Thread Daniel Torres via users
Hi George and Gilles. Thanks a lot for taking the time to test the code I sent. As Gilles mentioned all tests he made worked perfect, I decided to install a totally new *OMPI 4.1.0* and test again. Happily, the OOM killer is not shooting any process and all my experimentation worked perfect.