[OMPI users] MPI_Init sometimes fails when using UCX/GPUDirect RDMA

2020-07-21 Thread Oskar Lappi via users
Hi again, and thank you to Florent for answering my questions last time. The answers were very helpful! We have some strange errors occurring randomly when running MPI jobs. We are using openmpi 4.0.3 with UCX and GPUDirect RDMA and are running multi-node applications using SLURM on a cluster.

Re: [OMPI users] choosing network: infiniband vs. ethernet

2020-07-21 Thread Lana Deere via users
I'm using the infiniband drivers in the CentOS7 distribution, not the Mellanox drivers. The version of Lustre we're using is built against the distro drivers and breaks if the Mellanox drivers get installed. Is there a particular version of ucx which should be used with openmpi 4.0.4? I download