On Jun 19, 2020, at 6:59 PM, Thomas M. Payerle via users 
<users@lists.open-mpi.org> wrote:
> 
> We are upgrading a cluster from RHEL6 to RHEL8, and have migrated some nodes 
> to a new partition and reimaged with RHEL8.  I am having some issues getting 
> openmpi to work with infiniband on the nodes upgraded to RHEL8.

FWIW, Mellanox recommends using UCX these days (i.e., the UCX PML, not the 
openib BTL).  If you're changing your underlying OS stack, it might be worth 
upgrading to a) Open MPI v4.0.x, b) the UCX PML.

> For testing purposes, I am trying to run a simple MPI "hello world" code on 
> the local RHEL8 host (obviously, also having issues on multiple nodes, but 
> trying to simplify).
> 
> If I run with BTL set to vader,self or tcp,self on command line, the MPI code
> runs as expected.  If I set to openib,self (or leave unset), the job just
> hangs indefinitely, e.g. 
> bash> mpirun -H localhost -v --mca mpi_cuda_support 0 --mca 
> btl_openib_verbose 1 --mca btl openib,self -n 1 --show-progress -d 
> --debug-daemons ./hello-world-mpi
[snip]

> At this point the code just hangs indefinitely.  I see a PID 30387 named 
> hello-world-mpi with 3 threads,, which is consuming ~100% of a CPU core but 
> strace just shows doing epool_wait calls.

If upgrading to the UCX PML is not a possibility for some reason, can you send 
the stack trace to see exactly where it is hung?  E.g., if this is just hello 
world, is it stuck in MPI_INIT or MPI_FINALIZE?

-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to