Hi all - I'm trying to get openmpi with ucx working on a new Rocky Linux 8 + OpenHPC machine. I'm used to running with mpirun --mca pml ucx --mca osc ucx --mca btl ^vader,tcp,openib --bind-to core --map-by core --rank-by core However, now it complains that it can't start the pml, with the message -------------------------------------------------------------------------- No components were able to be opened in the pml framework.
This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: tin2 Framework: pml -------------------------------------------------------------------------- I thought maybe there were infiniband issues ("ucx_info -d" shows no active IB interface), so I removed the "--mca btl", but I still get the following error -------------------------------------------------------------------------- WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: tin2 -------------------------------------------------------------------------- -------------------------------------------------------------------------- No components were able to be opened in the pml framework. This typically means that either no components of this type were installed, or none of the installed components can be loaded. Sometimes this means that shared libraries required by these components are unable to be found/loaded. Host: tin2 Framework: pml -------------------------------------------------------------------------- [tin2:924804] PML ucx cannot be selected I would have expected that it would work with some sort of shared memory, since I'm just running on a single node. The ucx library is in LD_LIBRARY_PATH. However, I did notice that "omp_info --all" does not show the "uct" btl, which does show up on an older machine where this works. Is there any way to figure out where the initialization process is failing? thanks, Noam