As much as I hate to reply to myself, I'm going to in this case. Digging deeper into the old OS image (I found a couple of nodes that I forgot to image), it looks like libibverbs and librdmacm were, in fact installed. That explains how the previous image was able to avoid the "cannot open shared object file" messages.
My current theory is that somewhere between the (very) old version of librdmacm on the old image, and the new version on the new image, that there was a change that started to emit the "librdmacm: Fatal: no RDMA devices found" messages. All of this implies that the difference is related to something that happened with librdmacm, not something that changed in OpenMPI. Sorry for the list noise. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 03/02/2015 02:42 PM, Lloyd Brown wrote: > I hope this isn't too basic of a question, but is there a document > somewhere that describes how the selection of which BTL components (eg. > openib, tcp) to use occurs when mpirun/mpiexec is launched? I know it > can be influenced by conf files, parameters, and env variables. But > lacking those, how does it choose which components to use? > > I'm trying to diagnose an issue involving OpenMPI, OFED, and an OS > upgrade. I'm hoping that better understanding of how components are > selected, will help me figure out what changed with the OS upgrade. > > > > > Here's a longer explanation. > > We recently upgraded our HPC cluster from RHEL 6.2 to 6.6. We have > several versions of OpenMPI availale from a central NFS store. Our > cluster has some nodes with IB hardware, and some without. > > On the old OS image, we did not install any of the OFED components on > the non-IB nodes, and OpenMPI was able to somehow figure out that it > shouldn't even try the openib btl, without any runtime warnings. We got > the speeds we were expecting, when running osu_bw tests from the OMB > test suite, for either the IB nodes (about 3800 MB/s for 4xQDR IB), or > the non-IB nodes (about 115 MB/s for 1GbE). > > Since the OS upgrade, we start to get warnings like this on non-IB nodes > without OFED installed: > >> $ mpirun -np 2 hello_world >> [m7stage-1-1:09962] mca: base: component_find: unable to open >> /apps/openmpi/1.6.3_gnu-4.4/lib/openmpi/mca_btl_ofud: librdmacm.so.1: cannot >> open shared object file: No such file or directory (ignored) >> [m7stage-1-1:09961] mca: base: component_find: unable to open >> /apps/openmpi/1.6.3_gnu-4.4/lib/openmpi/mca_btl_ofud: librdmacm.so.1: cannot >> open shared object file: No such file or directory (ignored) >> [m7stage-1-1:09961] mca: base: component_find: unable to open >> /apps/openmpi/1.6.3_gnu-4.4/lib/openmpi/mca_btl_openib: librdmacm.so.1: >> cannot open shared object file: No such file or directory (ignored) >> [m7stage-1-1:09962] mca: base: component_find: unable to open >> /apps/openmpi/1.6.3_gnu-4.4/lib/openmpi/mca_btl_openib: librdmacm.so.1: >> cannot open shared object file: No such file or directory (ignored) >> Hello from process # 0 of 2 on host m7stage-1-1 >> Hello from process # 1 of 2 on host m7stage-1-1 > > Obviously these are references to software components associated with > OFED. We can install OFED on the non-IB nodes, but then we get warnings > more like this: > >> $ mpirun -np 2 hello_world >> librdmacm: Fatal: no RDMA devices found >> librdmacm: Fatal: no RDMA devices found >> -------------------------------------------------------------------------- >> [[63448,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: m7stage-1-1 >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> Hello from process # 0 of 2 on host m7stage-1-1 >> Hello from process # 1 of 2 on host m7stage-1-1 >> [m7stage-1-1:18753] 1 more process has sent help message >> help-mpi-btl-base.txt / btl:no-nics >> [m7stage-1-1:18753] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages > > Obviously we can work with this by using "--mca btl ^openib" or similar > on the non-IB nodes. And we're pursuing that option. > > But I'm struggling to understand what happened to cause OpenMPI on the > non-IB node, without OFED installed, to no longer be able to figure out > that it shouldn't use the openib btl. Thus the reason why I ask for > more information about how that decision is being made. Maybe that will > clue me in, as to what changed. > > > > Thanks, >