Hi Jeff ! What does "ompi_info | grep openib" show? > > $ ompi_info | grep openib MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
Additionally, Mellanox provides alternate support through their MXM > libraries, if you want to try that. > Yes, I know. But we already have a hybrid cluster with OpenMPI, OpenMP, CUDA, Torque and many other libraries installed, and because it works perfect over Ethernet interconnect my idea was to add InfiniBand support with minimum of changes. Mainly because we already have some custom-written software for OpenMPI. > If that shows that you have the openib BTL plugin loaded, try running with > "mpirun --mca btl_base_verbose 100 ..." That will provide additional > output about why / why not each point-to-point plugin is chosen. > > Yes, I tried to get this info already. And I saw in log that rdmacm wants IP address on port. So my question in topc start message was: Is it enough for OpenMPI to have RDMA only or IPoIB should also be installed? The mpirun output is: [node1:02674] mca: base: components_register: registering btl components [node1:02674] mca: base: components_register: found loaded component openib [node1:02674] mca: base: components_register: component openib register function successful [node1:02674] mca: base: components_register: found loaded component sm [node1:02674] mca: base: components_register: component sm register function successful [node1:02674] mca: base: components_register: found loaded component self [node1:02674] mca: base: components_register: component self register function successful [node1:02674] mca: base: components_open: opening btl components [node1:02674] mca: base: components_open: found loaded component openib [node1:02674] mca: base: components_open: component openib open function successful [node1:02674] mca: base: components_open: found loaded component sm [node1:02674] mca: base: components_open: component sm open function successful [node1:02674] mca: base: components_open: found loaded component self [node1:02674] mca: base: components_open: component self open function successful [node1:02674] select: initializing btl component openib [node1:02674] openib BTL: rdmacm IP address not found on port [node1:02674] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped [node1:02674] select: init of component openib returned failure [node1:02674] mca: base: close: component openib closed [node1:02674] mca: base: close: unloading component openib [node1:02674] select: initializing btl component sm [node1:02674] select: init of component sm returned failure [node1:02674] mca: base: close: component sm closed [node1:02674] mca: base: close: unloading component sm [node1:02674] select: initializing btl component self [node1:02674] select: init of component self returned success [node1:02674] mca: bml: Using self btl to [[16642,1],0] on node node1 [node1:02674] mca: base: close: component self closed [node1:02674] mca: base: close: unloading component self Best regards, Sergei.
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users