Hi everyone, I have a cluster of 32 nodes with Infiniband, four of them additionally have a 10G Mellanox Ethernet card for faster I/O. If my job based on openmpi 1.10.6 ends up on one of these nodes, it will crash:
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: pax10-28 Local device: mlx4_0 Local port: 1 CPCs attempted: udcm -------------------------------------------------------------------------- [pax10-28:08830] *** Process received signal *** [pax10-28:08830] Signal: Segmentation fault (11) [pax10-28:08830] Signal code: Address not mapped (1) [pax10-28:08830] Failing at address: 0x1a0 [pax10-28:08830] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x2b843325d5e0] [pax10-28:08830] [ 1] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/openmpi/mca_btl_openib.so(+0x133a0)[0x2b843752e3a0] [pax10-28:08830] [ 2] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libopen-pal.so.13(opal_progress+0x2a)[0x2b8433ad1dca] [pax10-28:08830] [ 3] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libmpi.so.12(ompi_mpi_init+0x957)[0x2b8432fb4a57] [pax10-28:08830] [ 4] /opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libmpi.so.12(MPI_Init+0x13d)[0x2b8432fd723d] [pax10-28:08830] [ 5] IMB-MPI1[0x402079] [pax10-28:08830] [ 6] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b843348bc05] [pax10-28:08830] [ 7] IMB-MPI1[0x401f79] The same test is working fine if using openmpi 3.0.0, then I get a similar warning but no crash: No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: pax10-29 Local device: mlx4_0 Local port: 1 CPCs attempted: rdmacm, udcm I'm using openmpi packages as included in OpenHPC for Centos 7. The node providing the error message show two IB devices: [pax10-28] /root # ibv_devinfo hca_id: mlx4_1 transport: InfiniBand (0) fw_ver: 2.33.5100 node_guid: 0cc4:7aff:ff5f:93fc sys_image_guid: 0cc4:7aff:ff5f:93ff vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: SM_2271000001000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 33 port_lmc: 0x00 link_layer: InfiniBand hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.36.5000 node_guid: 7cfe:9003:0093:8db0 sys_image_guid: 7cfe:9003:0093:8db0 vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: MT_1080120023 phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 1024 (3) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: Ethernet The other nodes look like this: [pax10-27] /root # ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.33.5100 node_guid: 0cc4:7aff:ff5f:957c sys_image_guid: 0cc4:7aff:ff5f:957f vendor_id: 0x02c9 vendor_part_id: 4099 hw_ver: 0x0 board_id: SM_2271000001000 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 16 port_lmc: 0x00 link_layer: InfiniBand So is there a way to tell openmpi to use mlx4_1 on the machines with 10G Ethernet and mlx4_0 on all other nodes? Regards, Götz _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users