Hi everyone,

I have a cluster of 32 nodes with Infiniband, four of them
additionally have a 10G Mellanox Ethernet card for faster I/O. If my
job based on openmpi 1.10.6 ends up on one of these nodes, it will
crash:

No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           pax10-28
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[pax10-28:08830] *** Process received signal ***
[pax10-28:08830] Signal: Segmentation fault (11)
[pax10-28:08830] Signal code: Address not mapped (1)
[pax10-28:08830] Failing at address: 0x1a0
[pax10-28:08830] [ 0] /usr/lib64/libpthread.so.0(+0xf5e0)[0x2b843325d5e0]
[pax10-28:08830] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/openmpi/mca_btl_openib.so(+0x133a0)[0x2b843752e3a0]
[pax10-28:08830] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libopen-pal.so.13(opal_progress+0x2a)[0x2b8433ad1dca]
[pax10-28:08830] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libmpi.so.12(ompi_mpi_init+0x957)[0x2b8432fb4a57]
[pax10-28:08830] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/1.10.6/lib/libmpi.so.12(MPI_Init+0x13d)[0x2b8432fd723d]
[pax10-28:08830] [ 5] IMB-MPI1[0x402079]
[pax10-28:08830] [ 6]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b843348bc05]
[pax10-28:08830] [ 7] IMB-MPI1[0x401f79]


The same test is working fine if using openmpi 3.0.0, then I get a
similar warning but no crash:
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           pax10-29
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm


I'm using openmpi packages as included in OpenHPC for Centos 7.

The node providing the error message show two IB devices:

[pax10-28] /root # ibv_devinfo
hca_id:    mlx4_1
    transport:            InfiniBand (0)
    fw_ver:                2.33.5100
    node_guid:            0cc4:7aff:ff5f:93fc
    sys_image_guid:            0cc4:7aff:ff5f:93ff
    vendor_id:            0x02c9
    vendor_part_id:            4099
    hw_ver:                0x0
    board_id:            SM_2271000001000
    phys_port_cnt:            1
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            1
            port_lid:        33
            port_lmc:        0x00
            link_layer:        InfiniBand

hca_id:    mlx4_0
    transport:            InfiniBand (0)
    fw_ver:                2.36.5000
    node_guid:            7cfe:9003:0093:8db0
    sys_image_guid:            7cfe:9003:0093:8db0
    vendor_id:            0x02c9
    vendor_part_id:            4099
    hw_ver:                0x0
    board_id:            MT_1080120023
    phys_port_cnt:            2
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        1024 (3)
            sm_lid:            0
            port_lid:        0
            port_lmc:        0x00
            link_layer:        Ethernet

        port:    2
            state:            PORT_DOWN (1)
            max_mtu:        4096 (5)
            active_mtu:        1024 (3)
            sm_lid:            0
            port_lid:        0
            port_lmc:        0x00
            link_layer:        Ethernet



The other nodes look like this:
[pax10-27] /root # ibv_devinfo
hca_id:    mlx4_0
    transport:            InfiniBand (0)
    fw_ver:                2.33.5100
    node_guid:            0cc4:7aff:ff5f:957c
    sys_image_guid:            0cc4:7aff:ff5f:957f
    vendor_id:            0x02c9
    vendor_part_id:            4099
    hw_ver:                0x0
    board_id:            SM_2271000001000
    phys_port_cnt:            1
        port:    1
            state:            PORT_ACTIVE (4)
            max_mtu:        4096 (5)
            active_mtu:        4096 (5)
            sm_lid:            1
            port_lid:        16
            port_lmc:        0x00
            link_layer:        InfiniBand



So is there a way to tell openmpi to use mlx4_1 on the machines with
10G Ethernet and mlx4_0 on all other nodes?

Regards, Götz
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to