Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's
going on. On a node that's working fine (w2), under port 1 there is a line:

LinkLayer: InfiniBand

On a node that is having trouble (w3), that line is not present. The
question is why this inconsistency occurs.

I don't seem to have ofed_info installed on my system -- not sure what
magical package Red Hat decided to hide that in. The InfiniBand stack I am
running is stock with our version of Scientific Linux (6.2). I am beginning
to wonder if this isn't some bug with the Red Hat/SL-provided InfiniBand
stack. I'll do some more poking, but at least now I've got something
semi-solid to poke at. Thanks for all of your help; I've attached the
results of "ibv_devinfo -v" for both systems, so if you see anything else
that jumps at you, please let me know.

Tim




On Sat, Jun 7, 2014 at 2:21 AM, Mike Dubman <mi...@dev.mellanox.co.il>
wrote:

> could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
> Thx
>
>
> On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamil...@gmail.com> wrote:
>
>> Hi Josh,
>>
>> I asked one of our more advanced users to add the "-mca btl_openib_if_include
>> mlx4_0:1" argument to his job script. Unfortunately, the same error
>> occurred as before.
>>
>> We'll keep digging on our end; if you have any other suggestions, please
>> let us know.
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamil...@gmail.com> wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for attempting to sort this out. In answer to your questions:
>>>
>>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>>> mpirun uses the ssh launcher to actually communicate and launch the
>>> processes on remote nodes.
>>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>>> motherboard on most of our nodes, including all that have this issue). They
>>> are all configured to use InfiniBand (no IPoIB or other protocols).
>>> 3. No, we don't explicitly ask for a device port pair. We will try your
>>> suggestion and report back.
>>>
>>> Thanks again!
>>>
>>> Tim
>>>
>>>
>>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.m...@gmail.com>
>>> wrote:
>>>
>>>> Strange indeed. This info (remote adapter info) is passed around in the
>>>> modex and the struct is locally populated during add procs.
>>>>
>>>> 1. How do you launch jobs? Mpirun, srun, or something else?
>>>> 2. How many active ports do you have on each HCA? Are they all
>>>> configured to use IB?
>>>> 3. Do you explicitly ask for a device:port pair with the "if include"
>>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>>>> IB.)
>>>>
>>>> Josh
>>>>
>>>>
>>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'd like to revive this thread, since I am still periodically getting
>>>>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>>>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide 
>>>>> any
>>>>> additional information that I can find useful. I've gone ahead and 
>>>>> attached
>>>>> a dump of the output under 1.8.1. The key lines are:
>>>>>
>>>>>
>>>>> --------------------------------------------------------------------------
>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>> same Infiniband network.
>>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>>
>>>>>   Local host:            w1
>>>>>   Local adapter:         mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>
>>>>>   Remote host:           w16
>>>>>   Remote Adapter:        (vendor 0x2c9, part ID 26428)
>>>>>   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>
>>>>> -------------------------------------------------------------------------
>>>>>
>>>>> Note that the vendor and part IDs are the same. If I immediately run
>>>>> on the same two nodes using MVAPICH2, everything is fine.
>>>>>
>>>>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>>>>> the same and made by the same vendor, yet it thinks the transport types 
>>>>> are
>>>>> different (and one is unknown). I'm hoping someone with some experience
>>>>> with how the OpenIB BTL works can shed some light on this problem...
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1?
>>>>>> I'm wondering if this is an issue with the OOB. If you have a debug 
>>>>>> build,
>>>>>> you can run -mca btl_openib_verbose 10
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, Tim
>>>>>>>
>>>>>>> Run "ibstat" on each host:
>>>>>>>
>>>>>>> 1. Make sure the adapters are alive and active.
>>>>>>>
>>>>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>>>>> w4's?
>>>>>>>
>>>>>>>
>>>>>>> Josh
>>>>>>>
>>>>>>>
>>>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamil...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand
>>>>>>>> adapters, and periodically our jobs abort at start-up with the 
>>>>>>>> following
>>>>>>>> error:
>>>>>>>>
>>>>>>>> ===
>>>>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>>>>> same Infiniband network.
>>>>>>>> Such mixed network trasport configuration is not supported by Open
>>>>>>>> MPI.
>>>>>>>>
>>>>>>>>   Local host:            w4
>>>>>>>>   Local adapter:         mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>>>>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>>>>
>>>>>>>>   Remote host:           w34
>>>>>>>>   Remote Adapter:        (vendor 0x2c9, part ID 26428)
>>>>>>>>   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>>>> ===
>>>>>>>>
>>>>>>>> I've done a bit of googling and not found very much. We do not see
>>>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>>>>
>>>>>>>> Any advice or thoughts would be very welcome, as I am stumped by
>>>>>>>> what causes this. The nodes are all running Scientific Linux 6 with
>>>>>>>> Mellanox drivers installed via the SL-provided RPMs.
>>>>>>>>
>>>>>>>> Tim
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.200
        node_guid:                      0025:90ff:ff1c:42e4
        sys_image_guid:                 0025:90ff:ff1c:42e7
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       SM_2092000001000
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffe00
        max_qp:                         65464
        max_qp_wr:                      16384
        device_cap_flags:               0x006c9c76
        max_sge:                        32
        max_sge_rd:                     0
        max_cq:                         65408
        max_cqe:                        4194303
        max_mr:                         131056
        max_pd:                         32764
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                1047424
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            56
        max_total_mcast_qp_attach:      458752
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        65472
        max_srq_wr:                     16383
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             15
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               55
                        port_lmc:               0x00
                        link_layer:             InfiniBand
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x02510868
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            128
                        subnet_timeout:         17
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           10.0 Gbps (4)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               
fe80:0000:0000:0000:0025:90ff:ff1c:42e5

hca_id: mlx4_0
        transport:                      InfiniBand (0)
        fw_ver:                         2.7.200
        node_guid:                      0025:90ff:ff1b:988c
        sys_image_guid:                 0025:90ff:ff1b:988f
        vendor_id:                      0x02c9
        vendor_part_id:                 26428
        hw_ver:                         0xB0
        board_id:                       SM_2092000001000
        phys_port_cnt:                  1
        max_mr_size:                    0xffffffffffffffff
        page_size_cap:                  0xfffffe00
        max_qp:                         65464
        max_qp_wr:                      16384
        device_cap_flags:               0x006c9c76
        max_sge:                        32
        max_sge_rd:                     0
        max_cq:                         65408
        max_cqe:                        4194303
        max_mr:                         131056
        max_pd:                         32764
        max_qp_rd_atom:                 16
        max_ee_rd_atom:                 0
        max_res_rd_atom:                1047424
        max_qp_init_rd_atom:            128
        max_ee_init_rd_atom:            0
        atomic_cap:                     ATOMIC_HCA (1)
        max_ee:                         0
        max_rdd:                        0
        max_mw:                         0
        max_raw_ipv6_qp:                0
        max_raw_ethy_qp:                0
        max_mcast_grp:                  8192
        max_mcast_qp_attach:            56
        max_total_mcast_qp_attach:      458752
        max_ah:                         0
        max_fmr:                        0
        max_srq:                        65472
        max_srq_wr:                     16383
        max_srq_sge:                    31
        max_pkeys:                      128
        local_ca_ack_delay:             15
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               64
                        port_lmc:               0x00
                        max_msg_sz:             0x40000000
                        port_cap_flags:         0x02510868
                        max_vl_num:             8 (4)
                        bad_pkey_cntr:          0x0
                        qkey_viol_cntr:         0x0
                        sm_sl:                  0
                        pkey_tbl_len:           128
                        gid_tbl_len:            128
                        subnet_timeout:         17
                        init_type_reply:        0
                        active_width:           4X (2)
                        active_speed:           10.0 Gbps (4)
                        phys_state:             LINK_UP (5)
                        GID[  0]:               
fe80:0000:0000:0000:0025:90ff:ff1b:988d

Reply via email to