could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
Thx


On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamil...@gmail.com> wrote:

> Hi Josh,
>
> I asked one of our more advanced users to add the "-mca btl_openib_if_include
> mlx4_0:1" argument to his job script. Unfortunately, the same error
> occurred as before.
>
> We'll keep digging on our end; if you have any other suggestions, please
> let us know.
>
> Tim
>
>
> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamil...@gmail.com> wrote:
>
>> Hi Josh,
>>
>> Thanks for attempting to sort this out. In answer to your questions:
>>
>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>> mpirun uses the ssh launcher to actually communicate and launch the
>> processes on remote nodes.
>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>> motherboard on most of our nodes, including all that have this issue). They
>> are all configured to use InfiniBand (no IPoIB or other protocols).
>> 3. No, we don't explicitly ask for a device port pair. We will try your
>> suggestion and report back.
>>
>> Thanks again!
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>
>>> Strange indeed. This info (remote adapter info) is passed around in the
>>> modex and the struct is locally populated during add procs.
>>>
>>> 1. How do you launch jobs? Mpirun, srun, or something else?
>>> 2. How many active ports do you have on each HCA? Are they all
>>> configured to use IB?
>>> 3. Do you explicitly ask for a device:port pair with the "if include"
>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>>> IB.)
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamil...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'd like to revive this thread, since I am still periodically getting
>>>> errors of this type. I have built 1.8.1 with --enable-debug and run with
>>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
>>>> additional information that I can find useful. I've gone ahead and attached
>>>> a dump of the output under 1.8.1. The key lines are:
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>> Open MPI detected two different OpenFabrics transport types in the same
>>>> Infiniband network.
>>>> Such mixed network trasport configuration is not supported by Open MPI.
>>>>
>>>>   Local host:            w1
>>>>   Local adapter:         mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>>>>
>>>>   Remote host:           w16
>>>>   Remote Adapter:        (vendor 0x2c9, part ID 26428)
>>>>   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>
>>>> -------------------------------------------------------------------------
>>>>
>>>> Note that the vendor and part IDs are the same. If I immediately run on
>>>> the same two nodes using MVAPICH2, everything is fine.
>>>>
>>>> I'm really very befuddled by this. OpenMPI sees that the two cards are
>>>> the same and made by the same vendor, yet it thinks the transport types are
>>>> different (and one is unknown). I'm hoping someone with some experience
>>>> with how the OpenIB BTL works can shed some light on this problem...
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
>>>>> wondering if this is an issue with the OOB. If you have a debug build, you
>>>>> can run -mca btl_openib_verbose 10
>>>>>
>>>>> Josh
>>>>>
>>>>>
>>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.m...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, Tim
>>>>>>
>>>>>> Run "ibstat" on each host:
>>>>>>
>>>>>> 1. Make sure the adapters are alive and active.
>>>>>>
>>>>>> 2. Look at the Link Layer settings for host w34. Does it match host
>>>>>> w4's?
>>>>>>
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>>
>>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamil...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand
>>>>>>> adapters, and periodically our jobs abort at start-up with the following
>>>>>>> error:
>>>>>>>
>>>>>>> ===
>>>>>>> Open MPI detected two different OpenFabrics transport types in the
>>>>>>> same Infiniband network.
>>>>>>> Such mixed network trasport configuration is not supported by Open
>>>>>>> MPI.
>>>>>>>
>>>>>>>   Local host:            w4
>>>>>>>   Local adapter:         mlx4_0 (vendor 0x2c9, part ID 26428)
>>>>>>>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>>>>>>>
>>>>>>>   Remote host:           w34
>>>>>>>   Remote Adapter:        (vendor 0x2c9, part ID 26428)
>>>>>>>   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN
>>>>>>> ===
>>>>>>>
>>>>>>> I've done a bit of googling and not found very much. We do not see
>>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes.
>>>>>>>
>>>>>>> Any advice or thoughts would be very welcome, as I am stumped by
>>>>>>> what causes this. The nodes are all running Scientific Linux 6 with
>>>>>>> Mellanox drivers installed via the SL-provided RPMs.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to