Hi Josh, I asked one of our more advanced users to add the "-mca btl_openib_if_include mlx4_0:1" argument to his job script. Unfortunately, the same error occurred as before.
We'll keep digging on our end; if you have any other suggestions, please let us know. Tim On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamil...@gmail.com> wrote: > Hi Josh, > > Thanks for attempting to sort this out. In answer to your questions: > > 1. Node allocation is done by TORQUE, however we don't use the TM API to > launch jobs (long story). Instead, we just pass a hostfile to mpirun, and > mpirun uses the ssh launcher to actually communicate and launch the > processes on remote nodes. > 2. We have only one port per HCA (the HCA silicon is integrated with the > motherboard on most of our nodes, including all that have this issue). They > are all configured to use InfiniBand (no IPoIB or other protocols). > 3. No, we don't explicitly ask for a device port pair. We will try your > suggestion and report back. > > Thanks again! > > Tim > > > On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: > >> Strange indeed. This info (remote adapter info) is passed around in the >> modex and the struct is locally populated during add procs. >> >> 1. How do you launch jobs? Mpirun, srun, or something else? >> 2. How many active ports do you have on each HCA? Are they all configured >> to use IB? >> 3. Do you explicitly ask for a device:port pair with the "if include" mca >> param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1" >> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over >> IB.) >> >> Josh >> >> >> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamil...@gmail.com> wrote: >> >>> Hi, >>> >>> I'd like to revive this thread, since I am still periodically getting >>> errors of this type. I have built 1.8.1 with --enable-debug and run with >>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any >>> additional information that I can find useful. I've gone ahead and attached >>> a dump of the output under 1.8.1. The key lines are: >>> >>> >>> -------------------------------------------------------------------------- >>> Open MPI detected two different OpenFabrics transport types in the same >>> Infiniband network. >>> Such mixed network trasport configuration is not supported by Open MPI. >>> >>> Local host: w1 >>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>> >>> Remote host: w16 >>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>> ------------------------------------------------------------------------- >>> >>> Note that the vendor and part IDs are the same. If I immediately run on >>> the same two nodes using MVAPICH2, everything is fine. >>> >>> I'm really very befuddled by this. OpenMPI sees that the two cards are >>> the same and made by the same vendor, yet it thinks the transport types are >>> different (and one is unknown). I'm hoping someone with some experience >>> with how the OpenIB BTL works can shed some light on this problem... >>> >>> Tim >>> >>> >>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.m...@gmail.com> >>> wrote: >>> >>>> >>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm >>>> wondering if this is an issue with the OOB. If you have a debug build, you >>>> can run -mca btl_openib_verbose 10 >>>> >>>> Josh >>>> >>>> >>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.m...@gmail.com> >>>> wrote: >>>> >>>>> Hi, Tim >>>>> >>>>> Run "ibstat" on each host: >>>>> >>>>> 1. Make sure the adapters are alive and active. >>>>> >>>>> 2. Look at the Link Layer settings for host w34. Does it match host >>>>> w4's? >>>>> >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamil...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand adapters, >>>>>> and periodically our jobs abort at start-up with the following error: >>>>>> >>>>>> === >>>>>> Open MPI detected two different OpenFabrics transport types in the >>>>>> same Infiniband network. >>>>>> Such mixed network trasport configuration is not supported by Open >>>>>> MPI. >>>>>> >>>>>> Local host: w4 >>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>>>>> >>>>>> Remote host: w34 >>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>>>>> === >>>>>> >>>>>> I've done a bit of googling and not found very much. We do not see >>>>>> this issue when we run with MVAPICH2 on the same sets of nodes. >>>>>> >>>>>> Any advice or thoughts would be very welcome, as I am stumped by what >>>>>> causes this. The nodes are all running Scientific Linux 6 with Mellanox >>>>>> drivers installed via the SL-provided RPMs. >>>>>> >>>>>> Tim >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >