could you please attach output of "ibv_devinfo -v" and "ofed_info -s" Thx
On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamil...@gmail.com> wrote: > Hi Josh, > > I asked one of our more advanced users to add the "-mca btl_openib_if_include > mlx4_0:1" argument to his job script. Unfortunately, the same error > occurred as before. > > We'll keep digging on our end; if you have any other suggestions, please > let us know. > > Tim > > > On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamil...@gmail.com> wrote: > >> Hi Josh, >> >> Thanks for attempting to sort this out. In answer to your questions: >> >> 1. Node allocation is done by TORQUE, however we don't use the TM API to >> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and >> mpirun uses the ssh launcher to actually communicate and launch the >> processes on remote nodes. >> 2. We have only one port per HCA (the HCA silicon is integrated with the >> motherboard on most of our nodes, including all that have this issue). They >> are all configured to use InfiniBand (no IPoIB or other protocols). >> 3. No, we don't explicitly ask for a device port pair. We will try your >> suggestion and report back. >> >> Thanks again! >> >> Tim >> >> >> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.m...@gmail.com> wrote: >> >>> Strange indeed. This info (remote adapter info) is passed around in the >>> modex and the struct is locally populated during add procs. >>> >>> 1. How do you launch jobs? Mpirun, srun, or something else? >>> 2. How many active ports do you have on each HCA? Are they all >>> configured to use IB? >>> 3. Do you explicitly ask for a device:port pair with the "if include" >>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1" >>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over >>> IB.) >>> >>> Josh >>> >>> >>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamil...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I'd like to revive this thread, since I am still periodically getting >>>> errors of this type. I have built 1.8.1 with --enable-debug and run with >>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any >>>> additional information that I can find useful. I've gone ahead and attached >>>> a dump of the output under 1.8.1. The key lines are: >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> Open MPI detected two different OpenFabrics transport types in the same >>>> Infiniband network. >>>> Such mixed network trasport configuration is not supported by Open MPI. >>>> >>>> Local host: w1 >>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>>> >>>> Remote host: w16 >>>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>>> >>>> ------------------------------------------------------------------------- >>>> >>>> Note that the vendor and part IDs are the same. If I immediately run on >>>> the same two nodes using MVAPICH2, everything is fine. >>>> >>>> I'm really very befuddled by this. OpenMPI sees that the two cards are >>>> the same and made by the same vendor, yet it thinks the transport types are >>>> different (and one is unknown). I'm hoping someone with some experience >>>> with how the OpenIB BTL works can shed some light on this problem... >>>> >>>> Tim >>>> >>>> >>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.m...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm >>>>> wondering if this is an issue with the OOB. If you have a debug build, you >>>>> can run -mca btl_openib_verbose 10 >>>>> >>>>> Josh >>>>> >>>>> >>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, Tim >>>>>> >>>>>> Run "ibstat" on each host: >>>>>> >>>>>> 1. Make sure the adapters are alive and active. >>>>>> >>>>>> 2. Look at the Link Layer settings for host w34. Does it match host >>>>>> w4's? >>>>>> >>>>>> >>>>>> Josh >>>>>> >>>>>> >>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamil...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand >>>>>>> adapters, and periodically our jobs abort at start-up with the following >>>>>>> error: >>>>>>> >>>>>>> === >>>>>>> Open MPI detected two different OpenFabrics transport types in the >>>>>>> same Infiniband network. >>>>>>> Such mixed network trasport configuration is not supported by Open >>>>>>> MPI. >>>>>>> >>>>>>> Local host: w4 >>>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>>>>>> >>>>>>> Remote host: w34 >>>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>>>>>> === >>>>>>> >>>>>>> I've done a bit of googling and not found very much. We do not see >>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes. >>>>>>> >>>>>>> Any advice or thoughts would be very welcome, as I am stumped by >>>>>>> what causes this. The nodes are all running Scientific Linux 6 with >>>>>>> Mellanox drivers installed via the SL-provided RPMs. >>>>>>> >>>>>>> Tim >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >