Aha ... looking at "ibv_devinfo -v" got me my first concrete hint of what's going on. On a node that's working fine (w2), under port 1 there is a line:
LinkLayer: InfiniBand On a node that is having trouble (w3), that line is not present. The question is why this inconsistency occurs. I don't seem to have ofed_info installed on my system -- not sure what magical package Red Hat decided to hide that in. The InfiniBand stack I am running is stock with our version of Scientific Linux (6.2). I am beginning to wonder if this isn't some bug with the Red Hat/SL-provided InfiniBand stack. I'll do some more poking, but at least now I've got something semi-solid to poke at. Thanks for all of your help; I've attached the results of "ibv_devinfo -v" for both systems, so if you see anything else that jumps at you, please let me know. Tim On Sat, Jun 7, 2014 at 2:21 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > could you please attach output of "ibv_devinfo -v" and "ofed_info -s" > Thx > > > On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller <btamil...@gmail.com> wrote: > >> Hi Josh, >> >> I asked one of our more advanced users to add the "-mca btl_openib_if_include >> mlx4_0:1" argument to his job script. Unfortunately, the same error >> occurred as before. >> >> We'll keep digging on our end; if you have any other suggestions, please >> let us know. >> >> Tim >> >> >> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller <btamil...@gmail.com> wrote: >> >>> Hi Josh, >>> >>> Thanks for attempting to sort this out. In answer to your questions: >>> >>> 1. Node allocation is done by TORQUE, however we don't use the TM API to >>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and >>> mpirun uses the ssh launcher to actually communicate and launch the >>> processes on remote nodes. >>> 2. We have only one port per HCA (the HCA silicon is integrated with the >>> motherboard on most of our nodes, including all that have this issue). They >>> are all configured to use InfiniBand (no IPoIB or other protocols). >>> 3. No, we don't explicitly ask for a device port pair. We will try your >>> suggestion and report back. >>> >>> Thanks again! >>> >>> Tim >>> >>> >>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd <jladd.m...@gmail.com> >>> wrote: >>> >>>> Strange indeed. This info (remote adapter info) is passed around in the >>>> modex and the struct is locally populated during add procs. >>>> >>>> 1. How do you launch jobs? Mpirun, srun, or something else? >>>> 2. How many active ports do you have on each HCA? Are they all >>>> configured to use IB? >>>> 3. Do you explicitly ask for a device:port pair with the "if include" >>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1" >>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over >>>> IB.) >>>> >>>> Josh >>>> >>>> >>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller <btamil...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'd like to revive this thread, since I am still periodically getting >>>>> errors of this type. I have built 1.8.1 with --enable-debug and run with >>>>> -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide >>>>> any >>>>> additional information that I can find useful. I've gone ahead and >>>>> attached >>>>> a dump of the output under 1.8.1. The key lines are: >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> Open MPI detected two different OpenFabrics transport types in the >>>>> same Infiniband network. >>>>> Such mixed network trasport configuration is not supported by Open MPI. >>>>> >>>>> Local host: w1 >>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>>>> >>>>> Remote host: w16 >>>>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>>>> >>>>> ------------------------------------------------------------------------- >>>>> >>>>> Note that the vendor and part IDs are the same. If I immediately run >>>>> on the same two nodes using MVAPICH2, everything is fine. >>>>> >>>>> I'm really very befuddled by this. OpenMPI sees that the two cards are >>>>> the same and made by the same vendor, yet it thinks the transport types >>>>> are >>>>> different (and one is unknown). I'm hoping someone with some experience >>>>> with how the OpenIB BTL works can shed some light on this problem... >>>>> >>>>> Tim >>>>> >>>>> >>>>> On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> Just wondering if you've tried with the latest stable OMPI, 1.8.1? >>>>>> I'm wondering if this is an issue with the OOB. If you have a debug >>>>>> build, >>>>>> you can run -mca btl_openib_verbose 10 >>>>>> >>>>>> Josh >>>>>> >>>>>> >>>>>> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd <jladd.m...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, Tim >>>>>>> >>>>>>> Run "ibstat" on each host: >>>>>>> >>>>>>> 1. Make sure the adapters are alive and active. >>>>>>> >>>>>>> 2. Look at the Link Layer settings for host w34. Does it match host >>>>>>> w4's? >>>>>>> >>>>>>> >>>>>>> Josh >>>>>>> >>>>>>> >>>>>>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller <btamil...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand >>>>>>>> adapters, and periodically our jobs abort at start-up with the >>>>>>>> following >>>>>>>> error: >>>>>>>> >>>>>>>> === >>>>>>>> Open MPI detected two different OpenFabrics transport types in the >>>>>>>> same Infiniband network. >>>>>>>> Such mixed network trasport configuration is not supported by Open >>>>>>>> MPI. >>>>>>>> >>>>>>>> Local host: w4 >>>>>>>> Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428) >>>>>>>> Local transport type: MCA_BTL_OPENIB_TRANSPORT_IB >>>>>>>> >>>>>>>> Remote host: w34 >>>>>>>> Remote Adapter: (vendor 0x2c9, part ID 26428) >>>>>>>> Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN >>>>>>>> === >>>>>>>> >>>>>>>> I've done a bit of googling and not found very much. We do not see >>>>>>>> this issue when we run with MVAPICH2 on the same sets of nodes. >>>>>>>> >>>>>>>> Any advice or thoughts would be very welcome, as I am stumped by >>>>>>>> what causes this. The nodes are all running Scientific Linux 6 with >>>>>>>> Mellanox drivers installed via the SL-provided RPMs. >>>>>>>> >>>>>>>> Tim >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.200 node_guid: 0025:90ff:ff1c:42e4 sys_image_guid: 0025:90ff:ff1c:42e7 vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_2092000001000 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 65464 max_qp_wr: 16384 device_cap_flags: 0x006c9c76 max_sge: 32 max_sge_rd: 0 max_cq: 65408 max_cqe: 4194303 max_mr: 131056 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 1047424 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 55 port_lmc: 0x00 link_layer: InfiniBand max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 17 init_type_reply: 0 active_width: 4X (2) active_speed: 10.0 Gbps (4) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:0025:90ff:ff1c:42e5
hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.7.200 node_guid: 0025:90ff:ff1b:988c sys_image_guid: 0025:90ff:ff1b:988f vendor_id: 0x02c9 vendor_part_id: 26428 hw_ver: 0xB0 board_id: SM_2092000001000 phys_port_cnt: 1 max_mr_size: 0xffffffffffffffff page_size_cap: 0xfffffe00 max_qp: 65464 max_qp_wr: 16384 device_cap_flags: 0x006c9c76 max_sge: 32 max_sge_rd: 0 max_cq: 65408 max_cqe: 4194303 max_mr: 131056 max_pd: 32764 max_qp_rd_atom: 16 max_ee_rd_atom: 0 max_res_rd_atom: 1047424 max_qp_init_rd_atom: 128 max_ee_init_rd_atom: 0 atomic_cap: ATOMIC_HCA (1) max_ee: 0 max_rdd: 0 max_mw: 0 max_raw_ipv6_qp: 0 max_raw_ethy_qp: 0 max_mcast_grp: 8192 max_mcast_qp_attach: 56 max_total_mcast_qp_attach: 458752 max_ah: 0 max_fmr: 0 max_srq: 65472 max_srq_wr: 16383 max_srq_sge: 31 max_pkeys: 128 local_ca_ack_delay: 15 port: 1 state: PORT_ACTIVE (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 64 port_lmc: 0x00 max_msg_sz: 0x40000000 port_cap_flags: 0x02510868 max_vl_num: 8 (4) bad_pkey_cntr: 0x0 qkey_viol_cntr: 0x0 sm_sl: 0 pkey_tbl_len: 128 gid_tbl_len: 128 subnet_timeout: 17 init_type_reply: 0 active_width: 4X (2) active_speed: 10.0 Gbps (4) phys_state: LINK_UP (5) GID[ 0]: fe80:0000:0000:0000:0025:90ff:ff1b:988d