It looks like you have two dual-port Mellanox VPI cards in this
machine. These cards can be set to run InfiniBand or Ethernet on a
port-by-port basis, and all four of your ports are set to Ethernet
mode. Two of your ports have active 100 gigabit Ethernet links, and
the other two have no link up at all.

With no InfiniBand links on the machine, you will, of course, not be
able to run your OpenMPI job over InfiniBand.

If your machines and network are set up for it, you might be able to
run your job over RoCE (RDMA Over Converged Ethernet) using one or
both of those 100 GbE links. I have never used RoCE myself, but one
starting point for gathering more information on it might be the
following section of the OpenMPI FAQ:

https://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce

Sincerely,
Rusty Dekema
University of Michigan
Advanced Research Computing - Technology Services


On Fri, Jul 14, 2017 at 12:34 PM, Boris M. Vulovic
<boris.m.vulo...@gmail.com> wrote:
> Gus, Gilles and John,
>
> Thanks for the help. Let me first post (below) the output from checkouts of
> the IB network:
> ibdiagnet
> ibhosts
> ibstat  (for login node, for now)
>
> What do you think?
> Thanks
> --Boris
>
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> -bash-4.1$ ibdiagnet
> ----------
> Load Plugins from:
> /usr/share/ibdiagnet2.1.1/plugins/
> (You can specify more paths to be looked in with "IBDIAGNET_PLUGINS_PATH"
> env variable)
>
> Plugin Name                                   Result     Comment
> libibdiagnet_cable_diag_plugin-2.1.1          Succeeded  Plugin loaded
> libibdiagnet_phy_diag_plugin-2.1.1            Succeeded  Plugin loaded
>
> ---------------------------------------------
> Discovery
> -E- Failed to initialize
>
> -E- Fabric Discover failed, err=IBDiag initialize wasn't done
> -E- Fabric Discover failed, MAD err=Failed to register SMI class
>
> ---------------------------------------------
> Summary
> -I- Stage                     Warnings   Errors     Comment
> -I- Discovery                                       NA
> -I- Lids Check                                      NA
> -I- Links Check                                     NA
> -I- Subnet Manager                                  NA
> -I- Port Counters                                   NA
> -I- Nodes Information                               NA
> -I- Speed / Width checks                            NA
> -I- Partition Keys                                  NA
> -I- Alias GUIDs                                     NA
> -I- Temperature Sensing                             NA
>
> -I- You can find detailed errors/warnings in:
> /var/tmp/ibdiagnet2/ibdiagnet2.log
>
> -E- A fatal error occurred, exiting...
> -bash-4.1$
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> -bash-4.1$ ibhosts
> ibwarn: [168221] mad_rpc_open_port: client_register for mgmt 1 failed
> src/ibnetdisc.c:766; can't open MAD port ((null):0)
> /usr/sbin/ibnetdiscover: iberror: failed: discover failed
> -bash-4.1$
>
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
> -bash-4.1$ ibstat
> CA 'mlx5_0'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb1c
>         System image GUID: 0x248a0703005abb1c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb1c
>                 Link layer: Ethernet
> CA 'mlx5_1'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb1d
>         System image GUID: 0x248a0703005abb1c
>         Port 1:
>                 State: Active
>                 Physical state: LinkUp
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x0000000000000000
>                 Link layer: Ethernet
> CA 'mlx5_2'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb30
>         System image GUID: 0x248a0703005abb30
>         Port 1:
>                 State: Down
>                 Physical state: Disabled
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb30
>                 Link layer: Ethernet
> CA 'mlx5_3'
>         CA type: MT4115
>         Number of ports: 1
>         Firmware version: 12.17.2020
>         Hardware version: 0
>         Node GUID: 0x248a0703005abb31
>         System image GUID: 0x248a0703005abb30
>         Port 1:
>                 State: Down
>                 Physical state: Disabled
>                 Rate: 100
>                 Base lid: 0
>                 LMC: 0
>                 SM lid: 0
>                 Capability mask: 0x3c010000
>                 Port GUID: 0x268a07fffe5abb31
>                 Link layer: Ethernet
> -bash-4.1$
> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
> On Fri, Jul 14, 2017 at 12:37 AM, John Hearns via users
> <users@lists.open-mpi.org> wrote:
>>
>> ABoris, as Gilles says - first do som elower level checkouts of your
>> Infiniband network.
>> I suggest running:
>> ibdiagnet
>> ibhosts
>> and then as Gilles says 'ibstat' on each node
>>
>>
>>
>> On 14 July 2017 at 03:58, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>>>
>>> Boris,
>>>
>>>
>>> Open MPI should automatically detect the infiniband hardware, and use
>>> openib (and *not* tcp) for inter node communications
>>>
>>> and a shared memory optimized btl (e.g. sm or vader) for intra node
>>> communications.
>>>
>>>
>>> note if you "-mca btl openib,self", you tell Open MPI to use the openib
>>> btl between any tasks,
>>>
>>> including tasks running on the same node (which is less efficient than
>>> using sm or vader)
>>>
>>>
>>> at first, i suggest you make sure infiniband is up and running on all
>>> your nodes.
>>>
>>> (just run ibstat, at least one port should be listed, state should be
>>> Active, and all nodes should have the same SM lid)
>>>
>>>
>>> then try to run two tasks on two nodes.
>>>
>>>
>>> if this does not work, you can
>>>
>>> mpirun --mca btl_base_verbose 100 ...
>>>
>>> and post the logs so we can investigate from there.
>>>
>>>
>>> Cheers,
>>>
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 7/14/2017 6:43 AM, Boris M. Vulovic wrote:
>>>>
>>>>
>>>> I would like to know how to invoke InfiniBand hardware on CentOS 6x
>>>> cluster with OpenMPI (static libs.) for running my C++ code. This is how I
>>>> compile and run:
>>>>
>>>> /usr/local/open-mpi/1.10.7/bin/mpic++ -L/usr/local/open-mpi/1.10.7/lib
>>>> -Bstatic main.cpp -o DoWork
>>>>
>>>> usr/local/open-mpi/1.10.7/bin/mpiexec -mca btl tcp,self --hostfile
>>>> hostfile5 -host node01,node02,node03,node04,node05 -n 200 DoWork
>>>>
>>>> Here, "*-mca btl tcp,self*" reveals that *TCP* is used, and the cluster
>>>> has InfiniBand.
>>>>
>>>> What should be changed in compiling and running commands for InfiniBand
>>>> to be invoked? If I just replace "*-mca btl tcp,self*" with "*-mca btl
>>>> openib,self*" then I get plenty of errors with relevant one saying:
>>>>
>>>> /At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications. This means that no Open MPI device has indicated that 
>>>> it
>>>> can be used to communicate between these processes. This is an error; Open
>>>> MPI requires that all MPI processes be able to reach each other. This error
>>>> can sometimes be the result of forgetting to specify the "self" BTL./
>>>>
>>>> Thanks very much!!!
>>>>
>>>>
>>>> *Boris *
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>
> --
>
> Boris M. Vulovic
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to