Below are the results from the ibnetdiscover command This command was run from node smd.

#
# Topology file: generated on Fri May 19 15:59:47 2017
#
# Initiated from node 0002c903000a0a32 port 0002c903000a0a34

vendid=0x8f1
devid=0x5a5a
sysimgguid=0x8f105001094d3
switchguid=0x8f105001094d2(8f105001094d2)
Switch 36 "S-0008f105001094d2" # "Voltaire 4036 # SWITCH-IB-1" enhanced port 0 lid 1 lmc 0 [1] "H-0002c903000a09c2"[1](2c903000a09c3) # "dl580 mlx4_0" lid 2 4xQDR [2] "H-00117500007986e4"[1](117500007986e4) # "sm4 qib0" lid 6 4xQDR [3] "H-00117500007990f6"[1](117500007990f6) # "sm3 qib0" lid 5 4xQDR [4] "H-0011750000797a12"[1](11750000797a12) # "sm2 qib0" lid 4 4xDDR [5] "H-0011750000797a68"[1](11750000797a68) # "sm1 qib0" lid 3 4xDDR [36] "H-0002c903000a0a32"[2](2c903000a0a34) # "MT25408 ConnectX Mellanox Technologies" lid 7 4xQDR

vendid=0x1175
devid=0x7322
sysimgguid=0x11750000797a68
caguid=0x11750000797a68
Ca    1 "H-0011750000797a68"        # "sm1 qib0"
[1](11750000797a68) "S-0008f105001094d2"[5] # lid 3 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xDDR

vendid=0x1175
devid=0x7322
sysimgguid=0x11750000797a12
caguid=0x11750000797a12
Ca    1 "H-0011750000797a12"        # "sm2 qib0"
[1](11750000797a12) "S-0008f105001094d2"[4] # lid 4 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xDDR

vendid=0x1175
devid=0x7322
sysimgguid=0x117500007990f6
caguid=0x117500007990f6
Ca    1 "H-00117500007990f6"        # "sm3 qib0"
[1](117500007990f6) "S-0008f105001094d2"[3] # lid 5 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR

vendid=0x1175
devid=0x7322
sysimgguid=0x117500007986e4
caguid=0x117500007986e4
Ca    1 "H-00117500007986e4"        # "sm4 qib0"
[1](117500007986e4) "S-0008f105001094d2"[2] # lid 6 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000a09c5
caguid=0x2c903000a09c2
Ca    2 "H-0002c903000a09c2"        # "dl580 mlx4_0"
[1](2c903000a09c3) "S-0008f105001094d2"[1] # lid 2 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR

vendid=0x2c9
devid=0x673c
sysimgguid=0x2c903000a0a35
caguid=0x2c903000a0a32
Ca 2 "H-0002c903000a0a32" # "MT25408 ConnectX Mellanox Technologies" [2](2c903000a0a34) "S-0008f105001094d2"[36] # lid 7 lmc 0 "Voltaire 4036 # SWITCH-IB-1" lid 1 4xQDR


On 05/19/2017 03:26 AM, John Hearns via users wrote:
Allan,
remember that Infiniband is not Ethernet. You dont NEED to set up IPOIB interfaces.

Two diagnostics please for you to run:

ibnetdiscover

ibdiagnet


Let us please have the reuslts of    ibnetdiscover




On 19 May 2017 at 09:25, John Hearns <hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote:

    Giles, Allan,

    if the host 'smd' is acting as a cluster head node it is not a
    must for it to have an Infiniband card.
    So you should be able to run jobs across the other nodes, which
    have Qlogic cards.
    I may have something mixed up here, if so I am sorry.

    If you want also to run jobs on the smd host, you should take note
    of what Giles says.
    You may be out of luck in that case.

    On 19 May 2017 at 09:15, Gilles Gouaillardet <gil...@rist.or.jp
    <mailto:gil...@rist.or.jp>> wrote:

        Allan,


        i just noted smd has a Mellanox card, while other nodes have
        QLogic cards.

        mtl/psm works best for QLogic while btl/openib (or mtl/mxm)
        work best for Mellanox,

        but these are not interoperable. also, i do not think
        btl/openib can be used with QLogic cards

        (please someone correct me if i am wrong)


        from the logs, i can see that smd (Mellanox) is not even able
        to use the infiniband port.

        if you run with 2 MPI tasks, both run on smd and hence
        btl/vader is used, that is why it works

        if you run with more than 2 MPI tasks, then smd and other
        nodes are used, and every MPI task fall back to btl/tcp

        for inter node communication.

        
[smd][[41971,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_complete_connect]
        connect() to 192.168.1.196 failed: No route to host (113)

        this usually indicates a firewall, but since both ssh and
        oob/tcp are fine, this puzzles me.


        what if you

        mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
        192.168.1.0/24 <http://192.168.1.0/24> --mca
        btl_tcp_if_include 192.168.1.0/24 <http://192.168.1.0/24>
        --mca pml ob1 --mca btl tcp,sm,vader,self  ring

        that should work with no error messages, and then you can try
        with 12 MPI tasks

        (note internode MPI communications will use tcp only)


        if you want optimal performance, i am afraid you cannot run
        any MPI task on smd (so mtl/psm can be used )

        (btw, make sure PSM support was built in Open MPI)

        a suboptimal option is to force MPI communications on IPoIB with

        /* make sure all nodes can ping each other via IPoIB first */

        mpirun --mca oob_tcp_if_include 192.168.1.0/24
        <http://192.168.1.0/24> --mca btl_tcp_if_include 10.1.0.0/24
        <http://10.1.0.0/24> --mca pml ob1 --mca btl tcp,sm,vader,self



        Cheers,


        Gilles


        On 5/19/2017 3:50 PM, Allan Overstreet wrote:

            Gilles,

            On which node is mpirun invoked ?

                The mpirun command was involed on node smd.

            Are you running from a batch manager?

                No.

            Is there any firewall running on your nodes ?

                No CentOS minimal does not have a firewall installed
            and Ubuntu Mate's firewall is disabled.

            All three of your commands have appeared to run
            successfully. The outputs of the three commands are attached.

            mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
            192.168.1.0/24 <http://192.168.1.0/24> --mca
            oob_base_verbose 100 true &> cmd1

            mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include
            192.168.1.0/24 <http://192.168.1.0/24> --mca
            oob_base_verbose 100 true &> cmd2

            mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
            192.168.1.0/24 <http://192.168.1.0/24> --mca
            oob_base_verbose 100 ring &> cmd3

            If I increase the number of processors in the ring
            program, mpirun will not succeed.

            mpirun -np 12 --hostfile nodes --mca oob_tcp_if_include
            192.168.1.0/24 <http://192.168.1.0/24> --mca
            oob_base_verbose 100 ring &> cmd4


            On 05/19/2017 02:18 AM, Gilles Gouaillardet wrote:

                Allan,


                - on which node is mpirun invoked ?

                - are you running from a batch manager ?

                - is there any firewall running on your nodes ?


                the error is likely occuring when wiring-up mpirun/orted

                what if you

                mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
                192.168.1.0/24 <http://192.168.1.0/24> --mca
                oob_base_verbose 100 true

                then (if the previous command worked)

                mpirun -np 12 --hostfile nodes --mca
                oob_tcp_if_include 192.168.1.0/24
                <http://192.168.1.0/24> --mca oob_base_verbose 100 true

                and finally (if both previous commands worked)

                mpirun -np 2 --hostfile nodes --mca oob_tcp_if_include
                192.168.1.0/24 <http://192.168.1.0/24> --mca
                oob_base_verbose 100 ring


                Cheers,

                Gilles

                On 5/19/2017 3:07 PM, Allan Overstreet wrote:

                    I experiencing many different errors with openmpi
                    version 2.1.1. I have had a suspicion that this
                    might be related to the way the servers were
                    connected and configured. Regardless below is a
                    diagram of how the server are configured.

                    __  _
                     [__]|=|
                     /::/|_|
                                               HOST: smd
                                               Dual 1Gb Ethernet Bonded
                               .-------------> Bond0 IP: 192.168.1.200
                               |               Infiniband Card:
                    MHQH29B-XTR <------------.
| Ib0 IP: 10.1.0.1 | | OS: Ubuntu Mate |
                               |  __ _                         |
                               | [__]|=|     |
                               | /::/|_|     |
| HOST: sm1 |
                               |               Dual 1Gb Ethernet
                    Bonded                  |
                               |-------------> Bond0 IP:
                    192.168.1.196                   |
                               |               Infiniband Card: QLOGIC
                    QLE7340 <---------|
| Ib0 IP: 10.1.0.2 | | OS: Centos 7 Minimal |
                               |  __ _                         |
                               | [__]|=|     |
                               |---------. /::/|_|               |
| | HOST: sm2 |
                               |         |     Dual 1Gb Ethernet
                    Bonded                  |
                               |         '---> Bond0 IP:
                    192.168.1.199                   |
                           __________          Infiniband Card: QLOGIC
                    QLE7340 __________
                          [_|||||||_°]         Ib0 IP: 10.1.0.3
                    [_|||||||_°]
                          [_|||||||_°]         OS: Centos 7 Minimal
                    [_|||||||_°]
                          [_|||||||_°]  __ _ [_|||||||_°]
                       Gb Ethernet Switch  [__]|=| Voltaire 4036 QDR
                    Switch
                               | /::/|_|      |
| HOST: sm3 |
                               |               Dual 1Gb Ethernet
                    Bonded                   |
                               |-------------> Bond0 IP:
                    192.168.1.203                    |
                               |               Infiniband Card: QLOGIC
                    QLE7340 <----------|
| Ib0 IP: 10.1.0.4 | | OS: Centos 7 Minimal | | __ _ |
                               | [__]|=|       |
                               | /::/|_|       |
| HOST: sm4 |
                               |               Dual 1Gb Ethernet
                    Bonded                   |
                               |-------------> Bond0 IP:
                    192.168.1.204                    |
                               |               Infiniband Card: QLOGIC
                    QLE7340 <----------|
| Ib0 IP: 10.1.0.5 | | OS: Centos 7 Minimal | | __ _ |
                               | [__]|=|        |
                               | /::/|_|        |
| HOST: dl580 |
                               |               Dual 1Gb Ethernet
                    Bonded                   |
                               '-------------> Bond0 IP:
                    192.168.1.201                    |
                                               Infiniband Card: QLOGIC
                    QLE7340 <----------'
                                               Ib0 IP: 10.1.0.6
                                               OS: Centos 7 Minimal

                    I have ensured that the Infiniband adapters can
                    ping each other and every node can passwordless
                    ssh into every other node. Every node has the same
                    /etc/hosts file,

                    cat /etc/hosts

                    127.0.0.1    localhost
                    192.168.1.200    smd
                    192.168.1.196    sm1
                    192.168.1.199    sm2
                    192.168.1.203    sm3
                    192.168.1.204    sm4
                    192.168.1.201    dl580

                    10.1.0.1    smd-ib
                    10.1.0.2    sm1-ib
                    10.1.0.3    sm2-ib
                    10.1.0.4    sm3-ib
                    10.1.0.5    sm4-ib
                    10.1.0.6    dl580-ib

                    I have been using a simple ring test program to
                    test openmpi. The code for this program is attached.

                    The hostfile used in all the commands is,

                    cat ./nodes

                    smd slots=2
                    sm1 slots=2
                    sm2 slots=2
                    sm3 slots=2
                    sm4 slots=2
                    dl580 slots=2

                    When running the following command on smd,

                    mpirun -mca btl openib,self -np 2 --hostfile nodes
                    ./ring

                    I obtain the following error,

                    ------------------------------------------------------------
                    A process or daemon was unable to complete a TCP
                    connection
                    to another process:
                      Local host:    sm1
                      Remote host:   192.168.1.200
                    This is usually caused by a firewall on the remote
                    host. Please
                    check that any firewall (e.g., iptables) has been
                    disabled and
                    try again.
                    ------------------------------------------------------------
                    
--------------------------------------------------------------------------

                    No OpenFabrics connection schemes reported that
                    they were able to be
                    used on a specific port.  As such, the openib BTL
                    (OpenFabrics
                    support) will be disabled for this port.

                      Local host:           smd
                      Local device:         mlx4_0
                      Local port:           1
                      CPCs attempted:       rdmacm, udcm
                    
--------------------------------------------------------------------------

                    Process 1 received token -1 from process 0
                    Process 0 received token -1 from process 1
                    [smd:12800] 1 more process has sent help message
                    help-mpi-btl-openib-cpc-base.txt / no cpcs for port
                    [smd:12800] Set MCA parameter
                    "orte_base_help_aggregate" to 0 to see all help /
                    error messages\

                    When increasing the number of processors no
                    program output is produced.

                    mpirun -mca btl openib,self -np 4 --hostfile nodes
                    ./ring
                    ------------------------------------------------------------
                    A process or daemon was unable to complete a TCP
                    connection
                    to another process:
                      Local host:    sm2
                      Remote host:   192.168.1.200
                    This is usually caused by a firewall on the remote
                    host. Please
                    check that any firewall (e.g., iptables) has been
                    disabled and
                    try again.
                    ------------------------------------------------------------
                    *** An error occurred in MPI_Init
                    *** on a NULL communicator
                    *** MPI_ERRORS_ARE_FATAL (processes in this
                    communicator will now abort,
                    ***    and potentially your MPI job)
                    *** An error occurred in MPI_Init
                    *** on a NULL communicator
                    *** MPI_ERRORS_ARE_FATAL (processes in this
                    communicator will now abort,
                    ***    and potentially your MPI job)
                    
--------------------------------------------------------------------------

                    A requested component was not found, or was unable
                    to be opened. This
                    means that this component is either not installed
                    or is unable to be
                    used on your system (e.g., sometimes this means
                    that shared libraries
                    that the component requires are unable to be
                    found/loaded). Note that
                    Open MPI stopped checking at the first component
                    that it did not find.

                    Host:      sm1.overst.local
                    Framework: btl
                    Component: openib
                    
--------------------------------------------------------------------------

                    
--------------------------------------------------------------------------

                    It looks like MPI_INIT failed for some reason;
                    your parallel process is
                    likely to abort.  There are many reasons that a
                    parallel process can
                    fail during MPI_INIT; some of which are due to
                    configuration or environment
                    problems.  This failure appears to be an internal
                    failure; here's some
                    additional information (which may only be relevant
                    to an Open MPI
                    developer):

                      mca_bml_base_open() failed
                      --> Returned "Not found" (-13) instead of
                    "Success" (0)
                    
--------------------------------------------------------------------------

                    
--------------------------------------------------------------------------

                    No OpenFabrics connection schemes reported that
                    they were able to be
                    used on a specific port.  As such, the openib BTL
                    (OpenFabrics
                    support) will be disabled for this port.

                      Local host:           smd
                      Local device:         mlx4_0
                      Local port:           1
                      CPCs attempted:       rdmacm, udcm
                    
--------------------------------------------------------------------------

                    [smd:12953] 1 more process has sent help message
                    help-mca-base.txt / find-available:not-valid
                    [smd:12953] Set MCA parameter
                    "orte_base_help_aggregate" to 0 to see all help /
                    error messages
                    [smd:12953] 1 more process has sent help message
                    help-mpi-runtime.txt /
                    mpi_init:startup:internal-failure
                    [smd:12953] 1 more process has sent help message
                    help-mpi-btl-openib-cpc-base.txt / no cpcs for port

                    Running mpirun from other nodes does not resolve
                    the issue. I have checked that none of the nodes
                    is running a firewall that would be blocking tcp
                    connections.

                    The error with the mlx4_0 adapter is expected as
                    this is used as an 10Gb Ethernet adapter to
                    another network. The infiniband adapter on smd
                    that is being used for QDR infiniband is mlx4_1.

                    Any help would be appreciated.

                    Sincerely,

                    Allan Overstreet



                    _______________________________________________
                    users mailing list
                    users@lists.open-mpi.org
                    <mailto:users@lists.open-mpi.org>
                    https://rfd.newmexicoconsortium.org/mailman/listinfo/users
                    <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>




        _______________________________________________
        users mailing list
        users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
        https://rfd.newmexicoconsortium.org/mailman/listinfo/users
        <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>





_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to