Allan,

the "No route to host" error indicates there is something going wrong with IPoIB on your cluster

(and Open MPI is not involved whatsoever in that)

on sm3 and sm4, you can run

/sbin/ifconfig

brctl show

iptables -L

iptables -t nat -L

we might be able to figure out what is going wrong from that.


if there is no mca_btl_openib.so component, it is likely the infiniband headers are not available on the node you compiled Open MPI.

i guess if you configure Open MPI with

--with-verbs

it will abort if the headers are not found.

in this case, simply install them and rebuild Open MPI.

if you are unsure about that part, please compress and post your config.log so we can have a look at it


Cheers,


gilles


On 5/29/2017 1:03 PM, Allan Overstreet wrote:
Gilles,

I was able to ping sm4 from sm3 and sm3 from sm4. However running netcat from sm4 and sm5 using the commands.

[allan@sm4 ~]$ nc -l 1234

and

[allan@sm3 ~]$ echo hello | nc 10.1.0.5 1234
Ncat: No route to host.

Testing this on other nodes,

[allan@sm2 ~]$ nc -l 1234

and

[allan@sm1 ~]$ echo hello | nc 10.1.0.3 1234
Ncat: No route to host.

These nodes do not have firewalls installed, so I am confused why this traffic isn't getting through.

I am compiling openmpi from source and the shared library /home/allan/software/openmpi/install/lib/openmpi/mca_btl_openib.so doesn't exist.


On 05/27/2017 11:25 AM, gil...@rist.or.jp wrote:
Allan,

about IPoIB, the error message (no route to host) is very puzzling.
did you double check IPoIB is ok between all nodes ?
this error message suggests IPoIB is not working between sm3 and sm4,
this could be caused by the subnet manager, or a firewall.
ping is the first tool you should use to test that, then you can use nc
(netcat).
for example, on sm4
nc -l 1234
on sm3
echo hello | nc 10.1.0.5 1234
(expected result: "hello" should be displayed on sm4)

about openib, you first need to double check the btl/openib was built.
assuming you did not configure with --disable-dlopen, you should have a
mca_btl_openib.so
file in /.../lib/openmpi. it should be accessible by the user, and
ldd /.../lib/openmpi/mca_btl_openib.so
should not have any unresolved dependencies on *all* your nodes

Cheers,

Gilles

----- Original Message -----
I have been having some issues with using openmpi with tcp over IPoIB
and openib. The problems arise when I run a program that uses basic
collective communication. The two programs that I have been using are
attached.

*** IPoIB ***

The mpirun command I am using to run mpi over IPoIB is,
mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./avg 8000

This program will appear to run on the nodes, but will sit at 100% CPU
and use no memory. On the host node an error will be printed,

[sm1][[58411,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)

Using another program,

mpirun --mca oob_tcp_if_include 192.168.1.0/24 --mca btl_tcp_if_
include
10.1.0.0/24 --mca pml ob1 --mca btl tcp,sm,vader,self -hostfile nodes
-np 8 ./congrad 800
Produces the following result. This program will also run on the nodes
sm1, sm2, sm3, and sm4 at 100% and use no memory.
[sm3][[61383,1],4][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],6][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],3][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.2 failed: No route to host (113)
[sm3][[61383,1],5][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.5 failed: No route to host (113)
[sm4][[61383,1],7][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.4 failed: No route to host (113)
[sm2][[61383,1],2][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.2 failed: No route to host (113)
[sm1][[61383,1],0][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)
[sm1][[61383,1],1][btl_tcp_endpoint.c:803:mca_btl_tcp_endpoint_
complete_connect]
connect() to 10.1.0.3 failed: No route to host (113)

*** openib ***

Running the congrad program over openib will produce the result,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
10.1.0.0/24 -hostfile nodes -np 8 ./avg 800
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
----------------------------------------------------------------------
----
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host:      sm2.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
----------------------------------------------------------------------
----
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
    mca_bml_base_open() failed
    --> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],1] usock_peer_accept:
usock_peer_send_connect_ack failed
[sm1.overst.local:32239] [[57506,0],1] usock_peer_send_blocking: send
()
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32239] [[57506,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32239] [[57506,0],1]-[[57506,1],0] usock_peer_accept:
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
[smd:31760] 4 more processes have sent help message help-mca-base.txt
/
find-available:not-valid
[smd:31760] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
help / error messages
[smd:31760] 4 more processes have sent help message help-mpi-runtime.
txt
/ mpi_init:startup:internal-failure
=== Later errors printed out on the host node ===
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
    Local host:    sm3
    Remote host:   10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
    Local host:    sm1
    Remote host:   10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
    Local host:    sm2
    Remote host:   10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
    Local host:    sm4
    Remote host:   10.1.0.1
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
The ./avg process was not created on any of the nodes.
Running the ./congrad program,
mpirun --mca btl self,sm,openib --mca mtl ^psm --mca btl_tcp_if_
include
10.1.0.0/24 -hostfile nodes -np 8 ./congrad 800
Will results in the following errors,

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
----------------------------------------------------------------------
----
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host:      sm3.overst.local
Framework: btl
Component: openib
----------------------------------------------------------------------
----
----------------------------------------------------------------------
----
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
    mca_bml_base_open() failed
    --> Returned "Not found" (-13) instead of "Success" (0)
----------------------------------------------------------------------
----
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
to socket 29 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],0] usock_peer_accept:
usock_peer_send_connect_ack failed
[sm1.overst.local:32271] [[57834,0],1] usock_peer_send_blocking: send
()
to socket 27 failed: Broken pipe (32)
[sm1.overst.local:32271] [[57834,0],1] ORTE_ERROR_LOG: Unreachable in
file oob_usock_connection.c at line 316
[sm1.overst.local:32271] [[57834,0],1]-[[57834,1],1] usock_peer_accept:
usock_peer_send_connect_ack failed
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
abort,
***    and potentially your MPI job)
[smd:32088] 5 more processes have sent help message help-mca-base.txt
/
find-available:not-valid
[smd:32088] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all
help / error messages
[smd:32088] 5 more processes have sent help message help-mpi-runtime.
txt
/ mpi_init:startup:internal-failure

Using these mpirun commands will successfully run using a test program
that is only using point to point communication.

The nodes are interconnected in the following way.

                                          __  _
                                         [__]|=|
                                         /::/|_|
                             HOST: smd
                             Dual 1Gb Ethernet Bonded
             .-------------> Bond0 IP: 192.168.1.200
| Infiniband Card: MHQH29B-XTR <------------.
             |               Ib0 IP: 10.1.0.1 |
             |               OS: Ubuntu Mate |
             |                           __  _ |
             |                          [__]|=| |
             |                          /::/|_| |
             |               HOST: sm1 |
             |               Dual 1Gb Ethernet Bonded |
             |-------------> Bond0 IP: 192.168.1.196
|
             |               Infiniband Card: QLOGIC QLE7340 <---------
|
             |               Ib0 IP: 10.1.0.2 |
             |               OS: Centos 7 Minimal |
             |                           __  _ |
             |                          [__]|=| |
             |---------.                /::/|_| |
             |         |     HOST: sm2 |
             |         |     Dual 1Gb Ethernet Bonded |
             |         '---> Bond0 IP: 192.168.1.199
|
         __________          Infiniband Card: QLOGIC QLE7340 __________
        [_|||||||_°]         Ib0 IP: 10.1.0.3 [_|||||||_°]
        [_|||||||_°]         OS: Centos 7 Minimal [_|||||||_°]
        [_|||||||_°]                     __  _ [_|||||||_°]
     Gb Ethernet Switch                 [__]|=| Voltaire
4036
QDR Switch
             | /::/|_|                         |
             |               HOST: sm3
|
             |               Dual 1Gb Ethernet Bonded
|
             |-------------> Bond0 IP: 192.168.1.203
|
             |               Infiniband Card: QLOGIC QLE7340 <---------
-|
             |               Ib0 IP: 10.1.0.4
|
             |               OS: Centos 7 Minimal
|
             | __ _
|
             | [__]|=|                          |
             | /::/|_|                          |
             |               HOST: sm4
|
             |               Dual 1Gb Ethernet Bonded
|
             |-------------> Bond0 IP: 192.168.1.204
|
             |               Infiniband Card: QLOGIC QLE7340 <---------
-|
             |               Ib0 IP: 10.1.0.5
|
             |               OS: Centos 7 Minimal
|
             | __ _
|
             | [__]|=|                           |
             | /::/|_|                           |
             |               HOST: dl580
|
             |               Dual 1Gb Ethernet Bonded
|
             '-------------> Bond0 IP: 192.168.1.201
|
                             Infiniband Card: QLOGIC QLE7340 <---------
-'
                             Ib0 IP: 10.1.0.6
                             OS: Centos 7 Minimal

Thanks for the help again.

Sincerely,

Allan Overstreet

  _______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to