Date: Fri, 17 Jan 2014 19:24:50 -0800
From: Ralph Castain <r...@open-mpi.org>

The most common cause of this problem is a firewall between the
nodes - you can ssh across, but not communicate. Have you checked
to see that the firewall is turned off?

Turns out some iptables rules (typical on our clusters) were active.
They are now turned off for continued testing as suggested. I have
rerun the mpi_test code, this time using a debug enabled build of openmpi/1.6.5 keeping with the intel compiler.

As shown below the problem is still there. I'm including some gdb
output this time. The job is shown to succeed using only eth0 over
1g but hang nearly immediately when the eth2 over 10g interface is
included.  Any more suggestions would be greatly appreciated.

[roberpj@bro127:~/samples/mpi_test] mpicc -g mpi_test.c

o Using eth0 only:

[roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i
f_include eth0 --host bro127,bro128 ./a.out
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
P0: Received from to P1
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P1: Sending to to P0
P1: Waiting to receive from to P0
P0: Received from to P1
Run 3 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Done
P0: Received from to P1
P0: Done

o Using eth2 only:

[roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i
f_include eth0,eth2 --host bro127,bro128 ./a.out
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
^Cmpirun: killing job...

o Using eth0,eth2 with verbosity:

[roberpj@bro127:~/samples/mpi_test] /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl tcp,sm,self --mca btl_tcp_i f_include eth0,eth2 --mca btl_base_verbose 100 --host bro127,bro128 ./a.out
[bro127:20157] mca: base: components_open: Looking for btl components
[bro127:20157] mca: base: components_open: opening btl components
[bro127:20157] mca: base: components_open: found loaded component self
[bro127:20157] mca: base: components_open: component self has no register function [bro127:20157] mca: base: components_open: component self open function successful
[bro127:20157] mca: base: components_open: found loaded component sm
[bro127:20157] mca: base: components_open: component sm has no register function
[bro128:23354] mca: base: components_open: Looking for btl components
[bro127:20157] mca: base: components_open: component sm open function successful
[bro127:20157] mca: base: components_open: found loaded component tcp
[bro127:20157] mca: base: components_open: component tcp register function successful [bro127:20157] mca: base: components_open: component tcp open function successful
[bro128:23354] mca: base: components_open: opening btl components
[bro128:23354] mca: base: components_open: found loaded component self
[bro128:23354] mca: base: components_open: component self has no register function [bro128:23354] mca: base: components_open: component self open function successful
[bro128:23354] mca: base: components_open: found loaded component sm
[bro128:23354] mca: base: components_open: component sm has no register function [bro128:23354] mca: base: components_open: component sm open function successful
[bro128:23354] mca: base: components_open: found loaded component tcp
[bro128:23354] mca: base: components_open: component tcp register function successful [bro128:23354] mca: base: components_open: component tcp open function successful
[bro127:20157] select: initializing btl component self
[bro127:20157] select: init of component self returned success
[bro127:20157] select: initializing btl component sm
[bro127:20157] select: init of component sm returned success
[bro127:20157] select: initializing btl component tcp
[bro127:20157] select: init of component tcp returned success
[bro128:23354] select: initializing btl component self
[bro128:23354] select: init of component self returned success
[bro128:23354] select: initializing btl component sm
[bro128:23354] select: init of component sm returned success
[bro128:23354] select: initializing btl component tcp
[bro128:23354] select: init of component tcp returned success
[bro127:20157] btl: tcp: attempting to connect() to address 10.27.2.128 on port 4
Number of processes = 2
Test repeated 3 times for reliability
[bro128:23354] btl: tcp: attempting to connect() to address 10.27.2.127 on port 4
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
[bro127:20157] btl: tcp: attempting to connect() to address 10.29.4.128 on port 4
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
[bro127][[9184,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215
:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed out (110)
^C mpirun: killing job...

o Master node bro127 debugging info:

[roberpj@bro127:~] gdb -p 21067
(gdb) bt
#0  0x00002ac7ae4a86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00002ac7acc3dedc in epoll_dispatch (base=0x3, arg=0x1916850, tv=0x20) at ../../../../openmpi-1.6.5/opal/event/epoll.c:215 #2 0x00002ac7acc3f276 in opal_event_base_loop (base=0x3, flags=26306640) at ../../../../openmpi-1.6.5/opal/event/event.c:838 #3 0x00002ac7acc3f122 in opal_event_loop (flags=3) at ../../../../openmpi-1.6.5/opal/event/event.c:766 #4 0x00002ac7acc82c14 in opal_progress () at ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189 #5 0x00002ac7b21a8c40 in mca_pml_ob1_recv (addr=0x3, count=26306640, datatype=0x20, src=-1, tag=0, comm=0x80000, status=0x7fff15ad5f38) at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105 #6 0x00002ac7acb830f7 in PMPI_Recv (buf=0x3, count=26306640, type=0x20, source=-1, tag=0, comm=0x80000, status=0x4026e0) at precv.c:78 #7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72
(gdb) frame 7
#7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72 72 MPI_Recv(&A[0], M, MPI_DOUBLE, procs-1, msgid, MPI_COMM_WORLD, &stat);
(gdb)

confirming ...
[root@bro127:~] iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

o Slave node bro128 debugging info:

[roberpj@bro128:~]  top -u roberpj
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
24334 roberpj   20   0  115m 5208 3216 R 100.0  0.0   2:32.12 a.out

[roberpj@bro128:~] gdb -p 24334
(gdb) bt
#0  0x00002b7475cc86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00002b747445dedc in epoll_dispatch (base=0x3, arg=0x9b6850, tv=0x20) at ../../../../openmpi-1.6.5/opal/event/epoll.c:215 #2 0x00002b747445f276 in opal_event_base_loop (base=0x3, flags=10184784) at ../../../../openmpi-1.6.5/opal/event/event.c:838 #3 0x00002b747445f122 in opal_event_loop (flags=3) at ../../../../openmpi-1.6.5/opal/event/event.c:766 #4 0x00002b74744a2c14 in opal_progress () at ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189 #5 0x00002b74799c8c40 in mca_pml_ob1_recv (addr=0x3, count=10184784, datatype=0x20, src=-1, tag=10899040, comm=0x0, status=0x7fff1ce5e778) at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105 #6 0x00002b74743a30f7 in PMPI_Recv (buf=0x3, count=10184784, type=0x20, source=-1, tag=10899040, comm=0x0, status=0x4026e0) at precv.c:78 #7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76
(gdb) frame 7
#7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76 76 MPI_Recv(&A[0], M, MPI_DOUBLE, myid-1, msgid, MPI_COMM_WORLD, &stat);
(gdb)

confirming ...
[root@bro128:~] iptables --list
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Reply via email to