Date: Fri, 17 Jan 2014 19:24:50 -0800
From: Ralph Castain <r...@open-mpi.org>
The most common cause of this problem is a firewall between the
nodes - you can ssh across, but not communicate. Have you checked
to see that the firewall is turned off?
Turns out some iptables rules (typical on our clusters) were active.
They are now turned off for continued testing as suggested. I have
rerun the mpi_test code, this time using a debug enabled build of
openmpi/1.6.5 keeping with the intel compiler.
As shown below the problem is still there. I'm including some gdb
output this time. The job is shown to succeed using only eth0 over
1g but hang nearly immediately when the eth2 over 10g interface is
included. Any more suggestions would be greatly appreciated.
[roberpj@bro127:~/samples/mpi_test] mpicc -g mpi_test.c
o Using eth0 only:
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl
tcp,sm,self --mca btl_tcp_i
f_include eth0 --host bro127,bro128 ./a.out
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
P0: Received from to P1
Run 2 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P1: Sending to to P0
P1: Waiting to receive from to P0
P0: Received from to P1
Run 3 of 3
P0: Sending to P1
P0: Waiting to receive from P1
P1: Sending to to P0
P1: Waiting to receive from to P0
P1: Sending to to P0
P1: Done
P0: Received from to P1
P0: Done
o Using eth2 only:
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl
tcp,sm,self --mca btl_tcp_i
f_include eth0,eth2 --host bro127,bro128 ./a.out
Number of processes = 2
Test repeated 3 times for reliability
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
^Cmpirun: killing job...
o Using eth0,eth2 with verbosity:
[roberpj@bro127:~/samples/mpi_test]
/opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl
tcp,sm,self --mca btl_tcp_i
f_include eth0,eth2 --mca btl_base_verbose 100 --host bro127,bro128
./a.out
[bro127:20157] mca: base: components_open: Looking for btl components
[bro127:20157] mca: base: components_open: opening btl components
[bro127:20157] mca: base: components_open: found loaded component self
[bro127:20157] mca: base: components_open: component self has no register
function
[bro127:20157] mca: base: components_open: component self open function
successful
[bro127:20157] mca: base: components_open: found loaded component sm
[bro127:20157] mca: base: components_open: component sm has no register
function
[bro128:23354] mca: base: components_open: Looking for btl components
[bro127:20157] mca: base: components_open: component sm open function
successful
[bro127:20157] mca: base: components_open: found loaded component tcp
[bro127:20157] mca: base: components_open: component tcp register function
successful
[bro127:20157] mca: base: components_open: component tcp open function
successful
[bro128:23354] mca: base: components_open: opening btl components
[bro128:23354] mca: base: components_open: found loaded component self
[bro128:23354] mca: base: components_open: component self has no register
function
[bro128:23354] mca: base: components_open: component self open function
successful
[bro128:23354] mca: base: components_open: found loaded component sm
[bro128:23354] mca: base: components_open: component sm has no register
function
[bro128:23354] mca: base: components_open: component sm open function
successful
[bro128:23354] mca: base: components_open: found loaded component tcp
[bro128:23354] mca: base: components_open: component tcp register function
successful
[bro128:23354] mca: base: components_open: component tcp open function
successful
[bro127:20157] select: initializing btl component self
[bro127:20157] select: init of component self returned success
[bro127:20157] select: initializing btl component sm
[bro127:20157] select: init of component sm returned success
[bro127:20157] select: initializing btl component tcp
[bro127:20157] select: init of component tcp returned success
[bro128:23354] select: initializing btl component self
[bro128:23354] select: init of component self returned success
[bro128:23354] select: initializing btl component sm
[bro128:23354] select: init of component sm returned success
[bro128:23354] select: initializing btl component tcp
[bro128:23354] select: init of component tcp returned success
[bro127:20157] btl: tcp: attempting to connect() to address 10.27.2.128 on
port 4
Number of processes = 2
Test repeated 3 times for reliability
[bro128:23354] btl: tcp: attempting to connect() to address 10.27.2.127 on
port 4
I am process 0 on node bro127
Run 1 of 3
P0: Sending to P1
[bro127:20157] btl: tcp: attempting to connect() to address 10.29.4.128 on
port 4
P0: Waiting to receive from P1
I am process 1 on node bro128
P1: Waiting to receive from to P0
[bro127][[9184,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215
:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection
timed out (110)
^C mpirun: killing job...
o Master node bro127 debugging info:
[roberpj@bro127:~] gdb -p 21067
(gdb) bt
#0 0x00002ac7ae4a86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00002ac7acc3dedc in epoll_dispatch (base=0x3, arg=0x1916850,
tv=0x20) at ../../../../openmpi-1.6.5/opal/event/epoll.c:215
#2 0x00002ac7acc3f276 in opal_event_base_loop (base=0x3, flags=26306640)
at ../../../../openmpi-1.6.5/opal/event/event.c:838
#3 0x00002ac7acc3f122 in opal_event_loop (flags=3) at
../../../../openmpi-1.6.5/opal/event/event.c:766
#4 0x00002ac7acc82c14 in opal_progress () at
../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
#5 0x00002ac7b21a8c40 in mca_pml_ob1_recv (addr=0x3, count=26306640,
datatype=0x20, src=-1, tag=0, comm=0x80000, status=0x7fff15ad5f38)
at
../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
#6 0x00002ac7acb830f7 in PMPI_Recv (buf=0x3, count=26306640, type=0x20,
source=-1, tag=0, comm=0x80000, status=0x4026e0) at precv.c:78
#7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at
mpi_test.c:72
(gdb) frame 7
#7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at
mpi_test.c:72
72 MPI_Recv(&A[0], M, MPI_DOUBLE, procs-1, msgid,
MPI_COMM_WORLD, &stat);
(gdb)
confirming ...
[root@bro127:~] iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
o Slave node bro128 debugging info:
[roberpj@bro128:~] top -u roberpj
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24334 roberpj 20 0 115m 5208 3216 R 100.0 0.0 2:32.12 a.out
[roberpj@bro128:~] gdb -p 24334
(gdb) bt
#0 0x00002b7475cc86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x00002b747445dedc in epoll_dispatch (base=0x3, arg=0x9b6850, tv=0x20)
at ../../../../openmpi-1.6.5/opal/event/epoll.c:215
#2 0x00002b747445f276 in opal_event_base_loop (base=0x3, flags=10184784)
at ../../../../openmpi-1.6.5/opal/event/event.c:838
#3 0x00002b747445f122 in opal_event_loop (flags=3) at
../../../../openmpi-1.6.5/opal/event/event.c:766
#4 0x00002b74744a2c14 in opal_progress () at
../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
#5 0x00002b74799c8c40 in mca_pml_ob1_recv (addr=0x3, count=10184784,
datatype=0x20, src=-1, tag=10899040, comm=0x0, status=0x7fff1ce5e778)
at
../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
#6 0x00002b74743a30f7 in PMPI_Recv (buf=0x3, count=10184784, type=0x20,
source=-1, tag=10899040, comm=0x0, status=0x4026e0) at precv.c:78
#7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at
mpi_test.c:76
(gdb) frame 7
#7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at
mpi_test.c:76
76 MPI_Recv(&A[0], M, MPI_DOUBLE, myid-1, msgid,
MPI_COMM_WORLD, &stat);
(gdb)
confirming ...
[root@bro128:~] iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination