I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:
./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
--with-openib
The hostname file contains the local host and one other node. When I
run it I get:
[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun
-- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
hello_c [max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152
to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect:
sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466]
[0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1
sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on
host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222]
[0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1]
orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1]
orted_recv_pls: received kill_local_procs [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0]
ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line
275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file
pls_rsh_module.c at line 1166 [max14:31465] [0,0,0] ORTE_ERROR_LOG:
Timeout in file errmgr_hnp.c at line 90 [max14:31465] ERROR: A daemon
on node max15 failed to start as expected.
[max14:31465] ERROR: There may be more information available from
[max14:31465] ERROR: the remote shell (see above).
[max14:31465] ERROR: The daemon exited unexpectedly with status 1.
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received exit [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465]
[0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed connection
[max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6
state 4 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
pls_base_orted_cmds.c at line 188 [max14:31465] [0,0,0]
ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198
---------------------------------------------------------------------
----- mpirun was unable to cleanly terminate the daemons for this
job.
Returned value Timeout instead of ORTE_SUCCESS.
---------------------------------------------------------------------
-----
I can see that the orted deamon program is starting on both computers
but it looks to me like they can't talk to each other.
Here is the output from ifconfig on one of the nodes, the other node
is similar.
[root@max14 ~]# /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:17:31:9C:93:A1
inet addr:192.168.2.14 Bcast:192.168.2.255 Mask:
255.255.255.0
inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1353 errors:0 dropped:0 overruns:0 frame:0
TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:188125 (183.7 KiB) TX bytes:1500567 (1.4 MiB)
Interrupt:17
eth1 Link encap:Ethernet HWaddr 00:17:31:9C:93:A2
inet addr:192.168.1.14 Bcast:192.168.1.255 Mask:
255.255.255.0
inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0
TX packets:49368158 errors:0 dropped:0 overruns:0