If you wanted it to use eth1, your other option would be to simply tell it to do so using the mca param. I believe it is something like -mca oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0
You may only need the latter since you only have the two interfaces. Ralph On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: > Interesting. Would you mind sharing your patch? > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Bob Soliday > Sent: Thursday, November 29, 2007 11:35 AM > To: Ralph H Castain > Cc: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem > > I solved the problem by making a change to > orte/mca/oob/tcp/oob_tcp_peer.c > > On Linux 2.6 I have read that after a failed connect system call the > next call to connect can immediately return ECONNABORTED and not try to > actually connect, the next call to connect will then work. So I changed > mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call > connect again. The hello_c example script is now working. > > I don't think this has solved the underlying cause as to way connect is > failing in the first place but at least now I move on to the next step. > My best guess at the moment is that it is using eth0 initially when I > want it to use eth1. This fails and then when it moves on to eth1 I run > into the "can't call connect after it just failed bug". > > --Bob > > > Ralph H Castain wrote: >> Hi Bob >> >> I'm afraid the person most familiar with the oob subsystem recently >> left the project, so we are somewhat hampered at the moment. I don't >> recognize the "Software caused connection abort" error message - it >> doesn't appear to be one of ours (at least, I couldn't find it >> anywhere in our code base, though I can't swear it isn't there in some > >> dark corner), and I don't find it in my own sys/errno.h file. >> >> With those caveats, all I can say is that something appears to be >> blocking the connection from your remote node back to the head node. >> Are you sure both nodes are available on IPv4 (since you disabled >> IPv6)? Can you try ssh'ing to the remote node and doing a ping to the >> head node using the IPv4 interface? >> >> Do you have another method you could use to check and see if max14 >> will accept connections from max15? If I interpret the error message >> correctly, it looks like something in the connect handshake is being >> aborted. We try a couple of times, but then give up and try other >> interfaces - since no other interface is available, you get that other > error message and we abort. >> >> Sorry I can't be more help - like I said, this is now a weak spot in >> our coverage that needs to be rebuilt. >> >> Ralph >> >> >> >> On 11/28/07 2:41 PM, "Bob Soliday" <soli...@aps.anl.gov> wrote: >> >>> I am new to openmpi and have a problem that I cannot seem to solve. >>> I am trying to run the hello_c example and I can't get it to work. >>> I compiled openmpi with: >>> >>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6 > >>> --with-openib >>> >>> The hostname file contains the local host and one other node. When I >>> run it I get: >>> >>> >>> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun >>> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2 >>> hello_c [max14:31465] [0,0,0] accepting connections via event library > >>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe >>> [max14:31466] [0,0,1] accepting connections via event library >>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe >>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152 >>> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0] >>> mca_oob_tcp_peer_complete_connect: >>> sending ack, 0 >>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255 >>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14 >>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466] >>> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1 >>> sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0] > >>> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0] >>> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on >>> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 >>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222] >>> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to >>> 192.168.1.14:38852 failed: Software caused connection abort (103) >>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >>> to >>> 192.168.1.14:38852 failed: Software caused connection abort (103) >>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect >>> to >>> 192.168.1.14:38852 failed, connecting over all interfaces failed! >>> [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1] >>> orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1] >>> orted_recv_pls: received kill_local_procs [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0] >>> ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line >>> 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file >>> pls_rsh_module.c at line 1166 [max14:31465] [0,0,0] ORTE_ERROR_LOG: >>> Timeout in file errmgr_hnp.c at line 90 [max14:31465] ERROR: A daemon > >>> on node max15 failed to start as expected. >>> [max14:31465] ERROR: There may be more information available from >>> [max14:31465] ERROR: the remote shell (see above). >>> [max14:31465] ERROR: The daemon exited unexpectedly with status 1. >>> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0] >>> [max14:31466] [0,0,1] orted_recv_pls: received exit [max14:31466] >>> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] >>> [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed connection >>> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6 >>> state 4 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/ >>> pls_base_orted_cmds.c at line 188 [max14:31465] [0,0,0] >>> ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198 >>> --------------------------------------------------------------------- >>> ----- mpirun was unable to cleanly terminate the daemons for this >>> job. >>> Returned value Timeout instead of ORTE_SUCCESS. >>> --------------------------------------------------------------------- >>> ----- >>> >>> >>> >>> I can see that the orted deamon program is starting on both computers > >>> but it looks to me like they can't talk to each other. >>> >>> Here is the output from ifconfig on one of the nodes, the other node >>> is similar. >>> >>> [root@max14 ~]# /sbin/ifconfig >>> eth0 Link encap:Ethernet HWaddr 00:17:31:9C:93:A1 >>> inet addr:192.168.2.14 Bcast:192.168.2.255 Mask: >>> 255.255.255.0 >>> inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:1353 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:188125 (183.7 KiB) TX bytes:1500567 (1.4 MiB) >>> Interrupt:17 >>> >>> eth1 Link encap:Ethernet HWaddr 00:17:31:9C:93:A2 >>> inet addr:192.168.1.14 Bcast:192.168.1.255 Mask: >>> 255.255.255.0 >>> inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:49368158 errors:0 dropped:0 overruns:0 > carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:21844618928 (20.3 GiB) TX bytes:16122676331 >>> (15.0 >>> GiB) >>> Interrupt:19 >>> >>> lo Link encap:Local Loopback >>> inet addr:127.0.0.1 Mask:255.0.0.0 >>> inet6 addr: ::1/128 Scope:Host >>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>> RX packets:82191 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:82191 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:0 >>> RX bytes:7383491 (7.0 MiB) TX bytes:7383491 (7.0 MiB) >>> >>> >>> These machines routinely run mpich2 and mvapich2 programs so I don't >>> suspect any problems with the gigabit or infiniband connections. >>> >>> Thanks, >>> --Bob Soliday >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users