Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

Ralph Castain Thu, 29 Nov 2007 12:49:02 -0500

If you wanted it to use eth1, your other option would be to simply tell it
to do so using the mca param. I believe it is something like -mca
oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0


You may only need the latter since you only have the two interfaces.
Ralph



On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote:

> Interesting.  Would you mind sharing your patch?
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Bob Soliday
> Sent: Thursday, November 29, 2007 11:35 AM
> To: Ralph H Castain
> Cc: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
> 
> I solved the problem by making a change to
> orte/mca/oob/tcp/oob_tcp_peer.c
> 
> On Linux 2.6 I have read that after a failed connect system call the
> next call to connect can immediately return ECONNABORTED and not try to
> actually connect, the next call to connect will then work. So I changed
> mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
> connect again. The hello_c example script is now working.
> 
> I don't think this has solved the underlying cause as to way connect is
> failing in the first place but at least now I move on to the next step.
> My best guess at the moment is that it is using eth0 initially when I
> want it to use eth1. This fails and then when it moves on to eth1 I run
> into the "can't call connect after it just failed bug".
> 
> --Bob
> 
> 
> Ralph H Castain wrote:
>> Hi Bob
>> 
>> I'm afraid the person most familiar with the oob subsystem recently
>> left the project, so we are somewhat hampered at the moment. I don't
>> recognize the "Software caused connection abort" error message - it
>> doesn't appear to be one of ours (at least, I couldn't find it
>> anywhere in our code base, though I can't swear it isn't there in some
> 
>> dark corner), and I don't find it in my own sys/errno.h file.
>> 
>> With those caveats, all I can say is that something appears to be
>> blocking the connection from your remote node back to the head node.
>> Are you sure both nodes are available on IPv4 (since you disabled
>> IPv6)? Can you try ssh'ing to the remote node and doing a ping to the
>> head node using the IPv4 interface?
>> 
>> Do you have another method you could use to check and see if max14
>> will accept connections from max15? If I interpret the error message
>> correctly, it looks like something in the connect handshake is being
>> aborted. We try a couple of times, but then give up and try other
>> interfaces - since no other interface is available, you get that other
> error message and we abort.
>> 
>> Sorry I can't be more help - like I said, this is now a weak spot in
>> our coverage that needs to be rebuilt.
>> 
>> Ralph
>>  
>> 
>> 
>> On 11/28/07 2:41 PM, "Bob Soliday" <soli...@aps.anl.gov> wrote:
>> 
>>> I am new to openmpi and have a problem that I cannot seem to solve.
>>> I am trying to run the hello_c example and I can't get it to work.
>>> I compiled openmpi with:
>>> 
>>> ./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
> 
>>> --with-openib
>>> 
>>> The hostname file contains the local host and one other node. When I
>>> run it I get:
>>> 
>>> 
>>> [soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun
>>> -- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
>>> hello_c [max14:31465] [0,0,0] accepting connections via event library
> 
>>> [max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
>>> [max14:31466] [0,0,1] accepting connections via event library
>>> [max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
>>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152
>>> to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0]
>>> mca_oob_tcp_peer_complete_connect:
>>> sending ack, 0
>>> [max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
>>> [max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
>>> nodelay 1 sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466]
>>> [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1
>>> sndbuf 262142 rcvbuf 262142 flags 00000802 [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]
> 
>>> mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0]
>>> mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on
>>> host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
>>> [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222]
>>> [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
>>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
>>> to
>>> 192.168.1.14:38852 failed: Software caused connection abort (103)
>>> [max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
>>> to
>>> 192.168.1.14:38852 failed, connecting over all interfaces failed!
>>> [max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1]
>>> orted_recv_pls: received message from [0,0,0] [max14:31466] [0,0,1]
>>> orted_recv_pls: received kill_local_procs [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465] [0,0,0]
>>> ORTE_ERROR_LOG: Timeout in file base/ pls_base_orted_cmds.c at line
>>> 275 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file
>>> pls_rsh_module.c at line 1166 [max14:31465] [0,0,0] ORTE_ERROR_LOG:
>>> Timeout in file errmgr_hnp.c at line 90 [max14:31465] ERROR: A daemon
> 
>>> on node max15 failed to start as expected.
>>> [max14:31465] ERROR: There may be more information available from
>>> [max14:31465] ERROR: the remote shell (see above).
>>> [max14:31465] ERROR: The daemon exited unexpectedly with status 1.
>>> [max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
>>> [max14:31466] [0,0,1] orted_recv_pls: received exit [max14:31466]
>>> [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15 [max14:31465]
>>> [0,0,0]-[0,0,1] mca_oob_tcp_msg_recv: peer closed connection
>>> [max14:31465] [0,0,0]-[0,0,1] mca_oob_tcp_peer_close(0x523100) sd 6
>>> state 4 [max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
>>> pls_base_orted_cmds.c at line 188 [max14:31465] [0,0,0]
>>> ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1198
>>> ---------------------------------------------------------------------
>>> ----- mpirun was unable to cleanly terminate the daemons for this
>>> job.
>>> Returned value Timeout instead of ORTE_SUCCESS.
>>> ---------------------------------------------------------------------
>>> -----
>>> 
>>> 
>>> 
>>> I can see that the orted deamon program is starting on both computers
> 
>>> but it looks to me like they can't talk to each other.
>>> 
>>> Here is the output from ifconfig on one of the nodes, the other node
>>> is similar.
>>> 
>>> [root@max14 ~]# /sbin/ifconfig
>>> eth0      Link encap:Ethernet  HWaddr 00:17:31:9C:93:A1
>>>            inet addr:192.168.2.14  Bcast:192.168.2.255  Mask:
>>> 255.255.255.0
>>>            inet6 addr: fe80::217:31ff:fe9c:93a1/64 Scope:Link
>>>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>            RX packets:1353 errors:0 dropped:0 overruns:0 frame:0
>>>            TX packets:9572 errors:0 dropped:0 overruns:0 carrier:0
>>>            collisions:0 txqueuelen:1000
>>>            RX bytes:188125 (183.7 KiB)  TX bytes:1500567 (1.4 MiB)
>>>            Interrupt:17
>>> 
>>> eth1      Link encap:Ethernet  HWaddr 00:17:31:9C:93:A2
>>>            inet addr:192.168.1.14  Bcast:192.168.1.255  Mask:
>>> 255.255.255.0
>>>            inet6 addr: fe80::217:31ff:fe9c:93a2/64 Scope:Link
>>>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>            RX packets:49652796 errors:0 dropped:0 overruns:0 frame:0
>>>            TX packets:49368158 errors:0 dropped:0 overruns:0
> carrier:0
>>>            collisions:0 txqueuelen:1000
>>>            RX bytes:21844618928 (20.3 GiB)  TX bytes:16122676331
>>> (15.0
>>> GiB)
>>>            Interrupt:19
>>> 
>>> lo        Link encap:Local Loopback
>>>            inet addr:127.0.0.1  Mask:255.0.0.0
>>>            inet6 addr: ::1/128 Scope:Host
>>>            UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>>            RX packets:82191 errors:0 dropped:0 overruns:0 frame:0
>>>            TX packets:82191 errors:0 dropped:0 overruns:0 carrier:0
>>>            collisions:0 txqueuelen:0
>>>            RX bytes:7383491 (7.0 MiB)  TX bytes:7383491 (7.0 MiB)
>>> 
>>> 
>>> These machines routinely run mpich2 and mvapich2 programs so I don't
>>> suspect any problems with the gigabit or infiniband connections.
>>> 
>>> Thanks,
>>> --Bob Soliday
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

Reply via email to