[OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-28 Thread Bob Soliday
  Metric:1
  RX packets:82191 errors:0 dropped:0 overruns:0 frame:0
  TX packets:82191 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:7383491 (7.0 MiB)  TX bytes:7383491 (7.0 MiB)


These machines routinely run mpich2 and mvapich2 programs so I don't suspect any
problems with the gigabit or infiniband connections.

Thanks,
--Bob Soliday



Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

I solved the problem by making a change to orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the next
call to connect can immediately return ECONNABORTED and not try to actually
connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call connect
again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step. My
best guess at the moment is that it is using eth0 initially when I want it
to use eth1. This fails and then when it moves on to eth1 I run into the
"can't call connect after it just failed bug".

--Bob


Ralph H Castain wrote:

Hi Bob

I'm afraid the person most familiar with the oob subsystem recently left the
project, so we are somewhat hampered at the moment. I don't recognize the
"Software caused connection abort" error message - it doesn't appear to be
one of ours (at least, I couldn't find it anywhere in our code base, though
I can't swear it isn't there in some dark corner), and I don't find it in my
own sys/errno.h file.

With those caveats, all I can say is that something appears to be blocking
the connection from your remote node back to the head node. Are you sure
both nodes are available on IPv4 (since you disabled IPv6)? Can you try
ssh'ing to the remote node and doing a ping to the head node using the IPv4
interface?

Do you have another method you could use to check and see if max14 will
accept connections from max15? If I interpret the error message correctly,
it looks like something in the connect handshake is being aborted. We try a
couple of times, but then give up and try other interfaces - since no other
interface is available, you get that other error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in our
coverage that needs to be rebuilt.

Ralph
 



On 11/28/07 2:41 PM, "Bob Soliday"  wrote:


I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:

./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
--with-openib

The hostname file contains the local host and one other node. When I
run it I get:


[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun --
debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
hello_c
[max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting
port 55152 to: 192.168.2.14:38852
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
Daemon [0,0,1] checking in as pid 31466 on host max14
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls: received kill_local_procs
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 15
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/
pls_base_orted_cmds.c at line 275
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c
at line 1166
[max14:31465] [0,0,0] ORTE_ERROR_LOG: Timeout in file errmgr_hnp.c at
line 90
[max14:31465] ERROR: A daemon on node max15 failed to start as expected.
[max14:31465] ERROR: There may be more information available from
[max14:31465] ERROR: the remote shell (see above).
[max14:31465] ERROR: The daemon exited unexpectedly with status 1.
[max14:31466] [0,0,1] orted_recv_pls: received message from [0,0,0]
[max14:31466] [0,0,1] orted_recv_pls:

Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

Jeff Squyres (jsquyres) wrote:
Interesting.  Would you mind sharing your patch? 


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users 
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob




I changed oob_tcp_peer.c at line 289 from:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
  }
  opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
  "connect to %s:%d failed: %s (%d)",
  ORTE_NAME_ARGS(orte_process_info.my_name),
  ORTE_NAME_ARGS(&(peer->peer_name)),
  inet_ntoa(inaddr.sin_addr),
  ntohs(inaddr.sin_port),
  strerror(opal_socket_errno),
  opal_socket_errno);
  continue;
}


to:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if (opal_socket_errno == ECONNABORTED) {
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
  }
  opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: 
"
  "connect to %s:%d failed: %s (%d)",
  ORTE_NAME_ARGS(orte_process_info.my_name),
  ORTE_NAME_ARGS(&(peer->peer_name)),
  inet_ntoa(inaddr.sin_addr),
  ntohs(inaddr.sin_port),
  strerror(opal_socket_errno),
  opal_socket_errno);
  continue;
}
  } else {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
  opal_event_add(&peer->peer_send_event, 0);
  return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
  }
}



Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

2007-11-29 Thread Bob Soliday

Thanks, this works. I have now removed my change to oob_tcp_peer.c.

--Bob Soliday

Ralph Castain wrote:

If you wanted it to use eth1, your other option would be to simply tell it
to do so using the mca param. I believe it is something like -mca
oob_tcp_if_include eth1 -mca oob_tcp_if_exclude eth0

You may only need the latter since you only have the two interfaces.
Ralph



On 11/29/07 9:47 AM, "Jeff Squyres (jsquyres)"  wrote:


Interesting.  Would you mind sharing your patch?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users 
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob


Ralph H Castain wrote:

Hi Bob

I'm afraid the person most familiar with the oob subsystem recently
left the project, so we are somewhat hampered at the moment. I don't
recognize the "Software caused connection abort" error message - it
doesn't appear to be one of ours (at least, I couldn't find it
anywhere in our code base, though I can't swear it isn't there in some
dark corner), and I don't find it in my own sys/errno.h file.

With those caveats, all I can say is that something appears to be
blocking the connection from your remote node back to the head node.
Are you sure both nodes are available on IPv4 (since you disabled
IPv6)? Can you try ssh'ing to the remote node and doing a ping to the
head node using the IPv4 interface?

Do you have another method you could use to check and see if max14
will accept connections from max15? If I interpret the error message
correctly, it looks like something in the connect handshake is being
aborted. We try a couple of times, but then give up and try other
interfaces - since no other interface is available, you get that other

error message and we abort.

Sorry I can't be more help - like I said, this is now a weak spot in
our coverage that needs to be rebuilt.

Ralph
 



On 11/28/07 2:41 PM, "Bob Soliday"  wrote:


I am new to openmpi and have a problem that I cannot seem to solve.
I am trying to run the hello_c example and I can't get it to work.
I compiled openmpi with:

./configure --prefix=/usr/local/software/openmpi-1.2.4 --disable-ipv6
--with-openib

The hostname file contains the local host and one other node. When I
run it I get:


[soliday@max14 mpi-ex]$ /usr/local/software/openmpi-1.2.4/bin/mpirun
-- debug-daemons -mca oob_tcp_debug 1000 -machinefile hostfile -np 2
hello_c [max14:31465] [0,0,0] accepting connections via event library
[max14:31465] [0,0,0] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1] accepting connections via event library
[max14:31466] [0,0,1] mca_oob_tcp_init: calling orte_gpr.subscribe
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_peer_try_connect: connecting port 55152
to: 192.168.2.14:38852 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect:
sending ack, 0
[max14:31465] [0,0,0] mca_oob_tcp_accept: 192.168.2.14:37255
[max14:31465] [0,0,0]-[0,0,1] accepted: 192.168.2.14 - 192.168.2.14
nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
[0,0,1]-[0,0,0] connected: 192.168.2.14 - 192.168.2.14 nodelay 1
sndbuf 262142 rcvbuf 262142 flags 0802 [max14:31466]
[0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_send: tag 2 [max14:31466] [0,0,1]-[0,0,0]
mca_oob_tcp_recv: tag 2 Daemon [0,0,1] checking in as pid 31466 on
host max14 [max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_send: tag 2
[max14:31466] [0,0,1]-[0,0,0] mca_oob_tcp_recv: tag 2 [max15:28222]
[0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed: Software caused connection abort (103)
[max15:28222] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect
to
192.168.1.14:38852 failed, connecting over all interfaces failed!
[max15:28222] OOB: Connection to HNP lost [max14:31466] [0,0,1]
orted_recv_pls: