Jeff Squyres (jsquyres) wrote:
Interesting. Would you mind sharing your patch?
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem
I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c
On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.
I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".
--Bob
I changed oob_tcp_peer.c at line 289 from:
/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
/* non-blocking so wait for completion */
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
to:
/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
/* non-blocking so wait for completion */
if (opal_socket_errno == ECONNABORTED) {
if(connect(peer->peer_sd,
(struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect:
"
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
} else {
if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
opal_event_add(&peer->peer_send_event, 0);
return ORTE_SUCCESS;
}
opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
"connect to %s:%d failed: %s (%d)",
ORTE_NAME_ARGS(orte_process_info.my_name),
ORTE_NAME_ARGS(&(peer->peer_name)),
inet_ntoa(inaddr.sin_addr),
ntohs(inaddr.sin_port),
strerror(opal_socket_errno),
opal_socket_errno);
continue;
}
}