Jeff Squyres (jsquyres) wrote:
Interesting. Would you mind sharing your patch?
-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Bob Soliday
Sent: Thursday, November 29, 2007 11:35 AM
To: Ralph H Castain
Cc: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] mca_oob_tcp_peer_try_connect problem

I solved the problem by making a change to
orte/mca/oob/tcp/oob_tcp_peer.c

On Linux 2.6 I have read that after a failed connect system call the
next call to connect can immediately return ECONNABORTED and not try to
actually connect, the next call to connect will then work. So I changed
mca_oob_tcp_peer_try_connect to test for ECONNABORTED and then call
connect again. The hello_c example script is now working.

I don't think this has solved the underlying cause as to way connect is
failing in the first place but at least now I move on to the next step.
My best guess at the moment is that it is using eth0 initially when I
want it to use eth1. This fails and then when it moves on to eth1 I run
into the "can't call connect after it just failed bug".

--Bob



I changed oob_tcp_peer.c at line 289 from:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
    (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
    opal_event_add(&peer->peer_send_event, 0);
    return ORTE_SUCCESS;
  }
  opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
              "connect to %s:%d failed: %s (%d)",
              ORTE_NAME_ARGS(orte_process_info.my_name),
              ORTE_NAME_ARGS(&(peer->peer_name)),
              inet_ntoa(inaddr.sin_addr),
              ntohs(inaddr.sin_port),
              strerror(opal_socket_errno),
              opal_socket_errno);
  continue;
}


to:


/* start the connect - will likely fail with EINPROGRESS */
if(connect(peer->peer_sd,
    (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
  /* non-blocking so wait for completion */
  if (opal_socket_errno == ECONNABORTED) {
    if(connect(peer->peer_sd,
        (struct sockaddr*)&inaddr, sizeof(struct sockaddr_in)) < 0) {
      if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
        opal_event_add(&peer->peer_send_event, 0);
        return ORTE_SUCCESS;
      }
      opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: 
"
                  "connect to %s:%d failed: %s (%d)",
                  ORTE_NAME_ARGS(orte_process_info.my_name),
                  ORTE_NAME_ARGS(&(peer->peer_name)),
                  inet_ntoa(inaddr.sin_addr),
                  ntohs(inaddr.sin_port),
                  strerror(opal_socket_errno),
                  opal_socket_errno);
      continue;
    }
  } else {
    if(opal_socket_errno == EINPROGRESS || opal_socket_errno == EWOULDBLOCK) {
      opal_event_add(&peer->peer_send_event, 0);
      return ORTE_SUCCESS;
    }
    opal_output(0, "[%lu,%lu,%lu]-[%lu,%lu,%lu] mca_oob_tcp_peer_try_connect: "
                "connect to %s:%d failed: %s (%d)",
                ORTE_NAME_ARGS(orte_process_info.my_name),
                ORTE_NAME_ARGS(&(peer->peer_name)),
                inet_ntoa(inaddr.sin_addr),
                ntohs(inaddr.sin_port),
                strerror(opal_socket_errno),
                opal_socket_errno);
    continue;
  }
}

Reply via email to