Hi, I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed 1.2.6-r1 to be the latest stable version of openMPI).
I do still get the following error message when running my test helloWorld program: [10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_ complete_connect] connect() failed with errno=113 [10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 [10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c onnect] connect() failed with errno=113 Again, this error does not happen with every run of the test program and occurs only certain times. How do I take care of this? Regards, Prasanna. On 9/9/08 9:00 AM, "users-requ...@open-mpi.org" <users-requ...@open-mpi.org> wrote: > > Message: 1 > Date: Mon, 8 Sep 2008 16:43:33 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Need help resolving No route to host error > with OpenMPI 1.1.2 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <af302d68-0d30-469e-afd3-566ff9628...@cisco.com> > Content-Type: text/plain; charset=WINDOWS-1252; format=flowed; > delsp=yes > > Are you able to upgrade to Open MPI v1.2.7? > > There were *many* bug fixes and changes in the 1.2 series compared to > the 1.1 series, some, in particular, were dealing with TCP socket > timeouts (which are important when dealing with large numbers of MPI > processes). > > > > On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote: > >> Hi, >> >> I am trying to run a test mpiHelloWorld program that simply >> initializes the MPI environment on all the nodes, prints the >> hostname and rank of each node in the MPI process group and exits. >> >> I am using MPI 1.1.2 and am running 997 processes on 499 nodes >> (Nodes have 2 dual core CPUs). >> >> I get the following error messages when I run my program as follows: >> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld >> ..... >> ..... >> ..... >> [0,1,380][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142] >> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] >> [0,1,140][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390] >> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=113 >> connect() failed with errno=113connect() failed with >> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144] >> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=113 >> [0,1,388][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with >> errno=113[0,1,386][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with >> errno=113 >> [0,1,139][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with >> errno=113 >> connect() failed with errno=113 >> ..... >> ..... >> >> The main thing is that I get these error messages around 3-4 times >> out of 10 attempts with the rest all completing successfully. I have >> looked into the FAQs in detail and also checked the tcp btl settings >> but am not able to figure it out. >> >> All the 499 nodes have only eth0 active and I get the error even >> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist >> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld >> >> I have attached the output of ompi_info ?all. >> >> The following is the output of /sbin/ifconfig on the node where I >> start the mpi process (it is one of the 499 nodes) >> >> eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6 >> inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:255.255.0.0 >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:1978724556 errors:17 dropped:0 overruns:0 frame: >> 17 >> TX packets:1767028063 errors:0 dropped:0 overruns:0 >> carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552 >> (657385.4 Mb) >> Interrupt:22 Base address:0xc000 >> >> lo Link encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> RX packets:70560 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:339687635 (323.9 Mb) TX bytes:339687635 (323.9 Mb) >> >> >> Kindly help. >> >> Regards, >> >> Prasanna. >> >> <ompi_info.rtf>_______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users