I do note that you are using an ancient version of Open MPI (1.2.8). Is there any way you can upgrade to a (much) later version, such as 1.4.3? That might improve your TCP connectivity -- we made improvements in those portions of the code over the years.
On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote: > > On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote: > >> No we do not have a firewall turned on. I can run smaller 96 slave cases >> on ln10 and ln13 included on the slavelist. >> >> Could there be another reason for this to fail ? > > What is in "procgroup"? Is it a single application? > > Offhand, there is nothing in OMPI that would explain the problem. The only > possibility I can think of would be if your "procgroup" file contains more > than 128 applications in it. > >> >> >> Sincerely, >> >> Waris Sindhi >> High Performance Computing, TechApps >> Pratt & Whitney, UTC >> (860)-565-8486 >> >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >> Behalf Of Ralph Castain >> Sent: Wednesday, April 27, 2011 2:18 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded >> >> Perhaps a firewall? All it is telling you is that mpirun couldn't >> establish TCP communications with the daemon on ln10. >> >> >> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote: >> >>> Hi, >>> I am getting a "oob-tcp: Communication retries exceeded" error >>> message when I run a 238 MPI slave code >>> >>> >>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp >>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix >>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup >>> >> ------------------------------------------------------------------------ >>> -- >>> mpirun was unable to start the specified application as it encountered >>> an error: >>> >>> Error name: Unknown error: 1 >>> Node: ln10 >>> >>> when attempting to start process rank 234. >>> >> ------------------------------------------------------------------------ >>> -- >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>> orted/orted_comm.c at line 130 >>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>> orted/orted_comm.c at line 130 >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >> retries >>> exceeded. Can not communicate with peer >>> >>> Any help would be greatly appreciated. >>> >>> Sincerely, >>> >>> Waris Sindhi >>> High Performance Computing, TechApps >>> Pratt & Whitney, UTC >>> (860)-565-8486 >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/