On Apr 28, 2011, at 6:04 AM, Jeff Squyres wrote: > I do note that you are using an ancient version of Open MPI (1.2.8).
I don't think that is accurate - at least, the output doesn't match that old a version. The process name format is indicative of something 1.3 or more recent. What lead you to conclude 1.2.8? > Is there any way you can upgrade to a (much) later version, such as 1.4.3? > That might improve your TCP connectivity -- we made improvements in those > portions of the code over the years. > > On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote: > >> >> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote: >> >>> No we do not have a firewall turned on. I can run smaller 96 slave cases >>> on ln10 and ln13 included on the slavelist. >>> >>> Could there be another reason for this to fail ? >> >> What is in "procgroup"? Is it a single application? >> >> Offhand, there is nothing in OMPI that would explain the problem. The only >> possibility I can think of would be if your "procgroup" file contains more >> than 128 applications in it. >> >>> >>> >>> Sincerely, >>> >>> Waris Sindhi >>> High Performance Computing, TechApps >>> Pratt & Whitney, UTC >>> (860)-565-8486 >>> >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On >>> Behalf Of Ralph Castain >>> Sent: Wednesday, April 27, 2011 2:18 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded >>> >>> Perhaps a firewall? All it is telling you is that mpirun couldn't >>> establish TCP communications with the daemon on ln10. >>> >>> >>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote: >>> >>>> Hi, >>>> I am getting a "oob-tcp: Communication retries exceeded" error >>>> message when I run a 238 MPI slave code >>>> >>>> >>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp >>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix >>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup >>>> >>> ------------------------------------------------------------------------ >>>> -- >>>> mpirun was unable to start the specified application as it encountered >>>> an error: >>>> >>>> Error name: Unknown error: 1 >>>> Node: ln10 >>>> >>>> when attempting to start process rank 234. >>>> >>> ------------------------------------------------------------------------ >>>> -- >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>>> orted/orted_comm.c at line 130 >>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>>> orted/orted_comm.c at line 130 >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> >>>> Any help would be greatly appreciated. >>>> >>>> Sincerely, >>>> >>>> Waris Sindhi >>>> High Performance Computing, TechApps >>>> Pratt & Whitney, UTC >>>> (860)-565-8486 >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users