We can probably put it in 1.5.4. The 1.4 RM's will have to speak for the 1.4 series...
On Apr 28, 2011, at 4:45 PM, Sindhi, Waris PW wrote: > Do you know when this fix is slated for an official release ? > > > Sincerely, > > Waris Sindhi > High Performance Computing, TechApps > Pratt & Whitney, UTC > (860)-565-8486 > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Thursday, April 28, 2011 9:03 AM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded > > > On Apr 28, 2011, at 6:56 AM, Sindhi, Waris PW wrote: > >> Yes the procgroup file has more than 128 applications in it. >> >> % wc -l procgroup >> 239 procgroup >> >> Is 128 the max applications that can be in a procgroup file ? > > Yep - this limitation is lifted in the developer's trunk, but not yet in > a release. > > >> >> Sincerely, >> >> Waris Sindhi >> High Performance Computing, TechApps >> Pratt & Whitney, UTC >> (860)-565-8486 >> >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On >> Behalf Of Ralph Castain >> Sent: Wednesday, April 27, 2011 8:09 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded >> >> >> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote: >> >>> No we do not have a firewall turned on. I can run smaller 96 slave >> cases >>> on ln10 and ln13 included on the slavelist. >>> >>> Could there be another reason for this to fail ? >> >> What is in "procgroup"? Is it a single application? >> >> Offhand, there is nothing in OMPI that would explain the problem. The >> only possibility I can think of would be if your "procgroup" file >> contains more than 128 applications in it. >> >>> >>> >>> Sincerely, >>> >>> Waris Sindhi >>> High Performance Computing, TechApps >>> Pratt & Whitney, UTC >>> (860)-565-8486 >>> >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >> On >>> Behalf Of Ralph Castain >>> Sent: Wednesday, April 27, 2011 2:18 PM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded >>> >>> Perhaps a firewall? All it is telling you is that mpirun couldn't >>> establish TCP communications with the daemon on ln10. >>> >>> >>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote: >>> >>>> Hi, >>>> I am getting a "oob-tcp: Communication retries exceeded" error >>>> message when I run a 238 MPI slave code >>>> >>>> >>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl > ^tcp >>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix >>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup >>>> >>> >> > ------------------------------------------------------------------------ >>>> -- >>>> mpirun was unable to start the specified application as it >> encountered >>>> an error: >>>> >>>> Error name: Unknown error: 1 >>>> Node: ln10 >>>> >>>> when attempting to start process rank 234. >>>> >>> >> > ------------------------------------------------------------------------ >>>> -- >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>>> orted/orted_comm.c at line 130 >>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >>>> orted/orted_comm.c at line 130 >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication >>> retries >>>> exceeded. Can not communicate with peer >>>> >>>> Any help would be greatly appreciated. >>>> >>>> Sincerely, >>>> >>>> Waris Sindhi >>>> High Performance Computing, TechApps >>>> Pratt & Whitney, UTC >>>> (860)-565-8486 >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/