We can probably put it in 1.5.4.  The 1.4 RM's will have to speak for the 1.4 
series...

On Apr 28, 2011, at 4:45 PM, Sindhi, Waris PW wrote:

> Do you know when this fix is slated for an official release ?  
> 
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of Ralph Castain
> Sent: Thursday, April 28, 2011 9:03 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
> 
> 
> On Apr 28, 2011, at 6:56 AM, Sindhi, Waris PW wrote:
> 
>> Yes the procgroup file has more than 128 applications in it.
>> 
>> % wc -l procgroup
>> 239 procgroup 
>> 
>> Is 128 the max applications that can be in a procgroup file ? 
> 
> Yep - this limitation is lifted in the developer's trunk, but not yet in
> a release.
> 
> 
>> 
>> Sincerely,
>> 
>> Waris Sindhi
>> High Performance Computing, TechApps
>> Pratt & Whitney, UTC
>> (860)-565-8486
>> 
>> -----Original Message-----
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
>> Behalf Of Ralph Castain
>> Sent: Wednesday, April 27, 2011 8:09 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>> 
>> 
>> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
>> 
>>> No we do not have a firewall turned on. I can run smaller 96 slave
>> cases
>>> on ln10 and ln13 included on the slavelist. 
>>> 
>>> Could there be another reason for this to fail ? 
>> 
>> What is in "procgroup"? Is it a single application?
>> 
>> Offhand, there is nothing in OMPI that would explain the problem. The
>> only possibility I can think of would be if your "procgroup" file
>> contains more than 128 applications in it.
>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> -----Original Message-----
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>> On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, April 27, 2011 2:18 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>>> 
>>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>>> establish TCP communications with the daemon on ln10.
>>> 
>>> 
>>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>>> 
>>>> Hi,
>>>>  I am getting a "oob-tcp: Communication retries exceeded" error
>>>> message when I run a 238 MPI slave code
>>>> 
>>>> 
>>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl
> ^tcp
>>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>>> 
>>> 
>> 
> ------------------------------------------------------------------------
>>>> --
>>>> mpirun was unable to start the specified application as it
>> encountered
>>>> an error:
>>>> 
>>>> Error name: Unknown error: 1
>>>> Node: ln10
>>>> 
>>>> when attempting to start process rank 234.
>>>> 
>>> 
>> 
> ------------------------------------------------------------------------
>>>> --
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> 
>>>> Any help would be greatly appreciated.
>>>> 
>>>> Sincerely,
>>>> 
>>>> Waris Sindhi
>>>> High Performance Computing, TechApps
>>>> Pratt & Whitney, UTC
>>>> (860)-565-8486
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to