On Apr 28, 2011, at 6:04 AM, Jeff Squyres wrote:

> I do note that you are using an ancient version of Open MPI (1.2.8).

I don't think that is accurate - at least, the output doesn't match that old a 
version. The process name format is indicative of something 1.3 or more recent.

What lead you to conclude 1.2.8?


>  Is there any way you can upgrade to a (much) later version, such as 1.4.3?  
> That might improve your TCP connectivity -- we made improvements in those 
> portions of the code over the years.
> 
> On Apr 27, 2011, at 8:09 PM, Ralph Castain wrote:
> 
>> 
>> On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote:
>> 
>>> No we do not have a firewall turned on. I can run smaller 96 slave cases
>>> on ln10 and ln13 included on the slavelist. 
>>> 
>>> Could there be another reason for this to fail ? 
>> 
>> What is in "procgroup"? Is it a single application?
>> 
>> Offhand, there is nothing in OMPI that would explain the problem. The only 
>> possibility I can think of would be if your "procgroup" file contains more 
>> than 128 applications in it.
>> 
>>> 
>>> 
>>> Sincerely,
>>> 
>>> Waris Sindhi
>>> High Performance Computing, TechApps
>>> Pratt & Whitney, UTC
>>> (860)-565-8486
>>> 
>>> -----Original Message-----
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of Ralph Castain
>>> Sent: Wednesday, April 27, 2011 2:18 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded
>>> 
>>> Perhaps a firewall? All it is telling you is that mpirun couldn't
>>> establish TCP communications with the daemon on ln10.
>>> 
>>> 
>>> On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:
>>> 
>>>> Hi,
>>>>  I am getting a "oob-tcp: Communication retries exceeded" error
>>>> message when I run a 238 MPI slave code
>>>> 
>>>> 
>>>> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
>>>> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
>>>> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
>>>> 
>>> ------------------------------------------------------------------------
>>>> --
>>>> mpirun was unable to start the specified application as it encountered
>>>> an error:
>>>> 
>>>> Error name: Unknown error: 1
>>>> Node: ln10
>>>> 
>>>> when attempting to start process rank 234.
>>>> 
>>> ------------------------------------------------------------------------
>>>> --
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
>>>> orted/orted_comm.c at line 130
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication
>>> retries
>>>> exceeded.  Can not communicate with peer
>>>> 
>>>> Any help would be greatly appreciated.
>>>> 
>>>> Sincerely,
>>>> 
>>>> Waris Sindhi
>>>> High Performance Computing, TechApps
>>>> Pratt & Whitney, UTC
>>>> (860)-565-8486
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to