Perhaps a firewall? All it is telling you is that mpirun couldn't establish TCP 
communications with the daemon on ln10.


On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote:

> Hi,
>     I am getting a "oob-tcp: Communication retries exceeded" error
> message when I run a 238 MPI slave code
> 
> 
> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp
> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix
> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup
> ------------------------------------------------------------------------
> --
> mpirun was unable to start the specified application as it encountered
> an error:
> 
> Error name: Unknown error: 1
> Node: ln10
> 
> when attempting to start process rank 234.
> ------------------------------------------------------------------------
> --
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file
> orted/orted_comm.c at line 130
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries
> exceeded.  Can not communicate with peer
> 
> Any help would be greatly appreciated.
> 
> Sincerely,
> 
> Waris Sindhi
> High Performance Computing, TechApps
> Pratt & Whitney, UTC
> (860)-565-8486
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to