Perhaps a firewall? All it is telling you is that mpirun couldn't establish TCP communications with the daemon on ln10.
On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote: > Hi, > I am getting a "oob-tcp: Communication retries exceeded" error > message when I run a 238 MPI slave code > > > /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp > --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix > /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup > ------------------------------------------------------------------------ > -- > mpirun was unable to start the specified application as it encountered > an error: > > Error name: Unknown error: 1 > Node: ln10 > > when attempting to start process rank 234. > ------------------------------------------------------------------------ > -- > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file > orted/orted_comm.c at line 130 > [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file > orted/orted_comm.c at line 130 > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication retries > exceeded. Can not communicate with peer > > Any help would be greatly appreciated. > > Sincerely, > > Waris Sindhi > High Performance Computing, TechApps > Pratt & Whitney, UTC > (860)-565-8486 > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users