Yes the procgroup file has more than 128 applications in it. % wc -l procgroup 239 procgroup
Is 128 the max applications that can be in a procgroup file ? Sincerely, Waris Sindhi High Performance Computing, TechApps Pratt & Whitney, UTC (860)-565-8486 -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, April 27, 2011 8:09 PM To: Open MPI Users Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded On Apr 27, 2011, at 1:31 PM, Sindhi, Waris PW wrote: > No we do not have a firewall turned on. I can run smaller 96 slave cases > on ln10 and ln13 included on the slavelist. > > Could there be another reason for this to fail ? What is in "procgroup"? Is it a single application? Offhand, there is nothing in OMPI that would explain the problem. The only possibility I can think of would be if your "procgroup" file contains more than 128 applications in it. > > > Sincerely, > > Waris Sindhi > High Performance Computing, TechApps > Pratt & Whitney, UTC > (860)-565-8486 > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, April 27, 2011 2:18 PM > To: Open MPI Users > Subject: Re: [OMPI users] OpenMPI out of band TCP retry exceeded > > Perhaps a firewall? All it is telling you is that mpirun couldn't > establish TCP communications with the daemon on ln10. > > > On Apr 27, 2011, at 11:58 AM, Sindhi, Waris PW wrote: > >> Hi, >> I am getting a "oob-tcp: Communication retries exceeded" error >> message when I run a 238 MPI slave code >> >> >> /opt/openmpi/i386/bin/mpirun -mca btl_openib_verbose 1 --mca btl ^tcp >> --mca pls_ssh_agent ssh -mca oob_tcp_peer_retries 1000 --prefix >> /usr/lib/openmpi/1.2.8-gcc/bin -np 239 --app procgroup >> > ------------------------------------------------------------------------ >> -- >> mpirun was unable to start the specified application as it encountered >> an error: >> >> Error name: Unknown error: 1 >> Node: ln10 >> >> when attempting to start process rank 234. >> > ------------------------------------------------------------------------ >> -- >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >> orted/orted_comm.c at line 130 >> [ln13:27867] [[61748,0],0] ORTE_ERROR_LOG: Unreachable in file >> orted/orted_comm.c at line 130 >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> [ln13:27867] [[61748,0],0]-[[61748,0],32] oob-tcp: Communication > retries >> exceeded. Can not communicate with peer >> >> Any help would be greatly appreciated. >> >> Sincerely, >> >> Waris Sindhi >> High Performance Computing, TechApps >> Pratt & Whitney, UTC >> (860)-565-8486 >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users