Try removing the barrier. On May 4, 2012, at 5:52 AM, Jorge Chiva Segura wrote:
> Hi all, > > I have a program that executes a communication loop similar to this one: > > 1: for(int p1=0; p1<np; ++p1) { > 2: for(int p2=0; p2<np; ++p2) { > 3: if(me==p1) { > 4: if(sendSize(p2)) > MPI_Ssend(sendBuffer[p2],sendSize(p2),MPI_FLOAT,p2,0,myw); > 5: if(recvSize(p2)) > MPI_Recv(recvBuffer[p2],recvSize(p2),MPI_FLOAT,p2,0,myw,&status); > 6: } else if(yo==p2) { > 7: if(recvSize(p1)) > MPI_Recv(recvBuffer[p1],recvSize(p1),MPI_FLOAT,p2,0,myw,&status); > 8: if(sendSize(p1)) > MPI_Ssend(sendBuffer[p1],sendSize(p1),MPI_FLOAT,p2,0,myw); > 9: } > 10: MPI_Barrier(myw); > 11: } > 12: } > > The program is an iterative process that makes some calculations, > communicates and then continues with the next iteration. The problem is that > after making 30 successful iterations the program hangs. With padb I have > seen that one of the processors waits at line 5 for the reception of data > that was already sent and the rest of the processors are waiting at the > barrier in line 10. The size of the messages and buffers is the same for all > the iterations. > > My real program makes use of asynchronous communications for obvious > performance reasons and it worked without problems when the case to solve was > smaller (lower number of processors and memory), but I found that for this > case the program hanged and that is why a changed the communication routine > using synchronous communications to see where is the problem. Now I know > where the program hangs, but I don't understand what I am doing wrong. > > Any suggestions? > > More specific data of the case and cluster: > Number of processors: 320 > Max size of the message: 6800 floats (27200 bytes) > Number of cores by node: 32 > File system: lustre > Resource manager: slurm > OMPI version: 1.4.4 > Operative system: Ubuntu 10.04.4 LTS > Kernel: RHEL 6.2 2.6.32-220.4.2 > Infiniband: OFED 1.4.2 > InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB > QDR / 10GigE] (rev b0) > > Thank you for your time, > Jorge > -- > Aquest missatge ha estat analitzat per MailScanner > a la cerca de virus i d'altres continguts perillosos, > i es considera que está net. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/