Try removing the barrier.

On May 4, 2012, at 5:52 AM, Jorge Chiva Segura wrote:

> Hi all,
> 
> I have a program that executes a communication loop similar to this one:
> 
> 1:    for(int p1=0; p1<np; ++p1) {
> 2:        for(int p2=0; p2<np; ++p2) {
> 3:            if(me==p1) {
> 4:                if(sendSize(p2)) 
> MPI_Ssend(sendBuffer[p2],sendSize(p2),MPI_FLOAT,p2,0,myw); 
> 5:                if(recvSize(p2)) 
> MPI_Recv(recvBuffer[p2],recvSize(p2),MPI_FLOAT,p2,0,myw,&status); 
> 6:            } else if(yo==p2) {
> 7:                if(recvSize(p1)) 
> MPI_Recv(recvBuffer[p1],recvSize(p1),MPI_FLOAT,p2,0,myw,&status); 
> 8:                if(sendSize(p1)) 
> MPI_Ssend(sendBuffer[p1],sendSize(p1),MPI_FLOAT,p2,0,myw); 
> 9:            }
> 10:          MPI_Barrier(myw);
> 11:     }
> 12:   }
> 
> The program is an iterative process that makes some calculations, 
> communicates and then continues with the next iteration. The problem is that 
> after making 30 successful iterations the program hangs. With padb I have 
> seen that one of the processors waits at line 5 for the reception of data 
> that was already sent and the rest of the processors are waiting at the 
> barrier in line 10. The size of the messages and buffers is the same for all 
> the iterations.
> 
> My real program makes use of asynchronous communications for obvious 
> performance reasons and it worked without problems when the case to solve was 
> smaller (lower number of processors and memory), but I found that for this 
> case the program hanged and that is why a changed the communication routine 
> using synchronous communications to see where is the problem. Now I know 
> where the program hangs, but I don't understand what I am doing wrong.
> 
> Any suggestions?
> 
> More specific data of the case and cluster:
> Number of processors: 320
> Max size of the message: 6800 floats (27200 bytes)
> Number of cores by node: 32
> File system: lustre
> Resource manager: slurm
> OMPI version: 1.4.4
> Operative system: Ubuntu 10.04.4 LTS
> Kernel: RHEL 6.2 2.6.32-220.4.2
> Infiniband: OFED 1.4.2
> InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB 
> QDR / 10GigE] (rev b0)
> 
> Thank you for your time,
> Jorge 
> -- 
> Aquest missatge ha estat analitzat per MailScanner 
> a la cerca de virus i d'altres continguts perillosos, 
> i es considera que está net.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to