Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Eugene Loh Tue, 26 May 2009 12:21:35 -0400

vasilis wrote:

Dear openMpi users,
I am trying to develop a code that runs in parallel mode with openMPI (1.3.2version). The code is written in Fortran 90, and I am running on a cluster
If I use 2 CPU the program runs fine, but for a larger number of CPUs I get thefollowing error:
[compute-2-6.local:18491] *** An error occurred in MPI_Recv[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will nowabort)
Here is the part of the code that this error refers to:
if( mumps_par%MYID .eq. 0 ) THEN
               res=res+res_cpu
               do iw=1,total_elem_cpu*unique
                       jacob(iw)=jacob(iw)+jacob_cpu(iw)
                       position_col(iw)=position_col(iw)+col_cpu(iw)
                       position_row(iw)=position_row(iw)+row_cpu(iw)
               end do

               do jw=1,nsize-1
callMPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)callMPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)callMPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)callMPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)res=res+res_cpu
                       do iw=1,total_elem_cpu*unique
                               
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
                               
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
                               
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
                       end do
               end do
       else
callMPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request1,ierr)callMPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request2,ierr)callMPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request3,ierr)callMPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request4,ierr)
 call MPI_Wait(request1, status1, ierr)
               call MPI_Wait(request2, status2, ierr)
               call MPI_Wait(request3, status3, ierr)
               call MPI_Wait(request4, status4, ierr)
       end if


I am also using the MUMPS library
Could someone help to track this error down. Is really annoying to use onlytwo processors.The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run thecode on a single node with more than 2 CPU but I got the same error!!

I think the error message means that the received message was longerthan the receive buffer that was specified. If I look at your code andtry to reason about its correctness, I think of the message-passingportion as looking like this:


if( mumps_par%MYID .eq. 0 ) THEN
   do jw=1,nsize-1

callMPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)call MPI_recv( res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)call MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)call MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)

   end do
else

callMPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)call MPI_Send( res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)

end if

If you're running on two processes, then the messages you receive are inthe order you expect. If there are more than two processes, however,certainly messages will start appearing "out of order" and yourindiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start gettingthem mixed up. You won't just get all messages from one rank and thenall from another and then all from another. Rather, the messages fromall these other processes will come interwoven, but you interpret themin a fixed order.

Here is what I mean. Let's say you have 3 processes. So, rank 0 willreceive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly,rank 1 and rank 2 will each send 4 messages to rank 0. Here is apossibility for the order in which messages are received:


jacob_cpu from rank 1
jacob_cpu from rank 2
res_cpu from rank 1
row_cpu from rank 1
res_cpu from rank 2
row_cpu from rank 2
col_cpu from rank 2
col_cpu from rank 1

Rank 0, however, is trying to unpack these in the order you prescribedin your code. Data will get misinterpreted. More to the point here,you will be trying to receive data into buffers of the wrong size (someof the time).

Maybe you should use tags to distinguish between the different types ofmessages you're trying to send.

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Reply via email to