vasilis wrote:

Dear openMpi users,

I am trying to develop a code that runs in parallel mode with openMPI (1.3.2 version). The code is written in Fortran 90, and I am running on a cluster

If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the following error:

[compute-2-6.local:18491] *** An error occurred in MPI_Recv [compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD [compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated [compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Here is the part of the code that this error refers to:
if( mumps_par%MYID .eq. 0 ) THEN
               res=res+res_cpu
               do iw=1,total_elem_cpu*unique
                       jacob(iw)=jacob(iw)+jacob_cpu(iw)
                       position_col(iw)=position_col(iw)+col_cpu(iw)
                       position_row(iw)=position_row(iw)+row_cpu(iw)
               end do

               do jw=1,nsize-1
call MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr) call MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr) call MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr) res=res+res_cpu
                       do iw=1,total_elem_cpu*unique
                               
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
                               
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
                               
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
                                       
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
                       end do
               end do
       else
call MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request1,ierr) call MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request2,ierr) call MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request3,ierr) call MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request4,ierr)
 call MPI_Wait(request1, status1, ierr)
               call MPI_Wait(request2, status2, ierr)
               call MPI_Wait(request3, status3, ierr)
               call MPI_Wait(request4, status4, ierr)
       end if


I am also using the MUMPS library

Could someone help to track this error down. Is really annoying to use only two processors. The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run the code on a single node with more than 2 CPU but I got the same error!!
I think the error message means that the received message was longer than the receive buffer that was specified. If I look at your code and try to reason about its correctness, I think of the message-passing portion as looking like this:

if( mumps_par%MYID .eq. 0 ) THEN
   do jw=1,nsize-1
call MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call MPI_recv( res_cpu,total_unknowns ,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr) call MPI_recv( row_cpu,total_elem_cpu*unique,MPI_INTEGER ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr) call MPI_recv( col_cpu,total_elem_cpu*unique,MPI_INTEGER ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
   end do
else
call MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) call MPI_Send( res_cpu,total_unknowns ,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
end if

If you're running on two processes, then the messages you receive are in the order you expect. If there are more than two processes, however, certainly messages will start appearing "out of order" and your indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting them mixed up. You won't just get all messages from one rank and then all from another and then all from another. Rather, the messages from all these other processes will come interwoven, but you interpret them in a fixed order.

Here is what I mean. Let's say you have 3 processes. So, rank 0 will receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly, rank 1 and rank 2 will each send 4 messages to rank 0. Here is a possibility for the order in which messages are received:

jacob_cpu from rank 1
jacob_cpu from rank 2
res_cpu from rank 1
row_cpu from rank 1
res_cpu from rank 2
row_cpu from rank 2
col_cpu from rank 2
col_cpu from rank 1

Rank 0, however, is trying to unpack these in the order you prescribed in your code. Data will get misinterpreted. More to the point here, you will be trying to receive data into buffers of the wrong size (some of the time).

Maybe you should use tags to distinguish between the different types of messages you're trying to send.

Reply via email to