vasilis wrote:
Dear openMpi users,
I am trying to develop a code that runs in parallel mode with openMPI (1.3.2
version). The code is written in Fortran 90, and I am running on a cluster
If I use 2 CPU the program runs fine, but for a larger number of CPUs I get the
following error:
[compute-2-6.local:18491] *** An error occurred in MPI_Recv
[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD
[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated
[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
abort)
Here is the part of the code that this error refers to:
if( mumps_par%MYID .eq. 0 ) THEN
res=res+res_cpu
do iw=1,total_elem_cpu*unique
jacob(iw)=jacob(iw)+jacob_cpu(iw)
position_col(iw)=position_col(iw)+col_cpu(iw)
position_row(iw)=position_row(iw)+row_cpu(iw)
end do
do jw=1,nsize-1
call
MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)
call
MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)
call
MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
call
MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
res=res+res_cpu
do iw=1,total_elem_cpu*unique
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
end do
end do
else
call
MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request1,ierr)
call
MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,request2,ierr)
call
MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request3,ierr)
call
MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_COMM_WORLD,request4,ierr)
call MPI_Wait(request1, status1, ierr)
call MPI_Wait(request2, status2, ierr)
call MPI_Wait(request3, status3, ierr)
call MPI_Wait(request4, status4, ierr)
end if
I am also using the MUMPS library
Could someone help to track this error down. Is really annoying to use only
two processors.
The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run the
code on a single node with more than 2 CPU but I got the same error!!
I think the error message means that the received message was longer
than the receive buffer that was specified. If I look at your code and
try to reason about its correctness, I think of the message-passing
portion as looking like this:
if( mumps_par%MYID .eq. 0 ) THEN
do jw=1,nsize-1
call
MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr)
call MPI_recv( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ierr)
call MPI_recv(
row_cpu,total_elem_cpu*unique,MPI_INTEGER
,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
call MPI_recv(
col_cpu,total_elem_cpu*unique,MPI_INTEGER
,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
end do
else
call
MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
call MPI_Send( res_cpu,total_unknowns
,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER
,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
end if
If you're running on two processes, then the messages you receive are in
the order you expect. If there are more than two processes, however,
certainly messages will start appearing "out of order" and your
indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting
them mixed up. You won't just get all messages from one rank and then
all from another and then all from another. Rather, the messages from
all these other processes will come interwoven, but you interpret them
in a fixed order.
Here is what I mean. Let's say you have 3 processes. So, rank 0 will
receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly,
rank 1 and rank 2 will each send 4 messages to rank 0. Here is a
possibility for the order in which messages are received:
jacob_cpu from rank 1
jacob_cpu from rank 2
res_cpu from rank 1
row_cpu from rank 1
res_cpu from rank 2
row_cpu from rank 2
col_cpu from rank 2
col_cpu from rank 1
Rank 0, however, is trying to unpack these in the order you prescribed
in your code. Data will get misinterpreted. More to the point here,
you will be trying to receive data into buffers of the wrong size (some
of the time).
Maybe you should use tags to distinguish between the different types of
messages you're trying to send.