Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

vasilis Wed, 27 May 2009 06:09:33 -0400

Thank you Eugene for your suggestion. I used different tags for each variable, 
and now I do not get this error. 
The problem now is that I am getting a different solution when I use more than 
2 CPUs. I checked the matrices and I found that they differ by a very small 
amount  of the order 10^(-10). Actually, I am getting a different solution if I 
use 4CPUs or 16CPUs!!!
Do you have any idea what could cause this behavior?


Thank you,
Vasilis

On Tuesday 26 of May 2009 7:21:32 pm you wrote:
> vasilis wrote:
> >Dear openMpi users,
> >
> >I am trying to develop a code that runs in parallel mode with openMPI
> > (1.3.2 version). The code is written in Fortran 90, and I am running on 
> > a cluster
> >
> >If I use 2 CPU the program runs fine, but for a larger number of CPUs I
> > get the following error:
> >
> >[compute-2-6.local:18491] *** An error occurred in MPI_Recv
> >[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD
> >[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated
> >[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now
> >abort)
> >
> >Here is the part of the code that this error refers to:
> >if( mumps_par%MYID .eq. 0 ) THEN
> >                res=res+res_cpu
> >                do iw=1,total_elem_cpu*unique
> >                        jacob(iw)=jacob(iw)+jacob_cpu(iw)
> >                        position_col(iw)=position_col(iw)+col_cpu(iw)
> >                        position_row(iw)=position_row(iw)+row_cpu(iw)
> >                end do
> >
> >                do jw=1,nsize-1
> >                        call
> >MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOUR
> >CE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call
> >MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_AN
> >Y_TAG,MPI_COMM_WORLD,status2,ierr) call
> >MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_
> >TAG,MPI_COMM_WORLD,status3,ierr) call
> >MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_
> >TAG,MPI_COMM_WORLD,status4,ierr)
> >
> >  res=res+res_cpu
> >                        do iw=1,total_elem_cpu*unique
> >                               
> > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw)
> > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw)
> > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=&
> > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw)
> > end do
> >                end do
> >        else
> >                call
> >MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par
> >%MYID,MPI_COMM_WORLD,request1,ierr) call
> >MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI
> >_COMM_WORLD,request2,ierr) call
> >MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C
> >OMM_WORLD,request3,ierr) call
> >MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C
> >OMM_WORLD,request4,ierr) call MPI_Wait(request1, status1, ierr)
> >                call MPI_Wait(request2, status2, ierr)
> >                call MPI_Wait(request3, status3, ierr)
> >                call MPI_Wait(request4, status4, ierr)
> >        end if
> >
> >
> >I am also using the MUMPS library
> >
> >Could someone help to track this error down. Is really annoying to use
> > only two processors.
> >The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run
> > the code on a single node with more than 2 CPU but I got the same error!!
>
> I think the error message means that the received message was longer
> than the receive buffer that was specified.  If I look at your code and
> try to reason about its correctness, I think of the message-passing
> portion as looking like this:
>
> if( mumps_par%MYID .eq. 0 ) THEN
>     do jw=1,nsize-1
>         call
> MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURC
>E,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call MPI_recv( 
> res_cpu,total_unknowns
> ,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ier
>r) call MPI_recv(
> row_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr)
>         call MPI_recv(
> col_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr)
>     end do
> else
>     call
> MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%M
>YID,MPI_COMM_WORLD,ierr) call MPI_Send(  res_cpu,total_unknowns
> ,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
>     call MPI_Send(  row_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
>     call MPI_Send(  col_cpu,total_elem_cpu*unique,MPI_INTEGER
> ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr)
> end if
>
> If you're running on two processes, then the messages you receive are in
> the order you expect.  If there are more than two processes, however,
> certainly messages will start appearing "out of order" and your
> indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting
> them mixed up.  You won't just get all messages from one rank and then
> all from another and then all from another.  Rather, the messages from
> all these other processes will come interwoven, but you interpret them
> in a fixed order.
>
> Here is what I mean.  Let's say you have 3 processes.  So, rank 0 will
> receive 8 messages:  4 from rank 1and 4 from rank 2.  Correspondingly,
> rank 1 and rank 2 will each send 4 messages to rank 0.  Here is a
> possibility for the order in which messages are received:
>
> jacob_cpu from rank 1
> jacob_cpu from rank 2
> res_cpu from rank 1
> row_cpu from rank 1
> res_cpu from rank 2
> row_cpu from rank 2
> col_cpu from rank 2
> col_cpu from rank 1
>
> Rank 0, however, is trying to unpack these in the order you prescribed
> in your code.  Data will get misinterpreted.  More to the point here,
> you will be trying to receive data into buffers of the wrong size (some
> of the time).
>
> Maybe you should use tags to distinguish between the different types of
> messages you're trying to send.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] "An error occurred in MPI_Recv" with more than 2 CPU

Reply via email to