Thank you Eugene for your suggestion. I used different tags for each variable, and now I do not get this error. The problem now is that I am getting a different solution when I use more than 2 CPUs. I checked the matrices and I found that they differ by a very small amount of the order 10^(-10). Actually, I am getting a different solution if I use 4CPUs or 16CPUs!!! Do you have any idea what could cause this behavior?
Thank you, Vasilis On Tuesday 26 of May 2009 7:21:32 pm you wrote: > vasilis wrote: > >Dear openMpi users, > > > >I am trying to develop a code that runs in parallel mode with openMPI > > (1.3.2 version). The code is written in Fortran 90, and I am running on > > a cluster > > > >If I use 2 CPU the program runs fine, but for a larger number of CPUs I > > get the following error: > > > >[compute-2-6.local:18491] *** An error occurred in MPI_Recv > >[compute-2-6.local:18491] *** on communicator MPI_COMM_WORLD > >[compute-2-6.local:18491] *** MPI_ERR_TRUNCATE: message truncated > >[compute-2-6.local:18491] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > >abort) > > > >Here is the part of the code that this error refers to: > >if( mumps_par%MYID .eq. 0 ) THEN > > res=res+res_cpu > > do iw=1,total_elem_cpu*unique > > jacob(iw)=jacob(iw)+jacob_cpu(iw) > > position_col(iw)=position_col(iw)+col_cpu(iw) > > position_row(iw)=position_row(iw)+row_cpu(iw) > > end do > > > > do jw=1,nsize-1 > > call > >MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOUR > >CE,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call > >MPI_recv(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_AN > >Y_TAG,MPI_COMM_WORLD,status2,ierr) call > >MPI_recv(row_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_ > >TAG,MPI_COMM_WORLD,status3,ierr) call > >MPI_recv(col_cpu,total_elem_cpu*unique,MPI_INTEGER,MPI_ANY_SOURCE,MPI_ANY_ > >TAG,MPI_COMM_WORLD,status4,ierr) > > > > res=res+res_cpu > > do iw=1,total_elem_cpu*unique > > > > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)=& > > jacob(status1(MPI_SOURCE)*total_elem_cpu*unique+iw)+jacob_cpu(iw) > > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)=& > > position_col(status4(MPI_SOURCE)*total_elem_cpu*unique+iw)+col_cpu(iw) > > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)=& > > position_row(status3(MPI_SOURCE)*total_elem_cpu*unique+iw)+row_cpu(iw) > > end do > > end do > > else > > call > >MPI_Isend(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par > >%MYID,MPI_COMM_WORLD,request1,ierr) call > >MPI_Isend(res_cpu,total_unknowns,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI > >_COMM_WORLD,request2,ierr) call > >MPI_Isend(row_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C > >OMM_WORLD,request3,ierr) call > >MPI_Isend(col_cpu,total_elem_cpu*unique,MPI_INTEGER,0,mumps_par%MYID,MPI_C > >OMM_WORLD,request4,ierr) call MPI_Wait(request1, status1, ierr) > > call MPI_Wait(request2, status2, ierr) > > call MPI_Wait(request3, status3, ierr) > > call MPI_Wait(request4, status4, ierr) > > end if > > > > > >I am also using the MUMPS library > > > >Could someone help to track this error down. Is really annoying to use > > only two processors. > >The cluster has about 8 nodes and each has 4 dual core CPU. I tried to run > > the code on a single node with more than 2 CPU but I got the same error!! > > I think the error message means that the received message was longer > than the receive buffer that was specified. If I look at your code and > try to reason about its correctness, I think of the message-passing > portion as looking like this: > > if( mumps_par%MYID .eq. 0 ) THEN > do jw=1,nsize-1 > call > MPI_recv(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,MPI_ANY_SOURC >E,MPI_ANY_TAG,MPI_COMM_WORLD,status1,ierr) call MPI_recv( > res_cpu,total_unknowns > ,MPI_DOUBLE_PRECISION,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status2,ier >r) call MPI_recv( > row_cpu,total_elem_cpu*unique,MPI_INTEGER > ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status3,ierr) > call MPI_recv( > col_cpu,total_elem_cpu*unique,MPI_INTEGER > ,MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,status4,ierr) > end do > else > call > MPI_Send(jacob_cpu,total_elem_cpu*unique,MPI_DOUBLE_PRECISION,0,mumps_par%M >YID,MPI_COMM_WORLD,ierr) call MPI_Send( res_cpu,total_unknowns > ,MPI_DOUBLE_PRECISION,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) > call MPI_Send( row_cpu,total_elem_cpu*unique,MPI_INTEGER > ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) > call MPI_Send( col_cpu,total_elem_cpu*unique,MPI_INTEGER > ,0,mumps_par%MYID,MPI_COMM_WORLD,ierr) > end if > > If you're running on two processes, then the messages you receive are in > the order you expect. If there are more than two processes, however, > certainly messages will start appearing "out of order" and your > indiscriminate use of MPI_ANY_SOURCE and MPI_ANY_TAG will start getting > them mixed up. You won't just get all messages from one rank and then > all from another and then all from another. Rather, the messages from > all these other processes will come interwoven, but you interpret them > in a fixed order. > > Here is what I mean. Let's say you have 3 processes. So, rank 0 will > receive 8 messages: 4 from rank 1and 4 from rank 2. Correspondingly, > rank 1 and rank 2 will each send 4 messages to rank 0. Here is a > possibility for the order in which messages are received: > > jacob_cpu from rank 1 > jacob_cpu from rank 2 > res_cpu from rank 1 > row_cpu from rank 1 > res_cpu from rank 2 > row_cpu from rank 2 > col_cpu from rank 2 > col_cpu from rank 1 > > Rank 0, however, is trying to unpack these in the order you prescribed > in your code. Data will get misinterpreted. More to the point here, > you will be trying to receive data into buffers of the wrong size (some > of the time). > > Maybe you should use tags to distinguish between the different types of > messages you're trying to send. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users