Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

McGrattan, Kevin B. Dr. (Fed) Mon, 03 Apr 2017 13:51:06 -0700

Thanks, George.

Are persistent send/receives matched from the start of the calculation? If so, 
then I guess MPI_CANCEL won’t work.


I don’t think Open MPI is the problem. I think there is something wrong with 
our cluster in that it just seems to hang up on these big packages. The 
calculation successfully exchanges hundreds or thousands before just hanging.

I’m not sure I understand completely your recommendation for dumping 
diagnostics. Is this documented somewhere?

Thanks

Kevin



From: George Bosilca [mailto:bosi...@icl.utk.edu]
Sent: Monday, April 03, 2017 2:29 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: McGrattan, Kevin B. Dr. (Fed) <kevin.mcgrat...@nist.gov>
Subject: Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Kevin,

In Open MPI we only support cancelling non-yet matched receives. So, you cannot 
cancel sends nor receive requests that have already been matched. While the 
latter are supposed to complete (otherwise they would not have been matched), 
the former are trickier to complete if the corresponding receive is never 
posted.

To sum this up, the bad news is that there is no way to correctly cancel MPI 
requests without hitting deadlock.

That being said, I can hardly understand how Open MPI can drop a message. There 
might be something else in here, that is more difficult to spot. We do have an 
internal way to dump all pending (or known) communication. Assuming you are 
using the OB1 PML here is how you dump all known communications. Attach to a 
process and find the communicator pointer (you will need to convert between the 
F90 communicator and the C pointer) and then call mca_pml.pml_dump( commptr, 1).

Also, it is possible to check how one of the more recent versions of Open MPI 
(> 2.1) behave with your code ?

  George.




On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) 
<kevin.mcgrat...@nist.gov<mailto:kevin.mcgrat...@nist.gov>> wrote:
I am running a large computational fluid dynamics code on a linux cluster 
(Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled with 
Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two sockets, each 
socket has six cores. I have noticed that the code hangs when the size of the 
packages exchanged using a persistent send and receive call become large. I 
cannot say exactly how large, but generally on the order of 10 MB. Rather than 
let the code just hang, I implemented a timing loop using MPI_TESTALL. If 
MPI_TESTALL fails to return successfully after, say, 10 minutes, I attempt to 
MPI_CANCEL the unsuccessful request(s) and continue on with the calculation, 
even if the communication(s) did not succeed. It would not necessarily cripple 
the calculation if a few MPI communications were unsuccessful. This is a 
snippet of code that tests if the communications are successful and attempts to 
cancel if not:

   START_TIME = MPI_WTIME()
   FLAG = .FALSE.
   DO WHILE(.NOT.FLAG)
      CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
      WAIT_TIME = MPI_WTIME() - START_TIME
      IF (WAIT_TIME>TIMEOUT) THEN
         WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process 
',MYID,' running on ',PNAME(1:PNAMELEN)
         DO NNN=1,NREQ
            IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
            CALL MPI_CANCEL(REQ(NNN),IERR)
            write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
            CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
            write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
            CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
            write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_TEST_CANCELLED'
         ENDDO
     ENDIF
   ENDDO

The job still hangs, and when I look at the error file, I see that on MPI 
process A, one of the sends has not completed, and on process B, one of the 
receives has not completed. The failed send and failed receive are consistent – 
that is they are matching. What I do not understand is that for both the 
uncompleted send and receive, the code hangs in MPI_WAIT. That is, I do not get 
the printout that says that the process has returned from MPI_WAIT. I interpret 
this to mean that either some of the large message has been sent or received, 
but not all. The MPI standard seems a bit vague on what is supposed to happen 
if part of the message simply disappears due to some network glitch. These 
errors occur after hundreds or thousands of successful exchanges. They never 
happen at the same point in the calculation. They are random, but they occur 
only when the messages are large (like MBs). When the messages are not large, 
the code can run for days or weeks without errors.

So why does MPI_WAIT hang? The MPI standard says

“If a communication is marked for cancellation, then an 
MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that 
communication is guaranteed to return, irrespective of the activities of other 
processes (i.e., 
MPI_Wait<https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a 
local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).

Could the problem be with my cluster – in that the large message is broken up 
into smaller packets, and one of these packets disappears and there is no way 
to cancel it? That’s really what I am looking for – a way to cancel the failed 
communication but still continue the calculation.


_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL

Reply via email to