Kevin,

You are right, changing to MPI_TEST only hides the issue. A message is
dropped, and the corresponding request will therefore never finish.

Your error message indicates that at least 2 processes have issues sending
data to burn005-ib. Have that node received a message in your run before ?
If yes, is the process somehow reaching the MPI_Finalize and started to
tear down connections to his peers ?

  George.



On Wed, Apr 5, 2017 at 4:16 PM, McGrattan, Kevin B. Dr. (Fed) <
kevin.mcgrat...@nist.gov> wrote:

> George
>
>
>
> Thanks for the advice. I still don’t know what’s wrong with my cluster. I
> get errors like this
>
>
>
> [[39827,1],182][btl_openib_component.c:3497:handle_wc] from burn001 to:
> burn005-ib error polling LP CQ with status RETRY EXCEEDED ERROR status
> number 12 for wr_id 2b428467dc00 opcode 1  vendor error 0 qp_idx 0
>
> [[39827,1],114][btl_openib_component.c:3497:handle_wc] from burn023 to:
> burn005-ib error polling HP CQ with status WORK REQUEST FLUSHED ERROR
> status number 5 for wr_id 8dd8080 opcode 128  vendor error 0 qp_idx 0
>
>
>
> I did some searching on these error messages, and I think it implies
> there’s something amiss with our IB fabric. But I am able to bypass some of
> the timeouts by doing this
>
>
>
> CALL MPI_CANCEL
>
> CALL MPI_TEST
>
> CALL MPI_TEST_CANCELLED
>
>
>
> I don’t think that the calls to MPI_TEST or MPI_TEST_CANCELLED do
> anything, but at least they don’t block. I am going to see if I can just
> ignore a dropped packet now and again, or try to figure out what’s wrong
> with our IB.
>
>
>
> Thanks
>
>
>
> Kevin
>
>
>
> *From:* George Bosilca [mailto:bosi...@icl.utk.edu]
> *Sent:* Monday, April 03, 2017 5:59 PM
> *To:* McGrattan, Kevin B. Dr. (Fed) <kevin.mcgrat...@nist.gov>
> *Cc:* Open MPI Users <users@lists.open-mpi.org>
>
> *Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
>
>
>
> On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) <
> kevin.mcgrat...@nist.gov> wrote:
>
> Thanks, George.
>
>
>
> Are persistent send/receives matched from the start of the calculation? If
> so, then I guess MPI_CANCEL won’t work.
>
>
>
> A persistent request is only matched when it is started. The MPI_Cancel on
> a persistent receive doesn't affect the persistent request itself, but
> instead only cancel the started instance of the request.
>
>
>
>  I don’t think Open MPI is the problem. I think there is something wrong
> with our cluster in that it just seems to hang up on these big packages.
> The calculation successfully exchanges hundreds or thousands before just
> hanging.
>
>
>
> While possible, it is highly unlikely that a message gets dropped by the
> network without some kind of warning (system log at least). You might want
> to take  a look in the dmesg to see if there is nothing unexpected there.
>
>
>
>  I’m not sure I understand completely your recommendation for dumping
> diagnostics. Is this documented somewhere?
>
>
>
> Unfortunately not, this is basically a developer trick to dump the state
> of the MPI library. This goes a little like this. Once you have attached a
> debugger to your process (let's assume gdb), you need to find the
> communicator where you have posted your requests (I can't help here this is
> not part of the code you sent). With <communicator_index> set to this value:
>
>
>
> gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>]
>
>
>
> will give you the C pointer of the communicator.
>
>
>
> gdb$ call mca_pml.pml_dump( ompi_comm_f_to_c_table.addr[<
> communicator_index>], 1)
>
>
>
> should print all the local known messages by the MPI library, including
> pending sends and receives. This will also print additional information
> (the status of the requests, the tag, the size, and so on) that can be
> understood by the developers. If you post the info here, we might be able
> to provide additional information on the issue.
>
>
>
> George.
>
>
>
>
>
>
>
> Thanks
>
>
>
> Kevin
>
>
>
>
>
>
>
> *From:* George Bosilca [mailto:bosi...@icl.utk.edu]
> *Sent:* Monday, April 03, 2017 2:29 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Cc:* McGrattan, Kevin B. Dr. (Fed) <kevin.mcgrat...@nist.gov>
> *Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL
>
>
>
> Kevin,
>
>
>
> In Open MPI we only support cancelling non-yet matched receives. So, you
> cannot cancel sends nor receive requests that have already been matched.
> While the latter are supposed to complete (otherwise they would not have
> been matched), the former are trickier to complete if the corresponding
> receive is never posted.
>
>
>
> To sum this up, the bad news is that there is no way to correctly cancel
> MPI requests without hitting deadlock.
>
>
>
> That being said, I can hardly understand how Open MPI can drop a message.
> There might be something else in here, that is more difficult to spot. We
> do have an internal way to dump all pending (or known) communication.
> Assuming you are using the OB1 PML here is how you dump all known
> communications. Attach to a process and find the communicator pointer (you
> will need to convert between the F90 communicator and the C pointer) and
> then call mca_pml.pml_dump( commptr, 1).
>
>
>
> Also, it is possible to check how one of the more recent versions of Open
> MPI (> 2.1) behave with your code ?
>
>
>
>   George.
>
>
>
>
>
>
>
>
>
> On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) <
> kevin.mcgrat...@nist.gov> wrote:
>
> I am running a large computational fluid dynamics code on a linux cluster
> (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled
> with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two
> sockets, each socket has six cores. I have noticed that the code hangs when
> the size of the packages exchanged using a persistent send and receive call
> become large. I cannot say exactly how large, but generally on the order of
> 10 MB. Rather than let the code just hang, I implemented a timing loop
> using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say,
> 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and
> continue on with the calculation, even if the communication(s) did not
> succeed. It would not necessarily cripple the calculation if a few MPI
> communications were unsuccessful. This is a snippet of code that tests if
> the communications are successful and attempts to cancel if not:
>
>
>
>    START_TIME = MPI_WTIME()
>
>    FLAG = .FALSE.
>
>    DO WHILE(.NOT.FLAG)
>
>       CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR)
>
>       WAIT_TIME = MPI_WTIME() - START_TIME
>
>       IF (WAIT_TIME>TIMEOUT) THEN
>
>          WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process
> ',MYID,' running on ',PNAME(1:PNAMELEN)
>
>          DO NNN=1,NREQ
>
>             IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE
>
>             CALL MPI_CANCEL(REQ(NNN),IERR)
>
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL'
>
>             CALL MPI_WAIT(REQ(NNN),STATUS,IERR)
>
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT'
>
>             CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR)
>
>             write(LU_ERR,*) ‘Request ',NNN,’ returns from
> MPI_TEST_CANCELLED'
>
>          ENDDO
>
>      ENDIF
>
>    ENDDO
>
>
>
> The job still hangs, and when I look at the error file, I see that on MPI
> process A, one of the sends has not completed, and on process B, one of the
> receives has not completed. The failed send and failed receive are
> consistent – that is they are matching. What I do not understand is that
> for both the uncompleted send and receive, the code hangs in MPI_WAIT. That
> is, I do not get the printout that says that the process has returned from
> MPI_WAIT. I interpret this to mean that either some of the large message
> has been sent or received, but not all. The MPI standard seems a bit vague
> on what is supposed to happen if part of the message simply disappears due
> to some network glitch. These errors occur after hundreds or thousands of
> successful exchanges. They never happen at the same point in the
> calculation. They are random, but they occur only when the messages are
> large (like MBs). When the messages are not large, the code can run for
> days or weeks without errors.
>
>
>
> So why does MPI_WAIT hang? The MPI standard says
>
>
>
> “If a communication is marked for cancellation, then an MPI_Wait
> <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that
> communication is guaranteed to return, irrespective of the activities of
> other processes (i.e., MPI_Wait
> <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a
> local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php).
>
>
>
>
> Could the problem be with my cluster – in that the large message is broken
> up into smaller packets, and one of these packets disappears and there is
> no way to cancel it? That’s really what I am looking for – a way to cancel
> the failed communication but still continue the calculation.
>
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to