Kevin, You are right, changing to MPI_TEST only hides the issue. A message is dropped, and the corresponding request will therefore never finish.
Your error message indicates that at least 2 processes have issues sending data to burn005-ib. Have that node received a message in your run before ? If yes, is the process somehow reaching the MPI_Finalize and started to tear down connections to his peers ? George. On Wed, Apr 5, 2017 at 4:16 PM, McGrattan, Kevin B. Dr. (Fed) < kevin.mcgrat...@nist.gov> wrote: > George > > > > Thanks for the advice. I still don’t know what’s wrong with my cluster. I > get errors like this > > > > [[39827,1],182][btl_openib_component.c:3497:handle_wc] from burn001 to: > burn005-ib error polling LP CQ with status RETRY EXCEEDED ERROR status > number 12 for wr_id 2b428467dc00 opcode 1 vendor error 0 qp_idx 0 > > [[39827,1],114][btl_openib_component.c:3497:handle_wc] from burn023 to: > burn005-ib error polling HP CQ with status WORK REQUEST FLUSHED ERROR > status number 5 for wr_id 8dd8080 opcode 128 vendor error 0 qp_idx 0 > > > > I did some searching on these error messages, and I think it implies > there’s something amiss with our IB fabric. But I am able to bypass some of > the timeouts by doing this > > > > CALL MPI_CANCEL > > CALL MPI_TEST > > CALL MPI_TEST_CANCELLED > > > > I don’t think that the calls to MPI_TEST or MPI_TEST_CANCELLED do > anything, but at least they don’t block. I am going to see if I can just > ignore a dropped packet now and again, or try to figure out what’s wrong > with our IB. > > > > Thanks > > > > Kevin > > > > *From:* George Bosilca [mailto:bosi...@icl.utk.edu] > *Sent:* Monday, April 03, 2017 5:59 PM > *To:* McGrattan, Kevin B. Dr. (Fed) <kevin.mcgrat...@nist.gov> > *Cc:* Open MPI Users <users@lists.open-mpi.org> > > *Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL > > > > On Mon, Apr 3, 2017 at 4:47 PM, McGrattan, Kevin B. Dr. (Fed) < > kevin.mcgrat...@nist.gov> wrote: > > Thanks, George. > > > > Are persistent send/receives matched from the start of the calculation? If > so, then I guess MPI_CANCEL won’t work. > > > > A persistent request is only matched when it is started. The MPI_Cancel on > a persistent receive doesn't affect the persistent request itself, but > instead only cancel the started instance of the request. > > > > I don’t think Open MPI is the problem. I think there is something wrong > with our cluster in that it just seems to hang up on these big packages. > The calculation successfully exchanges hundreds or thousands before just > hanging. > > > > While possible, it is highly unlikely that a message gets dropped by the > network without some kind of warning (system log at least). You might want > to take a look in the dmesg to see if there is nothing unexpected there. > > > > I’m not sure I understand completely your recommendation for dumping > diagnostics. Is this documented somewhere? > > > > Unfortunately not, this is basically a developer trick to dump the state > of the MPI library. This goes a little like this. Once you have attached a > debugger to your process (let's assume gdb), you need to find the > communicator where you have posted your requests (I can't help here this is > not part of the code you sent). With <communicator_index> set to this value: > > > > gdb$ p ompi_comm_f_to_c_table.addr[<communicator_index>] > > > > will give you the C pointer of the communicator. > > > > gdb$ call mca_pml.pml_dump( ompi_comm_f_to_c_table.addr[< > communicator_index>], 1) > > > > should print all the local known messages by the MPI library, including > pending sends and receives. This will also print additional information > (the status of the requests, the tag, the size, and so on) that can be > understood by the developers. If you post the info here, we might be able > to provide additional information on the issue. > > > > George. > > > > > > > > Thanks > > > > Kevin > > > > > > > > *From:* George Bosilca [mailto:bosi...@icl.utk.edu] > *Sent:* Monday, April 03, 2017 2:29 PM > *To:* Open MPI Users <users@lists.open-mpi.org> > *Cc:* McGrattan, Kevin B. Dr. (Fed) <kevin.mcgrat...@nist.gov> > *Subject:* Re: [OMPI users] MPI_WAIT hangs after a call to MPI_CANCEL > > > > Kevin, > > > > In Open MPI we only support cancelling non-yet matched receives. So, you > cannot cancel sends nor receive requests that have already been matched. > While the latter are supposed to complete (otherwise they would not have > been matched), the former are trickier to complete if the corresponding > receive is never posted. > > > > To sum this up, the bad news is that there is no way to correctly cancel > MPI requests without hitting deadlock. > > > > That being said, I can hardly understand how Open MPI can drop a message. > There might be something else in here, that is more difficult to spot. We > do have an internal way to dump all pending (or known) communication. > Assuming you are using the OB1 PML here is how you dump all known > communications. Attach to a process and find the communicator pointer (you > will need to convert between the F90 communicator and the C pointer) and > then call mca_pml.pml_dump( commptr, 1). > > > > Also, it is possible to check how one of the more recent versions of Open > MPI (> 2.1) behave with your code ? > > > > George. > > > > > > > > > > On Sat, Apr 1, 2017 at 12:40 PM, McGrattan, Kevin B. Dr. (Fed) < > kevin.mcgrat...@nist.gov> wrote: > > I am running a large computational fluid dynamics code on a linux cluster > (Centos 6.8, Open MPI 1.8.4). The code is written in Fortran and compiled > with Intel Fortran 16.0.3. The cluster has 36 nodes, each node has two > sockets, each socket has six cores. I have noticed that the code hangs when > the size of the packages exchanged using a persistent send and receive call > become large. I cannot say exactly how large, but generally on the order of > 10 MB. Rather than let the code just hang, I implemented a timing loop > using MPI_TESTALL. If MPI_TESTALL fails to return successfully after, say, > 10 minutes, I attempt to MPI_CANCEL the unsuccessful request(s) and > continue on with the calculation, even if the communication(s) did not > succeed. It would not necessarily cripple the calculation if a few MPI > communications were unsuccessful. This is a snippet of code that tests if > the communications are successful and attempts to cancel if not: > > > > START_TIME = MPI_WTIME() > > FLAG = .FALSE. > > DO WHILE(.NOT.FLAG) > > CALL MPI_TESTALL(NREQ,REQ(1:NREQ),FLAG,ARRAY_OF_STATUSES,IERR) > > WAIT_TIME = MPI_WTIME() - START_TIME > > IF (WAIT_TIME>TIMEOUT) THEN > > WRITE(LU_ERR,'(A,A,I6,A,A)') ‘Request timed out for MPI process > ',MYID,' running on ',PNAME(1:PNAMELEN) > > DO NNN=1,NREQ > > IF (ARRAY_OF_STATUSES(1,NNN)==MPI_SUCCESS) CYCLE > > CALL MPI_CANCEL(REQ(NNN),IERR) > > write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_CANCEL' > > CALL MPI_WAIT(REQ(NNN),STATUS,IERR) > > write(LU_ERR,*) ‘Request ',NNN,’ returns from MPI_WAIT' > > CALL MPI_TEST_CANCELLED(STATUS,FLAG2,IERR) > > write(LU_ERR,*) ‘Request ',NNN,’ returns from > MPI_TEST_CANCELLED' > > ENDDO > > ENDIF > > ENDDO > > > > The job still hangs, and when I look at the error file, I see that on MPI > process A, one of the sends has not completed, and on process B, one of the > receives has not completed. The failed send and failed receive are > consistent – that is they are matching. What I do not understand is that > for both the uncompleted send and receive, the code hangs in MPI_WAIT. That > is, I do not get the printout that says that the process has returned from > MPI_WAIT. I interpret this to mean that either some of the large message > has been sent or received, but not all. The MPI standard seems a bit vague > on what is supposed to happen if part of the message simply disappears due > to some network glitch. These errors occur after hundreds or thousands of > successful exchanges. They never happen at the same point in the > calculation. They are random, but they occur only when the messages are > large (like MBs). When the messages are not large, the code can run for > days or weeks without errors. > > > > So why does MPI_WAIT hang? The MPI standard says > > > > “If a communication is marked for cancellation, then an MPI_Wait > <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> call for that > communication is guaranteed to return, irrespective of the activities of > other processes (i.e., MPI_Wait > <https://www.open-mpi.org/doc/v2.0/man3/MPI_Wait.3.php> behaves as a > local function)” (https://www.open-mpi.org/doc/v2.0/man3/MPI_Cancel.3.php). > > > > > Could the problem be with my cluster – in that the large message is broken > up into smaller packets, and one of these packets disappears and there is > no way to cancel it? That’s really what I am looking for – a way to cancel > the failed communication but still continue the calculation. > > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > > > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users