John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was different as we were talking about losing a process.
The proposed solution to force the timeout to a large value was not fixing the problem, it was just delaying it enough for the application to run to completion. George. On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users < users@lists.open-mpi.org> wrote: > ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? > > On 28 September 2017 at 11:17, John Hearns <hear...@googlemail.com> wrote: > >> >> Google turns this up: >> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls >> >> >> On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch> >> wrote: >> >>> Hi, >>> >>> >>> we have a issue on our 32 nodes Linux cluster regarding the usage of >>> Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR >>> single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). >>> >>> >>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI >>> processes distributed on 16 nodes [node01-node16]), we observe the freeze >>> of the simulation due to an internal error displaying: "error polling LP CQ >>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 >>> vendor error 136 qp_idx 0" (see attached file for full output). >>> >>> >>> The job hangs, no computation neither communication occurs anymore, but >>> no exit neither unload of the nodes is observed. The job can be killed >>> normally but then the concerned nodes do not fully recover. A relaunch of >>> the simulation usually sustains a couple of iterations (few minutes >>> runtime), and then the job hangs again due to similar reasons. The only >>> workaround so far is to reboot the involved nodes. >>> >>> >>> Since we didn't find any hints on the web regarding this >>> strange behaviour, I am wondering if this is a known issue. We actually >>> don't know what causes this to happen and why. So any hints were to start >>> investigating or possible reasons for this to happen are welcome. >>> >>> >>> Ludovic >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >>> >> >> > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users