John,

On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing
the problem, it was just delaying it enough for the application to run to
completion.

  George.


On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
>
> On 28 September 2017 at 11:17, John Hearns <hear...@googlemail.com> wrote:
>
>>
>> Google turns this up:
>> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>>
>>
>> On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> we have a issue on our 32 nodes Linux cluster regarding the usage of
>>> Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR
>>> single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>>
>>>
>>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>>> of the simulation due to an internal error displaying: "error polling LP CQ
>>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>>
>>>
>>> The job hangs, no computation neither communication occurs anymore, but
>>> no exit neither unload of the nodes is observed. The job can be killed
>>> normally but then the concerned nodes do not fully recover. A relaunch of
>>> the simulation usually sustains a couple of iterations (few minutes
>>> runtime), and then the job hangs again due to similar reasons. The only
>>> workaround so far is to reboot the involved nodes.
>>>
>>>
>>> Since we didn't find any hints on the web regarding this
>>> strange behaviour, I am wondering if this is a known issue. We actually
>>> don't know what causes this to happen and why. So any hints were to start
>>> investigating or possible reasons for this to happen are welcome.​
>>>
>>>
>>> Ludovic
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to