ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns <hear...@googlemail.com> wrote:

>
> Google turns this up:
> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>
>
> On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch>
> wrote:
>
>> Hi,
>>
>>
>> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
>> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
>> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>
>>
>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>> of the simulation due to an internal error displaying: "error polling LP CQ
>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>
>>
>> The job hangs, no computation neither communication occurs anymore, but
>> no exit neither unload of the nodes is observed. The job can be killed
>> normally but then the concerned nodes do not fully recover. A relaunch of
>> the simulation usually sustains a couple of iterations (few minutes
>> runtime), and then the job hangs again due to similar reasons. The only
>> workaround so far is to reboot the involved nodes.
>>
>>
>> Since we didn't find any hints on the web regarding this
>> strange behaviour, I am wondering if this is a known issue. We actually
>> don't know what causes this to happen and why. So any hints were to start
>> investigating or possible reasons for this to happen are welcome.​
>>
>>
>> Ludovic
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to