ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? On 28 September 2017 at 11:17, John Hearns <hear...@googlemail.com> wrote:
> > Google turns this up: > https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls > > > On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch> > wrote: > >> Hi, >> >> >> we have a issue on our 32 nodes Linux cluster regarding the usage of Open >> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single >> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). >> >> >> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI >> processes distributed on 16 nodes [node01-node16]), we observe the freeze >> of the simulation due to an internal error displaying: "error polling LP CQ >> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 >> vendor error 136 qp_idx 0" (see attached file for full output). >> >> >> The job hangs, no computation neither communication occurs anymore, but >> no exit neither unload of the nodes is observed. The job can be killed >> normally but then the concerned nodes do not fully recover. A relaunch of >> the simulation usually sustains a couple of iterations (few minutes >> runtime), and then the job hangs again due to similar reasons. The only >> workaround so far is to reboot the involved nodes. >> >> >> Since we didn't find any hints on the web regarding this >> strange behaviour, I am wondering if this is a known issue. We actually >> don't know what causes this to happen and why. So any hints were to start >> investigating or possible reasons for this to happen are welcome. >> >> >> Ludovic >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users