Hi,
we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). On long runs (over ~10 days) involving more than 1 node (usually 64 MPI processes distributed on 16 nodes [node01-node16]?), we observe the freeze of the simulation due to an internal error displaying: "error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 vendor error 136 qp_idx 0" (see attached file for full output). The job hangs, no computation neither communication occurs anymore, but no exit neither unload of the nodes is observed. The job can be killed normally but then the concerned nodes do not fully recover. A relaunch of the simulation usually sustains a couple of iterations (few minutes runtime), and then the job hangs again due to similar reasons. The only workaround so far is to reboot the involved nodes. Since we didn't find any hints on the web regarding this strange behaviour, I am wondering if this is a known issue. We actually don't know what causes this to happen and why. So any hints were to start investigating or possible reasons for this to happen are welcome.? Ludovic
MPI_error_msg
Description: MPI_error_msg
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users