Hi,

we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, 
Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).


On long runs (over ~10 days) involving more than 1 node (usually 64 MPI 
processes distributed on 16 nodes [node01-node16]?), we observe the freeze of 
the simulation due to an internal error displaying: "error polling LP CQ with 
status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1  vendor 
error 136 qp_idx 0" (see attached file for full output).


The job hangs, no computation neither communication occurs anymore, but no exit 
neither unload of the nodes is observed. The job can be killed normally but 
then the concerned nodes do not fully recover. A relaunch of the simulation 
usually sustains a couple of iterations (few minutes runtime), and then the job 
hangs again due to similar reasons. The only workaround so far is to reboot the 
involved nodes.


Since we didn't find any hints on the web regarding this strange behaviour, I 
am wondering if this is a known issue. We actually don't know what causes this 
to happen and why. So any hints were to start investigating or possible reasons 
for this to happen are welcome.?


Ludovic

Attachment: MPI_error_msg
Description: MPI_error_msg

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to