Brett Pemberton <br...@vpac.org> wrote:
[[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
I've seen this error with Mellanox ConnectX cards and OFED 1.2.x with all versions of OpenMPI that I have tried (1.2.x and pre-1.3) and some MVAPICH versions, from which I have concluded that the problem lies in the lower levels (OFED or IB card firmware). Indeed after the installation of OFED 1.3.x and a possible firmware update (not sure about the firmware as I don't admin that cluster), these errors have disappeared.
-- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8240, Fax: +49 6221 54 8850 E-mail: bogdan.coste...@iwr.uni-heidelberg.de