Usually "retry exceeded error" points to some network issues, like bad
cable or some bad connector. You may use ibdiagnet tool for the network
debug - *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
Pasha
Brett Pemberton wrote:
Hey,
I've had a couple of errors recently, of the form:
[[1176,1],0][btl_openib_component.c:2905:handle_wc] from
tango092.vpac.org to: tango090 error polling LP CQ with status RETRY
EXCEEDED ERROR status number 12 for wr_id 38996224 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):
My first thought was to increase the retry count, but it is already at
maximum.
I've checked connections between the two nodes, and they seem ok
[root@tango090 ~]# ibv_rc_pingpong
local address: LID 0x005f, QPN 0xe4045d, PSN 0xdd13f0
remote address: LID 0x005d, QPN 0xfe0425, PSN 0xc43fe2
8192000 bytes in 0.07 seconds = 996.93 Mbit/sec
1000 iters in 0.07 seconds = 65.74 usec/iter
How can I stop this happening in the future, without increasing the
retry count?
cheers,
/ Brett
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users