You may try to use ibdiagnet tool: http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED (http://www.openfabrics.org/) Pasha. Prentice Bisbal wrote:
Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just reboot the nodes, or should I have the user who submitted these jobs just increase the retry count/timeout paramters? [0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress] from node14.aurora to: node40.aurora error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 11119 -------------------------------------------------------------------------- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details.