You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED (http://www.openfabrics.org/)
Pasha.
Prentice Bisbal wrote:
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?
[0,1,6][../../../../../ompi/mca/btl/openi