Greetings,
We are troubleshooting an IB network fabric issue that is causing some of our
MPI applications to failed with errors like this:
--
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry
application on half the nodes, then
the other half. My hunch is that you will find faulty cables.
I can of course be very wrong and it is something that this application
triggers.
On Wed, 16 Feb 2022 at 19:28, Shan-ho Tsai via users
mailto:users@lists.open-mpi.org>> wrote:
Greeting