Since it is happening on this cluster and not on others, have you checked the
InfiniBand counters to ensure it’s not a bad cable or something along those
lines? I believe the command is ibdiag (or something similar).
Collin
From: users On Behalf Of Bart Willems via
users
Sent: Thursday, June
Hello,
I am attempting to run High Performance Linpack (2.3) between 2 nodes with
Open MPI 4.1.4 and MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64. Within a
minute or so, the run always crashes with
[node002:04556] *** An error occurred in MPI_Recv
[node002:04556] *** reported by process [1007222785