Since it is happening on this cluster and not on others, have you checked the 
InfiniBand counters to ensure it’s not a bad cable or something along those 
lines?  I believe the command is ibdiag (or something similar).

Collin

From: users <users-boun...@lists.open-mpi.org> On Behalf Of Bart Willems via 
users
Sent: Thursday, June 9, 2022 12:32 PM
To: users@lists.open-mpi.org
Cc: Bart Willems <bwi...@gmail.com>
Subject: [OMPI users] HPL: Error occurred in MPI_Recv


CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless 
you trust the sender and know the content is safe.
Hello,

I am attempting to run High Performance Linpack (2.3) between 2 nodes with Open 
MPI 4.1.4 and MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64. Within a minute or 
so, the run always crashes with

[node002:04556] *** An error occurred in MPI_Recv
[node002:04556] *** reported by process [1007222785,24]
[node002:04556] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
[node002:04556] *** MPI_ERR_TRUNCATE: message truncated
[node002:04556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will 
now abort,
[node002:04556] ***    and potentially your MPI job)

I have reverted back to Open MPI 4.1.2 with which I have had no issues on other 
systems, but the problem persists on this cluster.

Any suggestions on steps to diagnose?

Thank you,
Bart

Reply via email to