No errors on any of the links. This is also not isolated to 1 or 2 nodes, it happens on all cluster nodes.
Bart On Thu, Jun 9, 2022 at 11:42 AM Collin Strassburger via users < users@lists.open-mpi.org> wrote: > Since it is happening on this cluster and not on others, have you checked > the InfiniBand counters to ensure it’s not a bad cable or something along > those lines? I believe the command is ibdiag (or something similar). > > > > Collin > > > > *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Bart > Willems via users > *Sent:* Thursday, June 9, 2022 12:32 PM > *To:* users@lists.open-mpi.org > *Cc:* Bart Willems <bwi...@gmail.com> > *Subject:* [OMPI users] HPL: Error occurred in MPI_Recv > > > > *CAUTION - EXTERNAL EMAIL:* Do not click any links or open any > attachments unless you trust the sender and know the content is safe. > > Hello, > > > > I am attempting to run High Performance Linpack (2.3) between 2 nodes with > Open MPI 4.1.4 and MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64. Within a > minute or so, the run always crashes with > > > > [node002:04556] *** An error occurred in MPI_Recv > [node002:04556] *** reported by process [1007222785,24] > [node002:04556] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3 > [node002:04556] *** MPI_ERR_TRUNCATE: message truncated > [node002:04556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator > will now abort, > [node002:04556] *** and potentially your MPI job) > > > > I have reverted back to Open MPI 4.1.2 with which I have had no issues on > other systems, but the problem persists on this cluster. > > > > Any suggestions on steps to diagnose? > > > > Thank you, > > Bart >