No errors on any of the links. This is also not isolated to 1 or 2 nodes,
it happens on all cluster nodes.

Bart

On Thu, Jun 9, 2022 at 11:42 AM Collin Strassburger via users <
users@lists.open-mpi.org> wrote:

> Since it is happening on this cluster and not on others, have you checked
> the InfiniBand counters to ensure it’s not a bad cable or something along
> those lines?  I believe the command is ibdiag (or something similar).
>
>
>
> Collin
>
>
>
> *From:* users <users-boun...@lists.open-mpi.org> *On Behalf Of *Bart
> Willems via users
> *Sent:* Thursday, June 9, 2022 12:32 PM
> *To:* users@lists.open-mpi.org
> *Cc:* Bart Willems <bwi...@gmail.com>
> *Subject:* [OMPI users] HPL: Error occurred in MPI_Recv
>
>
>
> *CAUTION - EXTERNAL EMAIL:* Do not click any links or open any
> attachments unless you trust the sender and know the content is safe.
>
> Hello,
>
>
>
> I am attempting to run High Performance Linpack (2.3) between 2 nodes with
> Open MPI 4.1.4 and MLNX_OFED_LINUX-5.6-2.0.9.0-rhel8.6-x86_64. Within a
> minute or so, the run always crashes with
>
>
>
> [node002:04556] *** An error occurred in MPI_Recv
> [node002:04556] *** reported by process [1007222785,24]
> [node002:04556] *** on communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [node002:04556] *** MPI_ERR_TRUNCATE: message truncated
> [node002:04556] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [node002:04556] ***    and potentially your MPI job)
>
>
>
> I have reverted back to Open MPI 4.1.2 with which I have had no issues on
> other systems, but the problem persists on this cluster.
>
>
>
> Any suggestions on steps to diagnose?
>
>
>
> Thank you,
>
> Bart
>

Reply via email to