[OMPI users] Fwd: Ompi-5.0.2 + ucx-1.17 on Infiniband fails to run

2024-07-07 Thread Sangam B via users
Hi,

The application compiled with OpenMPI-5.0.2 or 5.0.3 runs fine *only if
"mpirun -mca pml ob1" *option is used.
If any other options such as "-mca pml ucx" OR some other btl options OR if
none of the options are used, then it fails with following error:

[n1:0] *** An error occurred in MPI_Isend
[n1:0] *** reported by process [2874540033,16]
[n1:0] *** on communicator MPI_COMM_WORLD
[n1:0] *** MPI_ERR_TAG: invalid tag
[n1:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
now abort,
[n1:0] ***and MPI will try to terminate your MPI job as well)

The job is running on only 1 node.

OpenMPI-5.0.2 compiled with UCX-1.17.0 and following configure options:

 --with-knem=/opt/knem-1.1.4.90mlnx2 \
--with-ofi=/opt/libfabric/1.13.1 \
--with-ucx=/openmpi/ucx/1.17.0/g131xpmt \
--with-pbs=/opt/pbs \
--with-threads=pthreads \
--without-lsf --without-cuda \
--with-libevent=/openmpi/libevent/2.1.12
--with-libevent-libdir=/openmpi/libevent/2.1.12/lib \
--with-hwloc=/openmpi/hwloc/2.11.0/g131
--with-hwloc-libdir=/openmpi/hwloc/2.11.0/lib \
--with-pmix=/openmpi/pmix/502/g131
--with-pmix-libdir=/openmpi/pmix/502/lib \
--enable-shared --enable-static --enable-mt \
--enable-mca-no-build=btl-usnic
Each node has a 4X HDR card installed:

CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.33.1048
Hardware version: 0
Node GUID: 0x88e9a46f0680
System image GUID: 0x88e9a46f0680
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 37
LMC: 0
SM lid: 1
Capability mask: 0xa651e848
Port GUID: 0x88e9a46f0680
Link layer: InfiniBand

Can anybody help me to know why it works for only "-mca pml ob1" and why
not for other options?


Re: [OMPI users] Fwd: Ompi-5.0.2 + ucx-1.17 on Infiniband fails to run

2024-07-07 Thread Gilles Gouaillardet via users
Sangam,

A possible explanation is you are using tags higher than MPI_TAG_UB.
The MPI standard states this value is at least 32767. It is possible ob1
allows much higher tags than pml.

Cheers,

Gilles

On Sun, Jul 7, 2024, 22:51 Sangam B via users 
wrote:

> Hi,
>
> The application compiled with OpenMPI-5.0.2 or 5.0.3 runs fine *only if
> "mpirun -mca pml ob1" *option is used.
> If any other options such as "-mca pml ucx" OR some other btl options OR
> if none of the options are used, then it fails with following error:
>
> [n1:0] *** An error occurred in MPI_Isend
> [n1:0] *** reported by process [2874540033,16]
> [n1:0] *** on communicator MPI_COMM_WORLD
> [n1:0] *** MPI_ERR_TAG: invalid tag
> [n1:0] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [n1:0] ***and MPI will try to terminate your MPI job as well)
>
> The job is running on only 1 node.
>
> OpenMPI-5.0.2 compiled with UCX-1.17.0 and following configure options:
>
>  --with-knem=/opt/knem-1.1.4.90mlnx2 \
> --with-ofi=/opt/libfabric/1.13.1 \
> --with-ucx=/openmpi/ucx/1.17.0/g131xpmt \
> --with-pbs=/opt/pbs \
> --with-threads=pthreads \
> --without-lsf --without-cuda \
> --with-libevent=/openmpi/libevent/2.1.12
> --with-libevent-libdir=/openmpi/libevent/2.1.12/lib \
> --with-hwloc=/openmpi/hwloc/2.11.0/g131
> --with-hwloc-libdir=/openmpi/hwloc/2.11.0/lib \
> --with-pmix=/openmpi/pmix/502/g131
> --with-pmix-libdir=/openmpi/pmix/502/lib \
> --enable-shared --enable-static --enable-mt \
> --enable-mca-no-build=btl-usnic
> Each node has a 4X HDR card installed:
>
> CA 'mlx5_0'
> CA type: MT4123
> Number of ports: 1
> Firmware version: 20.33.1048
> Hardware version: 0
> Node GUID: 0x88e9a46f0680
> System image GUID: 0x88e9a46f0680
> Port 1:
> State: Active
> Physical state: LinkUp
> Rate: 200
> Base lid: 37
> LMC: 0
> SM lid: 1
> Capability mask: 0xa651e848
> Port GUID: 0x88e9a46f0680
> Link layer: InfiniBand
>
> Can anybody help me to know why it works for only "-mca pml ob1" and why
> not for other options?
>