I'm trying to run an MPI RMA application on an IB cluster and find that
Open MPI is using the pt2pt rdma component instead of openib (or UCX). I
tried getting some logs from Open MPI (current 3.1.x git):
```
$ mpirun -n 2 --mca btl_base_verbose 100 --mca osc_base_verbose 100
--mca osc_rdma_verbose 100 ./a.out
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: opening osc components
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: found loaded component sm
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: component sm open function successful
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: found loaded component monitoring
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: found loaded component pt2pt
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base:
components_open: found loaded component rdma
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] rdmacm CPC only supported
when the first QP is a PP QP; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] openib BTL: rdmacm CPC
unavailable for use on mlx5_0:1; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] [rank=0] openib: using
port mlx5_0:1
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: init of component
openib returned success
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: initializing btl
component tcp
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Searching for
exclude address+prefix: 127.0.0.1 / 8
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Found match:
127.0.0.1 (lo)
```
Is there any information on what makes "rdmacm CPC unavailable for use"?
I cannot make much sense of "rdmacm CPC only supported when the first QP
is a PP QP"... Is this a configuration problem of the system? A problem
with the software stack?
If I try the same using Open MPI 4.0.x it reports:
```
[taurusi6607.taurus.hrsk.tu-dresden.de:21681] Process is not bound:
distance to device is 0.000000
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: taurusi6606
Local device: mlx5_0
--------------------------------------------------------------------------
[taurusi6606.taurus.hrsk.tu-dresden.de:09069] select: init of component
openib returned failure
```
The message about rdmacm does not show up.
The system has mlx5 devices:
```
$ ~/opt/openmpi-v3.1.x/bin/mpirun -n 2 ibv_devices
device node GUID
------ ----------------
mlx5_0 08003800013c7507
device node GUID
------ ----------------
mlx5_0 08003800013c773b
```
Any help would be much appreciated!
Thanks,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart
Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users