I'm trying to run an MPI RMA application on an IB cluster and find that Open MPI is using the pt2pt rdma component instead of openib (or UCX). I tried getting some logs from Open MPI (current 3.1.x git):

```
$ mpirun -n 2 --mca btl_base_verbose 100 --mca osc_base_verbose 100 --mca osc_rdma_verbose 100 ./a.out [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: opening osc components [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: found loaded component sm [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: component sm open function successful [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: found loaded component monitoring [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: found loaded component pt2pt [taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: components_open: found loaded component rdma [taurusi6606.taurus.hrsk.tu-dresden.de:08214] rdmacm CPC only supported when the first QP is a PP QP; skipped [taurusi6606.taurus.hrsk.tu-dresden.de:08214] openib BTL: rdmacm CPC unavailable for use on mlx5_0:1; skipped [taurusi6606.taurus.hrsk.tu-dresden.de:08214] [rank=0] openib: using port mlx5_0:1 [taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: init of component openib returned success [taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: initializing btl component tcp [taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8 [taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Found match: 127.0.0.1 (lo)
```

Is there any information on what makes "rdmacm CPC unavailable for use"? I cannot make much sense of "rdmacm CPC only supported when the first QP is a PP QP"... Is this a configuration problem of the system? A problem with the software stack?

If I try the same using Open MPI 4.0.x it reports:
```
[taurusi6607.taurus.hrsk.tu-dresden.de:21681] Process is not bound: distance to device is 0.000000
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   taurusi6606
  Local device: mlx5_0
--------------------------------------------------------------------------
[taurusi6606.taurus.hrsk.tu-dresden.de:09069] select: init of component openib returned failure
```

The message about rdmacm does not show up.

The system has mlx5 devices:

```
$ ~/opt/openmpi-v3.1.x/bin/mpirun -n 2 ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              08003800013c7507
    device                 node GUID
    ------              ----------------
    mlx5_0              08003800013c773b
```

Any help would be much appreciated!

Thanks,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to