I'd suggest opening a ticket on the UCX repo itself. This looks to me like UCX not recognizing a Mellanox device, or at least not initializing it correctly.
> On Aug 11, 2021, at 8:21 AM, Ryan Novosielski <novos...@rutgers.edu> wrote: > > Thanks. That /is/ one solution, and what I’ll do in the interim since this > has to work in at least some fashion, but I would actually like to use UCX if > OpenIB is going to be deprecated. How do I find out what’s actually wrong? > > -- > #BlackLivesMatter > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Jul 29, 2021, at 11:35 AM, Ralph Castain via users >> <users@lists.open-mpi.org> wrote: >> >> So it _is_ UCX that is the problem! Try using OMPI_MCA_pml=ob1 instead >> >>> On Jul 29, 2021, at 8:33 AM, Ryan Novosielski <novos...@rutgers.edu> wrote: >>> >>> Thanks, Ralph. This /does/ change things, but not very much. I was not >>> under the impression that I needed to do that, since when I ran without >>> having built against UCX, it warned me about the openib method being >>> deprecated. By default, does OpenMPI not use either anymore, and I need to >>> specifically call for UCX? Seems strange. >>> >>> Anyhow, I’ve got some variables defined still, in addition to your >>> suggestion, for verbosity: >>> >>> [novosirj@amarel-test2 ~]$ env | grep ^OMPI >>> OMPI_MCA_pml=ucx >>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >>> OMPI_MCA_pml_ucx_verbose=100 >>> >>> Here goes: >>> >>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc --reservation=UCX >>> ./mpihello-gcc-8-openmpi-4.0.6 >>> srun: job 13995650 queued and waiting for resources >>> srun: job 13995650 has been allocated resources >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> WARNING: There was an error initializing an OpenFabrics device. >>> >>> Local host: gpu004 >>> Local device: mlx4_0 >>> -------------------------------------------------------------------------- >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>> OPAL memory hooks as external events >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>> mca_pml_ucx_open: UCX version 1.5.2 >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> self/self: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> rc/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> ud/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29823] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>> level is none >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/eno1: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> rc/mlx4_0:1: did not match transport list >>> -------------------------------------------------------------------------- >>> No components were able to be opened in the pml framework. >>> >>> This typically means that either no components of this type were >>> installed, or none of the installed components can be loaded. >>> Sometimes this means that shared libraries required by these >>> components are unable to be found/loaded. >>> >>> Host: gpu004 >>> Framework: pml >>> -------------------------------------------------------------------------- >>> [gpu004.amarel.rutgers.edu:29823] PML ucx cannot be selected >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>> ud/mlx4_0:1: did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/sysv: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 mm/posix: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 cma/cma: >>> did not match transport list >>> [gpu004.amarel.rutgers.edu:29824] >>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>> level is none >>> -------------------------------------------------------------------------- >>> No components were able to be opened in the pml framework. >>> >>> This typically means that either no components of this type were >>> installed, or none of the installed components can be loaded. >>> Sometimes this means that shared libraries required by these >>> components are unable to be found/loaded. >>> >>> Host: gpu004 >>> Framework: pml >>> -------------------------------------------------------------------------- >>> [gpu004.amarel.rutgers.edu:29824] PML ucx cannot be selected >>> slurmstepd: error: *** STEP 13995650.0 ON gpu004 CANCELLED AT >>> 2021-07-29T11:31:19 *** >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> srun: error: gpu004: tasks 0-1: Exited with exit code 1 >>> >>> -- >>> #BlackLivesMatter >>> ____ >>> || \\UTGERS, >>> |---------------------------*O*--------------------------- >>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >>> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >>> Newark >>> `' >>> >>>> On Jul 29, 2021, at 8:34 AM, Ralph Castain via users >>>> <users@lists.open-mpi.org> wrote: >>>> >>>> Ryan - I suspect what Sergey was trying to say was that you need to ensure >>>> OMPI doesn't try to use the OpenIB driver, or at least that it doesn't >>>> attempt to initialize it. Try adding >>>> >>>> OMPI_MCA_pml=ucx >>>> >>>> to your environment. >>>> >>>> >>>>> On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users >>>>> <users@lists.open-mpi.org> wrote: >>>>> >>>>> Hi >>>>> >>>>> This issue arrives from BTL OpenIB, not related to UCX >>>>> >>>>> From: users <users-boun...@lists.open-mpi.org> on behalf of Ryan >>>>> Novosielski via users <users@lists.open-mpi.org> >>>>> Date: Thursday, 29 July 2021, 08:25 >>>>> To: users@lists.open-mpi.org <users@lists.open-mpi.org> >>>>> Cc: Ryan Novosielski <novos...@rutgers.edu> >>>>> Subject: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: >>>>> There was an error initializing an OpenFabrics device." >>>>> >>>>> Hi there, >>>>> >>>>> New to using UCX, as a result of having built OpenMPI without it and >>>>> running tests and getting warned. Installed UCX from the distribution: >>>>> >>>>> [novosirj@amarel-test2 ~]$ rpm -qa ucx >>>>> ucx-1.5.2-1.el7.x86_64 >>>>> >>>>> …and rebuilt OpenMPI. Built fine. However, I’m getting some pretty >>>>> unhelpful messages about not using the IB card. I looked around the >>>>> internet some and set a couple of environment variables to get a little >>>>> more information: >>>>> >>>>> OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 >>>>> export OMPI_MCA_pml_ucx_verbose=100 >>>>> >>>>> Here’s what happens: >>>>> >>>>> [novosirj@amarel-test2 ~]$ srun -n 2 --mpi=pmi2 -p oarc >>>>> --reservation=UCX ./mpihello-gcc-8-openmpi-4.0.6 >>>>> srun: job 13993927 queued and waiting for resources >>>>> srun: job 13993927 has been allocated resources >>>>> -------------------------------------------------------------------------- >>>>> WARNING: There was an error initializing an OpenFabrics device. >>>>> >>>>> Local host: gpu004 >>>>> Local device: mlx4_0 >>>>> -------------------------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> WARNING: There was an error initializing an OpenFabrics device. >>>>> >>>>> Local host: gpu004 >>>>> Local device: mlx4_0 >>>>> -------------------------------------------------------------------------- >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>>>> OPAL memory hooks as external events >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>>>> mca_pml_ucx_open: UCX version 1.5.2 >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>>>> OPAL memory hooks as external events >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:197 >>>>> mca_pml_ucx_open: UCX version 1.5.2 >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> self/self: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> tcp/eno1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> self/self: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> tcp/ib0: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> rc/mlx4_0:1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> ud/mlx4_0:1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> mm/sysv: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> mm/posix: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> cma/cma: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>>>> level is none >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 >>>>> mca_pml_ucx_close >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> tcp/eno1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> tcp/ib0: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> rc/mlx4_0:1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> ud/mlx4_0:1: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> mm/sysv: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> mm/posix: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:304 >>>>> cma/cma: did not match transport list >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:311 support >>>>> level is none >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/ompi/mca/pml/ucx/pml_ucx.c:268 >>>>> mca_pml_ucx_close >>>>> [gpu004.amarel.rutgers.edu:02326] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>>>> OPAL memory hooks as external events >>>>> [gpu004.amarel.rutgers.edu:02327] >>>>> ../../../../../openmpi-4.0.6/opal/mca/common/ucx/common_ucx.c:147 using >>>>> OPAL memory hooks as external events >>>>> Hello world from processor gpu004.amarel.rutgers.edu, rank 0 out of 2 >>>>> processors >>>>> Hello world from processor gpu004.amarel.rutgers.edu, rank 1 out of 2 >>>>> processors >>>>> >>>>> Here’s the output of a couple more commands that seem to be recommended >>>>> when looking into this: >>>>> >>>>> [novosirj@gpu004 ~]$ ucx_info -d >>>>> # >>>>> # Memory domain: self >>>>> # component: self >>>>> # register: unlimited, cost: 0 nsec >>>>> # remote key: 8 bytes >>>>> # >>>>> # Transport: self >>>>> # >>>>> # Device: self >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6911.00 MB/sec >>>>> # latency: 0 nsec >>>>> # overhead: 10 nsec >>>>> # put_short: <= 4294967295 >>>>> # put_bcopy: unlimited >>>>> # get_bcopy: unlimited >>>>> # am_short: <= 8k >>>>> # am_bcopy: <= 8k >>>>> # domain: cpu >>>>> # atomic_add: 32, 64 bit >>>>> # atomic_and: 32, 64 bit >>>>> # atomic_or: 32, 64 bit >>>>> # atomic_xor: 32, 64 bit >>>>> # atomic_fadd: 32, 64 bit >>>>> # atomic_fand: 32, 64 bit >>>>> # atomic_for: 32, 64 bit >>>>> # atomic_fxor: 32, 64 bit >>>>> # atomic_swap: 32, 64 bit >>>>> # atomic_cswap: 32, 64 bit >>>>> # connection: to iface >>>>> # priority: 0 >>>>> # device address: 0 bytes >>>>> # iface address: 8 bytes >>>>> # error handling: none >>>>> # >>>>> # >>>>> # Memory domain: tcp >>>>> # component: tcp >>>>> # >>>>> # Transport: tcp >>>>> # >>>>> # Device: eno1 >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 113.16 MB/sec >>>>> # latency: 5776 nsec >>>>> # overhead: 50000 nsec >>>>> # am_bcopy: <= 8k >>>>> # connection: to iface >>>>> # priority: 1 >>>>> # device address: 4 bytes >>>>> # iface address: 2 bytes >>>>> # error handling: none >>>>> # >>>>> # Device: ib0 >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6239.81 MB/sec >>>>> # latency: 5210 nsec >>>>> # overhead: 50000 nsec >>>>> # am_bcopy: <= 8k >>>>> # connection: to iface >>>>> # priority: 1 >>>>> # device address: 4 bytes >>>>> # iface address: 2 bytes >>>>> # error handling: none >>>>> # >>>>> # >>>>> # Memory domain: ib/mlx4_0 >>>>> # component: ib >>>>> # register: unlimited, cost: 90 nsec >>>>> # remote key: 16 bytes >>>>> # local memory handle is required for zcopy >>>>> # >>>>> # Transport: rc >>>>> # >>>>> # Device: mlx4_0:1 >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6433.22 MB/sec >>>>> # latency: 900 nsec + 1 * N >>>>> # overhead: 75 nsec >>>>> # put_short: <= 88 >>>>> # put_bcopy: <= 8k >>>>> # put_zcopy: <= 1g, up to 6 iov >>>>> # put_opt_zcopy_align: <= 512 >>>>> # put_align_mtu: <= 2k >>>>> # get_bcopy: <= 8k >>>>> # get_zcopy: 33..1g, up to 6 iov >>>>> # get_opt_zcopy_align: <= 512 >>>>> # get_align_mtu: <= 2k >>>>> # am_short: <= 87 >>>>> # am_bcopy: <= 8191 >>>>> # am_zcopy: <= 8191, up to 5 iov >>>>> # am_opt_zcopy_align: <= 512 >>>>> # am_align_mtu: <= 2k >>>>> # am header: <= 127 >>>>> # domain: device >>>>> # connection: to ep >>>>> # priority: 10 >>>>> # device address: 3 bytes >>>>> # ep address: 4 bytes >>>>> # error handling: peer failure >>>>> # >>>>> # >>>>> # Transport: ud >>>>> # >>>>> # Device: mlx4_0:1 >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6433.22 MB/sec >>>>> # latency: 910 nsec >>>>> # overhead: 105 nsec >>>>> # am_short: <= 172 >>>>> # am_bcopy: <= 4088 >>>>> # am_zcopy: <= 4088, up to 7 iov >>>>> # am_opt_zcopy_align: <= 512 >>>>> # am_align_mtu: <= 4k >>>>> # am header: <= 3984 >>>>> # connection: to ep, to iface >>>>> # priority: 10 >>>>> # device address: 3 bytes >>>>> # iface address: 3 bytes >>>>> # ep address: 6 bytes >>>>> # error handling: peer failure >>>>> # >>>>> # >>>>> # Memory domain: rdmacm >>>>> # component: rdmacm >>>>> # supports client-server connection establishment via sockaddr >>>>> # < no supported devices found > >>>>> # >>>>> # Memory domain: sysv >>>>> # component: sysv >>>>> # allocate: unlimited >>>>> # remote key: 32 bytes >>>>> # >>>>> # Transport: mm >>>>> # >>>>> # Device: sysv >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6911.00 MB/sec >>>>> # latency: 80 nsec >>>>> # overhead: 10 nsec >>>>> # put_short: <= 4294967295 >>>>> # put_bcopy: unlimited >>>>> # get_bcopy: unlimited >>>>> # am_short: <= 92 >>>>> # am_bcopy: <= 8k >>>>> # domain: cpu >>>>> # atomic_add: 32, 64 bit >>>>> # atomic_and: 32, 64 bit >>>>> # atomic_or: 32, 64 bit >>>>> # atomic_xor: 32, 64 bit >>>>> # atomic_fadd: 32, 64 bit >>>>> # atomic_fand: 32, 64 bit >>>>> # atomic_for: 32, 64 bit >>>>> # atomic_fxor: 32, 64 bit >>>>> # atomic_swap: 32, 64 bit >>>>> # atomic_cswap: 32, 64 bit >>>>> # connection: to iface >>>>> # priority: 0 >>>>> # device address: 8 bytes >>>>> # iface address: 16 bytes >>>>> # error handling: none >>>>> # >>>>> # >>>>> # Memory domain: posix >>>>> # component: posix >>>>> # allocate: unlimited >>>>> # remote key: 37 bytes >>>>> # >>>>> # Transport: mm >>>>> # >>>>> # Device: posix >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 6911.00 MB/sec >>>>> # latency: 80 nsec >>>>> # overhead: 10 nsec >>>>> # put_short: <= 4294967295 >>>>> # put_bcopy: unlimited >>>>> # get_bcopy: unlimited >>>>> # am_short: <= 92 >>>>> # am_bcopy: <= 8k >>>>> # domain: cpu >>>>> # atomic_add: 32, 64 bit >>>>> # atomic_and: 32, 64 bit >>>>> # atomic_or: 32, 64 bit >>>>> # atomic_xor: 32, 64 bit >>>>> # atomic_fadd: 32, 64 bit >>>>> # atomic_fand: 32, 64 bit >>>>> # atomic_for: 32, 64 bit >>>>> # atomic_fxor: 32, 64 bit >>>>> # atomic_swap: 32, 64 bit >>>>> # atomic_cswap: 32, 64 bit >>>>> # connection: to iface >>>>> # priority: 0 >>>>> # device address: 8 bytes >>>>> # iface address: 16 bytes >>>>> # error handling: none >>>>> # >>>>> # >>>>> # Memory domain: cma >>>>> # component: cma >>>>> # register: unlimited, cost: 9 nsec >>>>> # >>>>> # Transport: cma >>>>> # >>>>> # Device: cma >>>>> # >>>>> # capabilities: >>>>> # bandwidth: 11145.00 MB/sec >>>>> # latency: 80 nsec >>>>> # overhead: 400 nsec >>>>> # put_zcopy: unlimited, up to 16 iov >>>>> # put_opt_zcopy_align: <= 1 >>>>> # put_align_mtu: <= 1 >>>>> # get_zcopy: unlimited, up to 16 iov >>>>> # get_opt_zcopy_align: <= 1 >>>>> # get_align_mtu: <= 1 >>>>> # connection: to iface >>>>> # priority: 0 >>>>> # device address: 8 bytes >>>>> # iface address: 4 bytes >>>>> # error handling: none >>>>> # >>>>> >>>>> [novosirj@gpu004 ~]$ ucx_info -p -u t >>>>> # >>>>> # UCP context >>>>> # >>>>> # md 0 : self >>>>> # md 1 : tcp >>>>> # md 2 : ib/mlx4_0 >>>>> # md 3 : rdmacm >>>>> # md 4 : sysv >>>>> # md 5 : posix >>>>> # md 6 : cma >>>>> # >>>>> # resource 0 : md 0 dev 0 flags -- self/self >>>>> # resource 1 : md 1 dev 1 flags -- tcp/eno1 >>>>> # resource 2 : md 1 dev 2 flags -- tcp/ib0 >>>>> # resource 3 : md 2 dev 3 flags -- rc/mlx4_0:1 >>>>> # resource 4 : md 2 dev 3 flags -- ud/mlx4_0:1 >>>>> # resource 5 : md 3 dev 4 flags -s rdmacm/sockaddr >>>>> # resource 6 : md 4 dev 5 flags -- mm/sysv >>>>> # resource 7 : md 5 dev 6 flags -- mm/posix >>>>> # resource 8 : md 6 dev 7 flags -- cma/cma >>>>> # >>>>> # memory: 0.84MB, file descriptors: 2 >>>>> # create time: 5.032 ms >>>>> # >>>>> >>>>> Thanks for any help you can offer. What am I missing? >>>>> >>>>> -- >>>>> #BlackLivesMatter >>>>> ____ >>>>> || \\UTGERS, >>>>> |---------------------------*O*--------------------------- >>>>> ||_// the State | Ryan Novosielski - novos...@rutgers.edu >>>>> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS >>>>> Campus >>>>> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >>>>> Newark >>>>> `' >>>>> >>>> >>> >> >> >