Just to follow up for the email web archives: this issue was followed up in 
https://github.com/open-mpi/ompi/issues/10841.

--
Jeff Squyres
jsquy...@cisco.com
________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Rob Kudyba via 
users <users@lists.open-mpi.org>
Sent: Thursday, September 22, 2022 2:15 PM
To: users@lists.open-mpi.org <users@lists.open-mpi.org>
Cc: Rob Kudyba <rk3...@columbia.edu>
Subject: [OMPI users] --mca parameter explainer; mpirun WARNING: There was an 
error initializing an OpenFabrics device

We're using OpenMPI 4.1.1, CUDA aware on RHEL 8 cluster that we load as a 
module with Infiniband controller Mellanox Technologies MT28908 Family 
ConnectX-6, we see this warning runnig mpirun without any MCA 
options/parameters:
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   xxxx
  Local device: mlx5_0
---------------------------------------------

I did add 0x02c9 to our mca-btl-openib-device-params.ini file for the Mellanox 
ConnectX6 stanza as we were getting the following warning that no longer 
appears:

WARNING: No preset parameters were found for the device that Open MPI detected:

  Local host:            xxxx
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123


Which I found is referenced in these 
comments<https://accserv.classe.cornell.edu/svn/packages/openmpi/opal/mca/btl/openib/mca-btl-openib-device-params.ini>:

# Note: Several vendors resell Mellanox hardware and put their own firmware
# on the cards, therefore overriding the default Mellanox vendor ID.
#
#     Mellanox      0x02c9

Running  ompi_info --param btl all we have:
MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.1)
MCA btl: smcuda (MCA v2.1.0, API v3.1.0, Component v4.1.1)

So I am trying to wrap my head around the various warnings, and what these 
various options/parameters available to use can improve performance and/or when 
to use them.

I've gone through the OpenMPI run-time tuning 
documentation<https://www.open-mpi.org/faq/?category=tuning>, and I've used  
this STREAMS 
benchmark<https://anilmaurya.wordpress.com/2016/10/12/stream-benchmarks/>, 
https://anilmaurya.wordpress.com/2016/10/12/stream-benchmarks/ as well as these 
OSU Micro-Benchmarks at 
https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/

With version 4.1.1, if I use --mca btl 'openib' I get seg faults which I 
believe is expected as it's 
deprecated<https://docs.open-mpi.org/en/v5.0.x/release-notes/networks.html>. 
I've tried --mca  btl '^openib', --mca  btl 'tcp' (or  --mca  btl 'tcp,self' 
using the OSU BMs) and the benchmark results are very similar even when I use 
multiple CPUs, threads and/or nodes. They also run without the warning 
messages. If I don't use a --mca option, I get the WARNING: message.

Does anyone know of a tried and true way to run these benchmarks so know if 
these MCA parameters make a difference or am I just not understanding how to 
use these? Perhaps running these benchmarks on a very active cluster with 
shared CPUs/nodes will affect the results?

I can share any desired results if that helps the discussion.

Thanks!

Reply via email to