Hi everyone,

To be honest, as an MPI / IB noob, I don't know if this falls under OpenMPI or Mellanox....

Am running a small cluster of HP DL380 G6/G7 machines.
Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card, connected by an IS dumb switch. When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error reported. The error doesn't stop my programs or appear to cause any problems, so this request for help is more about delving into the why.

OMPI is compiled from source using v4.0.3; which is the default version for Ubuntu 20.04 This compiles and works.  I did this because I wanted to understand the compilation process whilst using a known working OMPI version.

The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and I install that with --dkms --without-fw-update --hpc --with-nfsrdma

The actual error reported is:
Warning: There was an error initialising an OpenFabrics device.
  Local host:     of1
  Local device: mlx4_0

Then shortly after:
[of1:1015399] 19 more processes have sent help message help-mpi-btl-openib.txt / error in device init [of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Adding this MCA parameter to the mpirun line simply gives me 20 or so copies of the first warning.

Any ideas anyone ?
Cheers,
Bob.

Reply via email to