Hi everyone,
To be honest, as an MPI / IB noob, I don't know if this falls under
OpenMPI or Mellanox....
Am running a small cluster of HP DL380 G6/G7 machines.
Each runs Ubuntu server 20.04 and has a Mellanox ConnectX-3 card,
connected by an IS dumb switch.
When I begin my MPI program (snappyHexMesh for OpenFOAM) I get an error
reported.
The error doesn't stop my programs or appear to cause any problems, so
this request for help is more about delving into the why.
OMPI is compiled from source using v4.0.3; which is the default version
for Ubuntu 20.04
This compiles and works. I did this because I wanted to understand the
compilation process whilst using a known working OMPI version.
The Infiniband part is the Mellanox MLNXOFED installer v4.9-0.1.7.0 and
I install that with --dkms --without-fw-update --hpc --with-nfsrdma
The actual error reported is:
Warning: There was an error initialising an OpenFabrics device.
Local host: of1
Local device: mlx4_0
Then shortly after:
[of1:1015399] 19 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[of1:1015399] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
Adding this MCA parameter to the mpirun line simply gives me 20 or so
copies of the first warning.
Any ideas anyone ?
Cheers,
Bob.