Hi Sean,
Thanks for the report! I have a few questions/suggestions:
1) What version of Open MPI are you using?
2) What is your network? It sounds like you are on an IB cluster using
btl/openib (which is essentially discontinued). Can you try the Open MPI
4.0.4 release with UCX instead of openi
Sean,
you might also want to confirm openib is (part of) the issue by running
your app on TCP only.
mpirun --mca pml ob1 --mca btl tcp,self, ...
Cheers,
Gilles
- Original Message -
> Hi Sean,
>
> Thanks for the report! I have a few questions/suggestions:
>
> 1) What version of Open
Oops, I knew I forgot something!
I am using OpenMPI 3.1.1
I have tried loading in a OpenMPI 4.0.3 module but receive a repeating error at
runtime:
[tcn560.bullx:16698] pml_ucx.c:175 Error: Failed to receive UCX worker
address: Not found (-13)
[tcn560.bullx:16698] [[42671,6],29] ORTE_ERROR_LOG: