Oops, I knew I forgot something!
I am using OpenMPI 3.1.1
I have tried loading in a OpenMPI 4.0.3 module but receive a repeating error at
runtime:
[tcn560.bullx:16698] pml_ucx.c:175 Error: Failed to receive UCX worker
address: Not found (-13)
[tcn560.bullx:16698] [[42671,6],29] ORTE_ERROR_LOG:
Sean,
you might also want to confirm openib is (part of) the issue by running
your app on TCP only.
mpirun --mca pml ob1 --mca btl tcp,self, ...
Cheers,
Gilles
- Original Message -
> Hi Sean,
>
> Thanks for the report! I have a few questions/suggestions:
>
> 1) What version of Open
Hi Sean,
Thanks for the report! I have a few questions/suggestions:
1) What version of Open MPI are you using?
2) What is your network? It sounds like you are on an IB cluster using
btl/openib (which is essentially discontinued). Can you try the Open MPI
4.0.4 release with UCX instead of openi