Oops, I knew I forgot something!
I am using OpenMPI 3.1.1
I have tried loading in a OpenMPI 4.0.3 module but receive a repeating error at
runtime:
[tcn560.bullx:16698] pml_ucx.c:175 Error: Failed to receive UCX worker
address: Not found (-13)
[tcn560.bullx:16698] [[42671,6],29] ORTE_ERROR_LOG:
Sean,
you might also want to confirm openib is (part of) the issue by running
your app on TCP only.
mpirun --mca pml ob1 --mca btl tcp,self, ...
Cheers,
Gilles
- Original Message -
> Hi Sean,
>
> Thanks for the report! I have a few questions/suggestions:
>
> 1) What version of Open
Hi Sean,
Thanks for the report! I have a few questions/suggestions:
1) What version of Open MPI are you using?
2) What is your network? It sounds like you are on an IB cluster using
btl/openib (which is essentially discontinued). Can you try the Open MPI
4.0.4 release with UCX instead of openi
Hi all,
I am encountering a silent hang involving MPI_Ssend and MPI_Irecv. The
subroutine in question is called by each processor and is structured similar to
the pseudo code below. The subroutine is successfully called several thousand
times before the silent hang behavior manifests and never