Re: [OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-25 Thread Lewis,Sean via users
Oops, I knew I forgot something! I am using OpenMPI 3.1.1 I have tried loading in a OpenMPI 4.0.3 module but receive a repeating error at runtime: [tcn560.bullx:16698] pml_ucx.c:175 Error: Failed to receive UCX worker address: Not found (-13) [tcn560.bullx:16698] [[42671,6],29] ORTE_ERROR_LOG:

Re: [OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-25 Thread Gilles Gouaillardet via users
Sean, you might also want to confirm openib is (part of) the issue by running your app on TCP only. mpirun --mca pml ob1 --mca btl tcp,self, ... Cheers, Gilles - Original Message - > Hi Sean, > > Thanks for the report! I have a few questions/suggestions: > > 1) What version of Open

Re: [OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-25 Thread Joseph Schuchart via users
Hi Sean, Thanks for the report! I have a few questions/suggestions: 1) What version of Open MPI are you using? 2) What is your network? It sounds like you are on an IB cluster using btl/openib (which is essentially discontinued). Can you try the Open MPI 4.0.4 release with UCX instead of openi

[OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-24 Thread Lewis,Sean via users
Hi all, I am encountering a silent hang involving MPI_Ssend and MPI_Irecv. The subroutine in question is called by each processor and is structured similar to the pseudo code below. The subroutine is successfully called several thousand times before the silent hang behavior manifests and never