Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-22 Thread Eva
>>You can try a more recent version of openmpi >>1.10.2 was released recently, or try with a nightly snapshot of master. >>If all of these still fail, can you post a trimmed version of your program so we can investigate ? Hi Gilles, I try 1.10.2. My program has been running successfully without

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Jeff Squyres (jsquyres)
On Jan 21, 2016, at 7:40 AM, Eva wrote: > > Thanks Jeff. > > >>1. Can you create a small example to reproduce the problem? > > >>2. The TCP and verbs-based transports use different thresholds and > >>protocols, and can sometimes bring to light errors in the application > >>(e.g., the applica

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Eva
Thanks Jeff. >>1. Can you create a small example to reproduce the problem? >>2. The TCP and verbs-based transports use different thresholds and protocols, and can sometimes bring to light errors in the application (e.g., the application is making assumptions that just happen to be true for TCP, b

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Jeff Squyres (jsquyres)
Can you create a small example to reproduce the problem? The TCP and verbs-based transports use different thresholds and protocols, and can sometimes bring to light errors in the application (e.g., the application is making assumptions that just happen to be true for TCP, but not necessarily fo

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Gilles Gouaillardet
That could be a bug in openib, openmpi and/or your application. for example, a memory corruption could be unnoticed with tcp, but might cause openib hang. you can start by running your program under a memory debugger (valgrind, ddt or other) and confirm your application works fine. you can also up

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Eva
Gilles, Actually, there are some more strange things. With the same environment and MPI version, I write a simple program by using the same communication logic with my hang program. The simple program can work without hang. So is there any possible reason? I can try them one by one. Or can I debug

Re: [OMPI users] OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Gilles Gouaillardet
You can try a more recent version of openmpi 1.10.2 was released recently, or try with a nightly snapshot of master. If all of these still fail, can you post a trimmed version of your program so we can investigate ? Cheers, Gilles Eva wrote: >Gilles, > >>>Can you try to  >>>mpirun --mca btl t

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Eva
Gilles, >>Can you try to >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ... >>and confirm it works fine with TCP *and* without eager ? I have tried this and it works. So what should I do next? 2016-01-21 16:25 GMT+08:00 Eva : > Thanks Gilles. > it works fine on tcp > So I use this to

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Gilles Gouaillardet
Can you try to mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ... and confirm it works fine with TCP *and* without eager ? Cheers, Gilles On 1/21/2016 5:25 PM, Eva wrote: Thanks Gilles. it works fine on tcp So I use this to disable eager: -mca btl_openib_use_eager_rdma 0 -mca btl_open

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Eva
Thanks Gilles. it works fine on tcp So I use this to disable eager: -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 2016-01-21 13:10 GMT+08:00 Eva : > I run with two machines, 2 process per node: process0, process1, process2, > process3. > After some random rounds of communicat

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Gilles Gouaillardet
and by the way, you did mpirun --mca btl_tcp_eager_limit 56 in order to disable eager mode, right ? --mca btl_tcp_rndv_eager_limit 0 does something different Cheers, Gilles On 1/21/2016 2:10 PM, Eva wrote: I run with two machines, 2 process per node: process0, process1, process2, process3. Af

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Gilles Gouaillardet
Hi, can you post a trimmed version of your program so we can reproduce and analyze the hang ? Cheers, Gilles On 1/21/2016 2:10 PM, Eva wrote: I run with two machines, 2 process per node: process0, process1, process2, process3. After some random rounds of communications, the communication ha

Re: [OMPI users] MPI hangs on poll_device() with rdma

2016-01-21 Thread Eva
I run with two machines, 2 process per node: process0, process1, process2, process3. After some random rounds of communications, the communication hangs. When I debug into the program, I found: process1 sent a message to process2; process2 received the message from process1 and then start to receiv

[OMPI users] MPI hangs on poll_device() with rdma

2016-01-20 Thread Eva
Run MPI_Send on MPI1.8.5 without multithread enabled: it hangs on mca_pml_ob1_send() -> opal_progreses() -> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq -> phread_spin_unlock The program can run on TCP with no error.