You can try a more recent version of openmpi
1.10.2 was released recently, or try with a nightly snapshot of master.

If all of these still fail, can you post a trimmed version of your program so 
we can investigate ?

Cheers,

Gilles

Eva <wuzh...@gmail.com> wrote:
>Gilles,
>
>>>Can you try to 
>>>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ... 
>>>and confirm it works fine with TCP *and* without eager ? 
>
>
>I have tried this and it works. 
>
>So what should I do next?
>
>
>
>2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com>:
>
>Thanks Gilles.
>
>it works fine on tcp 
>
>So I use this to disable eager:
>
> -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
>
>
>2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>:
>
>I run with two machines, 2 process per node: process0, process1, process2, 
>process3.
>
>After some random rounds of communications, the communication hangs. When I 
>debug into the program, I found:
>
>process1 sent a message to process2; 
>
>process2 received the message from process1 and then start to receive messages 
>from other processes. 
>
>But process1 doesn't get notice: process2 has received its message and then 
>hang on MPI_Send->...->poll_device() of rdmav2.
>
>
>#0  0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so
>
>#1  0x00007f6bacf1ed93 in poll_device () from 
>/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>
>#2  0x00007f6bacf1f7ed in btl_openib_component_progress () from 
>/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>
>#3  0x00007f6bb06539da in opal_progress () from 
>/home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6
>
>#4  0x00007f6bab831f55 in mca_pml_ob1_send () from 
>/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so
>
>#5  0x00007f6bb0df33c2 in PMPI_Send () from 
>/home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1
>
>
>Some experiments I have tried:
>
>1. compile openmpi without multi-thread enable
>
>2. --mca pml_ob1_use_early_completion 0
>
>3. disable eager mode
>
>4. ssend, Bsend
>
>
>but it still hangs.
>
>
>The same program works fine on TCP for more than one year. After I move it 
>onto rdma, it starts to hang. And I can't debug into any rdma details
>
>
>2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>:
>
>Run MPI_Send on MPI1.8.5 without multithread enabled:
>
>it hangs on mca_pml_ob1_send() -> opal_progreses() -> 
>btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq -> 
>phread_spin_unlock
>
>The program can run on TCP with no error.
>
>
>
>

Reply via email to