Gilles,
Actually, there are some more strange things.
With the same environment and MPI version, I write a simple program by
using the same communication logic with my hang program.
The simple program can work without hang.
So is there any possible reason?  I can try them one by one.
Or can I debug into the openib source code to find the root cause with your
instructions or guide?

2016-01-21 17:03 GMT+08:00 Eva <wuzh...@gmail.com>:

> Gilles,
> >>Can you try to
> >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
> >>and confirm it works fine with TCP *and* without eager ?
>
> I have tried this and it works.
> So what should I do next?
>
>
> 2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com>:
>
>> Thanks Gilles.
>> it works fine on tcp
>> So I use this to disable eager:
>>  -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
>>
>> 2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>:
>>
>>> I run with two machines, 2 process per node: process0, process1,
>>> process2, process3.
>>> After some random rounds of communications, the communication hangs.
>>> When I debug into the program, I found:
>>> process1 sent a message to process2;
>>> process2 received the message from process1 and then start to receive
>>> messages from other processes.
>>> But process1 doesn't get notice: process2 has received its message and
>>> then hang on MPI_Send->...->poll_device() of rdmav2.
>>>
>>> #0  0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so
>>> #1  0x00007f6bacf1ed93 in poll_device () from
>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>> #2  0x00007f6bacf1f7ed in btl_openib_component_progress () from
>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>> #3  0x00007f6bb06539da in opal_progress () from
>>> /home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6
>>> #4  0x00007f6bab831f55 in mca_pml_ob1_send () from
>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so
>>> #5  0x00007f6bb0df33c2 in PMPI_Send () from
>>> /home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1
>>>
>>> Some experiments I have tried:
>>> 1. compile openmpi without multi-thread enable
>>> 2. --mca pml_ob1_use_early_completion 0
>>> 3. disable eager mode
>>> 4. ssend, Bsend
>>>
>>> but it still hangs.
>>>
>>> The same program works fine on TCP for more than one year. After I move
>>> it onto rdma, it starts to hang. And I can't debug into any rdma details
>>>
>>> 2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>:
>>>
>>>> Run MPI_Send on MPI1.8.5 without multithread enabled:
>>>> it hangs on mca_pml_ob1_send() -> opal_progreses() ->
>>>> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq
>>>> -> phread_spin_unlock
>>>> The program can run on TCP with no error.
>>>>
>>>
>>>
>>
>

Reply via email to