That could be a bug in openib, openmpi and/or your application.
for example, a memory corruption could be unnoticed with tcp, but might
cause openib hang.

you can start by running your program under a memory debugger
(valgrind, ddt or other) and confirm your application works fine.
you can also update your application and use tags unique per message, and
confirm both send and recv hang when transmitting the same message.

I assume you ran
unlimit -l unlimited
or equivalent before invoking mpirun
(and this is correctly propagated to the remote nodes)

btw, did you try an other MPI library ?
(mpich/mvapich, intel MPI or other)

That being said, my preferred option is you make your application source
available,
so the hang can be confirmed on a different system, and then debugged

Cheers,

Gilles

On Thursday, January 21, 2016, Eva <wuzh...@gmail.com> wrote:

> Gilles,
> Actually, there are some more strange things.
> With the same environment and MPI version, I write a simple program by
> using the same communication logic with my hang program.
> The simple program can work without hang.
> So is there any possible reason?  I can try them one by one.
> Or can I debug into the openib source code to find the root cause with
> your instructions or guide?
>
> 2016-01-21 17:03 GMT+08:00 Eva <wuzh...@gmail.com
> <javascript:_e(%7B%7D,'cvml','wuzh...@gmail.com');>>:
>
>> Gilles,
>> >>Can you try to
>> >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
>> >>and confirm it works fine with TCP *and* without eager ?
>>
>> I have tried this and it works.
>> So what should I do next?
>>
>>
>> 2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','wuzh...@gmail.com');>>:
>>
>>> Thanks Gilles.
>>> it works fine on tcp
>>> So I use this to disable eager:
>>>  -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
>>>
>>> 2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','wuzh...@gmail.com');>>:
>>>
>>>> I run with two machines, 2 process per node: process0, process1,
>>>> process2, process3.
>>>> After some random rounds of communications, the communication hangs.
>>>> When I debug into the program, I found:
>>>> process1 sent a message to process2;
>>>> process2 received the message from process1 and then start to receive
>>>> messages from other processes.
>>>> But process1 doesn't get notice: process2 has received its message and
>>>> then hang on MPI_Send->...->poll_device() of rdmav2.
>>>>
>>>> #0  0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so
>>>> #1  0x00007f6bacf1ed93 in poll_device () from
>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>>> #2  0x00007f6bacf1f7ed in btl_openib_component_progress () from
>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>>> #3  0x00007f6bb06539da in opal_progress () from
>>>> /home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6
>>>> #4  0x00007f6bab831f55 in mca_pml_ob1_send () from
>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so
>>>> #5  0x00007f6bb0df33c2 in PMPI_Send () from
>>>> /home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1
>>>>
>>>> Some experiments I have tried:
>>>> 1. compile openmpi without multi-thread enable
>>>> 2. --mca pml_ob1_use_early_completion 0
>>>> 3. disable eager mode
>>>> 4. ssend, Bsend
>>>>
>>>> but it still hangs.
>>>>
>>>> The same program works fine on TCP for more than one year. After I move
>>>> it onto rdma, it starts to hang. And I can't debug into any rdma details
>>>>
>>>> 2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','wuzh...@gmail.com');>>:
>>>>
>>>>> Run MPI_Send on MPI1.8.5 without multithread enabled:
>>>>> it hangs on mca_pml_ob1_send() -> opal_progreses() ->
>>>>> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> 
>>>>> cq
>>>>> -> phread_spin_unlock
>>>>> The program can run on TCP with no error.
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to