>>You can try a more recent version of openmpi
>>1.10.2 was released recently, or try with a nightly snapshot of master.

>>If all of these still fail, can you post a trimmed version of your
program so we can investigate ?

Hi Gilles,

I try 1.10.2. My program has been running successfully without any hang for
4 hours now. I will continue to watch its status.

Btw, have you fixed any such hang issues from 1.8.5 to 1.10.2?



2016-01-21 20:40 GMT+08:00 Eva <wuzh...@gmail.com>:

> Thanks Jeff.
>
> >>1. Can you create a small example to reproduce the problem?
>
> >>2. The TCP and verbs-based transports use different thresholds and
> protocols, and can sometimes bring to light errors in the application
> (e.g., the application is making assumptions that just happen to be true
> for TCP, but not necessarily for other transports).
>
> >>3. Is your program multi-threaded? If so, MPI_THREAD_MULTIPLE support in
> the v1.8 and v1.10 series is not fully baked.
>
> >>4. Additionally, if you have buffering / matching / progression
> assumptions in your application, you might accidentally block. An
> experiment to try to is to convert all MPI_SEND and MPI_ISEND to MPI_SSEND
> and MPI_ISSEND, respectively, and see if your program still functions
> properly on TCP.
>
>
> 1. I will try to create a mall example to reproduce the problem.
>
> 2. I didn't get your point. I didn't make any assumptions for TCP. Is
> there any difference in MPI for TCP and RDMA?
>
> 3. My program doesn't enable MPI_THREAD_MULTIPLE
>
> 4. what do you mean by buffering / matching / progression assumptions in
> your application?
>
> My program communicates like this:
>
> 4 processes: process0, 1, 2, 3
>
> process1/process3:
>
>  foreach to_id in process0, process2:
>
>        MPI_Send(send_buf, sendlen, to_id, TAG);
>
>        MPI_Recv(recv_buf, recvlen, to_id, TAG);
>
>
> process0/process2:
>
>       while(true):
>
>            MPI_recv(recv_buf, any_source, TAG);
>
>            MPI_Send(send_buf, source_id, TAG);
>
>
>
> 2016-01-21 17:49 GMT+08:00 Eva <wuzh...@gmail.com>:
>
>> Gilles,
>> Actually, there are some more strange things.
>> With the same environment and MPI version, I write a simple program by
>> using the same communication logic with my hang program.
>> The simple program can work without hang.
>> So is there any possible reason?  I can try them one by one.
>> Or can I debug into the openib source code to find the root cause with
>> your instructions or guide?
>>
>> 2016-01-21 17:03 GMT+08:00 Eva <wuzh...@gmail.com>:
>>
>>> Gilles,
>>> >>Can you try to
>>> >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
>>> >>and confirm it works fine with TCP *and* without eager ?
>>>
>>> I have tried this and it works.
>>> So what should I do next?
>>>
>>>
>>> 2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com>:
>>>
>>>> Thanks Gilles.
>>>> it works fine on tcp
>>>> So I use this to disable eager:
>>>>  -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
>>>>
>>>> 2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>:
>>>>
>>>>> I run with two machines, 2 process per node: process0, process1,
>>>>> process2, process3.
>>>>> After some random rounds of communications, the communication hangs.
>>>>> When I debug into the program, I found:
>>>>> process1 sent a message to process2;
>>>>> process2 received the message from process1 and then start to receive
>>>>> messages from other processes.
>>>>> But process1 doesn't get notice: process2 has received its message and
>>>>> then hang on MPI_Send->...->poll_device() of rdmav2.
>>>>>
>>>>> #0  0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so
>>>>> #1  0x00007f6bacf1ed93 in poll_device () from
>>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>>>> #2  0x00007f6bacf1f7ed in btl_openib_component_progress () from
>>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
>>>>> #3  0x00007f6bb06539da in opal_progress () from
>>>>> /home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6
>>>>> #4  0x00007f6bab831f55 in mca_pml_ob1_send () from
>>>>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so
>>>>> #5  0x00007f6bb0df33c2 in PMPI_Send () from
>>>>> /home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1
>>>>>
>>>>> Some experiments I have tried:
>>>>> 1. compile openmpi without multi-thread enable
>>>>> 2. --mca pml_ob1_use_early_completion 0
>>>>> 3. disable eager mode
>>>>> 4. ssend, Bsend
>>>>>
>>>>> but it still hangs.
>>>>>
>>>>> The same program works fine on TCP for more than one year. After I
>>>>> move it onto rdma, it starts to hang. And I can't debug into any rdma
>>>>> details
>>>>>
>>>>> 2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>:
>>>>>
>>>>>> Run MPI_Send on MPI1.8.5 without multithread enabled:
>>>>>> it hangs on mca_pml_ob1_send() -> opal_progreses() ->
>>>>>> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> 
>>>>>> cq
>>>>>> -> phread_spin_unlock
>>>>>> The program can run on TCP with no error.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to