Gilles, Actually, there are some more strange things. With the same environment and MPI version, I write a simple program by using the same communication logic with my hang program. The simple program can work without hang. So is there any possible reason? I can try them one by one. Or can I debug into the openib source code to find the root cause with your instructions or guide?
2016-01-21 17:03 GMT+08:00 Eva <wuzh...@gmail.com>: > Gilles, > >>Can you try to > >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ... > >>and confirm it works fine with TCP *and* without eager ? > > I have tried this and it works. > So what should I do next? > > > 2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com>: > >> Thanks Gilles. >> it works fine on tcp >> So I use this to disable eager: >> -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 >> >> 2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>: >> >>> I run with two machines, 2 process per node: process0, process1, >>> process2, process3. >>> After some random rounds of communications, the communication hangs. >>> When I debug into the program, I found: >>> process1 sent a message to process2; >>> process2 received the message from process1 and then start to receive >>> messages from other processes. >>> But process1 doesn't get notice: process2 has received its message and >>> then hang on MPI_Send->...->poll_device() of rdmav2. >>> >>> #0 0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so >>> #1 0x00007f6bacf1ed93 in poll_device () from >>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so >>> #2 0x00007f6bacf1f7ed in btl_openib_component_progress () from >>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so >>> #3 0x00007f6bb06539da in opal_progress () from >>> /home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6 >>> #4 0x00007f6bab831f55 in mca_pml_ob1_send () from >>> /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so >>> #5 0x00007f6bb0df33c2 in PMPI_Send () from >>> /home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1 >>> >>> Some experiments I have tried: >>> 1. compile openmpi without multi-thread enable >>> 2. --mca pml_ob1_use_early_completion 0 >>> 3. disable eager mode >>> 4. ssend, Bsend >>> >>> but it still hangs. >>> >>> The same program works fine on TCP for more than one year. After I move >>> it onto rdma, it starts to hang. And I can't debug into any rdma details >>> >>> 2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>: >>> >>>> Run MPI_Send on MPI1.8.5 without multithread enabled: >>>> it hangs on mca_pml_ob1_send() -> opal_progreses() -> >>>> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq >>>> -> phread_spin_unlock >>>> The program can run on TCP with no error. >>>> >>> >>> >> >