Thanks Gilles. it works fine on tcp So I use this to disable eager: -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>: > I run with two machines, 2 process per node: process0, process1, process2, > process3. > After some random rounds of communications, the communication hangs. When > I debug into the program, I found: > process1 sent a message to process2; > process2 received the message from process1 and then start to receive > messages from other processes. > But process1 doesn't get notice: process2 has received its message and > then hang on MPI_Send->...->poll_device() of rdmav2. > > #0 0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so > #1 0x00007f6bacf1ed93 in poll_device () from > /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so > #2 0x00007f6bacf1f7ed in btl_openib_component_progress () from > /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so > #3 0x00007f6bb06539da in opal_progress () from > /home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6 > #4 0x00007f6bab831f55 in mca_pml_ob1_send () from > /home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so > #5 0x00007f6bb0df33c2 in PMPI_Send () from > /home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1 > > Some experiments I have tried: > 1. compile openmpi without multi-thread enable > 2. --mca pml_ob1_use_early_completion 0 > 3. disable eager mode > 4. ssend, Bsend > > but it still hangs. > > The same program works fine on TCP for more than one year. After I move it > onto rdma, it starts to hang. And I can't debug into any rdma details > > 2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>: > >> Run MPI_Send on MPI1.8.5 without multithread enabled: >> it hangs on mca_pml_ob1_send() -> opal_progreses() -> >> btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq >> -> phread_spin_unlock >> The program can run on TCP with no error. >> > >