You can try a more recent version of openmpi 1.10.2 was released recently, or try with a nightly snapshot of master.
If all of these still fail, can you post a trimmed version of your program so we can investigate ? Cheers, Gilles Eva <wuzh...@gmail.com> wrote: >Gilles, > >>>Can you try to >>>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ... >>>and confirm it works fine with TCP *and* without eager ? > > >I have tried this and it works. > >So what should I do next? > > > >2016-01-21 16:25 GMT+08:00 Eva <wuzh...@gmail.com>: > >Thanks Gilles. > >it works fine on tcp > >So I use this to disable eager: > > -mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0 > > >2016-01-21 13:10 GMT+08:00 Eva <wuzh...@gmail.com>: > >I run with two machines, 2 process per node: process0, process1, process2, >process3. > >After some random rounds of communications, the communication hangs. When I >debug into the program, I found: > >process1 sent a message to process2; > >process2 received the message from process1 and then start to receive messages >from other processes. > >But process1 doesn't get notice: process2 has received its message and then >hang on MPI_Send->...->poll_device() of rdmav2. > > >#0 0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so > >#1 0x00007f6bacf1ed93 in poll_device () from >/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so > >#2 0x00007f6bacf1f7ed in btl_openib_component_progress () from >/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so > >#3 0x00007f6bb06539da in opal_progress () from >/home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6 > >#4 0x00007f6bab831f55 in mca_pml_ob1_send () from >/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so > >#5 0x00007f6bb0df33c2 in PMPI_Send () from >/home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1 > > >Some experiments I have tried: > >1. compile openmpi without multi-thread enable > >2. --mca pml_ob1_use_early_completion 0 > >3. disable eager mode > >4. ssend, Bsend > > >but it still hangs. > > >The same program works fine on TCP for more than one year. After I move it >onto rdma, it starts to hang. And I can't debug into any rdma details > > >2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com>: > >Run MPI_Send on MPI1.8.5 without multithread enabled: > >it hangs on mca_pml_ob1_send() -> opal_progreses() -> >btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq -> >phread_spin_unlock > >The program can run on TCP with no error. > > > >