and by the way, you did
mpirun --mca btl_tcp_eager_limit 56
in order to disable eager mode, right ?
--mca btl_tcp_rndv_eager_limit 0
does something different
Cheers,
Gilles
On 1/21/2016 2:10 PM, Eva wrote:
I run with two machines, 2 process per node: process0, process1,
process2, process3.
After some random rounds of communications, the communication hangs.
When I debug into the program, I found:
process1 sent a message to process2;
process2 received the message from process1 and then start to receive
messages from other processes.
But process1 doesn't get notice: process2 has received its message and
then hang on MPI_Send->...->poll_device() of rdmav2.
#0 0x00007f6ba95f03e5 in ?? () from /usr/lib64/libmlx4-rdmav2.so
#1 0x00007f6bacf1ed93 in poll_device () from
/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
#2 0x00007f6bacf1f7ed in btl_openib_component_progress () from
/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_btl_openib.so
#3 0x00007f6bb06539da in opal_progress () from
/home/openmpi-1.8.5-gcc4.8/lib/libopen-pal.so.6
#4 0x00007f6bab831f55 in mca_pml_ob1_send () from
/home/openmpi-1.8.5-gcc4.8/lib/openmpi/mca_pml_ob1.so
#5 0x00007f6bb0df33c2 in PMPI_Send () from
/home/openmpi-1.8.5-gcc4.8/lib/libmpi.so.1
Some experiments I have tried:
1. compile openmpi without multi-thread enable
2. --mca pml_ob1_use_early_completion 0
3. disable eager mode
4. ssend, Bsend
but it still hangs.
The same program works fine on TCP for more than one year. After I
move it onto rdma, it starts to hang. And I can't debug into any rdma
details
2016-01-21 11:24 GMT+08:00 Eva <wuzh...@gmail.com
<mailto:wuzh...@gmail.com>>:
Run MPI_Send on MPI1.8.5 without multithread enabled:
it hangs on mca_pml_ob1_send() -> opal_progreses() ->
btl_openib_component_progress() -> poll_device() ->
libmlx4-rdmav2.so -> cq -> phread_spin_unlock
The program can run on TCP with no error.
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/01/28313.php