Thanks Gilles. Got it. I will run it.
2016-02-26 16:10 GMT+08:00 Eva :
> Thanks Gilles. what do you mean " standard MPI benchmark" ? where can I
> find it?
>
> 2016-02-26 14:47 GMT+08:00 Eva :
>
>> I measure communication time for MPI_Send and end2end training ti
Thanks Gilles. what do you mean " standard MPI benchmark" ? where can I
find it?
2016-02-26 14:47 GMT+08:00 Eva :
> I measure communication time for MPI_Send and end2end training time
> (including model training and communication time).
> MPI1.4.1 is faster than MPI1.10.2:
I measure communication time for MPI_Send and end2end training time
(including model training and communication time).
MPI1.4.1 is faster than MPI1.10.2:
MPI_Send+MPI_Recv: 2.83%
end2end training time: 8.89%
2016-02-26 14:45 GMT+08:00 Eva :
> I measure communication time for MPI_Send and end2
I measure communication time for MPI_Send and end2end training
time(including model training and communication time).
MPI_Send+MPI_Recv
end2end training
MPI1.4.1 is faster than MPI1.10.2
2.83%
8.89%
2016-02-24 13:49 GMT+08:00 Eva :
> I compile the same program by using 1.4.1
I compile the same program by using 1.4.1 and 1.10.2rc3 and then run them
under the same environment. 1.4.1 is 8.89% faster than 1.10.2rc3. Is there
any official performance report for each version upgrade?
No. I didn't use MPI_Type_free
Is there any other reason?
2016-01-26 13:35 GMT+08:00 Eva :
> openmpi-1.10.2 cores at mca_coll_libnbc.so
>
> My program is transferred from 1.8.5 to 1.10.2. But when I run it, it
> cores as below.
>
> Program terminated with signal 11, S
openmpi-1.10.2 cores at mca_coll_libnbc.so
My program is transferred from 1.8.5 to 1.10.2. But when I run it, it cores
as below.
Program terminated with signal 11, Segmentation fault.
#0 0x7fa3550f51d2 in ompi_coll_libnbc_igather () from
/home/work/wuzhihua/install/openmpi-1.10.2rc3-gcc4.8/
ccessfully without any hang for
4 hours now. I will continue to watch its status.
Btw, have you fixed any such hang issues from 1.8.5 to 1.10.2?
2016-01-21 20:40 GMT+08:00 Eva :
> Thanks Jeff.
>
> >>1. Can you create a small example to reproduce the problem?
>
> >
process1/process3:
foreach to_id in process0, process2:
MPI_Send(send_buf, sendlen, to_id, TAG);
MPI_Recv(recv_buf, recvlen, to_id, TAG);
process0/process2:
while(true):
MPI_recv(recv_buf, any_source, TAG);
MPI_Send(send_buf, source_id, TAG);
201
debug into the openib source code to find the root cause with your
instructions or guide?
2016-01-21 17:03 GMT+08:00 Eva :
> Gilles,
> >>Can you try to
> >>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
> >>and confirm it works fine with TCP *and* without eag
Gilles,
>>Can you try to
>>mpirun --mca btl tcp,self --mca btl_tcp_eager_limit 56 ...
>>and confirm it works fine with TCP *and* without eager ?
I have tried this and it works.
So what should I do next?
2016-01-21 16:25 GMT+08:00 Eva :
> Thanks Gilles.
> it works fine o
Thanks Gilles.
it works fine on tcp
So I use this to disable eager:
-mca btl_openib_use_eager_rdma 0 -mca btl_openib_max_eager_rdma 0
2016-01-21 13:10 GMT+08:00 Eva :
> I run with two machines, 2 process per node: process0, process1, process2,
> process3.
> After some random
Bsend
but it still hangs.
The same program works fine on TCP for more than one year. After I move it
onto rdma, it starts to hang. And I can't debug into any rdma details
2016-01-21 11:24 GMT+08:00 Eva :
> Run MPI_Send on MPI1.8.5 without multithread enabled:
> it han
Run MPI_Send on MPI1.8.5 without multithread enabled:
it hangs on mca_pml_ob1_send() -> opal_progreses() ->
btl_openib_component_progress() -> poll_device() -> libmlx4-rdmav2.so -> cq
-> phread_spin_unlock
The program can run on TCP with no error.
14 matches
Mail list logo