Re: [OMPI users] MPI_AllReduce vs MPI_IAllReduce

Felipe . Mon, 30 Nov 2015 07:43:07 -0500 (EST)

Thanks for the reply, Ralph.

Now I think it is clearer to me why it could be so much slower. The reason
would be that the blocking algorithm for reduction has a implementation
very different than the non-blocking.


Since there are lots of ways to implement it, are there options to tune the
non-blocking reduction algorithm and its parameters?

Something like the ones we have for the blocking versions, for instance:
"coll_tuned_allreduce_algorithm", "coll_tuned_reduce_algorithm", etc.

--
Felipe

2015-11-27 18:20 GMT-02:00 Ralph Castain <r...@open-mpi.org>:

> One thing you might want to keep in mind is that “non-blocking” doesn’t
> mean “asynchronous progress”. The API may not block, but the communications
> only progress whenever you actually call down into the library.
>
> So if you are calling a non-blocking collective, and then make additional
> calls into MPI only rarely, you should expect to see slower performance.
>
> We are working on providing async progress on all operations, but I don’t
> believe much (if any) of it is in the release branches so far.
>
>
> On Nov 27, 2015, at 11:37 AM, Felipe . <philip...@gmail.com> wrote:
>
> >Try and do a variable amount of work for every process, I see non-blocking
>
> >as a way to speed-up communication if they arrive individually to the
> call.
> >Please always have this at the back of your mind when doing this.
>
> I tried to simplify the problem at the explanation. The
> "local_computation" is variable among different processes, so there is load
> imbalance in the real problem.
> The microbenchmark was just a way to test the overhead, which was really
> much greater than expectations.
>
> >Surely non-blocking has overhead, and if the communication time is low, so
>
> >will the overhead be much higher.
>
> Off course there is. But for my case, which is a real HPC application for
> seismic data processing, it was prohibitive and strangely high.
>
> >You haven't specified what nx*ny*nz is, and hence your "slower" and
> >"faster" makes "no sense"... And hence your questions are difficult to
> >answer, basically "it depends".
>
> On my tests, I used nx = 700, ny = 200,  nz = 60, total_iter = 1000. val
> is a real(4) array. This is basically the same sizeas the real application.
> Since I used the same values for all tests, it is reasonable to analyze
> the results.
> What I meant with question 1 was: overheads so high are expected?
>
> The microbenchmark is attached to this e-mail.
>
> The detailed result was (using 11 nodes):
>
> openmpi blocking:
>  ==================================
>  [RESULT] Reduce time =  21.790411
>  [RESULT] Total  time =  24.977373
>  ==================================
>
> openmpi non-blocking:
>  ==================================
>  [RESULT] Reduce time =  97.332792
>  [RESULT] Total  time = 100.470874
>  ==================================
>
> Intel MPI + blocking:
>  ==================================
>  [RESULT] Reduce time =  17.587828
>  [RESULT] Total  time =  20.655875
>  ==================================
>
>
> Intel MPI + non-blocking:
>  ==================================
>  [RESULT] Reduce time =  49.483195
>  [RESULT] Total  time =  52.642514
>  ==================================
>
> Thanks in advance.
>
> 2015-11-27 14:57 GMT-02:00 Felipe . <philip...@gmail.com>:
>
>> Hello!
>>
>> I have a program that basically is (first implementation):
>> for i in N:
>>   local_computation(i)
>>   mpi_allreduce(in_place, i)
>>
>> In order to try to mitigate the implicit barrier of the mpi_allreduce, I
>> tried to start an mpi_Iallreduce. Like this(second implementation):
>> for i in N:
>>   local_computation(i)
>>   j = i
>>   if i is not first:
>>     mpi_wait(request)
>>   mpi_Iallreduce(in_place, j, request)
>>
>> The result was that the second was a lot worse. The processes spent 3
>> times more time on the mpi_wait than on the mpi_allreduce from the first
>> implementation. I know it could be worst, but not that much.
>>
>> So, I made a microbenchmark to stress this, in Fortran. Here is the
>> implementation:
>> Blocking:
>> do i = 1, total_iter ! [
>>     t_0 = mpi_wtime()
>>
>>     call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
>> MPI_COMM_WORLD, ierror)
>>     if (ierror .ne. 0) then ! [
>>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>>     end if ! ]
>>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>> end do ! ]
>>
>> Non-Blocking:
>> do i = 1, total_iter ! [
>>     t_0 = mpi_wtime()
>>     call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
>> MPI_COMM_WORLD, request, ierror)
>>     if (ierror .ne. 0) then ! [
>>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>>     end if ! ]
>>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>>
>>     t_0 = mpi_wtime()
>>     call mpi_wait(request, status, ierror)
>>     if (ierror .ne. 0) then ! [
>>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>>     end if ! ]
>>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>>
>> end do ! ]
>>
>> The non-blocking was about five times slower. I tried Intel's MPI and it
>> was of 3 times, instead of 5.
>>
>> Question 1: Do you think that all this overhead makes sense?
>>
>> Question 2: Why is there so much overhead for non-blocking collective
>> calls?
>>
>> Question 3: Can I change the algorithm for the non-blocking allReduce to
>> improve this?
>>
>>
>> Best regards,
>> --
>> Felipe
>>
>
> <Teste_AllReduce.F90>_______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28119.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28120.php
>

Re: [OMPI users] MPI_AllReduce vs MPI_IAllReduce

Reply via email to