Re: [OMPI users] MPI_AllReduce vs MPI_IAllReduce

Nick Papior Fri, 27 Nov 2015 12:06:03 -0500 (EST)

Try and do a variable amount of work for every process, I see non-blocking
as a way to speed-up communication if they arrive individually to the call.
Please always have this at the back of your mind when doing this.


Surely non-blocking has overhead, and if the communication time is low, so
will the overhead be much higher.
You haven't specified what nx*ny*nz is, and hence your "slower" and
"faster" makes "no sense"...  And hence your questions are difficult to
answer, basically "it depends".

2015-11-27 17:57 GMT+01:00 Felipe . <philip...@gmail.com>:

> Hello!
>
> I have a program that basically is (first implementation):
> for i in N:
>   local_computation(i)
>   mpi_allreduce(in_place, i)
>
> In order to try to mitigate the implicit barrier of the mpi_allreduce, I
> tried to start an mpi_Iallreduce. Like this(second implementation):
> for i in N:
>   local_computation(i)
>   j = i
>   if i is not first:
>     mpi_wait(request)
>   mpi_Iallreduce(in_place, j, request)
>
> The result was that the second was a lot worse. The processes spent 3
> times more time on the mpi_wait than on the mpi_allreduce from the first
> implementation. I know it could be worst, but not that much.
>
> So, I made a microbenchmark to stress this, in Fortran. Here is the
> implementation:
> Blocking:
> do i = 1, total_iter ! [
>     t_0 = mpi_wtime()
>
>     call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
> MPI_COMM_WORLD, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
> end do ! ]
>
> Non-Blocking:
> do i = 1, total_iter ! [
>     t_0 = mpi_wtime()
>     call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
> MPI_COMM_WORLD, request, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>
>     t_0 = mpi_wtime()
>     call mpi_wait(request, status, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>
> end do ! ]
>
> The non-blocking was about five times slower. I tried Intel's MPI and it
> was of 3 times, instead of 5.
>
> Question 1: Do you think that all this overhead makes sense?
>
> Question 2: Why is there so much overhead for non-blocking collective
> calls?
>
> Question 3: Can I change the algorithm for the non-blocking allReduce to
> improve this?
>
>
> Best regards,
> --
> Felipe
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28117.php
>



-- 
Kind regards Nick

Re: [OMPI users] MPI_AllReduce vs MPI_IAllReduce

Reply via email to