[OMPI users] MPI_AllReduce vs MPI_IAllReduce

Felipe . Fri, 27 Nov 2015 11:57:48 -0500 (EST)

Hello!

I have a program that basically is (first implementation):
for i in N:
  local_computation(i)
  mpi_allreduce(in_place, i)


In order to try to mitigate the implicit barrier of the mpi_allreduce, I
tried to start an mpi_Iallreduce. Like this(second implementation):
for i in N:
  local_computation(i)
  j = i
  if i is not first:
    mpi_wait(request)
  mpi_Iallreduce(in_place, j, request)

The result was that the second was a lot worse. The processes spent 3 times
more time on the mpi_wait than on the mpi_allreduce from the first
implementation. I know it could be worst, but not that much.

So, I made a microbenchmark to stress this, in Fortran. Here is the
implementation:
Blocking:
do i = 1, total_iter ! [
    t_0 = mpi_wtime()

    call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
MPI_COMM_WORLD, ierror)
    if (ierror .ne. 0) then ! [
        write(*,*) "Error in line ", __LINE__, " rank = ", rank
        call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
    end if ! ]
    t_reduce = t_reduce + (mpi_wtime() - t_0)
end do ! ]

Non-Blocking:
do i = 1, total_iter ! [
    t_0 = mpi_wtime()
    call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
MPI_COMM_WORLD, request, ierror)
    if (ierror .ne. 0) then ! [
        write(*,*) "Error in line ", __LINE__, " rank = ", rank
        call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
    end if ! ]
    t_reduce = t_reduce + (mpi_wtime() - t_0)

    t_0 = mpi_wtime()
    call mpi_wait(request, status, ierror)
    if (ierror .ne. 0) then ! [
        write(*,*) "Error in line ", __LINE__, " rank = ", rank
        call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
    end if ! ]
    t_reduce = t_reduce + (mpi_wtime() - t_0)

end do ! ]

The non-blocking was about five times slower. I tried Intel's MPI and it
was of 3 times, instead of 5.

Question 1: Do you think that all this overhead makes sense?

Question 2: Why is there so much overhead for non-blocking collective calls?

Question 3: Can I change the algorithm for the non-blocking allReduce to
improve this?


Best regards,
--
Felipe

[OMPI users] MPI_AllReduce vs MPI_IAllReduce

Reply via email to