>Try and do a variable amount of work for every process, I see non-blocking
>as a way to speed-up communication if they arrive individually to the
call.
>Please always have this at the back of your mind when doing this.

I tried to simplify the problem at the explanation. The "local_computation"
is variable among different processes, so there is load imbalance in the
real problem.
The microbenchmark was just a way to test the overhead, which was really
much greater than expectations.

>Surely non-blocking has overhead, and if the communication time is low, so
>will the overhead be much higher.

Off course there is. But for my case, which is a real HPC application for
seismic data processing, it was prohibitive and strangely high.

>You haven't specified what nx*ny*nz is, and hence your "slower" and
>"faster" makes "no sense"... And hence your questions are difficult to
>answer, basically "it depends".

On my tests, I used nx = 700, ny = 200,  nz = 60, total_iter = 1000. val is
a real(4) array. This is basically the same sizeas the real application.
Since I used the same values for all tests, it is reasonable to analyze the
results.
What I meant with question 1 was: overheads so high are expected?

The microbenchmark is attached to this e-mail.

The detailed result was (using 11 nodes):

openmpi blocking:
 ==================================
 [RESULT] Reduce time =  21.790411
 [RESULT] Total  time =  24.977373
 ==================================

openmpi non-blocking:
 ==================================
 [RESULT] Reduce time =  97.332792
 [RESULT] Total  time = 100.470874
 ==================================

Intel MPI + blocking:
 ==================================
 [RESULT] Reduce time =  17.587828
 [RESULT] Total  time =  20.655875
 ==================================


Intel MPI + non-blocking:
 ==================================
 [RESULT] Reduce time =  49.483195
 [RESULT] Total  time =  52.642514
 ==================================

Thanks in advance.

2015-11-27 14:57 GMT-02:00 Felipe . <philip...@gmail.com>:

> Hello!
>
> I have a program that basically is (first implementation):
> for i in N:
>   local_computation(i)
>   mpi_allreduce(in_place, i)
>
> In order to try to mitigate the implicit barrier of the mpi_allreduce, I
> tried to start an mpi_Iallreduce. Like this(second implementation):
> for i in N:
>   local_computation(i)
>   j = i
>   if i is not first:
>     mpi_wait(request)
>   mpi_Iallreduce(in_place, j, request)
>
> The result was that the second was a lot worse. The processes spent 3
> times more time on the mpi_wait than on the mpi_allreduce from the first
> implementation. I know it could be worst, but not that much.
>
> So, I made a microbenchmark to stress this, in Fortran. Here is the
> implementation:
> Blocking:
> do i = 1, total_iter ! [
>     t_0 = mpi_wtime()
>
>     call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
> MPI_COMM_WORLD, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
> end do ! ]
>
> Non-Blocking:
> do i = 1, total_iter ! [
>     t_0 = mpi_wtime()
>     call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM,
> MPI_COMM_WORLD, request, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>
>     t_0 = mpi_wtime()
>     call mpi_wait(request, status, ierror)
>     if (ierror .ne. 0) then ! [
>         write(*,*) "Error in line ", __LINE__, " rank = ", rank
>         call mpi_abort(MPI_COMM_WORLD, ierror, ierror2)
>     end if ! ]
>     t_reduce = t_reduce + (mpi_wtime() - t_0)
>
> end do ! ]
>
> The non-blocking was about five times slower. I tried Intel's MPI and it
> was of 3 times, instead of 5.
>
> Question 1: Do you think that all this overhead makes sense?
>
> Question 2: Why is there so much overhead for non-blocking collective
> calls?
>
> Question 3: Can I change the algorithm for the non-blocking allReduce to
> improve this?
>
>
> Best regards,
> --
> Felipe
>

Attachment: Teste_AllReduce.F90
Description: Binary data

Reply via email to