>Try and do a variable amount of work for every process, I see non-blocking >as a way to speed-up communication if they arrive individually to the call. >Please always have this at the back of your mind when doing this.
I tried to simplify the problem at the explanation. The "local_computation" is variable among different processes, so there is load imbalance in the real problem. The microbenchmark was just a way to test the overhead, which was really much greater than expectations. >Surely non-blocking has overhead, and if the communication time is low, so >will the overhead be much higher. Off course there is. But for my case, which is a real HPC application for seismic data processing, it was prohibitive and strangely high. >You haven't specified what nx*ny*nz is, and hence your "slower" and >"faster" makes "no sense"... And hence your questions are difficult to >answer, basically "it depends". On my tests, I used nx = 700, ny = 200, nz = 60, total_iter = 1000. val is a real(4) array. This is basically the same sizeas the real application. Since I used the same values for all tests, it is reasonable to analyze the results. What I meant with question 1 was: overheads so high are expected? The microbenchmark is attached to this e-mail. The detailed result was (using 11 nodes): openmpi blocking: ================================== [RESULT] Reduce time = 21.790411 [RESULT] Total time = 24.977373 ================================== openmpi non-blocking: ================================== [RESULT] Reduce time = 97.332792 [RESULT] Total time = 100.470874 ================================== Intel MPI + blocking: ================================== [RESULT] Reduce time = 17.587828 [RESULT] Total time = 20.655875 ================================== Intel MPI + non-blocking: ================================== [RESULT] Reduce time = 49.483195 [RESULT] Total time = 52.642514 ================================== Thanks in advance. 2015-11-27 14:57 GMT-02:00 Felipe . <philip...@gmail.com>: > Hello! > > I have a program that basically is (first implementation): > for i in N: > local_computation(i) > mpi_allreduce(in_place, i) > > In order to try to mitigate the implicit barrier of the mpi_allreduce, I > tried to start an mpi_Iallreduce. Like this(second implementation): > for i in N: > local_computation(i) > j = i > if i is not first: > mpi_wait(request) > mpi_Iallreduce(in_place, j, request) > > The result was that the second was a lot worse. The processes spent 3 > times more time on the mpi_wait than on the mpi_allreduce from the first > implementation. I know it could be worst, but not that much. > > So, I made a microbenchmark to stress this, in Fortran. Here is the > implementation: > Blocking: > do i = 1, total_iter ! [ > t_0 = mpi_wtime() > > call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, > MPI_COMM_WORLD, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > end do ! ] > > Non-Blocking: > do i = 1, total_iter ! [ > t_0 = mpi_wtime() > call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, > MPI_COMM_WORLD, request, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > > t_0 = mpi_wtime() > call mpi_wait(request, status, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > > end do ! ] > > The non-blocking was about five times slower. I tried Intel's MPI and it > was of 3 times, instead of 5. > > Question 1: Do you think that all this overhead makes sense? > > Question 2: Why is there so much overhead for non-blocking collective > calls? > > Question 3: Can I change the algorithm for the non-blocking allReduce to > improve this? > > > Best regards, > -- > Felipe >
Teste_AllReduce.F90
Description: Binary data