Hello! I have a program that basically is (first implementation): for i in N: local_computation(i) mpi_allreduce(in_place, i)
In order to try to mitigate the implicit barrier of the mpi_allreduce, I tried to start an mpi_Iallreduce. Like this(second implementation): for i in N: local_computation(i) j = i if i is not first: mpi_wait(request) mpi_Iallreduce(in_place, j, request) The result was that the second was a lot worse. The processes spent 3 times more time on the mpi_wait than on the mpi_allreduce from the first implementation. I know it could be worst, but not that much. So, I made a microbenchmark to stress this, in Fortran. Here is the implementation: Blocking: do i = 1, total_iter ! [ t_0 = mpi_wtime() call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, MPI_COMM_WORLD, ierror) if (ierror .ne. 0) then ! [ write(*,*) "Error in line ", __LINE__, " rank = ", rank call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) end if ! ] t_reduce = t_reduce + (mpi_wtime() - t_0) end do ! ] Non-Blocking: do i = 1, total_iter ! [ t_0 = mpi_wtime() call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, MPI_COMM_WORLD, request, ierror) if (ierror .ne. 0) then ! [ write(*,*) "Error in line ", __LINE__, " rank = ", rank call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) end if ! ] t_reduce = t_reduce + (mpi_wtime() - t_0) t_0 = mpi_wtime() call mpi_wait(request, status, ierror) if (ierror .ne. 0) then ! [ write(*,*) "Error in line ", __LINE__, " rank = ", rank call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) end if ! ] t_reduce = t_reduce + (mpi_wtime() - t_0) end do ! ] The non-blocking was about five times slower. I tried Intel's MPI and it was of 3 times, instead of 5. Question 1: Do you think that all this overhead makes sense? Question 2: Why is there so much overhead for non-blocking collective calls? Question 3: Can I change the algorithm for the non-blocking allReduce to improve this? Best regards, -- Felipe