Try and do a variable amount of work for every process, I see non-blocking as a way to speed-up communication if they arrive individually to the call. Please always have this at the back of your mind when doing this.
Surely non-blocking has overhead, and if the communication time is low, so will the overhead be much higher. You haven't specified what nx*ny*nz is, and hence your "slower" and "faster" makes "no sense"... And hence your questions are difficult to answer, basically "it depends". 2015-11-27 17:57 GMT+01:00 Felipe . <philip...@gmail.com>: > Hello! > > I have a program that basically is (first implementation): > for i in N: > local_computation(i) > mpi_allreduce(in_place, i) > > In order to try to mitigate the implicit barrier of the mpi_allreduce, I > tried to start an mpi_Iallreduce. Like this(second implementation): > for i in N: > local_computation(i) > j = i > if i is not first: > mpi_wait(request) > mpi_Iallreduce(in_place, j, request) > > The result was that the second was a lot worse. The processes spent 3 > times more time on the mpi_wait than on the mpi_allreduce from the first > implementation. I know it could be worst, but not that much. > > So, I made a microbenchmark to stress this, in Fortran. Here is the > implementation: > Blocking: > do i = 1, total_iter ! [ > t_0 = mpi_wtime() > > call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, > MPI_COMM_WORLD, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > end do ! ] > > Non-Blocking: > do i = 1, total_iter ! [ > t_0 = mpi_wtime() > call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, > MPI_COMM_WORLD, request, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > > t_0 = mpi_wtime() > call mpi_wait(request, status, ierror) > if (ierror .ne. 0) then ! [ > write(*,*) "Error in line ", __LINE__, " rank = ", rank > call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) > end if ! ] > t_reduce = t_reduce + (mpi_wtime() - t_0) > > end do ! ] > > The non-blocking was about five times slower. I tried Intel's MPI and it > was of 3 times, instead of 5. > > Question 1: Do you think that all this overhead makes sense? > > Question 2: Why is there so much overhead for non-blocking collective > calls? > > Question 3: Can I change the algorithm for the non-blocking allReduce to > improve this? > > > Best regards, > -- > Felipe > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28117.php > -- Kind regards Nick