Thanks for the reply, Ralph. Now I think it is clearer to me why it could be so much slower. The reason would be that the blocking algorithm for reduction has a implementation very different than the non-blocking.
Since there are lots of ways to implement it, are there options to tune the non-blocking reduction algorithm and its parameters? Something like the ones we have for the blocking versions, for instance: "coll_tuned_allreduce_algorithm", "coll_tuned_reduce_algorithm", etc. -- Felipe 2015-11-27 18:20 GMT-02:00 Ralph Castain <r...@open-mpi.org>: > One thing you might want to keep in mind is that “non-blocking” doesn’t > mean “asynchronous progress”. The API may not block, but the communications > only progress whenever you actually call down into the library. > > So if you are calling a non-blocking collective, and then make additional > calls into MPI only rarely, you should expect to see slower performance. > > We are working on providing async progress on all operations, but I don’t > believe much (if any) of it is in the release branches so far. > > > On Nov 27, 2015, at 11:37 AM, Felipe . <philip...@gmail.com> wrote: > > >Try and do a variable amount of work for every process, I see non-blocking > > >as a way to speed-up communication if they arrive individually to the > call. > >Please always have this at the back of your mind when doing this. > > I tried to simplify the problem at the explanation. The > "local_computation" is variable among different processes, so there is load > imbalance in the real problem. > The microbenchmark was just a way to test the overhead, which was really > much greater than expectations. > > >Surely non-blocking has overhead, and if the communication time is low, so > > >will the overhead be much higher. > > Off course there is. But for my case, which is a real HPC application for > seismic data processing, it was prohibitive and strangely high. > > >You haven't specified what nx*ny*nz is, and hence your "slower" and > >"faster" makes "no sense"... And hence your questions are difficult to > >answer, basically "it depends". > > On my tests, I used nx = 700, ny = 200, nz = 60, total_iter = 1000. val > is a real(4) array. This is basically the same sizeas the real application. > Since I used the same values for all tests, it is reasonable to analyze > the results. > What I meant with question 1 was: overheads so high are expected? > > The microbenchmark is attached to this e-mail. > > The detailed result was (using 11 nodes): > > openmpi blocking: > ================================== > [RESULT] Reduce time = 21.790411 > [RESULT] Total time = 24.977373 > ================================== > > openmpi non-blocking: > ================================== > [RESULT] Reduce time = 97.332792 > [RESULT] Total time = 100.470874 > ================================== > > Intel MPI + blocking: > ================================== > [RESULT] Reduce time = 17.587828 > [RESULT] Total time = 20.655875 > ================================== > > > Intel MPI + non-blocking: > ================================== > [RESULT] Reduce time = 49.483195 > [RESULT] Total time = 52.642514 > ================================== > > Thanks in advance. > > 2015-11-27 14:57 GMT-02:00 Felipe . <philip...@gmail.com>: > >> Hello! >> >> I have a program that basically is (first implementation): >> for i in N: >> local_computation(i) >> mpi_allreduce(in_place, i) >> >> In order to try to mitigate the implicit barrier of the mpi_allreduce, I >> tried to start an mpi_Iallreduce. Like this(second implementation): >> for i in N: >> local_computation(i) >> j = i >> if i is not first: >> mpi_wait(request) >> mpi_Iallreduce(in_place, j, request) >> >> The result was that the second was a lot worse. The processes spent 3 >> times more time on the mpi_wait than on the mpi_allreduce from the first >> implementation. I know it could be worst, but not that much. >> >> So, I made a microbenchmark to stress this, in Fortran. Here is the >> implementation: >> Blocking: >> do i = 1, total_iter ! [ >> t_0 = mpi_wtime() >> >> call mpi_allreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, >> MPI_COMM_WORLD, ierror) >> if (ierror .ne. 0) then ! [ >> write(*,*) "Error in line ", __LINE__, " rank = ", rank >> call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) >> end if ! ] >> t_reduce = t_reduce + (mpi_wtime() - t_0) >> end do ! ] >> >> Non-Blocking: >> do i = 1, total_iter ! [ >> t_0 = mpi_wtime() >> call mpi_iallreduce(MPI_IN_PLACE, val, nx*ny*nz, MPI_REAL, MPI_SUM, >> MPI_COMM_WORLD, request, ierror) >> if (ierror .ne. 0) then ! [ >> write(*,*) "Error in line ", __LINE__, " rank = ", rank >> call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) >> end if ! ] >> t_reduce = t_reduce + (mpi_wtime() - t_0) >> >> t_0 = mpi_wtime() >> call mpi_wait(request, status, ierror) >> if (ierror .ne. 0) then ! [ >> write(*,*) "Error in line ", __LINE__, " rank = ", rank >> call mpi_abort(MPI_COMM_WORLD, ierror, ierror2) >> end if ! ] >> t_reduce = t_reduce + (mpi_wtime() - t_0) >> >> end do ! ] >> >> The non-blocking was about five times slower. I tried Intel's MPI and it >> was of 3 times, instead of 5. >> >> Question 1: Do you think that all this overhead makes sense? >> >> Question 2: Why is there so much overhead for non-blocking collective >> calls? >> >> Question 3: Can I change the algorithm for the non-blocking allReduce to >> improve this? >> >> >> Best regards, >> -- >> Felipe >> > > <Teste_AllReduce.F90>_______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28119.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28120.php >