Thanks for the reply, Ralph.
Now I think it is clearer to me why it could be so much slower. The reason
would be that the blocking algorithm for reduction has a implementation
very different than the non-blocking.
Since there are lots of ways to implement it, are there options to tune the
non-blo
One thing you might want to keep in mind is that “non-blocking” doesn’t mean
“asynchronous progress”. The API may not block, but the communications only
progress whenever you actually call down into the library.
So if you are calling a non-blocking collective, and then make additional calls
int
>Try and do a variable amount of work for every process, I see non-blocking
>as a way to speed-up communication if they arrive individually to the
call.
>Please always have this at the back of your mind when doing this.
I tried to simplify the problem at the explanation. The "local_computation"
is
Try and do a variable amount of work for every process, I see non-blocking
as a way to speed-up communication if they arrive individually to the call.
Please always have this at the back of your mind when doing this.
Surely non-blocking has overhead, and if the communication time is low, so
will t
Hello!
I have a program that basically is (first implementation):
for i in N:
local_computation(i)
mpi_allreduce(in_place, i)
In order to try to mitigate the implicit barrier of the mpi_allreduce, I
tried to start an mpi_Iallreduce. Like this(second implementation):
for i in N:
local_comput