Martin mentioned that you don't need to un/buffer if those array
sections are already contiguous in memory. Or, if they aren't
contiguous in memory but the "ghost" cells or gaps in memory aren't too
big, maybe it's worth including the gaps in the MPI_Allreduce call so
that the array sections effectively become contiguous. Or, you could
send each 1d "pencil" at a time (im elements) at the cost of having
more MPI calls (possibly wins if im is huge and jm*km is small). Or
construct derived datatypes and hope that the MPI implementation
(presumably OMPI) handles this well. Etc.? I think the short answer
is that there are all sorts of things that one could try, but maybe one
should first look at the 85%. Can you in any way estimate how long the
collective operation *should* take? Then, how is that step performing
in comparison? What's limiting performance -- a laggard process
arriving late at the collective? The bisection bandwidth of your
cluster? The overhead of the operation? A poor algorithm? Greg Fischer wrote: It looks like the buffering operations consume about 15% as much time as the allreduce operations. Not huge, but not trivial, all the same. Is there any way to avoid the buffering step? |
- [OMPI users] best way to ALLREDUCE multi-dimensional arrays... Greg Fischer
- Re: [OMPI users] best way to ALLREDUCE multi-dimension... Eugene Loh
- Re: [OMPI users] best way to ALLREDUCE multi-dimen... Greg Fischer
- Re: [OMPI users] best way to ALLREDUCE multi-d... Martin Siegert
- Re: [OMPI users] best way to ALLREDUCE multi-d... Eugene Loh