Martin mentioned that you don't need to un/buffer if those array sections are already contiguous in memory.  Or, if they aren't contiguous in memory but the "ghost" cells or gaps in memory aren't too big, maybe it's worth including the gaps in the MPI_Allreduce call so that the array sections effectively become contiguous.  Or, you could send each 1d "pencil" at a time (im elements) at the cost of having more MPI calls (possibly wins if im is huge and jm*km is small).  Or construct derived datatypes and hope that the MPI implementation (presumably OMPI) handles this well.  Etc.?  I think the short answer is that there are all sorts of things that one could try, but maybe one should first look at the 85%.  Can you in any way estimate how long the collective operation *should* take?  Then, how is that step performing in comparison?  What's limiting performance -- a laggard process arriving late at the collective?  The bisection bandwidth of your cluster?  The overhead of the operation?  A poor algorithm?

Greg Fischer wrote:
It looks like the buffering operations consume about 15% as much time as the allreduce operations.  Not huge, but not trivial, all the same.  Is there any way to avoid the buffering step?

On Thu, Sep 24, 2009 at 6:03 PM, Eugene Loh <eugene....@sun.com> wrote:
Greg Fischer wrote:
(I apologize in advance for the simplistic/newbie question.)

I'm performing an ALLREDUCE operation on a multi-dimensional array.  This operation is the biggest bottleneck in the code, and I'm wondering if there's a way to do it more efficiently than what I'm doing now.  Here's a representative example of what's happening:

   ir=1
   do ikl=1,km
     do ij=1,jm
       do ii=1,im
         albuf(ir)=array(ii,ij,ikl,nl,0,ng)
         ir=ir+1
       enddo
     enddo
   enddo
   agbuf=0.0
   call mpi_allreduce(albuf,agbuf,im*jm*kmloc(coords(2)+1),mpi_real,mpi_sum,ang_com,ierr)
   ir=1
   do ikl=1,km
     do ij=1,jm
       do ii=1,im
         phim(ii,ij,ikl,nl,0,ng)=agbuf(ir)
         ir=ir+1
       enddo
     enddo
   enddo

Is there any way to just do this in one fell swoop, rather than buffering, transmitting, and unbuffering?  This operation is looped over many times.  Are there savings to be had here?
There are three steps here:  buffering, transmitting, and unbuffering.  Any idea how the run time is distributed among those three steps?  E.g., if most time is spent in the MPI call, then combining all three steps into one is unlikely to buy you much... and might even hurt.  If most of the time is spent in the MPI call, then there may be some tuning of collective algorithms to do.  I don't have any experience doing this with OMPI.  I'm just saying it makes some sense to isolate the problem a little bit more.

Reply via email to