On 9 Sep 2010, at 21:40, Richard Treumann wrote:

> 
> Ashley 
> 
> Can you provide an example of a situation in which these semantically 
> redundant barriers help? 

I'm not making the case for semantically redundant barriers, I'm making a case 
for implicit synchronisation in every iteration of a application.  Many 
applications have this already by nature of the data-flow required, anything 
that calls mpi_allgather or mpi_allreduce are the easiest to verify but there 
are many other ways of achieving the same thing.  My point is about the subset 
of programs which don't have this attribute and are therefore susceptible to 
synchronisation problems.  It's my experience that for low iteration counts 
these codes can run fine but once they hit a problem they go over a cliff edge 
performance wise and there is no way back from there until the end of the job.  
The email from Gabriele would appear to be a case that demonstrates this 
problem but I've seen it many times before.

Using your previous email as an example I would describe adding barriers to a 
problem as a way artificially reducing the "elasticity" of the program to 
ensure balanced use of resources.

> I may be missing something but my statement for the text book would be 
> 
> "If adding a barrier to your MPI program makes it run faster, there is almost 
> certainly a flaw in it that is better solved another way." 
> 
> The only exception I can think of is some sort of one direction data 
> dependancy with messages small enough to go eagerly.  A program that calls 
> MPI_Reduce with a small message and the same root every iteration and  calls 
> no other collective would be an example. 
> 
> In that case, fast tasks at leaf positions would run free and a slow task 
> near the root could pile up early arrivals and end up with some additional 
> slowing. Unless it was driven into paging I cannot imagine the slowdown would 
> be significant though. 

I've diagnosed problems where the cause was a receive queue of tens of 
thousands of messages, in this case each and every receive performs slowly 
unless the descriptor is near the front of the queue so the concern is not 
purely about memory usage at individual processes although that can also be a 
factor.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


Reply via email to