Ashley's observation may apply to an application that iterates on many to 
one communication patterns. If the only collective used is MPI_Reduce, 
some non-root tasks can get ahead and keep pushing iteration results at 
tasks that are nearer the root. This could overload them and cause some 
extra slow down. 

In most parallel applications, there is some web of interdependency across 
tasks between iterations that keeps them roughly in step.  I find it hare 
to believe that there are many programs that need semantically redundant 
MPI_Barriers.

For example -

In a program that does neighbor communication, no task can get very far 
ahead of its neighbors.  It is possible for a task at one corner to be a a 
few steps ahead of one at the opposite corner but only a few steps. In 
this case though, the distant neighbor is not being affected by that task 
that is out ahead anyway. It is only affected by its immediate neighbors,

In a program that does an MPI_Bcast from root and an MPI_Reduce to root in 
each iteration, No task gets far ahead because the task that finished the 
Bcast early, just wait longer at the Reduce.

An program that makes a call to a non-rooted collective every iteration 
will stay in pretty tight synch.

Think carefully before tossing in either MPI_Barrier or some non-blocking 
barrier.  Unless MPI_Bcast or MPI_Reduce is the only collective you call, 
your problem is likely not progress skew..


Dick Treumann  -  MPI Team 
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846         Fax (845) 433-8363




From:
Ashley Pittman <ash...@pittman.co.uk>
To:
Open MPI Users <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date:
09/09/2010 03:53 AM
Subject:
Re: [OMPI users] MPI_Reduce performance
Sent by:
users-boun...@open-mpi.org




On 9 Sep 2010, at 08:31, Terry Frankcombe wrote:

> On Thu, 2010-09-09 at 01:24 -0600, Ralph Castain wrote:
>> As people have said, these time values are to be expected. All they
>> reflect is the time difference spent in reduce waiting for the slowest
>> process to catch up to everyone else. The barrier removes that factor
>> by forcing all processes to start from the same place.
>> 
>> 
>> No mystery here - just a reflection of the fact that your processes
>> arrive at the MPI_Reduce calls at different times.
> 
> 
> Yes, however, it seems Gabriele is saying the total execution time
> *drops* by ~500 s when the barrier is put *in*.  (Is that the right way
> around, Gabriele?)
> 
> That's harder to explain as a sync issue.

Not really, you need some way of keeping processes in sync or else the 
slow ones get slower and the fast ones stay fast.  If you have an 
un-balanced algorithm then you can end up swamping certain ranks and when 
they get behind they get even slower and performance goes off a cliff 
edge.

Adding sporadic barriers keeps everything in sync and running nicely, if 
things are performing well then the barrier only slows things down but if 
there is a problem it'll bring all process back together and destroy the 
positive feedback cycle.  This is why you often only need a 
synchronisation point every so often, I'm also a huge fan of asyncronous 
barriers as a full sync is a blunt and slow operation, using asyncronous 
barriers you can allow small differences in timing but prevent them from 
getting too large with very little overhead in the common case where 
processes are synced already.  I'm thinking specifically of starting a 
sync-barrier on iteration N, waiting for it on N+25 and immediately 
starting another one, again waiting for it 25 steps later.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to