Hi Ashley I understand the problem with descriptor flooding can be serious in an application with unidirectional data dependancy. Perhaps we have a different perception of how common that is.
It seems to me that such programs would be very rare but if they are more common than I imagine, then discussion of how to modulate them is worthwhile. In many cases, I think that adding some flow control to the application is a better solution than semantically redundant barrier. (A barrier that is there only to affect performance, not correctness, is what I mean by semantically redundant) For example, a Master/Worker application could have each worker break after every 4th send to the master and post an MPI_Recv for an OK_to_continue token. If the token had been sent, this would delay the worker a few microseconds. If it had not been sent, the worker would be kept waiting. The Master would keep track of how many messages from each worker it had absorbed and on message 3 from a particular worker, send an OK_to_continue token to that worker. The master would keep sending OK_to_continue tokens every 4th recv from then on (7, 11, 15 ...) The descriptor queues would all remain short and only a worker that the master could not keep up with would ever lose a chance to keep working. By sending the OK_to_continue token a bit early, the application would ensure that when there was no backlog, every worker would find a token when it looked for it and there would be no significant loss of compute time. Even with non-blocking barrier and a 10 step lag between Ibarrier and Wait, , if some worker is slow for 12 steps, the fast workers end up being kept in a non-productive MPI_Wait. Dick Dick Treumann - MPI Team IBM Systems & Technology Group Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601 Tele (845) 433-7846 Fax (845) 433-8363 users-boun...@open-mpi.org wrote on 09/09/2010 05:34:15 PM: > [image removed] > > Re: [OMPI users] MPI_Reduce performance > > Ashley Pittman > > to: > > Open MPI Users > > 09/09/2010 05:37 PM > > Sent by: > > users-boun...@open-mpi.org > > Please respond to Open MPI Users > > > On 9 Sep 2010, at 21:40, Richard Treumann wrote: > > > > > Ashley > > > > Can you provide an example of a situation in which these > semantically redundant barriers help? > > I'm not making the case for semantically redundant barriers, I'm > making a case for implicit synchronisation in every iteration of a > application. Many applications have this already by nature of the > data-flow required, anything that calls mpi_allgather or > mpi_allreduce are the easiest to verify but there are many other > ways of achieving the same thing. My point is about the subset of > programs which don't have this attribute and are therefore > susceptible to synchronisation problems. It's my experience that > for low iteration counts these codes can run fine but once they hit > a problem they go over a cliff edge performance wise and there is no > way back from there until the end of the job. The email from > Gabriele would appear to be a case that demonstrates this problem > but I've seen it many times before. > > Using your previous email as an example I would describe adding > barriers to a problem as a way artificially reducing the > "elasticity" of the program to ensure balanced use of resources. > > > I may be missing something but my statement for the text book would be > > > > "If adding a barrier to your MPI program makes it run faster, > there is almost certainly a flaw in it that is better solved another way." > > > > The only exception I can think of is some sort of one direction > data dependancy with messages small enough to go eagerly. A program > that calls MPI_Reduce with a small message and the same root every > iteration and calls no other collective would be an example. > > > > In that case, fast tasks at leaf positions would run free and a > slow task near the root could pile up early arrivals and end up with > some additional slowing. Unless it was driven into paging I cannot > imagine the slowdown would be significant though. > > I've diagnosed problems where the cause was a receive queue of tens > of thousands of messages, in this case each and every receive > performs slowly unless the descriptor is near the front of the queue > so the concern is not purely about memory usage at individual > processes although that can also be a factor. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users