It would be very strange for nanosleep to cause a problem for Open MPI -- it 
shouldn't interfere with any of Open MPI's mechanisms.  Double check that your 
my_barrier() function is actually working properly -- removing the nanosleep() 
shouldn't affect the correctness of your barrier.  

If you've implemented your own barrier function, here's a few points:

1. If you want to re-implement the back-end to MPI_Barrier itself, it would 
likely be possible to wrap up your routine in an Open MPI plugin (remember that 
the back-end of MPI_Barrier -- and others -- are driven by plugins; hence, you 
can actually replace the algorithms and whatnot that are used by MPI_Barrier 
without altering Open MPI's source code).  Let me know if you're interested in 
that.

2. MPI_Wait, as you surmised, is pretty much the same -- it aggressively polls, 
waiting for progress.  You *could* replace its behavior with a plugin, similar 
to MPI_Barrier, but it's a little harder (I can describe why, if you care).  

3. Your best bet might actually be to write a small profiling library that 
intercepts calls to MPI_Barrier and/or MPI_Wait and replaces them with 
not-aggressive versions.  E.g., your version of MPI_Wait can call MPI_Test, and 
if the request is not finished, call sleep() (or whatever).  Rinse, repeat.  

4. The mpi_yield_when_idle MCA parameter will simply call sched_yield() in 
OMPI's inner loops.  It'll still aggressively, but it'll call yield in the very 
core of those loops, thereby allowing other processes to pre-empt the MPI 
processes.  So it'll likely help your situation by allowing other processes to 
run, but the CPU's will still be pegged out at 100%.


On Dec 31, 2009, at 1:15 PM, Gijsbert Wiesenekker wrote:

> First of all, the reason that I have created a CPU-friendly version of 
> MPI_Barrier is that my program is asymmetric (so some of the nodes can easily 
> have to wait for several hours) and that it is I/O bound. My program uses MPI 
> mainly to synchronize I/O and to share some counters between the nodes, 
> followed by a gather/scatter of the files. MPI_Barrier (or any of the other 
> MPI calls) caused the four CPU's of my Quad Core to continuously run at 100% 
> because of the aggressive polling, making the server almost unusable and also 
> slowing my program down because there was less CPU time available for I/O and 
> file synchronization. With this version of MPI_Barrier CPU usage averages out 
> at about 25%. I only recently learned about the OMPI_MCA_mpi_yield_when_idle 
> variable, I still have to test if that is an alternative to my workaround.
> Meanwhile I seem to have found the cause of problem thanks to Ashley's 
> excellent padb tool. Following Eugene's recommendation, I have added the 
> MPI_Wait call: the same problem. Next I created a separate program that just 
> calls my_barrier repeatedly with randomized 1-2 seconds intervals. Again the 
> same problem (with 4 nodes), sometimes after a couple of iterations, 
> sometimes after 500, 1000 or 2000 iterations. Next I followed Ashley's 
> suggestion to use padb. I ran padb --all --mpi-queue and padb --all 
> --message-queue while the program was running fine and after the problem 
> occured. When the problem occurred padb said:
> 
> Warning, remote process state differs across ranks
> state : ranks
>     R : [2-3]
>     S : [0-1]
> 
> and
> 
> $ padb --all --stack-trace --tree
> Warning, remote process state differs across ranks
> state : ranks
>     R : [2-3]
>     S : [0-1]
> -----------------
> [0-1] (2 processes)
> -----------------
> main() at ?:?
>   barrier_util() at ?:?
>     my_sleep() at ?:?
>       __nanosleep_nocancel() at ?:?
> -----------------
> [2-3] (2 processes)
> -----------------
> ??() at ?:?
>   ??() at ?:?
>     ??() at ?:?
>       ??() at ?:?
>         ??() at ?:?
>           ompi_mpi_signed_char() at ?:?
>             ompi_request_default_wait_all() at ?:?
>               opal_progress() at ?:?
>                 -----------------
>                 2 (1 processes)
>                 -----------------
>                 mca_pml_ob1_progress() at ?:?
> 
> suggests that rather than OpenMPI being the problem, nanosleep is the culprit 
> because the call to it seems to hang.
> 
> Thanks for all the help.
> 
> Gijsbert
> 
> On Mon, Dec 14, 2009 at 8:22 PM, Ashley Pittman <ash...@pittman.co.uk> wrote:
> On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote:
> > The following routine gives a problem after some (not reproducible)
> > time on Fedora Core 12. The routine is a CPU usage friendly version of
> > MPI_Barrier.
> 
> There are some proposals for Non-blocking collectives before the MPI
> forum currently and I believe a working implementation which can be used
> as a plug-in for OpenMPI, I would urge you to look at these rather than
> try and implement your own.
> 
> > My question is: is there a problem with this routine that I overlooked
> > that somehow did not show up until now
> 
> Your code both does all-to-all communication and also uses probe, both
> of these can easily be avoided when implementing Barrier.
> 
> > Is there a way to see which messages have been sent/received/are
> > pending?
> 
> Yes, there is a message queue interface allowing tools to peek inside
> the MPI library and see these queues.  That I know of there are three
> tools which use this, either TotalView, DDT or my own tool, padb.
> TotalView and DDT are both full-featured graphical debuggers and
> commercial products, padb is a open-source text based tool.
> 
> Ashley,
> 
> --
> 
> Ashley Pittman, Bath, UK.
> 
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to