It would be very strange for nanosleep to cause a problem for Open MPI -- it shouldn't interfere with any of Open MPI's mechanisms. Double check that your my_barrier() function is actually working properly -- removing the nanosleep() shouldn't affect the correctness of your barrier.
If you've implemented your own barrier function, here's a few points: 1. If you want to re-implement the back-end to MPI_Barrier itself, it would likely be possible to wrap up your routine in an Open MPI plugin (remember that the back-end of MPI_Barrier -- and others -- are driven by plugins; hence, you can actually replace the algorithms and whatnot that are used by MPI_Barrier without altering Open MPI's source code). Let me know if you're interested in that. 2. MPI_Wait, as you surmised, is pretty much the same -- it aggressively polls, waiting for progress. You *could* replace its behavior with a plugin, similar to MPI_Barrier, but it's a little harder (I can describe why, if you care). 3. Your best bet might actually be to write a small profiling library that intercepts calls to MPI_Barrier and/or MPI_Wait and replaces them with not-aggressive versions. E.g., your version of MPI_Wait can call MPI_Test, and if the request is not finished, call sleep() (or whatever). Rinse, repeat. 4. The mpi_yield_when_idle MCA parameter will simply call sched_yield() in OMPI's inner loops. It'll still aggressively, but it'll call yield in the very core of those loops, thereby allowing other processes to pre-empt the MPI processes. So it'll likely help your situation by allowing other processes to run, but the CPU's will still be pegged out at 100%. On Dec 31, 2009, at 1:15 PM, Gijsbert Wiesenekker wrote: > First of all, the reason that I have created a CPU-friendly version of > MPI_Barrier is that my program is asymmetric (so some of the nodes can easily > have to wait for several hours) and that it is I/O bound. My program uses MPI > mainly to synchronize I/O and to share some counters between the nodes, > followed by a gather/scatter of the files. MPI_Barrier (or any of the other > MPI calls) caused the four CPU's of my Quad Core to continuously run at 100% > because of the aggressive polling, making the server almost unusable and also > slowing my program down because there was less CPU time available for I/O and > file synchronization. With this version of MPI_Barrier CPU usage averages out > at about 25%. I only recently learned about the OMPI_MCA_mpi_yield_when_idle > variable, I still have to test if that is an alternative to my workaround. > Meanwhile I seem to have found the cause of problem thanks to Ashley's > excellent padb tool. Following Eugene's recommendation, I have added the > MPI_Wait call: the same problem. Next I created a separate program that just > calls my_barrier repeatedly with randomized 1-2 seconds intervals. Again the > same problem (with 4 nodes), sometimes after a couple of iterations, > sometimes after 500, 1000 or 2000 iterations. Next I followed Ashley's > suggestion to use padb. I ran padb --all --mpi-queue and padb --all > --message-queue while the program was running fine and after the problem > occured. When the problem occurred padb said: > > Warning, remote process state differs across ranks > state : ranks > R : [2-3] > S : [0-1] > > and > > $ padb --all --stack-trace --tree > Warning, remote process state differs across ranks > state : ranks > R : [2-3] > S : [0-1] > ----------------- > [0-1] (2 processes) > ----------------- > main() at ?:? > barrier_util() at ?:? > my_sleep() at ?:? > __nanosleep_nocancel() at ?:? > ----------------- > [2-3] (2 processes) > ----------------- > ??() at ?:? > ??() at ?:? > ??() at ?:? > ??() at ?:? > ??() at ?:? > ompi_mpi_signed_char() at ?:? > ompi_request_default_wait_all() at ?:? > opal_progress() at ?:? > ----------------- > 2 (1 processes) > ----------------- > mca_pml_ob1_progress() at ?:? > > suggests that rather than OpenMPI being the problem, nanosleep is the culprit > because the call to it seems to hang. > > Thanks for all the help. > > Gijsbert > > On Mon, Dec 14, 2009 at 8:22 PM, Ashley Pittman <ash...@pittman.co.uk> wrote: > On Sun, 2009-12-13 at 19:04 +0100, Gijsbert Wiesenekker wrote: > > The following routine gives a problem after some (not reproducible) > > time on Fedora Core 12. The routine is a CPU usage friendly version of > > MPI_Barrier. > > There are some proposals for Non-blocking collectives before the MPI > forum currently and I believe a working implementation which can be used > as a plug-in for OpenMPI, I would urge you to look at these rather than > try and implement your own. > > > My question is: is there a problem with this routine that I overlooked > > that somehow did not show up until now > > Your code both does all-to-all communication and also uses probe, both > of these can easily be avoided when implementing Barrier. > > > Is there a way to see which messages have been sent/received/are > > pending? > > Yes, there is a message queue interface allowing tools to peek inside > the MPI library and see these queues. That I know of there are three > tools which use this, either TotalView, DDT or my own tool, padb. > TotalView and DDT are both full-featured graphical debuggers and > commercial products, padb is a open-source text based tool. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com