However, there were some experimentations to go in blocking more at least when only TCP was used. Unfortunately, this break some other things in Open MPI, because of our progression model. We are component based and these components are allowed to register periodically called callbacks ... And here periodically means as often as possible. There are at least 2 components that use this mechanism for their own progression: romio (mca/io/romio) and one-sided communications (mca/ osc/*). Switching in blocking mode will break these 2 components completely. This was the reason why we're not blocking when only TCP is used.
Anyway, there is a solution. We have to move from a poll base progress for these components to an event base progress. There were some discussions, and if I remember well ... everybody's waiting for one of my patches :) A patch that allow a component to add a completion callback to MPI requests ... I don't have a clear deadline for this, and unfortunately I'm a little busy right now ... but I'll work on it asap.
george. On Apr 24, 2008, at 9:43 AM, Barry Rountree wrote:
On Thu, Apr 24, 2008 at 12:56:03PM +0200, Ingo Josopait wrote:I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cputime.This is a kernel scheduling issue, not an OpenMPI issue. Busy waiting in one process should not cause noticable loss of responsiveness in anotherprocesses. Have you experimented with the "nice" command?But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster.Yup.When a cpu is idle, it uses considerably less energy. Last time I checked mycomputer used 180W when both cpu cores were working and 110W when bothcores were idle.What processor is this?I just made a small hack to solve the problem. I inserted a simple sleepcall into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } }I expect this would lead to increased execution time for all programsand increased energy consumption for most programs. Recall that energyis power multiplied by time. You're reducing the power on some nodes and increasing time on all nodes.The usleep call will let the program sleep for about 4 ms (it won'tsleep for a shorter time because of some timer granularity). But that isgood enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another.I think your mistake here is considering CPU load to be a useful metric. It isn't. Responsiveness is a useful metric, energy is a useful metric,but CPU load isn't a reliable guide to either of these.For a proper implementation you would want to actively poll without asleep call for a few milliseconds, and then use some other method thatsleeps not for a fixed time, but until new messages arrive.Well, it sounds like you can get to this before I can. Post your patchhere and I'll test it on the NAS suite, UMT2K, Paradis, and a few synthetic benchmarks I've written. The cluster I use has multimeters hooked up so I can also let you know how much energy is being saved. Barry Rountree Ph.D. Candidate, Computer Science University of Georgia _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
smime.p7s
Description: S/MIME cryptographic signature