I am using one of the nodes as a desktop computer. Therefore it is most important for me that the mpi program is not so greedily acquiring cpu time. But I would imagine that the energy consumption is generally a big issue, since energy is a major cost factor in a computer cluster. When a cpu is idle, it uses considerably less energy. Last time I checked my computer used 180W when both cpu cores were working and 110W when both cores were idle.
I just made a small hack to solve the problem. I inserted a simple sleep call into the function 'opal_condition_wait': --- orig/openmpi-1.2.6/opal/threads/condition.h +++ openmpi-1.2.6/opal/threads/condition.h @@ -78,6 +78,7 @@ #endif } else { while (c->c_signaled == 0) { + usleep(1000); opal_progress(); } } The usleep call will let the program sleep for about 4 ms (it won't sleep for a shorter time because of some timer granularity). But that is good enough for me. The cpu usage is (almost) zero when the tasks are waiting for one another. For a proper implementation you would want to actively poll without a sleep call for a few milliseconds, and then use some other method that sleeps not for a fixed time, but until new messages arrive. Barry Rountree schrieb: > On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote: >> I can think of several advantages that using blocking or signals to >> reduce the cpu load would have: >> >> - Reduced energy consumption > > Not necessarily. Any time the program ends up running longer, the > cluster is up and running (and wasting electricity) for that amount of > time. In the case where lots of tiny messages are being sent you could > easily end up using more energy. > >> - Running additional background programs could be done far more efficiently > > It's usually more efficient -- especially in terms of cache -- to batch > up programs to run one after the other instead of running them > simultaneously. > >> - It would be much simpler to examine the load balance. > > This is true, but it's still pretty trivial to measure load imbalance. > MPI allows you to write a wrapper library that intercepts any MPI_* > call. You can instrument the code however you like, then call PMPI_*, > then catch the return value, finish your instrumentation, and return > control to your program. Here's some pseudocode: > > int MPI_Barrier(MPI_Comm comm){ > gettimeofday(&start, NULL); > rc=PMPI_Barrier( comm ); > gettimeofday(&stop, NULL); > fprintf( logfile, "Barrier on node %d took %lf seconds\n", > rank, delta(&stop, &start) ); > return rc; > } > > I've got some code that does this for all of the MPI calls in OpenMPI > (ah, the joys of writing C code using python scripts). Let me know if > you'd find it useful. > >> It may depend on the type of program and the computational environment, >> but there are certainly many cases in which putting the system in idle >> mode would be advantageous. This is especially true for programs with >> low network traffic and/or high load imbalances. > > <grin> I could use a few more benchmarks like that. Seriously, if > you're mostly concerned about saving energy, a quick hack is to set a > timer as soon as you enter an MPI call (say for 100ms) and if the timer > goes off while you're still in the call, use DVS to drop your CPU > frequency to the lowest value it has. Then, when you exit the MPI call, > pop it back up to the highest frequency. This can save a significant > amount of energy, but even here there can be a performance penalty. For > example, UMT2K schleps around very large messages, and you really need > to be running as fast as possible during the MPI_Waitall calls or the > program will slow down by 1% or so (thus using more energy). > > Doing this just for Barriers and Allreduces seems to speed up the > program a tiny bit, but I haven't done enough runs to make sure this > isn't an artifact. > > (This is my dissertation topic, so before asking any question be advised > that I WILL talk your ear off.) > >> The "spin for a while and then block" method that you mentioned earlier >> seems to be a good compromise. Just do polling for some time that is >> long compared to the corresponding system call, and then go to sleep if >> nothing happens. In this way, the latency would be only marginally >> increased, while less cpu time is wasted in the polling loops, and I >> would be much happier. >> > > I'm interested in seeing what this does for energy savings. Are you > volunteering to test a patch? (I've got four other papers I need to > get finished up, so it'll be a few weeks before I start coding.) > > Barry Rountree > Ph.D. Candidate, Computer Science > University of Georgia > >> >> >> >> Jeff Squyres schrieb: >>> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: >>> >>>> Do you really mean that Open-MPI uses busy loop in order to handle >>>> incomming calls? It seems to be incorrect since >>>> spinning is a very bad and inefficient technique for this purpose. >>> It depends on what you're optimizing for. :-) We're optimizing for >>> minimum message passing latency on hosts that are not oversubscribed; >>> polling is very good at that. Polling is much better than blocking, >>> particularly if the blocking involves a system call (which will be >>> "slow"). Note that in a compute-heavy environment, they nodes are >>> going to be running at 100% CPU anyway. >>> >>> Also keep in mind that you're only going to have "waste" spinning in >>> MPI if you have a loosely/poorly synchronized application. Granted, >>> some applications are this way by nature, but we have not chosen to >>> optimize spare CPU cycles for them. As I said in a prior mail, adding >>> a blocking strategy is on the to-do list, but it's fairly low in >>> priority right now. Someone may care / improve the message passing >>> engine to include blocking, but it hasn't happened yet. Want to work >>> on it? :-) >>> >>> And for reference: almost all MPI's do busy polling to minimize >>> latency. Some of them will shift to blocking if nothing happens for a >>> "long" time. This second piece is what OMPI is lacking. >>> >>>> Why >>>> don't you use blocking and/or signals instead of >>>> that? >>> FWIW: I mentioned this in my other mail -- latency is quite definitely >>> negatively impacted when you use such mechanisms. Blocking and >>> signals are "slow" (in comparison to polling). >>> >>>> I think the priority of this task is very high because polling >>>> just wastes resources of the system. >>> In production HPC environments, the entire resource is dedicated to >>> the MPI app anyway, so there's nothing else that really needs it. So >>> we allow them to busy-spin. >>> >>> There is a mode to call yield() in the middle of every OMPI progress >>> loop, but it's only helpful for loosely/poorly synchronized MPI apps >>> and ones that use TCP or shared memory. Low latency networks such as >>> IB or Myrinet won't be as friendly to this setting because they're >>> busy polling (i.e., they call yield() much less frequently, if at all). >>> >>>> On the other hand, >>>> what Alberto claims is not reasonable to me. >>>> >>>> Alberto, >>>> - Are you oversubscribing one node which means that you are running >>>> your >>>> code on a single processor machine, pretending >>>> to have four CPUs? >>>> >>>> - Did you compile Open-MPI or installed from RPM? >>>> >>>> Receiving process shouldn't be that expensive. >>>> >>>> Regards, >>>> >>>> Danesh >>>> >>>> >>>> >>>> Jeff Squyres skrev: >>>>> Because on-node communication typically uses shared memory, so we >>>>> currently have to poll. Additionally, when using mixed on/off-node >>>>> communication, we have to alternate between polling shared memory and >>>>> polling the network. >>>>> >>>>> Additionally, we actively poll because it's the best way to lower >>>>> latency. MPI implementations are almost always first judged on their >>>>> latency, not [usually] their CPU utilization. Going to sleep in a >>>>> blocking system call will definitely negatively impact latency. >>>>> >>>>> We have plans for implementing the "spin for a while and then block" >>>>> technique (as has been used in other MPI's and middleware layers), >>>>> but >>>>> it hasn't been a high priority. >>>>> >>>>> >>>>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: >>>>> >>>>> >>>>>> Thanks Torje. I wonder what is the benefit of looping on the >>>>>> incoming >>>>>> message-queue socket rather than using system I/O signals, like read >>>>>> () or select(). >>>>>> >>>>>> On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: >>>>>> >>>>>>> Hi Alberto, >>>>>>> >>>>>>> The blocked processes are in fact spin-waiting. While they don't >>>>>>> have >>>>>>> anything better to do (waiting for that message), they will check >>>>>>> their incoming message-queues in a loop. >>>>>>> >>>>>>> So the MPI_Recv()-operation is blocking, but it doesn't mean that >>>>>>> the >>>>>>> processes are blocked by the OS scheduler. >>>>>>> >>>>>>> >>>>>>> I hope that made some sense :) >>>>>>> >>>>>>> >>>>>>> Best regards, >>>>>>> >>>>>>> Torje >>>>>>> >>>>>>> >>>>>>> On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: >>>>>>> >>>>>>> >>>>>>>> I have simple MPI program that sends data to processor rank 0. The >>>>>>>> communication works well but when I run the program on more than 2 >>>>>>>> processors (-np 4) the extra receivers waiting for data run on > >>>>>>>> 90% >>>>>>>> CPU load. I understand MPI_Recv() is a blocking operation, but why >>>>>>>> does it consume so much CPU compared to a regular system read()? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> #include <sys/types.h> >>>>>>>> #include <unistd.h> >>>>>>>> #include <stdio.h> >>>>>>>> #include <stdlib.h> >>>>>>>> #include <mpi.h> >>>>>>>> >>>>>>>> void process_sender(int); >>>>>>>> void process_receiver(int); >>>>>>>> >>>>>>>> >>>>>>>> int main(int argc, char* argv[]) >>>>>>>> { >>>>>>>> int rank; >>>>>>>> >>>>>>>> MPI_Init(&argc, &argv); >>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>>>>> >>>>>>>> printf("Processor %d (%d) initialized\n", rank, getpid()); >>>>>>>> >>>>>>>> if( rank == 1 ) >>>>>>>> process_sender(rank); >>>>>>>> else >>>>>>>> process_receiver(rank); >>>>>>>> >>>>>>>> MPI_Finalize(); >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> void process_sender(int rank) >>>>>>>> { >>>>>>>> int i, j, size; >>>>>>>> float data[100]; >>>>>>>> MPI_Status status; >>>>>>>> >>>>>>>> printf("Processor %d initializing data...\n", rank); >>>>>>>> for( i = 0; i < 100; ++i ) >>>>>>>> data[i] = i; >>>>>>>> >>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>>>>> >>>>>>>> printf("Processor %d sending data...\n", rank); >>>>>>>> MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); >>>>>>>> printf("Processor %d sent data\n", rank); >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> void process_receiver(int rank) >>>>>>>> { >>>>>>>> int count; >>>>>>>> float value[200]; >>>>>>>> MPI_Status status; >>>>>>>> >>>>>>>> printf("Processor %d waiting for data...\n", rank); >>>>>>>> MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, >>>>>>>> MPI_COMM_WORLD, &status); >>>>>>>> printf("Processor %d Got data from processor %d\n", rank, >>>>>>>> status.MPI_SOURCE); >>>>>>>> MPI_Get_count(&status, MPI_FLOAT, &count); >>>>>>>> printf("Processor %d, Got %d elements\n", rank, count); >>>>>>>> } >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users