On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote: > I can think of several advantages that using blocking or signals to > reduce the cpu load would have: > > - Reduced energy consumption
Not necessarily. Any time the program ends up running longer, the cluster is up and running (and wasting electricity) for that amount of time. In the case where lots of tiny messages are being sent you could easily end up using more energy. > - Running additional background programs could be done far more efficiently It's usually more efficient -- especially in terms of cache -- to batch up programs to run one after the other instead of running them simultaneously. > - It would be much simpler to examine the load balance. This is true, but it's still pretty trivial to measure load imbalance. MPI allows you to write a wrapper library that intercepts any MPI_* call. You can instrument the code however you like, then call PMPI_*, then catch the return value, finish your instrumentation, and return control to your program. Here's some pseudocode: int MPI_Barrier(MPI_Comm comm){ gettimeofday(&start, NULL); rc=PMPI_Barrier( comm ); gettimeofday(&stop, NULL); fprintf( logfile, "Barrier on node %d took %lf seconds\n", rank, delta(&stop, &start) ); return rc; } I've got some code that does this for all of the MPI calls in OpenMPI (ah, the joys of writing C code using python scripts). Let me know if you'd find it useful. > It may depend on the type of program and the computational environment, > but there are certainly many cases in which putting the system in idle > mode would be advantageous. This is especially true for programs with > low network traffic and/or high load imbalances. <grin> I could use a few more benchmarks like that. Seriously, if you're mostly concerned about saving energy, a quick hack is to set a timer as soon as you enter an MPI call (say for 100ms) and if the timer goes off while you're still in the call, use DVS to drop your CPU frequency to the lowest value it has. Then, when you exit the MPI call, pop it back up to the highest frequency. This can save a significant amount of energy, but even here there can be a performance penalty. For example, UMT2K schleps around very large messages, and you really need to be running as fast as possible during the MPI_Waitall calls or the program will slow down by 1% or so (thus using more energy). Doing this just for Barriers and Allreduces seems to speed up the program a tiny bit, but I haven't done enough runs to make sure this isn't an artifact. (This is my dissertation topic, so before asking any question be advised that I WILL talk your ear off.) > The "spin for a while and then block" method that you mentioned earlier > seems to be a good compromise. Just do polling for some time that is > long compared to the corresponding system call, and then go to sleep if > nothing happens. In this way, the latency would be only marginally > increased, while less cpu time is wasted in the polling loops, and I > would be much happier. > I'm interested in seeing what this does for energy savings. Are you volunteering to test a patch? (I've got four other papers I need to get finished up, so it'll be a few weeks before I start coding.) Barry Rountree Ph.D. Candidate, Computer Science University of Georgia > > > > > Jeff Squyres schrieb: > > On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote: > > > >> Do you really mean that Open-MPI uses busy loop in order to handle > >> incomming calls? It seems to be incorrect since > >> spinning is a very bad and inefficient technique for this purpose. > > > > It depends on what you're optimizing for. :-) We're optimizing for > > minimum message passing latency on hosts that are not oversubscribed; > > polling is very good at that. Polling is much better than blocking, > > particularly if the blocking involves a system call (which will be > > "slow"). Note that in a compute-heavy environment, they nodes are > > going to be running at 100% CPU anyway. > > > > Also keep in mind that you're only going to have "waste" spinning in > > MPI if you have a loosely/poorly synchronized application. Granted, > > some applications are this way by nature, but we have not chosen to > > optimize spare CPU cycles for them. As I said in a prior mail, adding > > a blocking strategy is on the to-do list, but it's fairly low in > > priority right now. Someone may care / improve the message passing > > engine to include blocking, but it hasn't happened yet. Want to work > > on it? :-) > > > > And for reference: almost all MPI's do busy polling to minimize > > latency. Some of them will shift to blocking if nothing happens for a > > "long" time. This second piece is what OMPI is lacking. > > > >> Why > >> don't you use blocking and/or signals instead of > >> that? > > > > FWIW: I mentioned this in my other mail -- latency is quite definitely > > negatively impacted when you use such mechanisms. Blocking and > > signals are "slow" (in comparison to polling). > > > >> I think the priority of this task is very high because polling > >> just wastes resources of the system. > > > > In production HPC environments, the entire resource is dedicated to > > the MPI app anyway, so there's nothing else that really needs it. So > > we allow them to busy-spin. > > > > There is a mode to call yield() in the middle of every OMPI progress > > loop, but it's only helpful for loosely/poorly synchronized MPI apps > > and ones that use TCP or shared memory. Low latency networks such as > > IB or Myrinet won't be as friendly to this setting because they're > > busy polling (i.e., they call yield() much less frequently, if at all). > > > >> On the other hand, > >> what Alberto claims is not reasonable to me. > >> > >> Alberto, > >> - Are you oversubscribing one node which means that you are running > >> your > >> code on a single processor machine, pretending > >> to have four CPUs? > >> > >> - Did you compile Open-MPI or installed from RPM? > >> > >> Receiving process shouldn't be that expensive. > >> > >> Regards, > >> > >> Danesh > >> > >> > >> > >> Jeff Squyres skrev: > >>> Because on-node communication typically uses shared memory, so we > >>> currently have to poll. Additionally, when using mixed on/off-node > >>> communication, we have to alternate between polling shared memory and > >>> polling the network. > >>> > >>> Additionally, we actively poll because it's the best way to lower > >>> latency. MPI implementations are almost always first judged on their > >>> latency, not [usually] their CPU utilization. Going to sleep in a > >>> blocking system call will definitely negatively impact latency. > >>> > >>> We have plans for implementing the "spin for a while and then block" > >>> technique (as has been used in other MPI's and middleware layers), > >>> but > >>> it hasn't been a high priority. > >>> > >>> > >>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote: > >>> > >>> > >>>> Thanks Torje. I wonder what is the benefit of looping on the > >>>> incoming > >>>> message-queue socket rather than using system I/O signals, like read > >>>> () or select(). > >>>> > >>>> On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote: > >>>> > >>>>> Hi Alberto, > >>>>> > >>>>> The blocked processes are in fact spin-waiting. While they don't > >>>>> have > >>>>> anything better to do (waiting for that message), they will check > >>>>> their incoming message-queues in a loop. > >>>>> > >>>>> So the MPI_Recv()-operation is blocking, but it doesn't mean that > >>>>> the > >>>>> processes are blocked by the OS scheduler. > >>>>> > >>>>> > >>>>> I hope that made some sense :) > >>>>> > >>>>> > >>>>> Best regards, > >>>>> > >>>>> Torje > >>>>> > >>>>> > >>>>> On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote: > >>>>> > >>>>> > >>>>>> I have simple MPI program that sends data to processor rank 0. The > >>>>>> communication works well but when I run the program on more than 2 > >>>>>> processors (-np 4) the extra receivers waiting for data run on > > >>>>>> 90% > >>>>>> CPU load. I understand MPI_Recv() is a blocking operation, but why > >>>>>> does it consume so much CPU compared to a regular system read()? > >>>>>> > >>>>>> > >>>>>> > >>>>>> #include <sys/types.h> > >>>>>> #include <unistd.h> > >>>>>> #include <stdio.h> > >>>>>> #include <stdlib.h> > >>>>>> #include <mpi.h> > >>>>>> > >>>>>> void process_sender(int); > >>>>>> void process_receiver(int); > >>>>>> > >>>>>> > >>>>>> int main(int argc, char* argv[]) > >>>>>> { > >>>>>> int rank; > >>>>>> > >>>>>> MPI_Init(&argc, &argv); > >>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >>>>>> > >>>>>> printf("Processor %d (%d) initialized\n", rank, getpid()); > >>>>>> > >>>>>> if( rank == 1 ) > >>>>>> process_sender(rank); > >>>>>> else > >>>>>> process_receiver(rank); > >>>>>> > >>>>>> MPI_Finalize(); > >>>>>> } > >>>>>> > >>>>>> > >>>>>> void process_sender(int rank) > >>>>>> { > >>>>>> int i, j, size; > >>>>>> float data[100]; > >>>>>> MPI_Status status; > >>>>>> > >>>>>> printf("Processor %d initializing data...\n", rank); > >>>>>> for( i = 0; i < 100; ++i ) > >>>>>> data[i] = i; > >>>>>> > >>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); > >>>>>> > >>>>>> printf("Processor %d sending data...\n", rank); > >>>>>> MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD); > >>>>>> printf("Processor %d sent data\n", rank); > >>>>>> } > >>>>>> > >>>>>> > >>>>>> void process_receiver(int rank) > >>>>>> { > >>>>>> int count; > >>>>>> float value[200]; > >>>>>> MPI_Status status; > >>>>>> > >>>>>> printf("Processor %d waiting for data...\n", rank); > >>>>>> MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55, > >>>>>> MPI_COMM_WORLD, &status); > >>>>>> printf("Processor %d Got data from processor %d\n", rank, > >>>>>> status.MPI_SOURCE); > >>>>>> MPI_Get_count(&status, MPI_FLOAT, &count); > >>>>>> printf("Processor %d, Got %d elements\n", rank, count); > >>>>>> } > >>>>>> > >>>>>> _______________________________________________ > >>>>>> users mailing list > >>>>>> us...@open-mpi.org > >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>> > >>> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users