Re: [OMPI users] Busy waiting [was Re: (no subject)]

Ingo Josopait Thu, 24 Apr 2008 06:56:09 -0400

I am using one of the nodes as a desktop computer. Therefore it is most
important for me that the mpi program is not so greedily acquiring cpu
time. But I would imagine that the energy consumption is generally a big
issue, since energy is a major cost factor in a computer cluster. When a
cpu is idle, it uses considerably less energy. Last time I checked my
computer used 180W when both cpu cores were working and 110W when both
cores were idle.


I just made a small hack to solve the problem. I inserted a simple sleep
call into the function 'opal_condition_wait':

--- orig/openmpi-1.2.6/opal/threads/condition.h
+++ openmpi-1.2.6/opal/threads/condition.h
@@ -78,6 +78,7 @@
 #endif
     } else {
         while (c->c_signaled == 0) {
+           usleep(1000);
             opal_progress();
         }
     }

The usleep call will let the program sleep for about 4 ms (it won't
sleep for a shorter time because of some timer granularity). But that is
good enough for me. The cpu usage is (almost) zero when the tasks are
waiting for one another.

For a proper implementation you would want to actively poll without a
sleep call for a few milliseconds, and then use some other method that
sleeps not for a fixed time, but until new messages arrive.



Barry Rountree schrieb:
> On Wed, Apr 23, 2008 at 11:38:41PM +0200, Ingo Josopait wrote:
>> I can think of several advantages that using blocking or signals to
>> reduce the cpu load would have:
>>
>> - Reduced energy consumption
> 
> Not necessarily.  Any time the program ends up running longer, the
> cluster is up and running (and wasting electricity) for that amount of
> time.  In the case where lots of tiny messages are being sent you could
> easily end up using more energy.  
> 
>> - Running additional background programs could be done far more efficiently
> 
> It's usually more efficient -- especially in terms of cache -- to batch
> up programs to run one after the other instead of running them
> simultaneously.  
> 
>> - It would be much simpler to examine the load balance.
> 
> This is true, but it's still pretty trivial to measure load imbalance.
> MPI allows you to write a wrapper library that intercepts any MPI_*
> call.  You can instrument the code however you like, then call PMPI_*,
> then catch the return value, finish your instrumentation, and return
> control to your program.  Here's some pseudocode:
> 
> int MPI_Barrier(MPI_Comm comm){
>       gettimeofday(&start, NULL);
>       rc=PMPI_Barrier( comm );
>       gettimeofday(&stop, NULL);
>       fprintf( logfile, "Barrier on node %d took %lf seconds\n",
>               rank, delta(&stop, &start) );
>       return rc;
> }
> 
> I've got some code that does this for all of the MPI calls in OpenMPI
> (ah, the joys of writing C code using python scripts).  Let me know if
> you'd find it useful.
> 
>> It may depend on the type of program and the computational environment,
>> but there are certainly many cases in which putting the system in idle
>> mode would be advantageous. This is especially true for programs with
>> low network traffic and/or high load imbalances.
> 
> <grin>  I could use a few more benchmarks like that.  Seriously, if
> you're mostly concerned about saving energy, a quick hack is to set a
> timer as soon as you enter an MPI call (say for 100ms) and if the timer
> goes off while you're still in the call, use DVS to drop your CPU
> frequency to the lowest value it has.  Then, when you exit the MPI call,
> pop it back up to the highest frequency.  This can save a significant
> amount of energy, but even here there can be a performance penalty.  For
> example, UMT2K schleps around very large messages, and you really need
> to be running as fast as possible during the MPI_Waitall calls or the
> program will slow down by 1% or so (thus using more energy).
> 
> Doing this just for Barriers and Allreduces seems to speed up the
> program a tiny bit, but I haven't done enough runs to make sure this
> isn't an artifact.
> 
> (This is my dissertation topic, so before asking any question be advised
> that I WILL talk your ear off.)
>  
>> The "spin for a while and then block" method that you mentioned earlier
>> seems to be a good compromise. Just do polling for some time that is
>> long compared to the corresponding system call, and then go to sleep if
>> nothing happens. In this way, the latency would be only marginally
>> increased, while less cpu time is wasted in the polling loops, and I
>> would be much happier.
>>
> 
> I'm interested in seeing what this does for energy savings.  Are you
> volunteering to test a patch?  (I've got four other papers I need to
> get finished up, so it'll be a few weeks before I start coding.)
> 
> Barry Rountree
> Ph.D. Candidate, Computer Science
> University of Georgia
> 
>>
>>
>>
>> Jeff Squyres schrieb:
>>> On Apr 23, 2008, at 3:49 PM, Danesh Daroui wrote:
>>>
>>>> Do you really mean that Open-MPI uses busy loop in order to handle
>>>> incomming calls? It seems to be incorrect since
>>>> spinning is a very bad and inefficient technique for this purpose.
>>> It depends on what you're optimizing for.  :-)  We're optimizing for  
>>> minimum message passing latency on hosts that are not oversubscribed;  
>>> polling is very good at that.  Polling is much better than blocking,  
>>> particularly if the blocking involves a system call (which will be  
>>> "slow").  Note that in a compute-heavy environment, they nodes are  
>>> going to be running at 100% CPU anyway.
>>>
>>> Also keep in mind that you're only going to have "waste" spinning in  
>>> MPI if you have a loosely/poorly synchronized application.  Granted,  
>>> some applications are this way by nature, but we have not chosen to  
>>> optimize spare CPU cycles for them.  As I said in a prior mail, adding  
>>> a blocking strategy is on the to-do list, but it's fairly low in  
>>> priority right now.  Someone may care / improve the message passing  
>>> engine to include blocking, but it hasn't happened yet.  Want to work  
>>> on it?  :-)
>>>
>>> And for reference: almost all MPI's do busy polling to minimize  
>>> latency.  Some of them will shift to blocking if nothing happens for a  
>>> "long" time.  This second piece is what OMPI is lacking.
>>>
>>>> Why
>>>> don't you use blocking and/or signals instead of
>>>> that?
>>> FWIW: I mentioned this in my other mail -- latency is quite definitely  
>>> negatively impacted when you use such mechanisms.  Blocking and  
>>> signals are "slow" (in comparison to polling).
>>>
>>>> I think the priority of this task is very high because polling
>>>> just wastes resources of the system.
>>> In production HPC environments, the entire resource is dedicated to  
>>> the MPI app anyway, so there's nothing else that really needs it.  So  
>>> we allow them to busy-spin.
>>>
>>> There is a mode to call yield() in the middle of every OMPI progress  
>>> loop, but it's only helpful for loosely/poorly synchronized MPI apps  
>>> and ones that use TCP or shared memory.  Low latency networks such as  
>>> IB or Myrinet won't be as friendly to this setting because they're  
>>> busy polling (i.e., they call yield() much less frequently, if at all).
>>>
>>>> On the other hand,
>>>> what Alberto claims is not reasonable to me.
>>>>
>>>> Alberto,
>>>> - Are you oversubscribing one node which means that you are running  
>>>> your
>>>> code on a single processor machine, pretending
>>>> to have four CPUs?
>>>>
>>>> - Did you compile Open-MPI or installed from RPM?
>>>>
>>>> Receiving process shouldn't be that expensive.
>>>>
>>>> Regards,
>>>>
>>>> Danesh
>>>>
>>>>
>>>>
>>>> Jeff Squyres skrev:
>>>>> Because on-node communication typically uses shared memory, so we
>>>>> currently have to poll.  Additionally, when using mixed on/off-node
>>>>> communication, we have to alternate between polling shared memory and
>>>>> polling the network.
>>>>>
>>>>> Additionally, we actively poll because it's the best way to lower
>>>>> latency.  MPI implementations are almost always first judged on their
>>>>> latency, not [usually] their CPU utilization.  Going to sleep in a
>>>>> blocking system call will definitely negatively impact latency.
>>>>>
>>>>> We have plans for implementing the "spin for a while and then block"
>>>>> technique (as has been used in other MPI's and middleware layers),  
>>>>> but
>>>>> it hasn't been a high priority.
>>>>>
>>>>>
>>>>> On Apr 23, 2008, at 12:19 PM, Alberto Giannetti wrote:
>>>>>
>>>>>
>>>>>> Thanks Torje. I wonder what is the benefit of looping on the  
>>>>>> incoming
>>>>>> message-queue socket rather than using system I/O signals, like read
>>>>>> () or select().
>>>>>>
>>>>>> On Apr 23, 2008, at 12:10 PM, Torje Henriksen wrote:
>>>>>>
>>>>>>> Hi Alberto,
>>>>>>>
>>>>>>> The blocked processes are in fact spin-waiting. While they don't  
>>>>>>> have
>>>>>>> anything better to do (waiting for that message), they will check
>>>>>>> their incoming message-queues in a loop.
>>>>>>>
>>>>>>> So the MPI_Recv()-operation is blocking, but it doesn't mean that  
>>>>>>> the
>>>>>>> processes are blocked by the OS scheduler.
>>>>>>>
>>>>>>>
>>>>>>> I hope that made some sense :)
>>>>>>>
>>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Torje
>>>>>>>
>>>>>>>
>>>>>>> On Apr 23, 2008, at 5:34 PM, Alberto Giannetti wrote:
>>>>>>>
>>>>>>>
>>>>>>>> I have simple MPI program that sends data to processor rank 0. The
>>>>>>>> communication works well but when I run the program on more than 2
>>>>>>>> processors (-np 4) the extra receivers waiting for data run on >  
>>>>>>>> 90%
>>>>>>>> CPU load. I understand MPI_Recv() is a blocking operation, but why
>>>>>>>> does it consume so much CPU compared to a regular system read()?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> #include <sys/types.h>
>>>>>>>> #include <unistd.h>
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <stdlib.h>
>>>>>>>> #include <mpi.h>
>>>>>>>>
>>>>>>>> void process_sender(int);
>>>>>>>> void process_receiver(int);
>>>>>>>>
>>>>>>>>
>>>>>>>> int main(int argc, char* argv[])
>>>>>>>> {
>>>>>>>> int rank;
>>>>>>>>
>>>>>>>> MPI_Init(&argc, &argv);
>>>>>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>>>>
>>>>>>>> printf("Processor %d (%d) initialized\n", rank, getpid());
>>>>>>>>
>>>>>>>> if( rank == 1 )
>>>>>>>>   process_sender(rank);
>>>>>>>> else
>>>>>>>>   process_receiver(rank);
>>>>>>>>
>>>>>>>> MPI_Finalize();
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> void process_sender(int rank)
>>>>>>>> {
>>>>>>>> int i, j, size;
>>>>>>>> float data[100];
>>>>>>>> MPI_Status status;
>>>>>>>>
>>>>>>>> printf("Processor %d initializing data...\n", rank);
>>>>>>>> for( i = 0; i < 100; ++i )
>>>>>>>>   data[i] = i;
>>>>>>>>
>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>>>>
>>>>>>>> printf("Processor %d sending data...\n", rank);
>>>>>>>> MPI_Send(data, 100, MPI_FLOAT, 0, 55, MPI_COMM_WORLD);
>>>>>>>> printf("Processor %d sent data\n", rank);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> void process_receiver(int rank)
>>>>>>>> {
>>>>>>>> int count;
>>>>>>>> float value[200];
>>>>>>>> MPI_Status status;
>>>>>>>>
>>>>>>>> printf("Processor %d waiting for data...\n", rank);
>>>>>>>> MPI_Recv(value, 200, MPI_FLOAT, MPI_ANY_SOURCE, 55,
>>>>>>>> MPI_COMM_WORLD, &status);
>>>>>>>> printf("Processor %d Got data from processor %d\n", rank,
>>>>>>>> status.MPI_SOURCE);
>>>>>>>> MPI_Get_count(&status, MPI_FLOAT, &count);
>>>>>>>> printf("Processor %d, Got %d elements\n", rank, count);
>>>>>>>> }
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Busy waiting [was Re: (no subject)]

Reply via email to