Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

Alex Shi Mon, 21 Jan 2013 23:50:31 -0800

On 01/22/2013 02:55 PM, Mike Galbraith wrote:
> On Tue, 2013-01-22 at 11:20 +0800, Alex Shi wrote: 
>>>>>>
>>>>>> I just looked into the aim9 benchmark, in this case it forks 2000 tasks,
>>>>>> after all tasks ready, aim9 give a signal than all tasks burst waking up
>>>>>> and run until all finished.
>>>>>> Since each of tasks are finished very quickly, a imbalanced empty cpu
>>>>>> may goes to sleep till a regular balancing give it some new tasks. That
>>>>>> causes the performance dropping. cause more idle entering.
>>>>>
>>>>> Sounds like for AIM (and possibly for other really bursty loads), we
>>>>> might want to do some load-balancing at wakeup time by *just* looking
>>>>> at the number of running tasks, rather than at the load average. Hmm?
>>>>>
>>>>> The load average is fundamentally always going to run behind a bit,
>>>>> and while you want to use it for long-term balancing, a short-term you
>>>>> might want to do just a "if we have a huge amount of runnable
>>>>> processes, do a load balancing *now*". Where "huge amount" should
>>>>> probably be relative to the long-term load balancing (ie comparing the
>>>>> number of runnable processes on this CPU right *now* with the load
>>>>> average over the last second or so would show a clear spike, and a
>>>>> reason for quick action).
>>>>>
>>>>
>>>> Sorry for response late!
>>>>
>>>> Just written a patch following your suggestion, but no clear improvement 
>>>> for this case.
>>>> I also tried change the burst checking interval, also no clear help.
>>>>
>>>> If I totally give up runnable load in periodic balancing, the performance 
>>>> can recover 60%
>>>> of lose.
>>>>
>>>> I will try to optimize wake up balancing in weekend.
>>>>
>>>
>>> (btw, the time for runnable avg to accumulate to 100%, needs 345ms; to
>>> 50% needs 32 ms)
>>>
>>> I have tried some tuning in both wake up balancing and regular
>>> balancing. Yes, when using instant load weight (without runnable avg
>>> engage), both in waking up, and regular balance, the performance recovered.
>>>
>>> But with per_cpu nr_running tracking, it's hard to find a elegant way to
>>> detect the burst whenever in waking up or in regular balance.
>>> In waking up, the whole sd_llc domain cpus are candidates, so just
>>> checking this_cpu is not enough.
>>> In regular balance, this_cpu is the migration destination cpu, checking
>>> if the burst on the cpu is not useful. Instead, we need to check whole
>>> domains' increased task number.
>>>
>>> So, guess 2 solutions for this issue.
>>> 1, for quick waking up, we need use instant load(same as current
>>> balancing) to do balance; and for regular balance, we can record both
>>> instant load and runnable load data for whole domain, then decide which
>>> one to use according to task number increasing in the domain after
>>> tracking done the whole domain.
>>>
>>> 2, we can keep current instant load balancing as performance balance
>>> policy, and using runnable load balancing in power friend policy.
>>> Since, none of us find performance benefit with runnable load balancing
>>> on benchmark hackbench/kbuild/aim9/tbench/specjbb etc.
>>> I prefer the 2nd.
>>
>> 3, On the other hand, Considering the aim9 testing scenario is rare in
>> real life(prepare thousands tasks and then wake up them at the same
>> time). And the runnable load avg includes useful running history info.
>> Only aim9 5~7% performance dropping is not unacceptable.
>> (kbuild/hackbench/tbench/specjbb have no clear performance change)
>>
>> So we can let this drop be with a reminder in code. Any comments?
> 
> Hm.  Burst of thousands of tasks may be rare and perhaps even silly, but
> what about few task bursts?   History is useless for bursts, they live
> or die now: modest gaggle of worker threads (NR_CPUS) for say video
> encoding job wake in parallel, each is handed a chunk of data to chew up
> in parallel.  Double scheduler latency of one worker (stack workers
> because individuals don't historically fill a cpu), you double latency
> for the entire job every time.
> 
> I think 2 is mandatory, keep both, and user picks his poison.
> 
> If you want max burst performance, you care about the here and now
> reality the burst is waking into.  If you're running a google freight
> train farm otoh, you may want some hysteresis so trains don't over-rev
> the electric meter on every microscopic spike.  Both policies make
> sense, but you can't have both performance profiles with either metric,
> so choosing one seems doomed to failure.
>


Thanks for your suggestions and example, Mike!
I just can't understand the your last words here, Sorry. what the
detailed concern of you on 'both performance profiles with either
metric'? Could you like to give your preferred solutions?

> Case in point: tick skew.  It was removed because synchronized ticking
> saves power.. and then promptly returned under user control because the
> power saving gain also inflicted serious latency pain.
> 
> -Mike
> 


-- 
Thanks Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 09/22] sched: compute runnable load avg in cpu_load and cpu_avg_load_per_task

Reply via email to