from:"Li, Aubrey"

Re: [RFC PATCH v2 1/8] cpuidle: menu: extract prediction functionality

2017-10-15 Thread Li, Aubrey

On 2017/10/14 8:26, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:27 AM CEST Aubrey Li wrote:
>> There are several factors in the menu governor to predict the next
>> idle interval:
>> - the next timer
>> - the recent idle interval history
>> - the corrected idle interval pattern
>> These factors are common enough to be extracted to be one function.
>>
>> Signed-off-by: Aubrey Li 
> 
> This patch alone would break things AFAICS, because it removes code from
> menu_select() without a replacement (and menu_predict() is never called
> just yet).
> 
> Please always do your best to ensure that things will work after *every*
> patch in a series.

okay, I'll correct this in the next version.

Thanks,
-Aubrey

Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry

2017-10-15 Thread Li, Aubrey

On 2017/10/14 8:35, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>> Record the overhead of idle entry in micro-second
>>
> 
> What is this needed for?

We need to figure out how long of a idle is a short idle and recording
the overhead is for this purpose. The short idle threshold is based
on this overhead.

> 
>> +void cpuidle_entry_end(void)
>> +{
>> +struct cpuidle_device *dev = cpuidle_get_device();
>> +u64 overhead;
>> +s64 diff;
>> +
>> +if (dev) {
>> +dev->idle_stat.entry_end = local_clock();
>> +overhead = div_u64(dev->idle_stat.entry_end -
>> +dev->idle_stat.entry_start, NSEC_PER_USEC);
> 
> Is the conversion really necessary?
> 
> If so, then why?

We can choose nano-second and micro-second. Given that workload results
in the short idle pattern, I think micro-second is good enough for the
real workload.

Another reason is that prediction from idle governor is micro-second, so
I convert it for comparing purpose.
> 
> And if there is a good reason, what about using right shift to do
> an approximate conversion to avoid the extra division here?

Sure >> 10 works for me as I don't think here precision is a big deal.

> 
>> +diff = overhead - dev->idle_stat.overhead;
>> +dev->idle_stat.overhead += diff >> 3;
> 
> Can you please explain what happens in the two lines above?

Online average computing algorithm, stolen from update_avg() @ 
kernel/sched/core.c.
> 
>> +/*
>> + * limit overhead to 1us
>> + */
>> +if (dev->idle_stat.overhead == 0)
>> +dev->idle_stat.overhead = 1;
>> +}
>> +}
>> +
>>  /**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>> diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h
>> index fc1e5d7..cad9b71 100644
>> --- a/include/linux/cpuidle.h
>> +++ b/include/linux/cpuidle.h
>> @@ -72,6 +72,15 @@ struct cpuidle_device_kobj;
>>  struct cpuidle_state_kobj;
>>  struct cpuidle_driver_kobj;
>>  
>> +struct cpuidle_stat {
>> +u64 entry_start;/* nanosecond */
>> +u64 entry_end;  /* nanosecond */
>> +u64 overhead;   /* nanosecond */
>> +unsigned intpredicted_us;   /* microsecond */
>> +boolpredicted;  /* ever predicted? */
>> +boolfast_idle;  /* fast idle? */
> 
> Some of the fields here are never used in the code after this patch.
> 
> Also it would be good to add a comment describing the meaning of the
> fields.
> 
okay, will add in the next version.

Thanks,
-Aubrey

Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle

2017-10-15 Thread Li, Aubrey

On 2017/10/14 8:51, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:30 AM CEST Aubrey Li wrote:
>> If the next idle is expected to be a fast idle, we should keep tick
>> on before going into idle
>>
>> Signed-off-by: Aubrey Li 
> 
> This also can be merged with the previous patch (and the [2/8]) IMO.
> 

okay, will merge in the next version.

>> ---
>>  drivers/cpuidle/cpuidle.c | 14 ++
>>  include/linux/cpuidle.h   |  2 ++
>>  kernel/time/tick-sched.c  |  4 
>>  3 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index ef6f7dd..6cb7e17 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -370,6 +370,20 @@ void cpuidle_predict(void)
>>  }
>>  
>>  /**
>> + * cpuidle_fast_idle - predict whether or not the coming idle is a fast idle
>> + * This function can be called in irq exit path, make it as soon as possible
>> + */
>> +bool cpuidle_fast_idle(void)
>> +{
>> +struct cpuidle_device *dev = cpuidle_get_device();
>> +
>> +if (!dev)
>> +return false;
>> +
>> +return dev->idle_stat.fast_idle;
> 
> return dev && dev->idle_stat.fast_idle;

Thanks!

>>  
>> @@ -916,6 +917,9 @@ static bool can_stop_idle_tick(int cpu, struct 
>> tick_sched *ts)
>>  return false;
>>  }
>>  
>> +if (cpuidle_fast_idle())
>> +return false;
>> +
>>  return true;
> 
> return !cpuidle_fast_idle();

And thanks!
> 
>>  }
>>  
>>
> 
> And IMO there is quite a bit too much marketing in the "fast_idle" name,
> as it seems all about avoiding to stop the tick if the predicted idle
> duration is short enough.
> 

okay, and what's your suggestion? :)
I'll try to move quiet_vmstat() into the normal idle branch if this patch series
are reasonable. Is fast_idle a good indication for it?

Thanks,
-Aubrey

Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle

2017-10-15 Thread Li, Aubrey

On 2017/10/16 12:45, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>
>> I'll try to move quiet_vmstat() into the normal idle branch if this patch 
>> series
>> are reasonable. Is fast_idle a good indication for it?
> 
> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code

It looks like this commit makes tick stop critical as it can be invoked in 
interrupt
exit path?

->smp_apic_timer_interrupt
-->irq_exit
--->tick_nohz_irq_exit
>__tick_nohz_idle_enter
->tick_nohz_stop_sched_tick
-->quiet_vmstat

Thanks,
-Aubrey

Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable

2017-10-15 Thread Li, Aubrey

On 2017/10/14 8:59, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>> Add a knob to make fast idle threshold tunable
>>
>> Signed-off-by: Aubrey Li 
> 
> I first of all am not sure about the need to add a tunable for this at all
> in the first place.

Actually I think a fixed value(10) might be good enough but not quite sure
if there is a requirement to tune it for different scenario, for example even
if the predicted idle interval is 100x overhead, I still want a fast path for
a better benchmark score?

>> @@ -1229,6 +1230,17 @@ static struct ctl_table kern_table[] = {
>>  .extra2 = &one,
>>  },
>>  #endif
>> +#ifdef CONFIG_CPU_IDLE
>> +{
>> +.procname   = "fast_idle_ratio",
>> +.data   = &sysctl_fast_idle_ratio,
>> +.maxlen = sizeof(int),
>> +.mode   = 0644,
>> +.proc_handler   = proc_dointvec_minmax,
>> +.extra1 = &one,
>> +.extra2 = &one_hundred,
>> +},
>> +#endif
>>  { }
>>  };
>>  
> 
> And if there is a good enough reason to add it, shouldn't the tunable be
> there in the cpuidle framework?
> 
sure, if it makes sense, I'll move it into cpuidle/sysfs.c

Thanks,
-Aubrey

Re: [RFC PATCH v2 7/8] cpuidle: introduce irq timing to make idle prediction

2017-10-15 Thread Li, Aubrey

On 2017/10/14 9:01, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:33 AM CEST Aubrey Li wrote:
>> Introduce irq timings output as a factor to predict the duration
>> of the coming idle
>>
>> @@ -342,13 +343,27 @@ void cpuidle_entry_end(void)
>>  void cpuidle_predict(void)
>>  {
>>  struct cpuidle_device *dev = cpuidle_get_device();
>> -unsigned int overhead_threshold;
>> +unsigned int idle_interval, overhead_threshold;
>> +u64 now, next_evt;
>>  
>>  if (!dev)
>>  return;
>>  
>>  overhead_threshold = dev->idle_stat.overhead * sysctl_fast_idle_ratio;
>>  
>> +/*
>> + * check irq timings if the next event is coming soon
>> + */
>> +now = local_clock();
>> +local_irq_disable();
>> +next_evt = irq_timings_next_event(now);
>> +local_irq_enable();
>> +idle_interval = div_u64(next_evt - now, NSEC_PER_USEC);
> 
> Another division ...
> 
okay, will replace with >>10 if micro-second is agreed to be the comparing unit

Thanks,
-Aubrey

Re: [RFC PATCH v2 4/8] tick/nohz: keep tick on for a fast idle

2017-10-15 Thread Li, Aubrey

On 2017/10/16 14:25, Mike Galbraith wrote:
> On Mon, 2017-10-16 at 13:34 +0800, Li, Aubrey wrote:
>> On 2017/10/16 12:45, Mike Galbraith wrote:
>>> On Mon, 2017-10-16 at 11:26 +0800, Li, Aubrey wrote:
>>>>
>>>> I'll try to move quiet_vmstat() into the normal idle branch if this patch 
>>>> series
>>>> are reasonable. Is fast_idle a good indication for it?
>>>
>>> see x86_tip 62cb1188ed86 sched/idle: Move quiet_vmstate() into the NOHZ code
>>
>> It looks like this commit makes tick stop critical as it can be invoked in 
>> interrupt
>> exit path?
> 
> do_idle() ain't critical?  It is in my book.  Hopefully, you're about
> to make that idle_stat.fast_idle thingy liberal enough that we cease
> mucking about with the tick on every microscopic idle (and I can then
> trash my years old local patch to avoid needlessly eating that cost).
> 

Welcome any feedback of other patch set, Mike, :)

Thanks,
-Aubrey

Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed

2017-10-15 Thread Li, Aubrey

On 2017/10/14 8:56, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>> sleep length indicates how long we'll be idle. Currently, it's updated
>> only when tick nohz enters. These patch series make a new requirement
>> with tick, so we should keep sleep length updated as needed
> 
> So what exactly would be the problem with leaving things as they are?

Previously ts->sleep_length is only updated when tick is stopped.

As follows, in

__tick_nohz_idle_enter()
{
if (can_stop_idle_tick() /* return true */) {
tick_nohz_stop_sched_tick()
|
|-> update sleep_length
}
}

Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
to update sleep_length every time we read it

If we leave it unchanged, the prediction could read a sleep_length long time
ago if the system keep ticking.
> 
>> ---
>>  kernel/time/tick-sched.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index d663fab..94fb9b8 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>   */
>>  ktime_t tick_nohz_get_sleep_length(void)
>>  {
>> +struct clock_event_device *dev = 
>> __this_cpu_read(tick_cpu_device.evtdev);
>>  struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>  
>> +ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>> +
>>  return ts->sleep_length;
>>  }
>>  
>>
> 
> I probably wouldn't do it this way ...
> 
> 

May I know the detailed thoughts?

Thanks,
-Aubrey

Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

2017-10-16 Thread Li, Aubrey

On 2017/10/14 9:14, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:26 AM CEST Aubrey Li wrote:
>> We found under some latency intensive workloads, short idle periods occurs
>> very common, then idle entry and exit path starts to dominate, so it's
>> important to optimize them. To determine the short idle pattern, we need
>> to figure out how long of the coming idle and the threshold of the short
>> idle interval.
>>
>> A cpu idle prediction functionality is introduced in this proposal to catch
>> the short idle pattern.
>>
>> Firstly, we check the IRQ timings subsystem, if there is an event
>> coming soon.
>> -- https://lwn.net/Articles/691297/
>>
>> Secondly, we check the idle statistics of scheduler, if it's likely we'll
>> go into a short idle.
>> -- https://patchwork.kernel.org/patch/2839221/
>>
>> Thirdly, we predict the next idle interval by using the prediction
>> fucntionality in the idle governor if it has.
>>
>> For the threshold of the short idle interval, we record the timestamps of
>> the idle entry, and multiply by a tunable parameter at here:
>> -- /proc/sys/kernel/fast_idle_ratio
>>
>> We use the output of the idle prediction to skip turning tick off if a
>> short idle is determined in this proposal. Reprogramming hardware timer
>> twice(off and on) is expensive for a very short idle. There are some
>> potential optimizations can be done according to the same indicator.
>>
>> I observed when system is idle, the idle predictor reports 20/s long idle
>> and ZERO fast idle on one CPU. And when the workload is running, the idle
>> predictor reports 72899/s fast idle and ZERO long idle on the same CPU.
>>
>> Aubrey Li (8):
>>   cpuidle: menu: extract prediction functionality
>>   cpuidle: record the overhead of idle entry
>>   cpuidle: add a new predict interface
>>   tick/nohz: keep tick on for a fast idle
>>   timers: keep sleep length updated as needed
>>   cpuidle: make fast idle threshold tunable
>>   cpuidle: introduce irq timing to make idle prediction
>>   cpuidle: introduce run queue average idle to make idle prediction
>>
>>  drivers/cpuidle/Kconfig  |   1 +
>>  drivers/cpuidle/cpuidle.c| 109 
>> +++
>>  drivers/cpuidle/governors/menu.c |  69 -
>>  include/linux/cpuidle.h  |  21 
>>  kernel/sched/idle.c  |  14 -
>>  kernel/sysctl.c  |  12 +
>>  kernel/time/tick-sched.c |   7 +++
>>  7 files changed, 209 insertions(+), 24 deletions(-)
>>
> 
> Overall, it looks like you could avoid stopping the tick every time the
> predicted idle duration is not longer than the tick interval in the first
> place.
> > Why don't you do that?

I didn't catch this.

Are you suggesting?

if(!cpu_stat.fast_idle)
tick_nohz_idle_enter()

Or you concern why the threshold can't simply be tick interval?

For the first, can_stop_idle_tick() is a better place to skip tick-off IMHO.
For the latter, if the threshold is close/equal to the tick, it's quite possible
the next event is the tick and no other else event.

Thanks,
-Aubrey

Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface

2017-10-16 Thread Li, Aubrey

On 2017/10/14 8:45, Rafael J. Wysocki wrote:
> On Saturday, September 30, 2017 9:20:29 AM CEST Aubrey Li wrote:
>> For the governor has predict functionality, add a new predict
>> interface in cpuidle framework to call and use it.
> 
> Care to describe how it is intended to work?
> 
> Also this patch uses data structures introduced in the previous one
> (and the previous one didn't use them), so maybe merge the two?

okay, will refine in the next version.

> 
>> ---
>>  drivers/cpuidle/cpuidle.c| 34 ++
>>  drivers/cpuidle/governors/menu.c |  7 +++
>>  include/linux/cpuidle.h  |  3 +++
>>  kernel/sched/idle.c  |  1 +
>>  4 files changed, 45 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
>> index 4066308..ef6f7dd 100644
>> --- a/drivers/cpuidle/cpuidle.c
>> +++ b/drivers/cpuidle/cpuidle.c
>> @@ -336,6 +336,40 @@ void cpuidle_entry_end(void)
>>  }
>>  
>>  /**
>> + * cpuidle_predict - predict whether the coming idle is a fast idle or not
>> + */
>> +void cpuidle_predict(void)
>> +{
>> +struct cpuidle_device *dev = cpuidle_get_device();
>> +unsigned int overhead_threshold;
>> +
>> +if (!dev)
>> +return;
>> +
>> +overhead_threshold = dev->idle_stat.overhead;
>> +
>> +if (cpuidle_curr_governor->predict) {
>> +dev->idle_stat.predicted_us = cpuidle_curr_governor->predict();
>> +/*
>> + * notify idle governor to avoid reduplicative
>> + * prediction computation
>> + */
> 
> This comment is hard to understand.
> 
> What does it mean, really?
> 

predict() does a lot of computation. If it's called in cpuidle_predict(),
we don't want it to be called in menu governor again. So here I use a flag to
tell menu governor if predict computation is already done for the coming idle
this time.

>> +dev->idle_stat.predicted = true;
>> +if (dev->idle_stat.predicted_us < overhead_threshold) {
>> +/*
>> + * notify tick subsystem to keep ticking
>> + * for the coming idle
>> + */
>> +dev->idle_stat.fast_idle = true;
>> +} else
>> +dev->idle_stat.fast_idle = false;
> 
> What about
> 
>   dev->idle_stat.fast_idle = dev->idle_stat.predicted_us < 
> overhead_threshold;

Ack

> 
> Also why don't you use dev->idle_stat.overhead directly here?

Yeah, I should merge a few patches, because dev->idle_stat.overhead is 
referenced many times,
for irq timings, for scheduler idle statistics and for idle governor.

> 
> And what is the basic idea here?  Why do we compare the predicted
> idle duration with the entry/exit overhead from the previous cycle
> (if I understood this correctly, that is)?

entry/exit overhead is an average here, and the variance should not be big.

> 
>> +} else {
>> +dev->idle_stat.predicted = false;
>> +dev->idle_stat.fast_idle = false;
>> +}
>> +}
>> +
>> +/**
>>   * cpuidle_install_idle_handler - installs the cpuidle idle loop handler
>>   */
>>  void cpuidle_install_idle_handler(void)
>> diff --git a/drivers/cpuidle/governors/menu.c 
>> b/drivers/cpuidle/governors/menu.c
>> index 6bed197..90b2a10 100644
>> --- a/drivers/cpuidle/governors/menu.c
>> +++ b/drivers/cpuidle/governors/menu.c
>> @@ -344,6 +344,12 @@ static int menu_select(struct cpuidle_driver *drv, 
>> struct cpuidle_device *dev)
>>  if (unlikely(latency_req == 0))
>>  return 0;
>>  
>> +/*don't predict again if idle framework already did it */
>> +if (!dev->idle_stat.predicted)
>> +menu_predict();
>> +else
>> +dev->idle_stat.predicted = false;
> 
> We provide the callback which is going to be used by the core if present,
> so why would the core not use it after all?

There is a case that in the loop

while (!need_resched()) {
cpuidle_idle_call()
}

The CPU receives a timer interrupt and wake up and find nothing to be scheduled
and back to call menu then mwait to sleep again.

Under this case, cpuidle_predict() is not reached but the idle data for the 
prediction needs to be updated.

Thanks,
-Aubrey

Re: [RFC PATCH v2 3/8] cpuidle: add a new predict interface

2017-10-16 Thread Li, Aubrey

On 2017/10/14 9:27, Rafael J. Wysocki wrote:
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 0951dac..8704f3c 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -225,6 +225,7 @@ static void do_idle(void)
>>   */
>>  __current_set_polling();
>>  quiet_vmstat();
>> +cpuidle_predict();
> 
> One more question here.
> 
> This changes the cpuidle code ordering such that if the ->predict callback
> is present, the governor prediction will run before tick_nohz_idle_enter(),
> whereas without that callback it runs in cpuidle_idle_call().
> 
> Is that actually going to work correctly for the menu governor, in particular?
>

This depends on how prediction works, patch 1/8 has the details.

The first factor we get is the next clock event from tick device, this is not
related to cpuidle call.

The other two factors are derived from idle interval history, the data is 
already
maintained in menu_device data structure.

I'm not sure if this can address your concern, or anything else I can provide to
help this.

Thanks,
-Aubrey

Re: [RFC PATCH v2 5/8] timers: keep sleep length updated as needed

2017-10-16 Thread Li, Aubrey

On 2017/10/17 7:58, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:46:41 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:56, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:31 AM CEST Aubrey Li wrote:
>>>> sleep length indicates how long we'll be idle. Currently, it's updated
>>>> only when tick nohz enters. These patch series make a new requirement
>>>> with tick, so we should keep sleep length updated as needed
>>>
>>> So what exactly would be the problem with leaving things as they are?
>>
>> Previously ts->sleep_length is only updated when tick is stopped.
>>
>> As follows, in
>>
>> __tick_nohz_idle_enter()
>> {
>>  if (can_stop_idle_tick() /* return true */) {
>>  tick_nohz_stop_sched_tick()
>>  |
>>  |-> update sleep_length
>>  }
>> }
> 
> Which is logical, because the tick will get in the way if we don't stop it,
> won't it?
> 
The scenario here is, tick(4ms) is too long for a short idle(e.g. 4us).
And there could be hundreds of short idle within one tick interval.

>>
>> Now ts->sleep_length is required out of tick_nohz_idle_enter(), so we want
>> to update sleep_length every time we read it
>>
>> If we leave it unchanged, the prediction could read a sleep_length long time
>> ago if the system keep ticking.
> 
> Well, but does it make sense to estimate the sleep length without stopping
> the tick?

For example, for the first short idle, we just turned tick off last time, so
we may get the sleep length = 3900us. Then we keep tick on, and the 100th short
idle comes, then the original sleep length is still 3900us(not updated), but
actually it should be e.g. 100us.

> 
>>>> ---
>>>>  kernel/time/tick-sched.c | 3 +++
>>>>  1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>>>> index d663fab..94fb9b8 100644
>>>> --- a/kernel/time/tick-sched.c
>>>> +++ b/kernel/time/tick-sched.c
>>>> @@ -1008,8 +1008,11 @@ void tick_nohz_irq_exit(void)
>>>>   */
>>>>  ktime_t tick_nohz_get_sleep_length(void)
>>>>  {
>>>> +  struct clock_event_device *dev = 
>>>> __this_cpu_read(tick_cpu_device.evtdev);
>>>>struct tick_sched *ts = this_cpu_ptr(&tick_cpu_sched);
>>>>  
>>>> +  ts->sleep_length = ktime_sub(dev->next_event, ktime_get());
>>>> +
>>>>return ts->sleep_length;
>>>>  }
>>>>  
>>>>
>>>
>>> I probably wouldn't do it this way ...
>>>
>>>
>>
>> May I know the detailed thoughts?
> 
> That depends on the answer above. :-)

Does the above explanation address the concern?

Thanks,
-Aubrey

Re: [RFC PATCH v2 6/8] cpuidle: make fast idle threshold tunable

2017-10-16 Thread Li, Aubrey

On 2017/10/17 8:01, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 8:00:45 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:59, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:32 AM CEST Aubrey Li wrote:
>>>> Add a knob to make fast idle threshold tunable
>>>>
>>>> Signed-off-by: Aubrey Li 
>>>
>>> I first of all am not sure about the need to add a tunable for this at all
>>> in the first place.
>>
>> Actually I think a fixed value(10) might be good enough but not quite sure
>> if there is a requirement to tune it for different scenario, for example even
>> if the predicted idle interval is 100x overhead, I still want a fast path for
>> a better benchmark score?
> 
> Any new tunables make the test matrix expand considerably, so it generally is
> better to err on the conservative side with adding them.
> 

Okay, it's fine for me without it. I'll remove in the next version.

Thanks,
-Aubrey

Re: [RFC PATCH v2 2/8] cpuidle: record the overhead of idle entry

2017-10-17 Thread Li, Aubrey

On 2017/10/17 8:05, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 5:11:57 AM CEST Li, Aubrey wrote:
>> On 2017/10/14 8:35, Rafael J. Wysocki wrote:
>>> On Saturday, September 30, 2017 9:20:28 AM CEST Aubrey Li wrote:
>>>> Record the overhead of idle entry in micro-second
>>>>
>>>
>>> What is this needed for?
>>
>> We need to figure out how long of a idle is a short idle and recording
>> the overhead is for this purpose. The short idle threshold is based
>> on this overhead.
> 
> I don't really understand this statement.
> 
> Pretent I'm not familiar with this stuff and try to explain it to me. :-)
> 

Okay, let me try, :-)

Today what we did in idle loop as follows:

do_idle {
idle_entry {
- deferrable stuff like quiet_vmstat
- turn off tick(without looking at historical/predicted idle interval)
- rcu idle enter, c-state selection, etc
}

idle_call {
- poll or halt or mwait
}

idle_exit {
- rcu idle exit
- restore the tick if tick is stopped before enter idle
}
}

And we already measured idle_entry and idle_exit costs several micro-seconds,
say 10us.

Now if idle_call is 1000us, much larger than idle_entry and idle_exit, we can
ignore the time cost in idle_entry and idle_exit.

But for some workloads with short idle pattern, like netperf, the idle_call
is 2us, then idle_entry and idle_exit start to dominate. If we can reduce the
time in idle_entry and idle_exit, we then get better workload performance
significantly.

Modem high-speed network and low-latency I/O like Nvme disk has this 
requirement.
Mike's patch was made several years ago though I don't know the details. Here is
an article related to this.
https://cacm.acm.org/magazines/2017/4/215032-attack-of-the-killer-microseconds/fulltext

>>>
>>>> +void cpuidle_entry_end(void)
>>>> +{
>>>> +  struct cpuidle_device *dev = cpuidle_get_device();
>>>> +  u64 overhead;
>>>> +  s64 diff;
>>>> +
>>>> +  if (dev) {
>>>> +  dev->idle_stat.entry_end = local_clock();
>>>> +  overhead = div_u64(dev->idle_stat.entry_end -
>>>> +  dev->idle_stat.entry_start, NSEC_PER_USEC);
>>>
>>> Is the conversion really necessary?
>>>
>>> If so, then why?
>>
>> We can choose nano-second and micro-second. Given that workload results
>> in the short idle pattern, I think micro-second is good enough for the
>> real workload.
>>
>> Another reason is that prediction from idle governor is micro-second, so
>> I convert it for comparing purpose.
>>>
>>> And if there is a good reason, what about using right shift to do
>>> an approximate conversion to avoid the extra division here?
>>
>> Sure >> 10 works for me as I don't think here precision is a big deal.
>>
>>>
>>>> +  diff = overhead - dev->idle_stat.overhead;
>>>> +  dev->idle_stat.overhead += diff >> 3;
>>>
>>> Can you please explain what happens in the two lines above?
>>
>> Online average computing algorithm, stolen from update_avg() @ 
>> kernel/sched/core.c.
> 
> OK
> 
> Maybe care to add a comment to that effect?

Sure, I'll add in the next version.

Thanks,
-Aubrey

Re: [RFC PATCH v2 0/8] Introduct cpu idle prediction functionality

2017-10-17 Thread Li, Aubrey

On 2017/10/17 8:07, Rafael J. Wysocki wrote:
> On Monday, October 16, 2017 9:44:41 AM CEST Li, Aubrey wrote:
>>
>> Or you concern why the threshold can't simply be tick interval?
> 
> That I guess.
> 
>> For the latter, if the threshold is close/equal to the tick, it's quite 
>> possible
>> the next event is the tick and no other else event.
> 
> Well, I don't quite get that.
> 
> What's the reasoning here?

if we set the threshold to the tick interval, when the system keeps the tick on 
once,
it will have no chance to turn if off forever.

Because we turn on the tick, and count in the time of idle entry and idle exit,
the idle interval always < tick interval, that means idle is always a short 
idle.

Thanks,
-Aubrey

Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection

2018-03-05 Thread Li, Aubrey

On 2018/3/5 21:53, Peter Zijlstra wrote:
> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra  wrote:
>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
> 
 IOW, the target residency of the selected state doesn't tell you how
 much time you should expect to be idle in general.
>>>
>>> Right, but I think that measure isn't of primary relevance. What we want
>>> to know is: 'should I stop the tick' and 'what C state do I go to'.

I understood the benefit of mapping duration to state number, is duration <->
state number mapping a generic solution to all arches? 

Back to the user's concern is, "I'm running a latency sensitive application, and
I want idle switching ASAP". So I think the user may not care about what C state
to go into, that is, even if a deeper state has chance to go, the user striving
for a higher workload score may still not want it?

>>>
>>> In order to answer those questions we need durations as input, but I
>>> don't think we should preserve durations throughout. The scheme from the
>>> above link reduces to N states in order to deal with arbitrary
>>> distributions, only the actual states -- ie boundaries where our answers
>>> changes -- are relevant, anything inside those boundaries would lead to
>>> the exact same answer anyway.
>>
>> I generally agree here, but I'm not convinced about flagging the
>> states, splitting them and so on.
> 
> I think linking them like that makes sense, but I can see room for
> discussion...
> 
>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>> to the state index and make the decision in the governor?
> 
> Much better option than returning a duration :-)
>
So what does "nohz = disable and state index = deepest" mean? This combination
does not make sense for performance only purpose?

Thanks,
-Aubrey

Re: [RFC/RFT][PATCH 4/7] cpuidle: menu: Split idle duration prediction from state selection

2018-03-06 Thread Li, Aubrey

On 2018/3/6 16:45, Peter Zijlstra wrote:
> On Tue, Mar 06, 2018 at 10:15:10AM +0800, Li, Aubrey wrote:
>> On 2018/3/5 21:53, Peter Zijlstra wrote:
>>> On Mon, Mar 05, 2018 at 02:05:10PM +0100, Rafael J. Wysocki wrote:
>>>> On Mon, Mar 5, 2018 at 1:50 PM, Peter Zijlstra  
>>>> wrote:
>>>>> On Mon, Mar 05, 2018 at 12:47:23PM +0100, Rafael J. Wysocki wrote:
>>>
>>>>>> IOW, the target residency of the selected state doesn't tell you how
>>>>>> much time you should expect to be idle in general.
>>>>>
>>>>> Right, but I think that measure isn't of primary relevance. What we want
>>>>> to know is: 'should I stop the tick' and 'what C state do I go to'.
>>
>> I understood the benefit of mapping duration to state number, is duration <->
>> state number mapping a generic solution to all arches?
> 
> Yes, all platforms have a limited set of possible idle states.
> 
>> Back to the user's concern is, "I'm running a latency sensitive application, 
>> and
>> I want idle switching ASAP". So I think the user may not care about what C 
>> state
>> to go into, that is, even if a deeper state has chance to go, the user 
>> striving
>> for a higher workload score may still not want it?
> 
> The user caring about performance very much cares about the actual idle
> state too, exit latency for deeper states is horrific and will screw
> them up just as much as the whole nohz timer reprogramming does.
> 
> We can basically view the whole nohz thing as an additional entry/exit
> latency for the idle state, which is why I don't think its weird to
> couple them.

okay, I see your point now. Thanks!

> 
>>>> Maybe just return a "nohz" indicator from cpuidle_select() in addition
>>>> to the state index and make the decision in the governor?
>>>
>>> Much better option than returning a duration :-)
>>>
>> So what does "nohz = disable and state index = deepest" mean? This 
>> combination
>> does not make sense for performance only purpose?
> 
> I tend to agree with you that the state space allowed by a separate
> variable is larger than required, but it's significantly smaller than
> preserving 'time' so I can live with it.
>

Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

2018-11-25 Thread Li, Aubrey

On 2018/11/18 22:03, Samuel Neves wrote:
> On 11/17/18 12:36 AM, Li, Aubrey wrote:
>> On 2018/11/17 7:10, Dave Hansen wrote:
>>> Just to be clear: there are 3 AVX-512 XSAVE states:
>>>
>>>  XFEATURE_OPMASK,
>>>  XFEATURE_ZMM_Hi256,
>>>  XFEATURE_Hi16_ZMM,
>>>
>>> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
>>> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
>>> looking too great).
> 
> XFEATURE_OPMASK refers to the additional 8 mask registers used in
> AVX512. These are more similar to general purpose registers than
> vector registers, and should not be too relevant here.
> 
>>>
>>> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
>>> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
>>> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
>>>
>>> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
>>> with AVX512 instructions.
>>>
>>> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
>>> XFEATURE_Hi16_ZMM.  That's without question.
>>
>> No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
>> interested to us.
>>
> 
> I think Dave is right, and it's easy enough to check this. See the
> attached program. For the "high current" instruction vpmuludq
> operating on zmm0--zmm3 registers, we have (on a Skylake-SP Xeon Gold
> 5120)
> 
>175,097  core_power.lvl0_turbo_license:u
>  ( +-  2.18% )
> 41,185  core_power.lvl1_turbo_license:u
>  ( +-  1.55% )
> 83,928,648  core_power.lvl2_turbo_license:u
>  ( +-  0.00% )
> 
> while for the same code operating on zmm28--zmm31 registers, we have
> 
>163,507  core_power.lvl0_turbo_license:u
>  ( +-  6.85% )
> 47,390  core_power.lvl1_turbo_license:u
>  ( +- 12.25% )
> 83,927,735  core_power.lvl2_turbo_license:u
>  ( +-  0.00% )
> 
> In other words, the register index does not seem to matter at all for
> turbo license purposes (this makes sense, considering these chips have
> 168 vector registers internally; zmm15--zmm31 are simply newly exposed
> architectural registers).
> 
> We can also see that XFEATURE_Hi16_ZMM does not imply license 1 or 2;
> we may be using xmm15--xmm31 purely for the convenient extra register
> space. For example, cases 4 and 5 of the sample program:
> 
> 84,064,239  core_power.lvl0_turbo_license:u
>  ( +-  0.00% )
>  0  core_power.lvl1_turbo_license:u
>  0  core_power.lvl2_turbo_license:u
> 
> 84,060,625  core_power.lvl0_turbo_license:u
>  ( +-  0.00% )
>  0  core_power.lvl1_turbo_license:u
>  0  core_power.lvl2_turbo_license:u
> 
> So what's most important is the width of the vectors being used, not
> the instruction set or the register index. Second to that is the
> instruction type, namely whether those are "heavy" instructions.
> Neither of these things can be accurately captured by the XSAVE state.
> 

okay, in terms of license 2 we only care about, AVX512 is a requirement.
Do we have any exception of non-AVX512 producing license 2? If no, I'm
gonna use 

#define XFEATURE_MASK_AVX512(XFEATURE_MASK_OPMASK \
 | XFEATURE_MASK_ZMM_Hi256 \
 | XFEATURE_MASK_Hi16_ZMM)

to expose AVX512 component usage. Although AVX512 is not a sufficient
condition of license 2, but the usage could be an useful hint to the user
tool to further check PMU counter. (That is, if the hint is zero, no need
further check).

What do you think?

Thanks,
-Aubrey

Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

2018-11-20 Thread Li, Aubrey

On 2018/11/18 22:03, Samuel Neves wrote:
> On 11/17/18 12:36 AM, Li, Aubrey wrote:
>> On 2018/11/17 7:10, Dave Hansen wrote:
>>> Just to be clear: there are 3 AVX-512 XSAVE states:
>>>
>>>  XFEATURE_OPMASK,
>>>  XFEATURE_ZMM_Hi256,
>>>  XFEATURE_Hi16_ZMM,
>>>
>>> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
>>> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
>>> looking too great).
> 
> XFEATURE_OPMASK refers to the additional 8 mask registers used in
> AVX512. These are more similar to general purpose registers than
> vector registers, and should not be too relevant here.
> 
>>>
>>> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
>>> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
>>> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
>>>
>>> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
>>> with AVX512 instructions.
>>>
>>> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
>>> XFEATURE_Hi16_ZMM.  That's without question.
>>
>> No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
>> interested to us.
>>
> 
> I think Dave is right, and it's easy enough to check this. See the
> attached program. For the "high current" instruction vpmuludq
> operating on zmm0--zmm3 registers, we have (on a Skylake-SP Xeon Gold
> 5120)
> 
>175,097  core_power.lvl0_turbo_license:u
>  ( +-  2.18% )
> 41,185  core_power.lvl1_turbo_license:u
>  ( +-  1.55% )
> 83,928,648  core_power.lvl2_turbo_license:u
>  ( +-  0.00% )
> 
> while for the same code operating on zmm28--zmm31 registers, we have
> 
>163,507  core_power.lvl0_turbo_license:u
>  ( +-  6.85% )
> 47,390  core_power.lvl1_turbo_license:u
>  ( +- 12.25% )
> 83,927,735  core_power.lvl2_turbo_license:u
>  ( +-  0.00% )
> 
> In other words, the register index does not seem to matter at all for
> turbo license purposes (this makes sense, considering these chips have
> 168 vector registers internally; zmm15--zmm31 are simply newly exposed
> architectural registers).
> 
> We can also see that XFEATURE_Hi16_ZMM does not imply license 1 or 2;
> we may be using xmm15--xmm31 purely for the convenient extra register
> space. For example, cases 4 and 5 of the sample program:
> 
> 84,064,239  core_power.lvl0_turbo_license:u
>  ( +-  0.00% )
>  0  core_power.lvl1_turbo_license:u
>  0  core_power.lvl2_turbo_license:u
> 
> 84,060,625  core_power.lvl0_turbo_license:u
>  ( +-  0.00% )
>  0  core_power.lvl1_turbo_license:u
>  0  core_power.lvl2_turbo_license:u
> 

Thanks for your program, Samuel, it's very helpful. But I saw a different
output on my side, May I have your glibc version?

Thanks,
-Aubrey

> So what's most important is the width of the vectors being used, not
> the instruction set or the register index. Second to that is the
> instruction type, namely whether those are "heavy" instructions.
> Neither of these things can be accurately captured by the XSAVE state.
> 
>>>
>>> It's probably *possible* to run AVX512 instructions by loading state
>>> into the YMM register and then executing AVX512 instructions that only
>>> write to memory and never to register state.  That *might* allow
>>> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
>>> for the frequency to be affected since AVX512 instructions _are_
>>> executing.  But, there's no way to detect this situation from XSAVE
>>> states themselves.
>>>
>>
>> Andi should have more details on this. FWICT, not all AVX512 instructions
>> has high current, those only touching memory do not cause notable frequency
>> drop.
> 
> According to section 15.26 of the Intel optimization reference manual,
> "heavy" instructions consist of floating-point and integer
> multiplication. Moves, adds, logical operations, etc, will request at
> most turbo license 1 when operating on zmm registers.
> 
>>
>> Thanks,
>> -Aubrey
>>

Re: [PATCH v3 2/2] proc: add /proc//arch_state

2018-11-20 Thread Li, Aubrey

On 2018/11/20 1:39, Peter Zijlstra wrote:
> On Thu, Nov 15, 2018 at 07:00:07AM +0800, Aubrey Li wrote:
>> Add a /proc//arch_state interface to expose per-task cpu specific
>> state values.
>>
>> Exposing AVX-512 Hi16_ZMM registers usage is for the user space job
>> scheduler to cluster AVX-512 using tasks together, because these tasks
>> could cause core turbo frequency drop.
> 
> I still don't much like the name; how about arch_simd_state ? 

My intention is
#cat /proc//arch_state
0 1 0

Here, the first "0" denotes simd_state, and the second "1" denotes another
cpu specific feature(may come soon), and can be extended.

But sure I can limit it to simd_state and change the name in the patch.
>Also,
> since we're printing an integer, I still prefer we go print the turbo
> license level. I know level 1 isn't too interesting atm, but consider
> future hardware widening the thing again and us growing level 3 or
> something.
The problem is, FWICT, the bits in XSAVE buffer is not exactly mapped to
the turbo license level, so we can't print turbo license level correctly,
or the first patch need rework for that.

> 
> Also; you were going to shop around with the other architectures to see
> what they want/need for this interface. I see nothing on that.
> 
I'm open for your suggestion, :)

Thanks,
-Aubrey

Re: [PATCH v3 2/2] proc: add /proc//arch_state

2018-11-21 Thread Li, Aubrey

On 2018/11/21 17:53, Peter Zijlstra wrote:
> On Wed, Nov 21, 2018 at 09:19:36AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 21, 2018 at 09:39:00AM +0800, Li, Aubrey wrote:
>>>> Also; you were going to shop around with the other architectures to see
>>>> what they want/need for this interface. I see nothing on that.
>>>>
>>> I'm open for your suggestion, :)
>>
>> Well, we have linux-arch and the various maintainers are also listed in
>> MAINTAINERS. Go forth and ask..
> 
> Ok, so I googled a wee bit (you could have too).
> 
> There's not that many architectures that build big hot chips
> (powerpc,x86,arm64,s390) (mips, sparc64 and ia64 are pretty dead I
> think, although the Fujitsu Sparc M10 X+/X SIMD looked like it could be
> 'fun').
> 
> Of those, powerpc altivec doesn't seem to be very wide, but you'd have
> to ask the power folks. Same for s390 z13.
> 
> The Fujitsu/ARM64-SVE stuff looks like it can be big and hot.
> 
> And RISC-V has was vector extention, but I don't think anybody is
> actually building big hot versions of that just yet.
> 
Thanks Peter. Add more maintainers here.

On some x86 architectures, the tasks using simd instruction(AVX512 particularly)
need to be dealt with specially against the tasks not using simd instruction.
I proposed an interface to expose such CPU specific information for the user
space tools to apply different scheduling policies.

The interface can be refined to be the format as /proc//status. Not sure
if it's useful to any other architectures.

Welcome any comments.

Thanks,
-Aubrey

Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

2018-11-15 Thread Li, Aubrey

On 2018/11/15 23:40, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> AVX-512 component has 3 states, only Hi16_ZMM state causes notable
>> frequency drop. Add per task Hi16_ZMM state tracking to context switch.
> 
> Just curious, but is there any public documentation of this?  It seems
> really odd to me that something using the same AVX-512 instructions on
> some low-numbered registers would behave differently than the same
> instructions on some high-numbered registers.  I'm not saying this is
> wrong, but it's certainly counter-intuitive and I think that begs for
> some more explanation.

Yes, Intel 64 and IA-32 Architectures software developer's Manual mentioned
this in performance event CORE_POWER.LVL2_TURBO_LICENSE.

"Core cycles where the core was running with power delivery for license
level 2 (introduced in Skylake Server microarchitecture). This includes
high current AVX 512-bit instructions."

I translated license level 2 to frequency drop.

Thanks,
-Aubrey

Re: [PATCH v3 2/2] proc: add /proc//arch_state

2018-11-15 Thread Li, Aubrey

On 2018/11/15 23:18, Dave Hansen wrote:
> On 11/14/18 3:00 PM, Aubrey Li wrote:
>> +void arch_thread_state(struct seq_file *m, struct task_struct *task)
>> +{
>> +/*
>> + * Report AVX-512 Hi16_ZMM registers usage
>> + */
>> +if (task->thread.fpu.hi16zmm_usage)
>> +seq_putc(m, '1');
>> +else
>> +seq_putc(m, '0');
>> +seq_putc(m, '\n');
>> +}
> 
> Am I reading this right that this file just dumps out a plain 0 or 1
> based on the internal kernel state?  BTW, there's no example of the
> output in the changelog, so it's rather hard to tell if my guess is
> right.  (Hint, hint).
> 
> If so, I'd really prefer you not do this.  We have /proc/$pid/stat to
> stand as a disaster in this regard.  It is essentially
> non-human-readable gibberish because it's impossible to tell what the
> values mean without a decoder ring.

Yes, I'm following /proc/$pid/stat format, as I think this interface is
not for the end user, but for developer and user space job scheduler. So
I guess this style might be okay.

> 
> If we go down this road, we need a file along the lines of
> /proc/$pid/status.

I checked /proc/$pid/status, all common information to architectures.
That's why I want to open a new interface to CPU specific state.

> 
> But, either way, this is a new ABI that we need to consider carefully.
> It needs documentation.  For instance, will this really mean "Hi16_ZMM
> user" from now until the end of time?  Or, does it just mean "group me
> with other tasks that have this bit set"?
> 
I'm open to this interface. Let's wait to see if there are more comments
and suggestions.

Thanks,
-Aubrey

Re: [PATCH v3 1/2] x86/fpu: track AVX-512 usage of tasks

2018-11-16 Thread Li, Aubrey

On 2018/11/17 7:10, Dave Hansen wrote:
> On 11/15/18 4:21 PM, Li, Aubrey wrote:
>> "Core cycles where the core was running with power delivery for license
>> level 2 (introduced in Skylake Server microarchitecture). This includes
>> high current AVX 512-bit instructions."
>>
>> I translated license level 2 to frequency drop.
> 
> BTW, the "high" in that text: "high-current AVX 512-bit instructions" is
> talking about high-current, not "high ... instructions" or high-numbered
> registers.  I think that might be the source of some of the confusion
> about which XSAVE state needs to be examined.
> 
> Just to be clear: there are 3 AVX-512 XSAVE states:
> 
> XFEATURE_OPMASK,
> XFEATURE_ZMM_Hi256,
> XFEATURE_Hi16_ZMM,
> 
> I honestly don't know what XFEATURE_OPMASK does.  It does not appear to
> be affected by VZEROUPPER (although VZEROUPPER's SDM documentation isn't
> looking too great).
> 
> But, XFEATURE_ZMM_Hi256 is used for the upper 256 bits of the
> registers ZMM0-ZMM15.  Those are AVX-512-only registers.  The only way
> to get data into XFEATURE_ZMM_Hi256 state is by using AVX512 instructions.
> 
> XFEATURE_Hi16_ZMM is the same.  The only way to get state in there is
> with AVX512 instructions.
> 
> So, first of all, I think you *MUST* check XFEATURE_ZMM_Hi256 and
> XFEATURE_Hi16_ZMM.  That's without question.

No, XFEATURE_ZMM_Hi256 does not request turbo license 2, so it's less
interested to us.

> 
> It's probably *possible* to run AVX512 instructions by loading state
> into the YMM register and then executing AVX512 instructions that only
> write to memory and never to register state.  That *might* allow
> XFEATURE_Hi16_ZMM and XFEATURE_ZMM_Hi256 to stay in the init state, but
> for the frequency to be affected since AVX512 instructions _are_
> executing.  But, there's no way to detect this situation from XSAVE
> states themselves.
> 

Andi should have more details on this. FWICT, not all AVX512 instructions
has high current, those only touching memory do not cause notable frequency
drop.

Thanks,
-Aubrey

Re: [RFC PATCH v2 1/2] x86/fpu: detect AVX task

2018-11-11 Thread Li, Aubrey

On 2018/11/9 19:21, Thomas Gleixner wrote:
> Aubrey,
> 
> On Thu, 8 Nov 2018, Aubrey Li wrote:
> 
>> Subject:  x86/fpu: detect AVX task
> 
> What is an AVX task? I know what you mean, but for the casual reader this
> is not very informative. So something like:
> 
>   x86/fpu: Track AVX usage of tasks
> 
> would be more informative and precise. The mechanism you add is tracking
> the state and not just detecting in.
> 
>> XSAVES and its variants use init optimization to reduce the amount of
>> data that they save to memory during context switch. Init optimization
>> uses the state component bitmap to denote if a component is in its init
>> configuration. We use this information to detect if a task contains AVX
>> instructions.
> 
> Please avoid 'We use..'. Changelogs should be factual and precise and not
> be written as first-person narrative.
> 
> Aside of that, you very well explained how XSAVES optimization works, but
> that's only a small part of the picture. The changelog should also contain
> a justification why this is necessary in the first place along with more
> missing bits and pieces. And please use paragraphs to structure the
> changelog instead of having one big lump.
> 
> Something like this:
> 
>   User space tools which do automated task placement need information about
>   AVX usage of tasks, because of 
> 
>   The extended control register (XCR) allows access to the XINUSE
>   state-component bitmap, which allows software to discover the state of
>   the init optimization used by XSAVEOPT and XSAVES. Set bits in the bitmap
>   denote the usage of AVX, SIMD, FPU and other components. The XSAVE
>   variants store only the state of used components to speed up the
>   operation.
> 
>   The XGETBV instruction, if supported, allows software to read the
>   state-component bitmap and use this information to detect the usage of
>   the tracked components.
> 
>   Add accessor functions and per task state tracking to context switch.
> 
>   The tracking turns on the usage flag immediately, but requires 3
>   consecutive context switches with no usage to clear it. This decay is
>   required because of 
> 
> You surely had all this information in your head when writing the code and
> the changelog, so you could have spared me to dig all this out of the SDM.

Thanks Thomas, will try to refine in the next version.

> 
>> +/*
>> + * This function is called during context switch to update AVX component 
>> state
>> + */
>> +static inline void update_avx_state(struct avx_state *avx)
>> +{
>> +/*
>> + * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
>> + * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
>> + * by which the processor tracks the status of various components.
>> + */
>> +if (!use_xgetbv1()) {
>> +avx->state = 0;
> 
> avx->state should be initialized to 0 when the task starts, so why
> does it need to be cleared when XGETBV is not supported?

will fix in the next version.

> 
> Also this is the only usage site of use_xgetbv1(). So please open code it
> here and avoid the extra inline which does not provide any extra value in
> this case.

will fix in the next version

> 
>> +return;
>> +}
>> +/*
>> + * XINUSE is dynamic to track component state because VZEROUPPER
>> + * happens on every function end and reset the bitmap to the
>> + * initial configuration.
>> + *
>> + * State decay is introduced to solve the race condition between
>> + * context switch and a function end. State is aggressively set
>> + * once it's detected but need to be cleared by decay 3 context
>> + * switches
> 
> I have no idea what _the_ race condition between context switch and a
> function end is about. Probably most readers wont have.
> 

VZEROUPPER instruction resets the init state. If context switch happens
to occur exactly after VZEROUPPER instruction, XINUSE bitmap is empty(all
zeros), which indicates the task is not using AVX. That's why the state
decay count is used here.


>> + */
>> +if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
> 
> In the changelog and also in the code you say AVX (avx->state), but this
> actually looks only for Hi16_ZMM state, which are the upper 16 AVX512
> registers.
> 
> So again, this wants to be documented in the changelog along with an
> explanation WHY this check is restricted to Hi16_ZMM state. And please
> rename the variable accordingly. This is confusing at best.

okay, I'll try to address this in the next version.


> 
>> +avx->state = 1;
>> +avx->decay_count = AVX_STATE_DECAY_COUNT;
>> +} else {
>> +if (avx->decay_count)
>> +avx->decay_count--;
>> +else
>> +avx->state = 0;
>> +}
> 
> Is there a reason why you need two variables for this?
> 
>   if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM)
>   tsk-

Re: [RFC PATCH v2 1/2] x86/fpu: detect AVX task

2018-11-11 Thread Li, Aubrey

Hi Dave,

Thanks for your comments!

On 2018/11/12 10:32, Dave Hansen wrote:
> On 11/7/18 9:16 AM, Aubrey Li wrote:
>> XSAVES and its variants use init optimization to reduce the amount of
>> data that they save to memory during context switch. Init optimization
>> uses the state component bitmap to denote if a component is in its init
>> configuration. We use this information to detect if a task contains AVX
>> instructions.
> 
> I'm a little uncomfortable with changelogs like this.  Someone might
> read this and think that this patch precisely detects AVX instructions.
>  It would be great is we could make this more precise to say that this
> patch detects if there is valid state in the AVX registers, *not* if the
> task contains or uses AVX instructions.

Our intention is to figure out if the task contains AVX instructions, this
kind of task causes frequency drop and need attention when dispatched.

If there is a valid state in the AVX registers, we can say the tasks contains
AVX instructions, can't we?

>>  
>> +#define AVX_STATE_DECAY_COUNT   3
> 
> How was this number chosen?  What does this mean?
> 
> It appears that this is saying that after 3 non-AVX-512-using context
> switches, the task is not considered to be using AVX512 any more.  That
> seems a bit goofy because the context switch rate is highly dependent on
> HZ, and on how often the task yields.
> 
> Do we want this, or do we want something more time-based?
>
This counter is introduced here to solve the race of context switch and
VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times.
Due to scheduling latency, 3 jiffies could only happen AVX task on-off just
1 time. So IMHO the context switches number is better here.

> 
>> +/*
>> + * This function is called during context switch to update AVX component 
>> state
>> + */
>> +static inline void update_avx_state(struct avx_state *avx)
>> +{
>> +/*
>> + * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
>> + * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
>> + * by which the processor tracks the status of various components.
>> + */
>> +if (!use_xgetbv1()) {
>> +avx->state = 0;
>> +return;
>> +}
> 
> This is a place where we have conflated the implementation in the CPU
> and the logical operation that we are performing.
> 
> In this case, it appears that we want to know whether AVX state
> detection is available, but we're doing that with a function that's
> apparently asking if the kernel is using XGETBV1.
> 
> I'd really like if this looked like this:
> 
>   if (!have_avx_state_detect()) {
>   avx->state = 0;
>   return;
>   }
> 
> Then, in have_avx_state_detect(), explain what XGETBV1 does.  BTW, I
> don't think we *totally* need to duplicate the SDM definitions in kernel
> code for each instruction.  It's fine to just say that it set 1 for
> features not in the init state.
> 

Thomas suggested open this inline since this is only usage of xgetbv1. So I'll
use static_cpu_has() here directly. Does it sound good?

>> +/*
>> + * XINUSE is dynamic to track component state because VZEROUPPER
>> + * happens on every function end and reset the bitmap to the
>> + * initial configuration.
> 
> This is confusing to me because VZEROUPPER is not apparently involved
> here.  What is this trying to say?
> 
VZERROUPPER instruction in the task resets the bitmap.


>> + * State decay is introduced to solve the race condition between
>> + * context switch and a function end. State is aggressively set
>> + * once it's detected but need to be cleared by decay 3 context
>> + * switches
>> + */
> 
> I'd probably say:
> 
>   AVX512-using processes frequently set AVX512 back to the init   
>   state themselves.  Thus, this detection mechanism can miss.
>   The decay ensures that false-negatives do not immediately make
>   a task be considered as not using AVX512.

Thanks, will refine it in the next version.
And yes, AVX512-using processoes set AVX512 back to the init state themselves
by VZEROUPPER instructions.

> 
>> +if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
> 
> This is *just* an AVX512 state, right?  That isn't reflected in any
> comments or naming.  Also, why just this state, and not any of the other
> AVX512-related states?

Only this state causes significant frequency drop, other states not.
> 
> This is also precisely the kind of thing that would be nice to wrap up
> in a tiny helper.
> 
>   if (avx512_in_use())
> 
> is much more self-documenting, for instance.

Thanks, will refine it in the next version.

> 
>> +avx->state = 1;
> 
> I'm not a huge fan of this naming.  Could this be:
> 
>   avx->had_avx_512_state = true;
> 
>> +avx->decay_count = AVX_STATE_DECAY_COUNT;
>> +} else {
>> +if (avx->decay_count)
>> +avx->dec

Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state

2018-11-11 Thread Li, Aubrey

On 2018/11/12 13:31, Ingo Molnar wrote:
> 
> * Peter Zijlstra  wrote:
> 
>> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
>>>
>>> * Aubrey Li  wrote:
>>>
 Expose the per-task cpu specific thread state value, it's helpful
 for userland to classify and schedule the tasks by different policies
>>>
>>> That's pretty vague - what exactly would use this information? I'm sure 
>>> you have a usecase in mind - could you please describe it?
>>
>> Yeah, "thread_state" is a pretty terrible name for this. The use-case is
>> detectoring which tasks use AVX3 such that a userspace component (think
>> job scheduler) can cluster them together.
> 
> I'd prefer the kernel to do such clustering...
> 
Some userland application projects like Kubernetes request an interface of 
such information, we could do the clustering either in the kernel or from
userland, does it make sense we expose the information via the proposed 
interface first?

Thanks,
-Aubrey

Re: [RFC PATCH v2 1/2] x86/fpu: detect AVX task

2018-11-12 Thread Li, Aubrey

On 2018/11/12 23:46, Dave Hansen wrote:
> On 11/11/18 9:38 PM, Li, Aubrey wrote:
>
>>> Do we want this, or do we want something more time-based?
>>>
>> This counter is introduced here to solve the race of context switch and
>> VZEROUPPER. 3 context switches mean the same thread is on-off CPU 3 times.
>> Due to scheduling latency, 3 jiffies could only happen AVX task on-off just
>> 1 time. So IMHO the context switches number is better here.
> 
> Imagine we have a HZ=1000 system where AVX_STATE_DECAY_COUNT=3.  That
> means that a task can be marked as a non-AVX-512-user after not using it
> for ~3 ms.  But, with HZ=250, that's ~12ms.

>From the other side, if we set a 4ms decay, when HZ=1000, context switch count
is 4, that means, we have 4 times of chance to maintain the AVX state, that is,
we are able to filter 4 times init state reset out. But if HZ = 250, the context
switch is 1, we only have 1 time of chance to filter init state reset out.
> 
> Also, don't forget that we have context switches from the timer
> interrupt, but also from normal old operations that sleep.
> 
> Let's say our AVX-512 app was doing:
> 
>   while (foo) {
>   do_avx_512();
>   read(pipe, buf, len);
>   read(pipe, buf, len);
>   read(pipe, buf, len);
>   }
> 
> And all three pipe reads context-switched the task.  That loop could
> finish in way under 3HZ, but still end up in do_avx_512() each time with
> fpu...avx->state=0.

Yeah, we are trying to address a prediction according to the historical pattern,
so you always can make a pattern to beat the prediction pattern. But in 
practice,
I measured tensorflow with AVX512 enabled, linpack with AVX512, and a micro 
benchmark, the current 3 context switches decay works well enough.

> 
> BTW, I don't have a great solution for this.  I was just pointing out
> one of the pitfalls from using context switch counts so strictly.

I really don't think time-based is better than the count in this case. 

>>>> +/*
>>>>   * Highest level per task FPU state data structure that
>>>>   * contains the FPU register state plus various FPU
>>>>   * state fields:
>>>> @@ -303,6 +312,14 @@ struct fpu {
>>>>unsigned char   initialized;
>>>>  
>>>>/*
>>>> +   * @avx_state:
>>>> +   *
>>>> +   * This data structure indicates whether this context
>>>> +   * contains AVX states
>>>> +   */
>>>
>>> Yeah, that's precisely what fpu->state.xsave.xfeatures does. :)
>>> I see, will refine in the next version
> 
> One other thought about the new 'avx_state':
> 
> fxregs_state (which is a part of the XSAVE state) has some padding and
> 'sw_reserved' areas.  You *might* be able to steal some space there.
> Not that this is a huge space eater, but why waste the space if we don't
> have to?
> 

IMHO, I prefer not adding any extra thing into a data structure associated
with a hardware table. Let me try to work out a new version to see if it can
satisfy you.

Thanks,
-Aubrey

Re: [RFC PATCH v2 1/2] x86/fpu: detect AVX task

2018-11-13 Thread Li, Aubrey

On 2018/11/13 18:25, David Laight wrote:
> From: Li, Aubrey
>> Sent: 12 November 2018 01:41
> ...
>> VZEROUPPER instruction resets the init state. If context switch happens
>> to occur exactly after VZEROUPPER instruction, XINUSE bitmap is empty(all
>> zeros), which indicates the task is not using AVX. That's why the state
>> decay count is used here.
> 
> Isn't there an obvious optimisation to execute VZEROALL during system call
> entry?

I'm not aware of this in the kernel, maybe you are talking about some 
optimization in glibc?

> If that is done does any of this actually work?

The bitmap is checked during context switch, not system call entry.

Also, the flag is turned on immediately once it's detected, but requires 3
*consecutive* context switches with no usage to clear. So it could filter
most jitter patterns out.

I measured tensorflow(with AVX512), linpack(with AVX512) and a micro
benchmark, this works properly.

If you have a AVX512 workload, I'd like to know if this works for it.

Thanks,
-Aubrey
> 
>   David
>  
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
>

Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey

On 2018/12/12 8:14, Arjan van de Ven wrote:
> On 12/11/2018 3:46 PM, Li, Aubrey wrote:
>> On 2018/12/12 1:18, Dave Hansen wrote:
>>> On 12/10/18 4:24 PM, Aubrey Li wrote:
>>>> The tracking turns on the usage flag at the next context switch of
>>>> the task, but requires 3 consecutive context switches with no usage
>>>> to clear it. This decay is required because well-written AVX-512
>>>> applications are expected to clear this state when not actively using
>>>> AVX-512 registers.
>>>
>>> One concern about this:  Given a HZ=1000 system, this means that the
>>> flag needs to get scanned every ~3ms.  That's a pretty good amount of
>>> scanning on a system with hundreds or thousands of tasks running around.
>>>
>>> How many tasks does this scale to until you're eating up an entire CPU
>>> or two just scanning /proc?
>>>
>>
>> Do we have a real requirement to do this in practical environment?
>> AFAIK, 1s or even 5s is good enough in some customers environment.
> 
> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
> time we found the task to use avx? (need to find a good time unit)
> 
> 
Are you suggesting kernel does not do any translation, just provide a fact
to the user space tool and let user space tool to decide how to use this info?

So how does user space tool use this timestamp in your mind?

Thanks,
-Aubrey

Re: [RESEND PATCH v5 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Li, Aubrey

On 2018/12/18 16:33, Thomas Gleixner wrote:
> Aubrey,
> 
> On Tue, 18 Dec 2018, Aubrey Li wrote:
> 
> RESEND
> 
> Please don't do that. This is not a resend because you changed something,
> so it's new version. Usually I ignore resends when I have the original
> submission already lined up for review.

oh, okay, I'll send another version, sorry for the trouble...

Thanks,
-Aubrey

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Li, Aubrey

On 2018/12/18 22:14, Thomas Gleixner wrote:
> On Tue, 18 Dec 2018, Aubrey Li wrote:
>> diff --git a/arch/x86/include/asm/fpu/internal.h 
>> b/arch/x86/include/asm/fpu/internal.h
>> index a38bf5a1e37a..8778ac172255 100644
>> --- a/arch/x86/include/asm/fpu/internal.h
>> +++ b/arch/x86/include/asm/fpu/internal.h
>> @@ -411,6 +411,13 @@ static inline int copy_fpregs_to_fpstate(struct fpu 
>> *fpu)
>>  {
>>  if (likely(use_xsave())) {
>>  copy_xregs_to_kernel(&fpu->state.xsave);
>> +
>> +/*
>> + * AVX512 state is tracked here because its use is
>> + * known to slow the max clock speed of the core.
>> + */
>> +if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>> +fpu->avx512_timestamp = jiffies_64;
> 
> Even if unlikely this is incorrect when running a 32 bit kernel because
> there jiffies_64 cannot be atomically loaded vs. a concurrent update. See
> the comment in include/linux/jiffies.h right above the jiffies_64
> declaration.
> 
Yeah, I noticed this, because this is under use_xsave() condition, also need
valid AVX512 state, so a 32 bit kernel won't enter this branch.

Thanks,
-Aubrey

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Li, Aubrey

On 2018/12/18 23:32, Thomas Gleixner wrote:
> On Tue, 18 Dec 2018, Li, Aubrey wrote:
> 
>> On 2018/12/18 22:14, Thomas Gleixner wrote:
>>> On Tue, 18 Dec 2018, Aubrey Li wrote:
>>>> diff --git a/arch/x86/include/asm/fpu/internal.h 
>>>> b/arch/x86/include/asm/fpu/internal.h
>>>> index a38bf5a1e37a..8778ac172255 100644
>>>> --- a/arch/x86/include/asm/fpu/internal.h
>>>> +++ b/arch/x86/include/asm/fpu/internal.h
>>>> @@ -411,6 +411,13 @@ static inline int copy_fpregs_to_fpstate(struct fpu 
>>>> *fpu)
>>>>  {
>>>>if (likely(use_xsave())) {
>>>>copy_xregs_to_kernel(&fpu->state.xsave);
>>>> +
>>>> +  /*
>>>> +   * AVX512 state is tracked here because its use is
>>>> +   * known to slow the max clock speed of the core.
>>>> +   */
>>>> +  if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>>>> +  fpu->avx512_timestamp = jiffies_64;
>>>
>>> Even if unlikely this is incorrect when running a 32 bit kernel because
>>> there jiffies_64 cannot be atomically loaded vs. a concurrent update. See
>>> the comment in include/linux/jiffies.h right above the jiffies_64
>>> declaration.
>>>
>> Yeah, I noticed this, because this is under use_xsave() condition, also need
>> valid AVX512 state, so a 32 bit kernel won't enter this branch.
> 
> What exactly prevents a 32bit kernel from having the AVX512 feature bit
> set? And if it cannot be set on 32bit, then why are you compiling that code
> in at all?
> 

I misunderstood, you mean 32bit kernel, not 32bit machine. Theoretically 32bit
kernel can use AVX512, but not sure if anyone use it like this. get_jiffies_64()
includes jiffies_lock ops so not good in context switch. So I want to use raw
jiffies_64 here. jiffies is a good candidate but it has wraparound overflow 
issue.
Other time source are expensive here.

Should I limit the code only running on 64bit kernel? 

Thanks,
-Aubrey

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Li, Aubrey

On 2018/12/19 1:14, Dave Hansen wrote:
> On 12/18/18 7:32 AM, Thomas Gleixner wrote:
>> What exactly prevents a 32bit kernel from having the AVX512 feature bit
>> set? And if it cannot be set on 32bit, then why are you compiling that code
>> in at all?
> 
> There are three different AVX-512 states (and three bits) which Aubrey's
> patch checks.  All three have different rules.  Here's a summary along
> with some relevant SDM quotes from Vol1-13.6.
> 
> Opmask state: All opmask registers can be set in 32-bit mode.
> ZMM_Hi256 state: "An execution of XRSTOR or XRSTORS outside 64-bit mode
> does not update ZMM8_H–ZMM15_H."  This implies that
> ZMM0_H-ZMM7_H *are* updated in 32-bit mode.
> Hi16_ZMM state: "Outside 64-bit mode, Hi16_ZMM state is always in its
>initial configuration."
> 
> All of Hi16_ZMM and *part* of ZMM_Hi256 can not be practically used in
> 32-bit mode.  But, even using part of ZMM_Hi256 means the xfeature bit
> will be set.
> 
> So, 2/3 of the features can be used in 32-bit mode.  Nothing that I can
> find _prevents_ those features from being used in 32-bit mode.
> 
> Aubrey, do you have information to the contrary?
> 
Thanks Dave for the summary, similar info I have from SDM, so yes, 32bit
kernel can have xfeature bits set.

Re: [PATCH v6 1/3] x86/fpu: track AVX-512 usage of tasks

2018-12-18 Thread Li, Aubrey

On 2018/12/19 5:38, Andi Kleen wrote:
>> I misunderstood, you mean 32bit kernel, not 32bit machine. Theoretically 
>> 32bit
>> kernel can use AVX512, but not sure if anyone use it like this. 
>> get_jiffies_64()
>> includes jiffies_lock ops so not good in context switch. So I want to use raw
>> jiffies_64 here. jiffies is a good candidate but it has wraparound overflow 
>> issue.
>> Other time source are expensive here.
>>
>> Should I limit the code only running on 64bit kernel? 
> 
> Yes making it 64bit only should be fine.
> 


> Other alternative would be to use 32bit jiffies on 32bit. I assume
> wrapping is not that big a problem here.
> 

Thomas, is this acceptable?

Thanks,
-Aubrey

Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey

On 2018/12/12 1:18, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
> 

Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.

Thanks,
-Aubrey

Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey

On 2018/12/12 1:20, Dave Hansen wrote:
>  to update AVX512 state
>> + */
>> +static inline void update_avx512_state(struct fpu *fpu)
>> +{
>> +/*
>> + * AVX512 state is tracked here because its use is known to slow
>> + * the max clock speed of the core.
>> + *
>> + * However, AVX512-using tasks are expected to clear this state when
>> + * not actively using these registers. Thus, this tracking mechanism
>> + * can miss. To ensure that false-negatives do not immediately show
>> + * up, decay the usage count over time.
>> + */
>> +if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>> +fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
>> +else if (fpu->avx512_usage)
>> +fpu->avx512_usage--;
>> +}
>> +
>>  /*
>>   * This function is called only during boot time when x86 caps are not set
>>   * up and alternative can not be used yet.
>> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>>  {
>>  if (likely(use_xsave())) {
>>  copy_xregs_to_kernel(&fpu->state.xsave);
>> +update_avx512_state(fpu);
>>  return 1;
>>  }
> 
> 
> Is there a reason we shouldn't do:
> 
>   if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
>   update_avx512_state(fpu);
> 
> ?
> 

Why _!_ ?

Re: [RFC PATCH v1 1/2] x86/fpu: detect AVX task

2018-11-07 Thread Li, Aubrey

On 2018/11/8 1:41, Tim Chen wrote:
> On 11/06/2018 10:23 AM, Aubrey Li wrote:
> 
>> +static inline void update_avx_state(struct avx_state *avx)
>> +{
>> +/*
>> + * Check if XGETBV with ECX = 1 supported. XGETBV with ECX = 1
>> + * returns the logical-AND of XCR0 and XINUSE. XINUSE is a bitmap
>> + * by which the processor tracks the status of various components.
>> + */
>> +if (!use_xgetbv1()) {
>> +avx->state = 0;
>> +return;
>> +}
>> +/*
>> + * XINUSE is dynamic to track component state because VZEROUPPER
>> + * happens on every function end and reset the bitmap to the
>> + * initial configuration.
>> + *
>> + * State decay is introduced to solve the race condition between
>> + * context switch and a function end. State is aggressively set
>> + * once it's detected but need to be cleared by decay 3 context
>> + * switches
>> + */
>> +if (xgetbv(XINUSE_STATE_BITMAP_INDEX) & XFEATURE_MASK_Hi16_ZMM) {
>> +avx->state = 1;
>> +avx->decay_count = AVX_STATE_DECAY_COUNT;
>> +} else {
>> +if (!avx->decay_count)
> 
> Seems like the check should be
> 
>   if (avx->decay_count)
> 
> as we decrement the decay_count if it is non-zero.

Right, thanks to point this out, will fix in v2 soon.

Thanks,
-Aubrey
> 
>> +avx->decay_count--;
>> +else
>> +avx->state = 0;
>> +}
>> +}
> 
> Tim
>

Re: [RFC PATCH v1 2/2] proc: add /proc//thread_state

2018-11-08 Thread Li, Aubrey

On 2018/11/8 18:17, Peter Zijlstra wrote:
> On Thu, Nov 08, 2018 at 07:32:46AM +0100, Ingo Molnar wrote:
>>
>> * Aubrey Li  wrote:
>>
>>> Expose the per-task cpu specific thread state value, it's helpful
>>> for userland to classify and schedule the tasks by different policies
>>
>> That's pretty vague - what exactly would use this information? I'm sure 
>> you have a usecase in mind - could you please describe it?
> 
> Yeah, "thread_state" is a pretty terrible name for this.

task_struct has a CPU specific element "thread", I quote it to here to 
create a generic interface to expose CPU specific state of the task.
Like /proc//stat, I plan to use this "thread_state" to host any CPU
specific state, including AVX state(now only), and any other states(may come
soon). So this interface may be extended in future like the following:

#cat /proc//thread_state
1 0 0


>
 The use-case is
> detectoring which tasks use AVX3 such that a userspace component (think
> job scheduler) can cluster them together.> 
> The 'problem' is that running AVX2+ code drops the max clock, because
> you light up the massive wide (and thus large area) paths.

Thanks to explain this.

> 
> So maybe something like "simd_active" is a better name, dunno.

As above, maybe the name can be hidden by thread_state?

> 
> Or maybe "simd_class" and we can write out 0,1,2,3 depending on the AVX
> class being used, dunno. It might make sense to look at what other arch
> SIMD stuff looks like to form this interface.
> 
A task may use 1,2,3 simultaneously, as a scheduler hint, it's just cluster
or not, so 0/1 may be good enough.

Thanks,
-Aubrey

[PATCH v3]PM/Sleep: Timer quiesce in freeze state

2014-12-08 Thread Li, Aubrey

The patch is based on v3.18.

Freeze is a general power saving state that processes are frozen, devices
are suspended and CPUs are in idle state. However, when the system enters
freeze state, there are a few timers keep ticking and hence consumes more
power unnecessarily. The observed timer events in freeze state are:
- tick_sched_timer
- watchdog lockup detector
- realtime scheduler period timer

The system power consumption in freeze state will be reduced significantly
if we quiesce these timers.

The patch is tested on:
- Sandybrdige-EP system, both RTC alarm and power button are able to wake
  the system up from freeze state.
- HP laptop EliteBook 8460p, both RTC alarm and power button are able to
  wake the system up from freeze state.
- Baytrail-T(ASUS_T100) platform, power button is able to wake the system
  up from freeze state.

Suggested-by: Thomas Gleixner 
Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Rafael J. Wysocki 
Cc: Len Brown 
Cc: Alan Cox 
---
 drivers/cpuidle/cpuidle.c   | 13 
 include/linux/clockchips.h  |  4 +++
 include/linux/suspend.h |  4 +++
 include/linux/timekeeping.h |  2 ++
 kernel/power/suspend.c  | 50 ++---
 kernel/sched/idle.c | 45 ++
 kernel/softirq.c|  5 +--
 kernel/time/clockevents.c   | 13 
 kernel/time/tick-common.c   | 53 +++
 kernel/time/tick-internal.h |  3 ++
 kernel/time/timekeeping.c   | 77 +
 11 files changed, 243 insertions(+), 26 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 125150d..b9a3ada 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cpuidle.h"
 
@@ -119,6 +120,18 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct 
cpuidle_driver *drv,
ktime_t time_start, time_end;
s64 diff;
 
+   /*
+* under the freeze scenario, the timekeeping is suspended
+* as well as the clock source device, so we bypass the idle
+* counter update in freeze idle
+*/
+   if (in_freeze()) {
+   entered_state = target_state->enter(dev, drv, index);
+   if (!cpuidle_state_is_coupled(dev, drv, entered_state))
+   local_irq_enable();
+   return entered_state;
+   }
+
trace_cpu_idle_rcuidle(index, dev->cpu);
time_start = ktime_get();
 
diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
index 2e4cb67..d118e0b 100644
--- a/include/linux/clockchips.h
+++ b/include/linux/clockchips.h
@@ -18,6 +18,9 @@ enum clock_event_nofitiers {
CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
CLOCK_EVT_NOTIFY_SUSPEND,
CLOCK_EVT_NOTIFY_RESUME,
+   CLOCK_EVT_NOTIFY_FREEZE_PREPARE,
+   CLOCK_EVT_NOTIFY_FREEZE,
+   CLOCK_EVT_NOTIFY_UNFREEZE,
CLOCK_EVT_NOTIFY_CPU_DYING,
CLOCK_EVT_NOTIFY_CPU_DEAD,
 };
@@ -95,6 +98,7 @@ enum clock_event_mode {
  */
 struct clock_event_device {
void(*event_handler)(struct clock_event_device *);
+   void(*real_handler)(struct clock_event_device *);
int (*set_next_event)(unsigned long evt,
  struct clock_event_device *);
int (*set_next_ktime)(ktime_t expires,
diff --git a/include/linux/suspend.h b/include/linux/suspend.h
index 3388c1b..86a651c 100644
--- a/include/linux/suspend.h
+++ b/include/linux/suspend.h
@@ -203,6 +203,8 @@ extern void suspend_set_ops(const struct 
platform_suspend_ops *ops);
 extern int suspend_valid_only_mem(suspend_state_t state);
 extern void freeze_set_ops(const struct platform_freeze_ops *ops);
 extern void freeze_wake(void);
+extern bool in_freeze(void);
+extern bool idle_should_freeze(void);
 
 /**
  * arch_suspend_disable_irqs - disable IRQs for suspend
@@ -230,6 +232,8 @@ static inline void suspend_set_ops(const struct 
platform_suspend_ops *ops) {}
 static inline int pm_suspend(suspend_state_t state) { return -ENOSYS; }
 static inline void freeze_set_ops(const struct platform_freeze_ops *ops) {}
 static inline void freeze_wake(void) {}
+static inline bool in_freeze(void) { return false; }
+static inline bool idle_should_freeze(void) { return false; }
 #endif /* !CONFIG_SUSPEND */
 
 /* struct pbe is used for creating lists of pages that should be restored
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index 1caa6b0..07957a9 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -5,6 +5,8 @@
 
 void timekeeping_init(void);
 extern int timekeeping_suspended;
+extern void timekeeping_freeze(void);
+extern void timekeeping_unfreeze(void);
 
 /*
  * Get and set timeofday
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index c347e3c..6467fb8 100644
--

Re: [PATCH v2 4/4] PMC driver: Add Cherrytrail PMC interface

2015-01-25 Thread Li, Aubrey

On 2015/1/22 17:26, Andy Shevchenko wrote:
> On Thu, 2015-01-22 at 12:02 +0800, Li, Aubrey wrote:
>> On 2015/1/21 5:50, Andy Shevchenko wrote:
>>> The patch adds CHT PMC interface. This exposes all the South IP device power
>>> states and S0ix states for CHT. The bit map of FUNC_DIS and D3_STS_0 
>>> registers
>>> for SoCs are consistent. The D3_STS_1 and FUNC_DIS_2 registers, however, are
>>> not aligned. This is fixed by splitting a common mapping on per register 
>>> basis.
>>>
>> Should we define the bit map table completely separate for different
>> platforms? My concern is, when D3_STS_0 and FUNC_DIS becomes not
>> consistent in a new SoC, the implementation in this patch has to be
>> rewritten completely.
>>
>> Defining entire bit map table for different platform introduces
>> reduplicated bit definitions, but when we add a new platform in future,
>> we don't need to consider the existing platforms definition, and no need
>> to change code structure any longer.
>>
>> Thoughts?
>>
> 
> But this what I did by introducing pmc_reg_map structure per SoC.
> You may or may not use previous definitions.
> 
okay, it makes sense to me.

Thanks,
-Aubrey

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

2015-01-26 Thread Li, Aubrey

Hi Thomas,

Thanks for the comments, my feedback below:

On 2015/1/22 18:15, Thomas Gleixner wrote:
> On Tue, 9 Dec 2014, Li, Aubrey wrote:
>> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
>> index 2e4cb67..d118e0b 100644
>> --- a/include/linux/clockchips.h
>> +++ b/include/linux/clockchips.h
>> @@ -18,6 +18,9 @@ enum clock_event_nofitiers {
>>  CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
>>  CLOCK_EVT_NOTIFY_SUSPEND,
>>  CLOCK_EVT_NOTIFY_RESUME,
>> +CLOCK_EVT_NOTIFY_FREEZE_PREPARE,
>> +CLOCK_EVT_NOTIFY_FREEZE,
>> +CLOCK_EVT_NOTIFY_UNFREEZE,
> 
> Can we please stop adding more crap to that notifier thing? I rather
> see that go away than being expanded.

Are you referring to FREEZE_PREPARE or remove all of FREEZE staff at all?

What's the disadvantage of adding more notifier?

> 
>> @@ -95,6 +98,7 @@ enum clock_event_mode {
>>   */
>>  struct clock_event_device {
>>  void(*event_handler)(struct clock_event_device *);
>> +void(*real_handler)(struct clock_event_device *);
> 
> This is really not the place to put this. We have the hotpath
> functions together at the beginning of the struct and not some random
> stuff used once in a while. Think about cache lines.

okay, will move to the tail.

> 
> A proper explanation why you need this at all is missing.
> 
Thanks, will add some comments here.

>>  /*
>> + * cpu idle freeze function
>> + */
> 
> How is that comment helpful?
> 
> Not at all. But its simpler to add pointless comments than to document
> the important and non obvious stuff, right?

okay.

> 
>> +static void cpu_idle_freeze(void)
>> +{
>> +struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
>> +struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
>> +
>> +/*
>> + * suspend tick, the last CPU suspend timekeeping
>> + */
>> +clockevents_notify(CLOCK_EVT_NOTIFY_FREEZE, NULL);
>> +/*
>> + * idle loop here if idle_should_freeze()
>> + */
>> +while (idle_should_freeze()) {
>> +int next_state;
>> +/*
>> + * interrupt must be disabled before cpu enters idle
>> + */
>> +local_irq_disable();
>> +
>> +next_state = cpuidle_select(drv, dev);
>> +if (next_state < 0) {
>> +arch_cpu_idle();
>> +continue;
>> +}
>> +/*
>> + * cpuidle_enter will return with interrupt enabled
>> + */
>> +cpuidle_enter(drv, dev, next_state);
> 
> How is that supposed to work?
> 
> If timekeeping is not yet unfrozen, then any interrupt handling code
> which calls anything time related is going to hit lala land.
> 
> You must guarantee that timekeeping is unfrozen before any interrupt
> is handled. If you cannot guarantee that, you cannot freeze
> timekeeping ever.
> 
> The cpu local tick device is less critical, but it happens to work by
> chance, not by design.

There are two way to guarantee this: the first way is, disable interrupt
before timekeeping frozen and enable interrupt after timekeeping is
unfrozen. However, we need to handle wakeup handler before unfreeze
timekeeping to wake freeze task up from wait queue.

So we have to go the other way, the other way is, we ignore time related
calls during freeze, like what I added in irq_enter below.

Or, we need to re-implement freeze wait and wake up mechanism?
> 
>>  void irq_enter(void)
>>  {
>>  rcu_irq_enter();
>> -if (is_idle_task(current) && !in_interrupt()) {
>> +if (is_idle_task(current) && !in_interrupt() && !in_freeze()) {
>>  /*
>>   * Prevent raise_softirq from needlessly waking up ksoftirqd
>>   * here, as softirq will be serviced on return from interrupt.
>> @@ -364,7 +365,7 @@ static inline void tick_irq_exit(void)
>>  
>>  /* Make sure that timer wheel updates are propagated */
>>  if ((idle_cpu(cpu) && !need_resched()) || tick_nohz_full_cpu(cpu)) {
>> -if (!in_interrupt())
>> +if (!in_interrupt() && !in_freeze())
>>  tick_nohz_irq_exit();
> 
> We keep adding conditional over conditionals to the irq_enter/exit
> hotpath. Can we please find some more intelligent solution for this?

I need more time to consider this, will get back to you if I have an idea.

> 
>> @@ -375,15 +386,21 @@ void tick_shutdown(unsigned int *cpup)
>&g

Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

2015-01-13 Thread Li, Aubrey

Happy New Year, can you please take a look at this patch now?

Thanks,
-Aubrey

On 2014/12/9 11:01, Li, Aubrey wrote:
> The patch is based on v3.18.
> 
> Freeze is a general power saving state that processes are frozen, devices
> are suspended and CPUs are in idle state. However, when the system enters
> freeze state, there are a few timers keep ticking and hence consumes more
> power unnecessarily. The observed timer events in freeze state are:
> - tick_sched_timer
> - watchdog lockup detector
> - realtime scheduler period timer
> 
> The system power consumption in freeze state will be reduced significantly
> if we quiesce these timers.
> 
> The patch is tested on:
> - Sandybrdige-EP system, both RTC alarm and power button are able to wake
>   the system up from freeze state.
> - HP laptop EliteBook 8460p, both RTC alarm and power button are able to
>   wake the system up from freeze state.
> - Baytrail-T(ASUS_T100) platform, power button is able to wake the system
>   up from freeze state.
> 
> Suggested-by: Thomas Gleixner 
> Signed-off-by: Aubrey Li 
> Cc: Peter Zijlstra 
> Cc: Rafael J. Wysocki 
> Cc: Len Brown 
> Cc: Alan Cox 
> ---
>  drivers/cpuidle/cpuidle.c   | 13 
>  include/linux/clockchips.h  |  4 +++
>  include/linux/suspend.h |  4 +++
>  include/linux/timekeeping.h |  2 ++
>  kernel/power/suspend.c  | 50 ++---
>  kernel/sched/idle.c | 45 ++
>  kernel/softirq.c|  5 +--
>  kernel/time/clockevents.c   | 13 
>  kernel/time/tick-common.c   | 53 +++
>  kernel/time/tick-internal.h |  3 ++
>  kernel/time/timekeeping.c   | 77 
> +
>  11 files changed, 243 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
> index 125150d..b9a3ada 100644
> --- a/drivers/cpuidle/cpuidle.c
> +++ b/drivers/cpuidle/cpuidle.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "cpuidle.h"
>  
> @@ -119,6 +120,18 @@ int cpuidle_enter_state(struct cpuidle_device *dev, 
> struct cpuidle_driver *drv,
>   ktime_t time_start, time_end;
>   s64 diff;
>  
> + /*
> +  * under the freeze scenario, the timekeeping is suspended
> +  * as well as the clock source device, so we bypass the idle
> +  * counter update in freeze idle
> +  */
> + if (in_freeze()) {
> + entered_state = target_state->enter(dev, drv, index);
> + if (!cpuidle_state_is_coupled(dev, drv, entered_state))
> + local_irq_enable();
> + return entered_state;
> + }
> +
>   trace_cpu_idle_rcuidle(index, dev->cpu);
>   time_start = ktime_get();
>  
> diff --git a/include/linux/clockchips.h b/include/linux/clockchips.h
> index 2e4cb67..d118e0b 100644
> --- a/include/linux/clockchips.h
> +++ b/include/linux/clockchips.h
> @@ -18,6 +18,9 @@ enum clock_event_nofitiers {
>   CLOCK_EVT_NOTIFY_BROADCAST_EXIT,
>   CLOCK_EVT_NOTIFY_SUSPEND,
>   CLOCK_EVT_NOTIFY_RESUME,
> + CLOCK_EVT_NOTIFY_FREEZE_PREPARE,
> + CLOCK_EVT_NOTIFY_FREEZE,
> + CLOCK_EVT_NOTIFY_UNFREEZE,
>   CLOCK_EVT_NOTIFY_CPU_DYING,
>   CLOCK_EVT_NOTIFY_CPU_DEAD,
>  };
> @@ -95,6 +98,7 @@ enum clock_event_mode {
>   */
>  struct clock_event_device {
>   void(*event_handler)(struct clock_event_device *);
> + void(*real_handler)(struct clock_event_device *);
>   int (*set_next_event)(unsigned long evt,
> struct clock_event_device *);
>   int (*set_next_ktime)(ktime_t expires,
> diff --git a/include/linux/suspend.h b/include/linux/suspend.h
> index 3388c1b..86a651c 100644
> --- a/include/linux/suspend.h
> +++ b/include/linux/suspend.h
> @@ -203,6 +203,8 @@ extern void suspend_set_ops(const struct 
> platform_suspend_ops *ops);
>  extern int suspend_valid_only_mem(suspend_state_t state);
>  extern void freeze_set_ops(const struct platform_freeze_ops *ops);
>  extern void freeze_wake(void);
> +extern bool in_freeze(void);
> +extern bool idle_should_freeze(void);
>  
>  /**
>   * arch_suspend_disable_irqs - disable IRQs for suspend
> @@ -230,6 +232,8 @@ static inline void suspend_set_ops(const struct 
> platform_suspend_ops *ops) {}
>  static inline int pm_suspend(suspend_state_t state) { return -ENOSYS; }
>  static inline void freeze_set_ops(const struct platform_freeze_ops *ops) {}
>  static inline void freeze_wake(void) {}
> +static inline b

Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

2015-01-26 Thread Li, Aubrey

On 2015/1/26 22:45, Rafael J. Wysocki wrote:
> On Monday, January 26, 2015 03:15:43 PM Thomas Gleixner wrote:
>> On Mon, 26 Jan 2015, Rafael J. Wysocki wrote:
>>
>>> On Monday, January 26, 2015 10:40:24 AM Thomas Gleixner wrote:
>>>> On Mon, 26 Jan 2015, Li, Aubrey wrote:
>>>>> On 2015/1/22 18:15, Thomas Gleixner wrote:
>>>>>> Can we please stop adding more crap to that notifier thing? I rather
>>>>>> see that go away than being expanded.
>>>>>
>>>>> Are you referring to FREEZE_PREPARE or remove all of FREEZE staff at all?
>>>>>
>>>>> What's the disadvantage of adding more notifier?
>>>>
>>>> clockevents_notify() is not a notifier. Its a multiplex call and I
>>>> want to get rid of it and replace it with explicit functions.
>>>
>>> OK, so perhaps we need to move _SUSPEND/_RESUME out of there to start with?
>>>
>>> As far as I can say, clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL) and
>>> clockevents_notify(CLOCK_EVT_NOTIFY_RESUME, NULL) are each only called from
>>> one place and moreover, since they are in syscore_ops, we don't need any
>>> locking around them.
>>>
>>> So what about the patch below?
>>
>> I'm cleaning up the whole replacement of notify. The stuff below is
>> part of it.
>>
>>>  
>>> -   clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);
>>> +   tick_suspend();
>>> +   tick_suspend_broadcast();
>>
>> That's exactly the stuff I don't want to see. Blind code
>> move.
> 
> At least it's clear what the patch does. :-)
> 
>> tick_suspend_broadcast() wants to be called from tick_suspend().
> 
> OK
> 
>> Still compiling and testing a gazillion of combinations.
> 
> OK, so it looks like we need to wait with the suspend to idle changes until
> this lands.

Please cc the patches to me when you post. I'll refine the patch after that.

Thanks,
-Aubrey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

2015-01-27 Thread Li, Aubrey

On 2015/1/26 22:41, Rafael J. Wysocki wrote:
> On Monday, January 26, 2015 10:40:24 AM Thomas Gleixner wrote:
>> On Mon, 26 Jan 2015, Li, Aubrey wrote:
>>> On 2015/1/22 18:15, Thomas Gleixner wrote:
> 
> [...]
> 
>>>>> + /*
>>>>> +  * cpuidle_enter will return with interrupt enabled
>>>>> +  */
>>>>> + cpuidle_enter(drv, dev, next_state);
>>>>
>>>> How is that supposed to work?
>>>>
>>>> If timekeeping is not yet unfrozen, then any interrupt handling code
>>>> which calls anything time related is going to hit lala land.
>>>>
>>>> You must guarantee that timekeeping is unfrozen before any interrupt
>>>> is handled. If you cannot guarantee that, you cannot freeze
>>>> timekeeping ever.
>>>>
>>>> The cpu local tick device is less critical, but it happens to work by
>>>> chance, not by design.
>>>
>>> There are two way to guarantee this: the first way is, disable interrupt
>>> before timekeeping frozen and enable interrupt after timekeeping is
>>> unfrozen. However, we need to handle wakeup handler before unfreeze
>>> timekeeping to wake freeze task up from wait queue.
>>>
>>> So we have to go the other way, the other way is, we ignore time related
>>> calls during freeze, like what I added in irq_enter below.
>>
>> Groan. You just do not call in irq_enter/exit(), but what prevents any
>> interrupt handler or whatever to call into the time/timer code after
>> interrupts got reenabled?
>>
>> Nothing. 
>>
>>> Or, we need to re-implement freeze wait and wake up mechanism?
>>
>> You need to make sure in the low level idle implementation that this
>> cannot happen.
>>
>> tick_freeze()
>> {
>>  raw_spin_lock(&tick_freeze_lock);
>>  tick_frozen++;
>>  if (tick_frozen == num_online_cpus())
>>  timekeeping_suspend();
>>  else
>>  tick_suspend_local();
>>  raw_spin_unlock(&tick_freeze_lock);
>> }
>>
>> tick_unfreeze()
>> {
>>  raw_spin_lock(&tick_freeze_lock);
>>  if (tick_frozen == num_online_cpus())
>>  timekeeping_resume();
>>  else
>>  tick_resume_local();
>>  tick_frozen--;
>>  raw_spin_unlock(&tick_freeze_lock);
>> }
>>
>> idle_freeze()
>> {
>>  local_irq_disable();
>>
>>  tick_freeze();
>>
>>  /* Must keep interrupts disabled! */
>>  go_deep_idle()
>>
>>  tick_unfreeze();
>>
>>  local_irq_enable();
>> }
>>
>> That's the only way you can do it proper, everything else will just be
>> a horrible mess of bandaids and duct tape.
>>
>> So that does not need any of the irq_enter/exit conditionals, it does
>> not need the real_handler hack. It just works.
> 
> As long as go_deep_idle() above does not enable interrupts.  This means we 
> won't
> be able to use some C-states for suspend-to-idle (hald-induced C1 on some x86
> for one example), but that's not a very big deal.

Does the legacy ACPI system IO method to enter C2/C3 need interrupt
enabled as well?

Do we need some platform ops to cover those legacy platforms? Different
platform go different branch here.

Thanks,
-Aubrey

> 
>> The only remaining issue might be a NMI calling into
>> ktime_get_mono_fast_ns() before timekeeping is resumed. Its probably a
>> non issue on x86/tsc, but it might be a problem on other platforms
>> which turn off devices, clocks, It's not rocket science to prevent
>> that.
> 
> I don't see any users of ktime_get_mono_fast_ns() at all, unless some 
> non-trivial
> macros are involved.  At least grepping for it only returns the definition,
> declarations and the line in trace.c.
> 
> Rafael
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3]PM/Sleep: Timer quiesce in freeze state

2015-01-27 Thread Li, Aubrey

On 2015/1/27 23:10, Rafael J. Wysocki wrote:
> On Tuesday, January 27, 2015 04:03:29 PM Li, Aubrey wrote:
>> On 2015/1/26 22:41, Rafael J. Wysocki wrote:
>>> On Monday, January 26, 2015 10:40:24 AM Thomas Gleixner wrote:
>>>> On Mon, 26 Jan 2015, Li, Aubrey wrote:
>>>>> On 2015/1/22 18:15, Thomas Gleixner wrote:
>>>
>>> [...]
>>>
>>>>>>> +   /*
>>>>>>> +* cpuidle_enter will return with interrupt enabled
>>>>>>> +*/
>>>>>>> +   cpuidle_enter(drv, dev, next_state);
>>>>>>
>>>>>> How is that supposed to work?
>>>>>>
>>>>>> If timekeeping is not yet unfrozen, then any interrupt handling code
>>>>>> which calls anything time related is going to hit lala land.
>>>>>>
>>>>>> You must guarantee that timekeeping is unfrozen before any interrupt
>>>>>> is handled. If you cannot guarantee that, you cannot freeze
>>>>>> timekeeping ever.
>>>>>>
>>>>>> The cpu local tick device is less critical, but it happens to work by
>>>>>> chance, not by design.
>>>>>
>>>>> There are two way to guarantee this: the first way is, disable interrupt
>>>>> before timekeeping frozen and enable interrupt after timekeeping is
>>>>> unfrozen. However, we need to handle wakeup handler before unfreeze
>>>>> timekeeping to wake freeze task up from wait queue.
>>>>>
>>>>> So we have to go the other way, the other way is, we ignore time related
>>>>> calls during freeze, like what I added in irq_enter below.
>>>>
>>>> Groan. You just do not call in irq_enter/exit(), but what prevents any
>>>> interrupt handler or whatever to call into the time/timer code after
>>>> interrupts got reenabled?
>>>>
>>>> Nothing. 
>>>>
>>>>> Or, we need to re-implement freeze wait and wake up mechanism?
>>>>
>>>> You need to make sure in the low level idle implementation that this
>>>> cannot happen.
>>>>
>>>> tick_freeze()
>>>> {
>>>>raw_spin_lock(&tick_freeze_lock);
>>>>tick_frozen++;
>>>>if (tick_frozen == num_online_cpus())
>>>>timekeeping_suspend();
>>>>else
>>>>tick_suspend_local();
>>>>raw_spin_unlock(&tick_freeze_lock);
>>>> }
>>>>
>>>> tick_unfreeze()
>>>> {
>>>>raw_spin_lock(&tick_freeze_lock);
>>>>if (tick_frozen == num_online_cpus())
>>>>timekeeping_resume();
>>>>else
>>>>tick_resume_local();
>>>>tick_frozen--;
>>>>raw_spin_unlock(&tick_freeze_lock);
>>>> }
>>>>
>>>> idle_freeze()
>>>> {
>>>>local_irq_disable();
>>>>
>>>>tick_freeze();
>>>>
>>>>/* Must keep interrupts disabled! */
>>>>go_deep_idle()
>>>>
>>>>tick_unfreeze();
>>>>
>>>>local_irq_enable();
>>>> }
>>>>
>>>> That's the only way you can do it proper, everything else will just be
>>>> a horrible mess of bandaids and duct tape.
>>>>
>>>> So that does not need any of the irq_enter/exit conditionals, it does
>>>> not need the real_handler hack. It just works.
>>>
>>> As long as go_deep_idle() above does not enable interrupts.  This means we 
>>> won't
>>> be able to use some C-states for suspend-to-idle (hald-induced C1 on some 
>>> x86
>>> for one example), but that's not a very big deal.
>>
>> Does the legacy ACPI system IO method to enter C2/C3 need interrupt
>> enabled as well?
>>
>> Do we need some platform ops to cover those legacy platforms? Different
>> platform go different branch here.
> 
> No, we don't.
> 
> I think this needs to be addressed in a different way overall.  If you don't
> mind, I'd like to prepare my own version of the patch at this point.  That
> likely will be simpler than trying to explain what I'd like to do and I guess
> I'll need a few iterations to get something acceptable anyway.

Sure, please go ahead and just keep me posted.

Thanks,
-Aubrey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LKP] [x86/platform, acpi] 7486341a98f: genirq: Flags mismatch irq 8. 00000080 (mmc0) vs. 00000000 (rtc0)

2015-03-22 Thread Li, Aubrey

On 2015/3/20 16:38, Huang Ying wrote:
> FYI, we noticed the below changes on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 7486341a98f26857f383aec88ffa10950087c3a1 ("x86/platform, acpi: Bypass 
> legacy PIC and PIT in ACPI hardware reduced mode")
> 
> 
> +-+++
> | | 394838c960 | 7486341a98 |
> +-+++
> | boot_successes  | 10 | 10 |
> | early-boot-hang | 1  | 1  |
> +-+++
> 
> 
> [   28.682462] microcode: CPU3 sig=0x30673, pf=0x2, revision=0x31e
> [   28.692739] microcode: Microcode Update Driver: v2.00 
> , Peter Oruba
> [   28.739947] mmc0: Reset 0x1 never completed.
> [   28.745155] genirq: Flags mismatch irq 8. 0080 (mmc0) vs.  
> (rtc0)

mmc driver requests IRQ 8?
>From surface it looks like either a firmware issue or ACPI get the wrong
IRQ resource.
I failed to replicate this issue with the attached config file on my
side, which platform is the test running?

Thanks,
-Aubrey
> [   28.753151] CPU: 1 PID: 138 Comm: systemd-udevd Not tainted 
> 4.0.0-rc4-wl-03426-gdba586f #1
> [   28.753156]  880076683480 8800764ffa78 818b2eb3 
> 0007
> [   28.753159]  88006d006800 8800764ffad8 810d37af 
> 8800764ffad8
> [   28.753162]  0246 810d394a 880053bdc400 
> 
> [   28.753163] Call Trace:
> [   28.753174]  [] dump_stack+0x4c/0x65
> [   28.753180]  [] __setup_irq+0x57f/0x5d0
> [   28.753183]  [] ? request_threaded_irq+0xaa/0x1b0
> [   28.753191]  [] ? sdhci_request+0x200/0x200 [sdhci]
> [   28.753194]  [] request_threaded_irq+0xf4/0x1b0
> [   28.753199]  [] sdhci_add_host+0x800/0xf90 [sdhci]
> [   28.753205]  [] sdhci_acpi_probe+0x348/0x440 [sdhci_acpi]
> [   28.753210]  [] platform_drv_probe+0x34/0xa0
> [   28.753215]  [] driver_probe_device+0x90/0x3e0
> [   28.753218]  [] __driver_attach+0x9b/0xa0
> [   28.753221]  [] ? __device_attach+0x40/0x40
> [   28.753231]  [] bus_for_each_dev+0x6b/0xb0
> [   28.753234]  [] driver_attach+0x1e/0x20
> [   28.753237]  [] bus_add_driver+0x180/0x250
> [   28.753240]  [] ? 0xa002e000
> [   28.753243]  [] driver_register+0x64/0xf0
> [   28.753246]  [] __platform_driver_register+0x4a/0x50
> [   28.753250]  [] sdhci_acpi_driver_init+0x17/0x1000 
> [sdhci_acpi]
> [   28.753254]  [] do_one_initcall+0xc0/0x1f0
> [   28.753259]  [] ? kmem_cache_alloc_trace+0x1cc/0x240
> [   28.753262]  [] ? do_init_module+0x28/0x1dd
> [   28.753266]  [] do_init_module+0x61/0x1dd
> [   28.753270]  [] load_module+0x189f/0x1be0
> [   28.753274]  [] ? store_uevent+0x40/0x40
> [   28.753279]  [] SyS_finit_module+0x86/0xb0
> [   28.753283]  [] system_call_fastpath+0x12/0x17
> [   28.753700] mmc0: Failed to request IRQ 8: -16
> [   28.975382] IOAPIC[0]: Set routing entry (8-16 -> 0x32 -> IRQ 13 Mode:1 
> Active:1 Dest:15)
> [   28.975934] sdhci-acpi: probe of INT33BB:00 failed with error -16
> 
> 
> Thanks,
> Ying Huang
> 
> 
> 
> ___
> LKP mailing list
> l...@linux.intel.com
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [LKP] [x86/platform, acpi] 7486341a98f: genirq: Flags mismatch irq 8. 00000080 (mmc0) vs. 00000000 (rtc0)

2015-03-30 Thread Li, Aubrey

On 2015/3/30 16:37, Jiang Liu wrote:
> On 2015/3/30 16:28, Li, Aubrey wrote:
>> Ying,
>>
>> can you please try this patch to see if the problem is gone on your side?
> Hi Aubrey,
>   I would be better if we could change RTC driver instead.

Hey Gerry,

IRQ8 for RTC is for history reason. If we dynamically assign IRQ to RTC,
it might be a mess for legacy machine. If we still use IRQ8 for RTC but
dynamically assign IRQ to other components, then there is a race IRQ8
being used before RTC driver is loaded.

So, statically assigning IRQ is the best choice in my mind, this also
matches the original IRQ assign policy.

Thanks,
-Aubrey

> Thanks!
> Gerry
>>
>> Thanks,
>> -Aubrey
>>
>>
>> On 2015/3/26 20:13, Li, Aubrey wrote:
>>> On 2015/3/25 15:22, Huang Ying wrote:
>>>> [   28.745155] genirq: Flags mismatch irq 8. 0080 (mmc0) vs.  
>>>> (rtc0)
>>>
>>> okay, I replicated this on my side now.
>>>
>>> Firstly, I don't think the patch did anything wrong. However, the patch
>>> exposes a few issues FWICT currently:
>>>
>>> - Should we enable RTC Alarm the kind of Fixed hardware event in
>>> hardware-reduced ACPI mode? I found RTC required registers in ACPI PM
>>> block are not valid(register address = 0)
>>>
>>> - I checked RTC device in ACPI table, there is no interrupt resource
>>> under RTC(firmware bug?), So irq 8 should be a hardcoded number. The
>>> question is, shouldn't we update bitmap of allocated_irqs here? Or we
>>> assume irq0~15 is reserved? If we assume IRQ0~15 is reserved, then
>>> requesting IRQ8 without updating bitmap of allocated_irqs is fine.
>>>
>>> - Because we don't update bitmap of allocated_irqs when RTC request
>>> IRQ8, so when MMC driver allocate irq resource, it's possible it gets
>>> irq8, so we saw "genirq: Flags mismatch irq 8. 0080 (mmc0) vs.
>>>  (rtc0)". So here is another question, when we dynamically
>>> allocate irq from irq domain, shouldn't we start from IRQ16? Yes, if
>>> allocated_irqs bitmap is updated, then it should be fine if we start
>>> from IRQ1.
>>>
>>> What the patch does is, it changes the behavior of how we allocate irq
>>> from irq domain. Previously we have legacy IRQs so we statically assign
>>> IRQ numbers for IOAPICs to host legacy IRQs, and now we allocate every
>>> IRQ dynamically.
>>>
>>> For me I think I can deliver a patch against RTC driver to update
>>> allocated_irqs bitmap, also, we should free irq when we found RTC ACPI
>>> registers are not valid.
>>>
>>> Certainly I'm open to any suggestions.
>>>
>>> Thanks,
>>> -Aubrey
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] x86/platform, acpi: Statically assign IRQ numbers in ACPI, hardware reduced mode

2015-03-31 Thread Li, Aubrey

We have to be conservative to dynamically assign IRQ number on the platform
in ACPI Hardware-reduced mode. On the Bay Trail-T(ASUS-T100) platform, there
is a RTC device still using the legacy hardcoded IRQ8, which could cause the
following error if RTC is configured:

7486341a98f: genirq: Flags mismatch irq 8. 0080 (mmc0) vs.  (rtc0)

So we want to statically assign IRQ numbers in ACPI hardware reduced mode to
fix this error, this also matches with the original IRQ assignment policy.

Signed-off-by: Li Aubrey 
Cc: Alan Cox 
Cc: Len Brown 
Cc: Rafael J. Wysocki 
Cc: Arjan van de Ven 
---
 arch/x86/kernel/acpi/boot.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 803b684..4cd0761 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -460,8 +460,12 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, 
const unsigned long end)
 
acpi_table_print_madt_entry(header);
 
-   /* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */
-   if (ioapic->global_irq_base < nr_legacy_irqs())
+   /*
+* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs,
+* Or for the platform in Hardware-reduced ACPI model
+*/
+   if (ioapic->global_irq_base < nr_legacy_irqs() ||
+   acpi_gbl_reduced_hardware)
cfg.type = IOAPIC_DOMAIN_LEGACY;
 
mp_register_ioapic(ioapic->id, ioapic->address, ioapic->global_irq_base,
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v15 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-17 Thread Li, Aubrey

On 2019/4/17 7:01, Andrew Morton wrote:
> On Tue, 16 Apr 2019 14:32:48 +0800 Aubrey Li  
> wrote:
> 
>> The architecture specific information of the running processes could
>> be useful to the userland. Add support to examine process architecture
>> specific information externally.
> 
> The implementation looks just fine to me.  Have you had any feedback on
> the overall desirability of adding this feature?
> 
This is orientated by the customer's complain. When the customer colocated
their latency critical tasks with AVX512 tasks, the latency critical tasks
were affected due to core frequency slowing down by AVX512 use. So we propose
this interface for the user space tools to identify AVX512 using task and
apply user space dispatching policy. We may have subsequent effort based on
this proposal.

>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -96,6 +96,11 @@
>>  #include 
>>  #include "internal.h"
>>  
>> +/* Add support for architecture specific output in /proc/pid/status */
>> +#ifndef arch_proc_pid_status
>> +#define arch_proc_pid_status(m, task)
>> +#endif
> 
> To this I suggest adding
> 
> /* arch_proc_pid_status() must be defined in asm/processor.h */
> 
> Because we've regularly had different architectures defining such things
> in different headers, resulting in a mess.

Thanks, I'll add it in the next version.
-Aubrey

Re: [PATCH v17 1/3] proc: add /proc//arch_status

2019-04-24 Thread Li, Aubrey

On 2019/4/25 5:18, Thomas Gleixner wrote:
> On Mon, 22 Apr 2019, Aubrey Li wrote:
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 5ad92419be19..d5a9c5ddd453 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -208,6 +208,7 @@ config X86
>>  select USER_STACKTRACE_SUPPORT
>>  select VIRT_TO_BUS
>>  select X86_FEATURE_NAMESif PROC_FS
>> +select PROC_PID_ARCH_STATUS if PROC_FS
> 
> Can you please stop mixing arch and proc code? There is no point in
> enabling this on x86 right away.
> 
>> +
>> +config PROC_PID_ARCH_STATUS
>> +bool "Enable /proc//arch_status file"
> 
> Why is this switchable? x86 selects it if PROC_FS is enabled and all other
> architectures are absolutely not interested in this.

Above and this, I was trying to avoid an empty arch_file on other architectures.
In previous proposal the entry only exists on the platform with AVX512.

> 
>> +default n
>> +help
>> +  Provides a way to examine process architecture specific information.
>> +  See  for more information.
> 
> Which contains zero information about this file when only this patch is
> applied. 
> 
>> @@ -94,6 +94,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
> 
> That include is required because it does NOT contain anything useful for
> this, right?
> 
>> +/*
>> + * Add support for task architecture specific output in 
>> /proc/pid/arch_status.
>> + * task_arch_status() must be defined in asm/processor.h
>> + */
>> +#ifdef CONFIG_PROC_PID_ARCH_STATUS
>> +# ifndef task_arch_status
>> +# define task_arch_status(m, task)
>> +# endif
> 
> What exactly is the point of this macro mess? If an architecture selects
> CONFIG_PROC_PID_ARCH_STATUS then it has to provide proc_task_arch_status()
> and the prototype should be in include/linux/proc_fs.h.

I was trying to address Andy's last comments. If we have the prototype in 
include/linux/proc_fs.h, we'll have a weak function definition in 
fs/proc/array.c,
which bloats other architectures.

In that way proc_task_arch_status() should be defined in asm/processor.h,
but proc_task_arch_status() has four parameters, I don't want unnecessary
"struct pid_namespace *ns" and "struct pid *pid" leaked into arch headers,
so I defined task_arch_status(m, task) to avoid that.

> 
>> +static int proc_pid_arch_status(struct seq_file *m, struct pid_namespace 
>> *ns,
>> +struct pid *pid, struct task_struct *task)
>> +{
>> +task_arch_status(m, task);
>> +return 0;
>> +}
>> +#endif /* CONFIG_PROC_PID_ARCH_STATUS */
> 
> Thanks,
> 
>   tglx
> 

Thanks,
-Aubrey

Re: [PATCH v17 1/3] proc: add /proc//arch_status

2019-04-25 Thread Li, Aubrey

On 2019/4/25 15:20, Thomas Gleixner wrote:
> On Thu, 25 Apr 2019, Li, Aubrey wrote:
> 
>> On 2019/4/25 5:18, Thomas Gleixner wrote:
>>> On Mon, 22 Apr 2019, Aubrey Li wrote:
>>>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>>>> index 5ad92419be19..d5a9c5ddd453 100644
>>>> --- a/arch/x86/Kconfig
>>>> +++ b/arch/x86/Kconfig
>>>> @@ -208,6 +208,7 @@ config X86
>>>>select USER_STACKTRACE_SUPPORT
>>>>select VIRT_TO_BUS
>>>>select X86_FEATURE_NAMESif PROC_FS
>>>> +  select PROC_PID_ARCH_STATUS if PROC_FS
>>>
>>> Can you please stop mixing arch and proc code? There is no point in
>>> enabling this on x86 right away.
>>>
>>>> +
>>>> +config PROC_PID_ARCH_STATUS
>>>> +  bool "Enable /proc//arch_status file"
>>>
>>> Why is this switchable? x86 selects it if PROC_FS is enabled and all other
>>> architectures are absolutely not interested in this.
>>
>> Above and this, I was trying to avoid an empty arch_file on other 
>> architectures.
>> In previous proposal the entry only exists on the platform with AVX512.
> 
> What's the benefit of having a conditional enabled empty file for all other
> architectures? Nothing AFAICT.
>  
>>>> +/*
>>>> + * Add support for task architecture specific output in 
>>>> /proc/pid/arch_status.
>>>> + * task_arch_status() must be defined in asm/processor.h
>>>> + */
>>>> +#ifdef CONFIG_PROC_PID_ARCH_STATUS
>>>> +# ifndef task_arch_status
>>>> +# define task_arch_status(m, task)
>>>> +# endif
>>>
>>> What exactly is the point of this macro mess? If an architecture selects
>>> CONFIG_PROC_PID_ARCH_STATUS then it has to provide proc_task_arch_status()
>>> and the prototype should be in include/linux/proc_fs.h.
>>
>> I was trying to address Andy's last comments. If we have the prototype in 
>> include/linux/proc_fs.h, we'll have a weak function definition in 
>> fs/proc/array.c,
>> which bloats other architectures.
>>
>> In that way proc_task_arch_status() should be defined in asm/processor.h,
>> but proc_task_arch_status() has four parameters, I don't want unnecessary
>> "struct pid_namespace *ns" and "struct pid *pid" leaked into arch headers,
>> so I defined task_arch_status(m, task) to avoid that.
> 
> This #define mess is ugly and pointless.

> Let the arch select CONFIG_PROC_PID_ARCH_STATUS 

Sorry, I didn't get the point here, above you mentioned not mixing arch and 
proc code
and not enabling this on x86 right away, then how to let x86 select it?

Thanks,
-Aubrey

>and if it selects it it has to provide the function. No waek function required 
>at all.
> 
> Thanks,
> 
>   tglx
>

Re: [PATCH v17 1/3] proc: add /proc//arch_status

2019-04-25 Thread Li, Aubrey

On 2019/4/25 16:20, Thomas Gleixner wrote:
> On Thu, 25 Apr 2019, Li, Aubrey wrote:
>> On 2019/4/25 15:20, Thomas Gleixner wrote:
>>> Let the arch select CONFIG_PROC_PID_ARCH_STATUS 
>>
>> Sorry, I didn't get the point here, above you mentioned not mixing arch and 
>> proc code
>> and not enabling this on x86 right away, then how to let x86 select it?
>>
> 
> By using 'select PROC_PID_ARCH_STATUS' in the arch//Kconfig file in the
> patch which implements the required arch function perhaps?

Oh,oh, I misunderstood, thanks for the clarification!

Thanks,
-Aubrey

Re: [PATCH v17 1/3] proc: add /proc//arch_status

2019-04-25 Thread Li, Aubrey

On 2019/4/25 18:11, Enrico Weigelt, metux IT consult wrote:
> On 25.04.19 03:50, Li, Aubrey wrote:
> 
>>>> +>>> +config PROC_PID_ARCH_STATUS>>> + bool "Enable
> /proc//arch_status file">>>> Why is this switchable? x86 selects it
> if PROC_FS is enabled and all other>> architectures are absolutely not
> interested in this.> > Above and this, I was trying to avoid an empty
> arch_file on other architectures.> In previous proposal the entry only
> exists on the platform with AVX512.
> Why not making it depend on those archs that actually support it ?

Yep, I'll make it disabled by default and not switchable and let arch select it.

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Li, Aubrey

The original patch seems missing the following change for 32bit.

Thanks,
-Aubrey

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 9fbb10383434..78de28ebc45d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -111,7 +111,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
/*
 * Take rq->lock to make 64-bit read safe on 32-bit platforms.
 */
-   raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
if (index == CPUACCT_STAT_NSTATS) {
@@ -125,7 +125,7 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
}
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
return data;
@@ -140,14 +140,14 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, 
int cpu, u64 val)
/*
 * Take rq->lock to make 64-bit write safe on 32-bit platforms.
 */
-   raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
cpuusage->usages[i] = val;
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 }
 
@@ -252,13 +252,13 @@ static int cpuacct_all_seq_show(struct seq_file *m, void 
*V)
 * Take rq->lock to make 64-bit read safe on 32-bit
 * platforms.
 */
-   raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_lock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
 
seq_printf(m, " %llu", cpuusage->usages[index]);
 
 #ifndef CONFIG_64BIT
-   raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+   raw_spin_unlock_irq(rq_lockp(cpu_rq(cpu)));
 #endif
}
seq_puts(m, "\n");

Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-09 Thread Li, Aubrey

On 2019/4/10 9:58, Andy Lutomirski wrote:
> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li  wrote:
>>
>> The architecture specific information of the running processes could
>> be useful to the userland. Add support to examine process architecture
>> specific information externally.
>>
>> Signed-off-by: Aubrey Li 
>> Cc: Peter Zijlstra 
>> Cc: Andi Kleen 
>> Cc: Tim Chen 
>> Cc: Dave Hansen 
>> Cc: Arjan van de Ven 
>> Cc: Linux API 
>> Cc: Alexey Dobriyan 
>> Cc: Andrew Morton 
>> ---
>>  fs/proc/array.c | 5 +
>>  include/linux/proc_fs.h | 2 ++
>>  2 files changed, 7 insertions(+)
>>
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index 2edbb657f859..331592a61718 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file *m, 
>> struct mm_struct *mm)
>> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>  }
>>
>> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct 
>> *task)
>> +{
>> +}
> 
> This pointlessly bloats other architectures.  Do this instead in an
> appropriate header:
> 
> #ifndef arch_proc_pid_status
> static inline void arch_proc_pid_status(...)
> {
> }
> #endif
> 

I saw a bunch of similar weak functions, is it not acceptable?

fs/proc$ grep weak *.c
cpuinfo.c:__weak void arch_freq_prepare_all(void)
meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long 
long *size)
vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 
*ppos)
vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
vmcore.c:ssize_t __weak

> Or add /proc/PID/x86_status, which sounds better in most respects to me.
> 

I didn't figure out how to make /proc/PID/x86_status invisible to other
architectures in an appropriate way, do you have any suggestions?

Thanks,
-Aubrey

Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-09 Thread Li, Aubrey

On 2019/4/10 10:25, Andy Lutomirski wrote:
> On Tue, Apr 9, 2019 at 7:20 PM Li, Aubrey  wrote:
>>
>> On 2019/4/10 9:58, Andy Lutomirski wrote:
>>> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li  wrote:
>>>>
>>>> The architecture specific information of the running processes could
>>>> be useful to the userland. Add support to examine process architecture
>>>> specific information externally.
>>>>
>>>> Signed-off-by: Aubrey Li 
>>>> Cc: Peter Zijlstra 
>>>> Cc: Andi Kleen 
>>>> Cc: Tim Chen 
>>>> Cc: Dave Hansen 
>>>> Cc: Arjan van de Ven 
>>>> Cc: Linux API 
>>>> Cc: Alexey Dobriyan 
>>>> Cc: Andrew Morton 
>>>> ---
>>>>  fs/proc/array.c | 5 +
>>>>  include/linux/proc_fs.h | 2 ++
>>>>  2 files changed, 7 insertions(+)
>>>>
>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>> index 2edbb657f859..331592a61718 100644
>>>> --- a/fs/proc/array.c
>>>> +++ b/fs/proc/array.c
>>>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file 
>>>> *m, struct mm_struct *mm)
>>>> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>>>  }
>>>>
>>>> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct 
>>>> *task)
>>>> +{
>>>> +}
>>>
>>> This pointlessly bloats other architectures.  Do this instead in an
>>> appropriate header:
>>>
>>> #ifndef arch_proc_pid_status
>>> static inline void arch_proc_pid_status(...)
>>> {
>>> }
>>> #endif
>>>
>>
>> I saw a bunch of similar weak functions, is it not acceptable?
>>
>> fs/proc$ grep weak *.c
>> cpuinfo.c:__weak void arch_freq_prepare_all(void)
>> meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
>> vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long 
>> long *size)
>> vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
>> vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>> vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 
>> *ppos)
>> vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>> vmcore.c:ssize_t __weak
> 
> I think they're acceptable, but I don't personally like them.
> 

okay, let me try to see if I can refine it in an appropriate way.

>>
>>> Or add /proc/PID/x86_status, which sounds better in most respects to me.
>>>
>>
>> I didn't figure out how to make /proc/PID/x86_status invisible to other
>> architectures in an appropriate way, do you have any suggestions?
> 
> #ifdef CONFIG_X86?
> 

I'm not sure if everyone like adding an arch #ifdef in a common file,
please allow me to wait a while to see if there are other comments.

Thanks,
-Aubrey

Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-09 Thread Li, Aubrey

On 2019/4/10 10:36, Li, Aubrey wrote:
> On 2019/4/10 10:25, Andy Lutomirski wrote:
>> On Tue, Apr 9, 2019 at 7:20 PM Li, Aubrey  wrote:
>>>
>>> On 2019/4/10 9:58, Andy Lutomirski wrote:
>>>> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li  wrote:
>>>>>
>>>>> The architecture specific information of the running processes could
>>>>> be useful to the userland. Add support to examine process architecture
>>>>> specific information externally.
>>>>>
>>>>> Signed-off-by: Aubrey Li 
>>>>> Cc: Peter Zijlstra 
>>>>> Cc: Andi Kleen 
>>>>> Cc: Tim Chen 
>>>>> Cc: Dave Hansen 
>>>>> Cc: Arjan van de Ven 
>>>>> Cc: Linux API 
>>>>> Cc: Alexey Dobriyan 
>>>>> Cc: Andrew Morton 
>>>>> ---
>>>>>  fs/proc/array.c | 5 +
>>>>>  include/linux/proc_fs.h | 2 ++
>>>>>  2 files changed, 7 insertions(+)
>>>>>
>>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>>> index 2edbb657f859..331592a61718 100644
>>>>> --- a/fs/proc/array.c
>>>>> +++ b/fs/proc/array.c
>>>>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file 
>>>>> *m, struct mm_struct *mm)
>>>>> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>>>>  }
>>>>>
>>>>> +void __weak arch_proc_pid_status(struct seq_file *m, struct task_struct 
>>>>> *task)
>>>>> +{
>>>>> +}
>>>>
>>>> This pointlessly bloats other architectures.  Do this instead in an
>>>> appropriate header:
>>>>
>>>> #ifndef arch_proc_pid_status
>>>> static inline void arch_proc_pid_status(...)
>>>> {
>>>> }
>>>> #endif
>>>>
>>>
>>> I saw a bunch of similar weak functions, is it not acceptable?
>>>
>>> fs/proc$ grep weak *.c
>>> cpuinfo.c:__weak void arch_freq_prepare_all(void)
>>> meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file *m)
>>> vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned 
>>> long long *size)
>>> vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
>>> vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
>>> vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 
>>> *ppos)
>>> vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>>> vmcore.c:ssize_t __weak
>>
>> I think they're acceptable, but I don't personally like them.
>>
> 
> okay, let me try to see if I can refine it in an appropriate way.

Hi Andy,

Is this what you want? 

Thanks,
-Aubrey


diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 2bb3a648fc12..82d77d3aefff 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -990,5 +990,8 @@ enum l1tf_mitigations {
 };
 
 extern enum l1tf_mitigations l1tf_mitigation;
+/* Add support for architecture specific output in /proc/pid/status */
+void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
+#define arch_proc_pid_status arch_proc_pid_status
 
 #endif /* _ASM_X86_PROCESSOR_H */
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 2edbb657f859..fd65a6ba2864 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -401,6 +401,11 @@ static inline void task_thp_status(struct seq_file *m, 
struct mm_struct *mm)
seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
 }
 
+/* Add support for architecture specific output in /proc/pid/status */
+#ifndef arch_proc_pid_status
+#define arch_proc_pid_status(m, task)
+#endif
+
 int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
 {
@@ -424,6 +429,7 @@ int proc_pid_status(struct seq_file *m, struct 
pid_namespace *ns,
task_cpus_allowed(m, task);
cpuset_task_status_allowed(m, task);
task_context_switch_counts(m, task);
+   arch_proc_pid_status(m, task);
return 0;
 }

Re: [PATCH v14 1/3] /proc/pid/status: Add support for architecture specific output

2019-04-10 Thread Li, Aubrey

On 2019/4/10 22:54, Andy Lutomirski wrote:
> On Tue, Apr 9, 2019 at 8:40 PM Li, Aubrey  wrote:
>>
>> On 2019/4/10 10:36, Li, Aubrey wrote:
>>> On 2019/4/10 10:25, Andy Lutomirski wrote:
>>>> On Tue, Apr 9, 2019 at 7:20 PM Li, Aubrey  
>>>> wrote:
>>>>>
>>>>> On 2019/4/10 9:58, Andy Lutomirski wrote:
>>>>>> On Tue, Apr 9, 2019 at 6:55 PM Aubrey Li  
>>>>>> wrote:
>>>>>>>
>>>>>>> The architecture specific information of the running processes could
>>>>>>> be useful to the userland. Add support to examine process architecture
>>>>>>> specific information externally.
>>>>>>>
>>>>>>> Signed-off-by: Aubrey Li 
>>>>>>> Cc: Peter Zijlstra 
>>>>>>> Cc: Andi Kleen 
>>>>>>> Cc: Tim Chen 
>>>>>>> Cc: Dave Hansen 
>>>>>>> Cc: Arjan van de Ven 
>>>>>>> Cc: Linux API 
>>>>>>> Cc: Alexey Dobriyan 
>>>>>>> Cc: Andrew Morton 
>>>>>>> ---
>>>>>>>  fs/proc/array.c | 5 +
>>>>>>>  include/linux/proc_fs.h | 2 ++
>>>>>>>  2 files changed, 7 insertions(+)
>>>>>>>
>>>>>>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>>>>>>> index 2edbb657f859..331592a61718 100644
>>>>>>> --- a/fs/proc/array.c
>>>>>>> +++ b/fs/proc/array.c
>>>>>>> @@ -401,6 +401,10 @@ static inline void task_thp_status(struct seq_file 
>>>>>>> *m, struct mm_struct *mm)
>>>>>>> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>>>>>>  }
>>>>>>>
>>>>>>> +void __weak arch_proc_pid_status(struct seq_file *m, struct 
>>>>>>> task_struct *task)
>>>>>>> +{
>>>>>>> +}
>>>>>>
>>>>>> This pointlessly bloats other architectures.  Do this instead in an
>>>>>> appropriate header:
>>>>>>
>>>>>> #ifndef arch_proc_pid_status
>>>>>> static inline void arch_proc_pid_status(...)
>>>>>> {
>>>>>> }
>>>>>> #endif
>>>>>>
>>>>>
>>>>> I saw a bunch of similar weak functions, is it not acceptable?
>>>>>
>>>>> fs/proc$ grep weak *.c
>>>>> cpuinfo.c:__weak void arch_freq_prepare_all(void)
>>>>> meminfo.c:void __attribute__((weak)) arch_report_meminfo(struct seq_file 
>>>>> *m)
>>>>> vmcore.c:int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned 
>>>>> long long *size)
>>>>> vmcore.c:void __weak elfcorehdr_free(unsigned long long addr)
>>>>> vmcore.c:ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 
>>>>> *ppos)
>>>>> vmcore.c:ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, 
>>>>> u64 *ppos)
>>>>> vmcore.c:int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>>>>> vmcore.c:ssize_t __weak
>>>>
>>>> I think they're acceptable, but I don't personally like them.
>>>>
>>>
>>> okay, let me try to see if I can refine it in an appropriate way.
>>
>> Hi Andy,
>>
>> Is this what you want?
>>
>> Thanks,
>> -Aubrey
>>
>> 
>> diff --git a/arch/x86/include/asm/processor.h 
>> b/arch/x86/include/asm/processor.h
>> index 2bb3a648fc12..82d77d3aefff 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -990,5 +990,8 @@ enum l1tf_mitigations {
>>  };
>>
>>  extern enum l1tf_mitigations l1tf_mitigation;
>> +/* Add support for architecture specific output in /proc/pid/status */
>> +void arch_proc_pid_status(struct seq_file *m, struct task_struct *task);
>> +#define arch_proc_pid_status arch_proc_pid_status
>>
>>  #endif /* _ASM_X86_PROCESSOR_H */
>> diff --git a/fs/proc/array.c b/fs/proc/array.c
>> index 2edbb657f859..fd65a6ba2864 100644
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -401,6 +401,11 @@ static inline void task_thp_status(struct seq_file *m, 
>> struct mm_struct *mm)
>> seq_printf(m, "THP_enabled:\t%d\n", thp_enabled);
>>  }
>>
>> +/* Add support for architecture specific output in /proc/pid/status */
>> +#ifndef arch_proc_pid_status
>> +#define arch_proc_pid_status(m, task)
>> +#endif
>> +
>>  int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>> struct pid *pid, struct task_struct *task)
>>  {
>> @@ -424,6 +429,7 @@ int proc_pid_status(struct seq_file *m, struct 
>> pid_namespace *ns,
>> task_cpus_allowed(m, task);
>> cpuset_task_status_allowed(m, task);
>> task_context_switch_counts(m, task);
>> +   arch_proc_pid_status(m, task);
>> return 0;
>>  }
>>
> 
> Yes.  But I still think it would be nicer to separate the arch stuff
> into its own file.  Others might reasonably disagree with me.
> 
I like arch_status, I proposed but no other arch shows interesting in it.

I think the problem is similar for x86_status, it does not make sense for
those x86 platform without AVX512 to have an empty arch file. I personally
don't like [arch]_status because the code may become unclean if more arches
added in future.

Maybe it's too early to have a separated arch staff file for now.

Thanks,
-Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-02-22 Thread Li, Aubrey

Hi Vincent,

Sorry for the delay, I just returned from Chinese New Year holiday.

On 2021/1/25 22:51, Vincent Guittot wrote:
> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey  wrote:
>>
>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>>>
>>>> A long-tail load balance cost is observed on the newly idle path,
>>>> this is caused by a race window between the first nr_running check
>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>
>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>
>>> We should better detect that when trying to detach task like below
>>
>> This should be a compromise from my understanding. If we give up load balance
>> this time due to the race condition, we do reduce the load balance cost on 
>> the
>> newly idle path, but if there is an imbalance indeed at the same sched_domain
> 
> Redo path is there in case, LB has found an imbalance but it can't
> move some loads from this busiest rq to dest rq because of some cpu
> affinity. So it tries to fix the imbalance by moving load onto another
> rq of the group. In your case, the imbalance has disappeared because
> it has already been pulled by another rq so you don't have to try to
> find another imbalance. And I would even say you should not in order
> to let other level to take a chance to spread the load
> 
>> level, we have to wait the next softirq entry to handle that imbalance. This
>> means the tasks on the second busiest runqueue have to stay longer, which 
>> could
>> introduce tail latency as well. That's why I introduced a variable to control
>> the redo loops. I'll send this to the benchmark queue to see if it makes any
> 
> TBH, I don't like multiplying the number of knobs

Sure, I can take your approach, :)

>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>
>>> lockdep_assert_held(&env->src_rq->lock);
>>>
>>> +   /*
>>> +* Another CPU has emptied this runqueue in the meantime.
>>> +* Just return and leave the load_balance properly.
>>> +*/
>>> +   if (env->src_rq->nr_running <= 1 && !env->loop) {

May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
nr_running <=1, we should return immediately with LBF_ALL_PINNED flag cleared.

How about the following change?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..1761d33accaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
 * We don't want to steal all, otherwise we may be treated 
likewise,
 * which could at worst lead to a livelock crash.
 */
-   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1)
+   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 1) {
+   /* Clear the flag as we will not test any task */
+   env->flag &= ~LBF_ALL_PINNED;
break;
+   }
 
p = list_last_entry(tasks, struct task_struct, se.group_node);
 
Thanks,
-Aubrey

Re: [RFC PATCH v1] sched/fair: limit load balance redo times at the same sched_domain level

2021-02-23 Thread Li, Aubrey

On 2021/2/24 1:33, Vincent Guittot wrote:
> On Tue, 23 Feb 2021 at 06:41, Li, Aubrey  wrote:
>>
>> Hi Vincent,
>>
>> Sorry for the delay, I just returned from Chinese New Year holiday.
>>
>> On 2021/1/25 22:51, Vincent Guittot wrote:
>>> On Mon, 25 Jan 2021 at 15:00, Li, Aubrey  wrote:
>>>>
>>>> On 2021/1/25 18:56, Vincent Guittot wrote:
>>>>> On Mon, 25 Jan 2021 at 06:50, Aubrey Li  wrote:
>>>>>>
>>>>>> A long-tail load balance cost is observed on the newly idle path,
>>>>>> this is caused by a race window between the first nr_running check
>>>>>> of the busiest runqueue and its nr_running recheck in detach_tasks.
>>>>>>
>>>>>> Before the busiest runqueue is locked, the tasks on the busiest
>>>>>> runqueue could be pulled by other CPUs and nr_running of the busiest
>>>>>> runqueu becomes 1, this causes detach_tasks breaks with LBF_ALL_PINNED
>>>>>
>>>>> We should better detect that when trying to detach task like below
>>>>
>>>> This should be a compromise from my understanding. If we give up load 
>>>> balance
>>>> this time due to the race condition, we do reduce the load balance cost on 
>>>> the
>>>> newly idle path, but if there is an imbalance indeed at the same 
>>>> sched_domain
>>>
>>> Redo path is there in case, LB has found an imbalance but it can't
>>> move some loads from this busiest rq to dest rq because of some cpu
>>> affinity. So it tries to fix the imbalance by moving load onto another
>>> rq of the group. In your case, the imbalance has disappeared because
>>> it has already been pulled by another rq so you don't have to try to
>>> find another imbalance. And I would even say you should not in order
>>> to let other level to take a chance to spread the load
>>>
>>>> level, we have to wait the next softirq entry to handle that imbalance. 
>>>> This
>>>> means the tasks on the second busiest runqueue have to stay longer, which 
>>>> could
>>>> introduce tail latency as well. That's why I introduced a variable to 
>>>> control
>>>> the redo loops. I'll send this to the benchmark queue to see if it makes 
>>>> any
>>>
>>> TBH, I don't like multiplying the number of knobs
>>
>> Sure, I can take your approach, :)
>>
>>>>>
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -7688,6 +7688,16 @@ static int detach_tasks(struct lb_env *env)
>>>>>
>>>>> lockdep_assert_held(&env->src_rq->lock);
>>>>>
>>>>> +   /*
>>>>> +* Another CPU has emptied this runqueue in the meantime.
>>>>> +* Just return and leave the load_balance properly.
>>>>> +*/
>>>>> +   if (env->src_rq->nr_running <= 1 && !env->loop) {
>>
>> May I know why !env->loop is needed here? IIUC, if detach_tasks is invoked
> 
> IIRC,  my point was to do the test only when trying to detach the 1st
> task. A lot of things can happen when a break is involved but TBH I
> can't remember a precise UC. It may be over cautious

When the break happens, rq unlock and local irq restored, so it's still possible
the rq is emptied by another CPU.

> 
>> from LBF_NEED_BREAK, env->loop could be non-zero, but as long as src_rq's
>> nr_running <=1, we should return immediately with LBF_ALL_PINNED flag 
>> cleared.
>>
>> How about the following change?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 04a3ce20da67..1761d33accaa 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7683,8 +7683,11 @@ static int detach_tasks(struct lb_env *env)
>>  * We don't want to steal all, otherwise we may be treated 
>> likewise,
>>  * which could at worst lead to a livelock crash.
>>  */
>> -   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 
>> 1)
>> +   if (env->idle != CPU_NOT_IDLE && env->src_rq->nr_running <= 
>> 1) {
> 
> IMO, we must do the test before:  while (!list_empty(tasks)) {
> 
> because src_rq might have become empty if waiting tasks have been
> pulled by another cpu and the running one became idle in the meantime

Okay, after the running one became idle, it still has LBF_ALL_PINNED, which
needs to be cleared as well. Thanks!

> 
>> +   /* Clear the flag as we will not test any task */
>> +   env->flag &= ~LBF_ALL_PINNED;
>> break;
>> +   }
>>
>> p = list_last_entry(tasks, struct task_struct, 
>> se.group_node);
>>
>> Thanks,
>> -Aubrey

Re: [PATCH v5 0/4] Scan for an idle sibling in a single pass

2021-01-31 Thread Li, Aubrey

On 2021/1/27 21:51, Mel Gorman wrote:
> Changelog since v4
> o Avoid use of intermediate variable during select_idle_cpu
> 
> Changelog since v3
> o Drop scanning based on cores, SMT4 results showed problems
> 
> Changelog since v2
> o Remove unnecessary parameters
> o Update nr during scan only when scanning for cpus
> 
> Changlog since v1
> o Move extern declaration to header for coding style
> o Remove unnecessary parameter from __select_idle_cpu
> 
> This series of 4 patches reposts three patches from Peter entitled
> "select_idle_sibling() wreckage". It only scans the runqueues in a single
> pass when searching for an idle sibling.
> 
> Three patches from Peter were dropped. The first patch altered how scan
> depth was calculated. Scan depth deletion is a random number generator
> with two major limitations. The avg_idle time is based on the time
> between a CPU going idle and being woken up clamped approximately by
> 2*sysctl_sched_migration_cost.  This is difficult to compare in a sensible
> fashion to avg_scan_cost. The second issue is that only the avg_scan_cost
> of scan failures is recorded and it does not decay.  This requires deeper
> surgery that would justify a patch on its own although Peter notes that
> https://lkml.kernel.org/r/20180530143105.977759...@infradead.org is
> potentially useful for an alternative avg_idle metric.
> 
> The second patch dropped scanned based on cores instead of CPUs as it
> rationalised the difference between core scanning and CPU scanning.
> Unfortunately, Vincent reported problems with SMT4 so it's dropped
> for now until depth searching can be fixed.
> 
> The third patch dropped converted the idle core scan throttling mechanism
> to SIS_PROP. While this would unify the throttling of core and CPU
> scanning, it was not free of regressions and has_idle_cores is a fairly
> effective throttling mechanism with the caveat that it can have a lot of
> false positives for workloads like hackbench.
> 
> Peter's series tried to solve three problems at once, this subset addresses
> one problem.
> 
>  kernel/sched/fair.c | 151 +++-
>  kernel/sched/features.h |   1 -
>  2 files changed, 70 insertions(+), 82 deletions(-)
> 

4 benchmarks measured on a x86 4s system with 24 cores per socket and
2 HTs per core, total 192 CPUs. 

The load level is [25%, 50%, 75%, 100%].

- hackbench almost has a universal win.
- netperf high load has notable changes, as well as tbench 50% load.

Details below:

hackbench: 10 iterations, 1 loops, 40 fds per group
==

- pipe process

group   base%stdv5  %std
3   1   19.18   1.0266  9.06
6   1   9.170.987   13.03
9   1   7.111.0195  4.61
12  1   1.070.9927  1.43

- pipe thread

group   base%stdv5  %std
3   1   11.14   0.9742  7.27
6   1   9.150.9572  7.48
9   1   2.950.986   4.05
12  1   1.750.9992  1.68

- socket process

group   base%stdv5  %std
3   1   2.9 0.9586  2.39
6   1   0.680.9641  1.3
9   1   0.640.9388  0.76
12  1   0.560.9375  0.55

- socket thread

group   base%stdv5  %std
3   1   3.820.9686  2.97
6   1   2.060.9667  1.91
9   1   0.440.9354  1.25
12  1   0.540.9362  0.6

netperf: 10 iterations x 100 seconds, transactions rate / sec
=

- tcp request/response performance

thread  base%stdv4  %std
25% 1   5.341.0039  5.13
50% 1   4.971.0115  6.3
75% 1   5.090.9257  6.75
100%1   4.530.908   4.83



- udp request/response performance

thread  base%stdv4  %std
25% 1   6.180.9896  6.09
50% 1   5.881.0198  8.92
75% 1   24.38   0.9236  29.14
100%1   26.16   0.9063  22.16

tbench: 10 iterations x 100 seconds, throughput / sec
=

thread  base%stdv4  %std
25% 1   0.451.003   1.48
50% 1   1.710.9286  0.82
75% 1   0.840.9928  0.94
100%1   0.760.9762  0.59

schbench: 10 iterations x 100 seconds, 99th percentile latency
==

mthread base%stdv4  %std
25% 1   2.890.9884  7.34
50% 1   40.38   1.0055  38.37
75% 1   4.761.0095  4.62
100%1   10.09   1.0083  8.03

Thanks,
-Aubrey

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-11 Thread Li, Aubrey

Hi Joel,

On 2020/8/10 0:44, Joel Fernandes wrote:
> Hi Aubrey,
> 
> Apologies for replying late as I was still looking into the details.
> 
> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
> [...]
>> +/*
>> + * Core scheduling policy:
>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>> + * on the same core concurrently.
>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>  thread on the same core concurrently. 
>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>> + * with idle thread on the same core.
>> + */
>> +enum coresched_policy {
>> +   CORE_SCHED_DISABLED,
>> +   CORE_SCHED_COOKIE_MATCH,
>> +CORE_SCHED_COOKIE_TRUST,
>> +   CORE_SCHED_COOKIE_LONELY,
>> +};
>>
>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>> of performance regression. Not sure if this sounds attractive?
> 
> Instead of this, I think it can be something simpler IMHO:
> 
> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>core-scheduling patchset, such tasks will share a core and sniff on each
>other. So let us not pretend that such tasks are not trusted).
> 
> 2. All kernel threads and idle task would have a cookie 0 (so that will cover
>ksoftirqd reported in your original issue).
> 
> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>enable it. Setting this option would tag all tasks that are forked from a
>cookie-0 task with their own cookie. Later on, such tasks can be added to
>a group. This cover's PeterZ's ask about having 'default untrusted').
>(Users like ChromeOS that don't want to userspace system processes to be
>tagged can disable this option so such tasks will be cookie-0).
> 
> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>above behaviors.

How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set uperf's
cookie to be cookie-0 via prctl?

Thanks,
-Aubrey
> 
> 5. Document everything clearly so the semantics are clear both to the
>developers of core scheduling and to system administrators.
> 
> Note that, with the concept of "system trusted cookie", we can also do
> optimizations like:
> 1. Disable STIBP when switching into trusted tasks.
> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching into
>trusted tasks.
> 
> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
> other engineer requested I do something like #2 already.
> 
> Once we get full-syscall isolation working, threads belonging to a process
> can also share a core so those can just share a core with the task-group
> leader.
> 
>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>
>> This is a good question, from the data we measured by uperf,
>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
> 
> This is worrying for sure. :-(. We ought to debug/profile it more to see what
> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
> 
> Any other thoughts from others on this?
> 
> thanks,
> 
>  - Joel
> 
> 
>>> thanks,
>>>
>>>  - Joel
>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>> all processes (default untrusted) so everything gets a cookie which
>>> some folks said was how they wanted (have a whitelist instead of
>>> blacklist).
>>>
>>

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-12 Thread Li, Aubrey

On 2020/8/13 7:08, Joel Fernandes wrote:
> On Wed, Aug 12, 2020 at 10:01:24AM +0800, Li, Aubrey wrote:
>> Hi Joel,
>>
>> On 2020/8/10 0:44, Joel Fernandes wrote:
>>> Hi Aubrey,
>>>
>>> Apologies for replying late as I was still looking into the details.
>>>
>>> On Wed, Aug 05, 2020 at 11:57:20AM +0800, Li, Aubrey wrote:
>>> [...]
>>>> +/*
>>>> + * Core scheduling policy:
>>>> + * - CORE_SCHED_DISABLED: core scheduling is disabled.
>>>> + * - CORE_COOKIE_MATCH: tasks with same cookie can run
>>>> + * on the same core concurrently.
>>>> + * - CORE_COOKIE_TRUST: trusted task can run with kernel
>>>>thread on the same core concurrently. 
>>>> + * - CORE_COOKIE_LONELY: tasks with cookie can run only
>>>> + * with idle thread on the same core.
>>>> + */
>>>> +enum coresched_policy {
>>>> +   CORE_SCHED_DISABLED,
>>>> +   CORE_SCHED_COOKIE_MATCH,
>>>> +  CORE_SCHED_COOKIE_TRUST,
>>>> +   CORE_SCHED_COOKIE_LONELY,
>>>> +};
>>>>
>>>> We can set policy to CORE_COOKIE_TRUST of uperf cgroup and fix this kind
>>>> of performance regression. Not sure if this sounds attractive?
>>>
>>> Instead of this, I think it can be something simpler IMHO:
>>>
>>> 1. Consider all cookie-0 task as trusted. (Even right now, if you apply the
>>>core-scheduling patchset, such tasks will share a core and sniff on each
>>>other. So let us not pretend that such tasks are not trusted).
>>>
>>> 2. All kernel threads and idle task would have a cookie 0 (so that will 
>>> cover
>>>ksoftirqd reported in your original issue).
>>>
>>> 3. Add a config option (CONFIG_SCHED_CORE_DEFAULT_TASKS_UNTRUSTED). Default
>>>enable it. Setting this option would tag all tasks that are forked from a
>>>cookie-0 task with their own cookie. Later on, such tasks can be added to
>>>a group. This cover's PeterZ's ask about having 'default untrusted').
>>>(Users like ChromeOS that don't want to userspace system processes to be
>>>tagged can disable this option so such tasks will be cookie-0).
>>>
>>> 4. Allow prctl/cgroup interfaces to create groups of tasks and override the
>>>above behaviors.
>>
>> How does uperf in a cgroup work with ksoftirqd? Are you suggesting I set 
>> uperf's
>> cookie to be cookie-0 via prctl?
> 
> Yes, but let me try to understand better. There are 2 problems here I think:
> 
> 1. ksoftirqd getting idled when HT is turned on, because uperf is sharing a
> core with it: This should not be any worse than SMT OFF, because even SMT OFF
> would also reduce ksoftirqd's CPU time just core sched is doing. Sure
> core-scheduling adds some overhead with IPIs but such a huge drop of perf is
> strange. Peter any thoughts on that?
> 
> 2. Interface: To solve the performance problem, you are saying you want uperf
> to share a core with ksoftirqd so that it is not forced into idle.  Why not
> just keep uperf out of the cgroup?

I guess this is unacceptable for who runs their apps in container and vm.

Thanks,
-Aubrey

> Then it will have cookie 0 and be able to
> share core with kernel threads. About user-user isolation that you need, if
> you tag any "untrusted" threads by adding it to CGroup, then there will
> automatically isolated from uperf while allowing uperf to share CPU with
> kernel threads.
> 
> Please let me know your thoughts and thanks,
> 
>  - Joel
> 
>>
>> Thanks,
>> -Aubrey
>>>
>>> 5. Document everything clearly so the semantics are clear both to the
>>>developers of core scheduling and to system administrators.
>>>
>>> Note that, with the concept of "system trusted cookie", we can also do
>>> optimizations like:
>>> 1. Disable STIBP when switching into trusted tasks.
>>> 2. Disable L1D flushing / verw stuff for L1TF/MDS issues, when switching 
>>> into
>>>trusted tasks.
>>>
>>> At least #1 seems to be biting enabling HT on ChromeOS right now, and one
>>> other engineer requested I do something like #2 already.
>>>
>>> Once we get full-syscall isolation working, threads belonging to a process
>>> can also share a core so those can just share a core with the task-group
>>> leader.
>>>
>>>>> Is the uperf throughput worse with SMT+core-scheduling versus no-SMT ?
>>>>
>>>> This is a good question, from the data we measured by uperf,
>>>> SMT+core-scheduling is 28.2% worse than no-SMT, :(
>>>
>>> This is worrying for sure. :-(. We ought to debug/profile it more to see 
>>> what
>>> is causing the overhead. Me/Vineeth added it as a topic for LPC as well.
>>>
>>> Any other thoughts from others on this?
>>>
>>> thanks,
>>>
>>>  - Joel
>>>
>>>
>>>>> thanks,
>>>>>
>>>>>  - Joel
>>>>> PS: I am planning to write a patch behind a CONFIG option that tags
>>>>> all processes (default untrusted) so everything gets a cookie which
>>>>> some folks said was how they wanted (have a whitelist instead of
>>>>> blacklist).
>>>>>
>>>>
>>

Re: [RFC PATCH 00/16] Core scheduling v6

2020-08-03 Thread Li, Aubrey

On 2020/7/1 5:32, Vineeth Remanan Pillai wrote:
> Sixth iteration of the Core-Scheduling feature.
> 
> Core scheduling is a feature that allows only trusted tasks to run
> concurrently on cpus sharing compute resources (eg: hyperthreads on a
> core). The goal is to mitigate the core-level side-channel attacks
> without requiring to disable SMT (which has a significant impact on
> performance in some situations). Core scheduling (as of v6) mitigates
> user-space to user-space attacks and user to kernel attack when one of
> the siblings enters the kernel via interrupts. It is still possible to
> have a task attack the sibling thread when it enters the kernel via
> syscalls.
> 
> By default, the feature doesn't change any of the current scheduler
> behavior. The user decides which tasks can run simultaneously on the
> same core (for now by having them in the same tagged cgroup). When a
> tag is enabled in a cgroup and a task from that cgroup is running on a
> hardware thread, the scheduler ensures that only idle or trusted tasks
> run on the other sibling(s). Besides security concerns, this feature
> can also be beneficial for RT and performance applications where we
> want to control how tasks make use of SMT dynamically.
> 
> This iteration is mostly a cleanup of v5 except for a major feature of
> pausing sibling when a cpu enters kernel via nmi/irq/softirq. Also
> introducing documentation and includes minor crash fixes.
> 
> One major cleanup was removing the hotplug support and related code.
> The hotplug related crashes were not documented and the fixes piled up
> over time leading to complex code. We were not able to reproduce the
> crashes in the limited testing done. But if they are reroducable, we
> don't want to hide them. We should document them and design better
> fixes if any.
> 
> In terms of performance, the results in this release are similar to
> v5. On a x86 system with N hardware threads:
> - if only N/2 hardware threads are busy, the performance is similar
>   between baseline, corescheduling and nosmt
> - if N hardware threads are busy with N different corescheduling
>   groups, the impact of corescheduling is similar to nosmt
> - if N hardware threads are busy and multiple active threads share the
>   same corescheduling cookie, they gain a performance improvement over
>   nosmt.
>   The specific performance impact depends on the workload, but for a
>   really busy database 12-vcpu VM (1 coresched tag) running on a 36
>   hardware threads NUMA node with 96 mostly idle neighbor VMs (each in
>   their own coresched tag), the performance drops by 54% with
>   corescheduling and drops by 90% with nosmt.
> 

We found uperf(in cgroup) throughput drops by ~50% with corescheduling.

The problem is, uperf triggered a lot of softirq and offloaded softirq
service to *ksoftirqd* thread. 

- default, ksoftirqd thread can run with uperf on the same core, we saw
  100% CPU utilization.
- coresched enabled, ksoftirqd's core cookie is different from uperf, so
  they can't run concurrently on the same core, we saw ~15% forced idle.

I guess this kind of performance drop can be replicated by other similar
(a lot of softirq activities) workloads.

Currently core scheduler picks cookie-match tasks for all SMT siblings, does
it make sense we add a policy to allow cookie-compatible task running together?
For example, if a task is trusted(set by admin), it can work with kernel thread.
The difference from corescheduling disabled is that we still have user to user
isolation.

Thanks,
-Aubrey

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Li, Aubrey

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v7->v8:
>> - refine update_idle_cpumask, no functionality change
>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases

v5 or v8?

> 
> Reviewed-by: Vincent Guittot 
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 45 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>> rq->idle_balance = idle_cpu(cpu);
>> +   update_idle_cpumask(cpu, rq->idle_balance);
>> trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>> rq->idle_stamp = 0;
>> rq->avg_idle = 2*sysctl_sched_migration_cost;
>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +   rq->last_idle_state = 1;
>>
>> INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c0c4d9ad7da8..25f36ecfee54 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works only because select_idle_cpu is ca

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-11 Thread Li, Aubrey

On 2020/12/11 23:22, Vincent Guittot wrote:
> On Fri, 11 Dec 2020 at 16:19, Li, Aubrey  wrote:
>>
>> On 2020/12/11 23:07, Vincent Guittot wrote:
>>> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>>>> kbuild have been tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v7->v8:
>>>> - refine update_idle_cpumask, no functionality change
>>>> - fix a suspicious RCU usage warning with CONFIG_PROVE_RCU=y
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior
>>>>
>>>> Cc: Peter Zijlstra 
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>
>>> This version looks good to me. I don't see regressions of v5 anymore
>>> and see some improvements on heavy cases
>>
>> v5 or v8?
> 
> the v8 looks good to me and I don't see the regressions that I have
> seen with the v5 anymore
> 
Sounds great, thanks, :)

> 
>>
>>>
>>> Reviewed-by: Vincent Guittot 
>>>
>>>> ---
>>>>  include/linux/sched/topology.h | 13 ++
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 45 +-
>>>>  kernel/sched/idle.c|  5 
>>>>  kernel/sched/sched.h   |  4 +++
>>>>  kernel/sched/topology.c|  3 ++-
>>>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>> atomic_tref;
>>>> atomic_tnr_busy_cpus;
>>>> int has_idle_cores;
>>>> +   /*
>>>> +* Span of all idle CPUs in this domain.
>>>> +*
>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>> +* by attaching extra space to the end of the structure,
>>>> +* depending on how many CPUs the kernel has booted up with)
>>>> +*/
>>>> +   unsigned long   idle_cpus_span[];
>>>>  };

Re: [RFC PATCH 0/4] Reduce worst-case scanning of runqueues in select_idle_sibling

2020-12-07 Thread Li, Aubrey

On 2020/12/7 23:42, Mel Gorman wrote:
> On Mon, Dec 07, 2020 at 04:04:41PM +0100, Vincent Guittot wrote:
>> On Mon, 7 Dec 2020 at 10:15, Mel Gorman  wrote:
>>>
>>> This is a minimal series to reduce the amount of runqueue scanning in
>>> select_idle_sibling in the worst case.
>>>
>>> Patch 1 removes SIS_AVG_CPU because it's unused.
>>>
>>> Patch 2 improves the hit rate of p->recent_used_cpu to reduce the amount
>>> of scanning. It should be relatively uncontroversial
>>>
>>> Patch 3-4 scans the runqueues in a single pass for select_idle_core()
>>> and select_idle_cpu() so runqueues are not scanned twice. It's
>>> a tradeoff because it benefits deep scans but introduces overhead
>>> for shallow scans.
>>>
>>> Even if patch 3-4 is rejected to allow more time for Aubrey's idle cpu mask
>>
>> patch 3 looks fine and doesn't collide with Aubrey's work. But I don't
>> like patch 4  which manipulates different cpumask including
>> load_balance_mask out of LB and I prefer to wait for v6 of Aubrey's
>> patchset which should fix the problem of possibly  scanning twice busy
>> cpus in select_idle_core and select_idle_cpu
>>
> 
> Seems fair, we can see where we stand after V6 of Aubrey's work.  A lot
> of the motivation for patch 4 would go away if we managed to avoid calling
> select_idle_core() unnecessarily. As it stands, we can call it a lot from
> hackbench even though the chance of getting an idle core are minimal.
> 

Sorry for the delay, I sent v6 out just now. Comparing to v5, v6 followed 
Vincent's
suggestion to decouple idle cpumask update from stop_tick signal, that is, the
CPU is set in idle cpumask every time the CPU enters idle, this should address
Peter's concern about the facebook trail-latency workload, as I didn't see
any regression in schbench workload 99.th latency report.

However, I also didn't see any significant benefit so far, probably I should
put more load on the system. I'll do more characterization of uperf workload
to see if I can find anything.

Thanks,
-Aubrey

Re: [RFC PATCH v6] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-08 Thread Li, Aubrey

Hi Peter,

Thanks for the comments.

On 2020/12/8 22:16, Peter Zijlstra wrote:
> On Tue, Dec 08, 2020 at 09:49:57AM +0800, Aubrey Li wrote:
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b8af602dea79 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3999,6 +3999,7 @@ void scheduler_tick(void)
>>  rq_lock(rq, &rf);
>>  
>>  update_rq_clock(rq);
>> +update_idle_cpumask(rq, false);
> 
> Does that really need to be done with rq->lock held?> 
>>  thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>>  update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
> 
>> @@ -6808,6 +6813,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
>> struct rq_flags *rf)
>>  }
>>  #endif /* CONFIG_SMP */
>>  
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + */
>> +void update_idle_cpumask(struct rq *rq, bool set_idle)
>> +{
>> +struct sched_domain *sd;
>> +int cpu = cpu_of(rq);
>> +int idle_state;
>> +
>> +/*
>> + * If called from scheduler tick, only update
>> + * idle cpumask if the CPU is busy, as idle
>> + * cpumask is also updated on idle entry.
>> + *
>> + */
>> +if (!set_idle && idle_cpu(cpu))
>> +return;
> 
> scheduler_tick() already calls idle_cpu() when SMP.
> 
>> +/*
>> + * Also set SCHED_IDLE cpu in idle cpumask to
>> + * allow SCHED_IDLE cpu as a wakeup target
>> + */
>> +idle_state = set_idle || sched_idle_cpu(cpu);
>> +/*
>> + * No need to update idle cpumask if the state
>> + * does not change.
>> + */
>> +if (rq->last_idle_state == idle_state)
>> +return;
>> +
>> +rcu_read_lock();
> 
> This is called with IRQs disabled, surely we can forgo rcu_read_lock()
> here.
> 
>> +sd = rcu_dereference(per_cpu(sd_llc, cpu));
>> +if (!sd || !sd->shared)
>> +goto unlock;
> 
> I don't think !sd->shared is possible here.
> 
>> +if (idle_state)
>> +cpumask_set_cpu(cpu, sds_idle_cpus(sd->shared));
>> +else
>> +cpumask_clear_cpu(cpu, sds_idle_cpus(sd->shared));
>> +
>> +rq->last_idle_state = idle_state;
>> +unlock:
>> +rcu_read_unlock();
>> +}
>> +
>>  static unsigned long wakeup_gran(struct sched_entity *se)
>>  {
>>  unsigned long gran = sysctl_sched_wakeup_granularity;
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index f324dc36fc43..f995660edf2b 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -156,6 +156,11 @@ static void cpuidle_idle_call(void)
>>  return;
>>  }
>>  
>> +/*
>> + * The CPU is about to go idle, set it in idle cpumask
>> + * to be a wake up target.
>> + */
>> +update_idle_cpumask(this_rq(), true);
> 
> This should be in do_idle(), right around arch_cpu_idle_enter().
> 
>>  /*
>>   * The RCU framework needs to be told that we are entering an idle
>>   * section, so no more rcu read side critical sections and one more
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 8d1ca65db3b0..db460b20217a 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1004,6 +1004,7 @@ struct rq {
>>  /* This is used to determine avg_idle's max value */
>>  u64 max_idle_balance_cost;
>>  #endif /* CONFIG_SMP */
>> +unsigned char   last_idle_state;
> 
> All of that is pointless for UP. Also, is this the best location?
> 
Good point, I should put all of these under SMP. I'll refine the patch soon.

Thanks,
-Aubrey

Re: [PATCH 5/5] sched/fair: Merge select_idle_core/cpu()

2021-01-18 Thread Li, Aubrey

On 2021/1/15 18:08, Mel Gorman wrote:
> From: Peter Zijlstra (Intel) 
> 
> Both select_idle_core() and select_idle_cpu() do a loop over the same
> cpumask. Observe that by clearing the already visited CPUs, we can
> fold the iteration and iterate a core at a time.
> 
> All we need to do is remember any non-idle CPU we encountered while
> scanning for an idle core. This way we'll only iterate every CPU once.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 97 +++--
>  1 file changed, 59 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 12e08da90024..6c0f841e9e75 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6006,6 +6006,14 @@ static inline int find_idlest_cpu(struct sched_domain 
> *sd, struct task_struct *p
>   return new_cpu;
>  }
>  
> +static inline int __select_idle_cpu(struct task_struct *p, int core, struct 
> cpumask *cpus)

Sorry if I missed anything, why p and cpus are needed here?

> +{
> + if (available_idle_cpu(core) || sched_idle_cpu(core))
> + return core;
> +
> + return -1;
> +}
> +
>  #ifdef CONFIG_SCHED_SMT
>  DEFINE_STATIC_KEY_FALSE(sched_smt_present);
>  EXPORT_SYMBOL_GPL(sched_smt_present);
> @@ -6066,40 +6074,34 @@ void __update_idle_core(struct rq *rq)
>   * there are no idle cores left in the system; tracked through
>   * sd_llc->shared->has_idle_cores and enabled through update_idle_core() 
> above.
>   */
> -static int select_idle_core(struct task_struct *p, struct sched_domain *sd, 
> int target)
> +static int select_idle_core(struct task_struct *p, int core, struct cpumask 
> *cpus, int *idle_cpu)
>  {
> - struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> - int core, cpu;
> + bool idle = true;
> + int cpu;
>  
>   if (!static_branch_likely(&sched_smt_present))
> - return -1;
> -
> - if (!test_idle_cores(target, false))
> - return -1;
> -
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + return __select_idle_cpu(p, core, cpus);
>  
> - for_each_cpu_wrap(core, cpus, target) {
> - bool idle = true;
> -
> - for_each_cpu(cpu, cpu_smt_mask(core)) {
> - if (!available_idle_cpu(cpu)) {
> - idle = false;
> - break;
> + for_each_cpu(cpu, cpu_smt_mask(core)) {
> + if (!available_idle_cpu(cpu)) {
> + idle = false;
> + if (*idle_cpu == -1) {
> + if (sched_idle_cpu(cpu) && 
> cpumask_test_cpu(cpu, p->cpus_ptr)) {
> + *idle_cpu = cpu;
> + break;
> + }
> + continue;
>   }
> + break;
>   }
> -
> - if (idle)
> - return core;
> -
> - cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
> + if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> + *idle_cpu = cpu;
>   }
>  
> - /*
> -  * Failed to find an idle core; stop looking for one.
> -  */
> - set_idle_cores(target, 0);
> + if (idle)
> + return core;
>  
> + cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>   return -1;
>  }
>  
> @@ -6107,9 +6109,18 @@ static int select_idle_core(struct task_struct *p, 
> struct sched_domain *sd, int
>  
>  #define sched_smt_weight 1
>  
> -static inline int select_idle_core(struct task_struct *p, struct 
> sched_domain *sd, int target)
> +static inline void set_idle_cores(int cpu, int val)
>  {
> - return -1;
> +}
> +
> +static inline bool test_idle_cores(int cpu, bool def)
> +{
> + return def;
> +}
> +
> +static inline int select_idle_core(struct task_struct *p, int core, struct 
> cpumask *cpus, int *idle_cpu)
> +{
> + return __select_idle_cpu(p, core, cpus);
>  }
>  
>  #endif /* CONFIG_SCHED_SMT */
> @@ -6124,10 +6135,11 @@ static inline int select_idle_core(struct task_struct 
> *p, struct sched_domain *s
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, 
> int target)
>  {
>   struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> + int i, cpu, idle_cpu = -1, nr = INT_MAX;
> + bool smt = test_idle_cores(target, false);
> + int this = smp_processor_id();
>   struct sched_domain *this_sd;
>   u64 time;
> - int this = smp_processor_id();
> - int cpu, nr = INT_MAX;
>  
>   this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
>   if (!this_sd)
> @@ -6135,7 +6147,7 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>  
>   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>  
> - if (sched_feat(SIS_PROP))

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 20:56, Peter Zijlstra wrote:
> On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote:
>> Please let me know if I put cookie match check at the right position
>> in task_hot(), if so, I'll obtain some performance data of it.
>>
>> Thanks,
>> -Aubrey
>>
>> ===
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7f2fb08..d4bdcf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +continue;
>> +
>>  env->dst_cpu = cpu;
>>  if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  break;
> 
> This one might need a little help too, I've not fully considered NUMA
> balancing though.
> 
I dropped this numa change for now as it may be too strong, too. I'll
do more experiment about this on the new iteration.

The following patch is rebased on top of queue tree, cookie check is moved
from can_migrate_task to task_hot.

please let me know if any issues.

Thanks,
-Aubrey
==
>From 70d0ed9bab658b0bad60fda73f81b747f20975f0 Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Tue, 23 Mar 2021 03:26:34 +
Subject: [PATCH] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task may be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 29 ++
 kernel/sched/sched.h | 73 
 2 files changed, 96 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index efde8df2bc35..a74061484194 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5877,11 +5877,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -5967,9 +5971,10 @@ static inline int find_idlest_cpu(struct sched_domain 
*sd, struct task_struct *p
return new_cpu;
 }
 
-static inline int __select_idle_cpu(int cpu)
+static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
return cpu;
 
return -1;
@@ -6039,7 +6044,7 @@ static int select_idle_core(struct task_struct *p, int 
core, struct cpumask *cpu
int cpu;
 
if (!static_branch_likely(&sched_smt_present))
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
@@ -6077,7 +6082,7 @@ static inline bool test_idle_cores(int cpu, bool def)
 
 static inline int select_idle_core(struct task_struct *p, int core, struct 
cpumask *cpus, int *idle_cpu)
 {
-   return __select_idle_cpu(core);
+   return __select_idle_cpu(core, p);
 }
 
 #endif /* CONFIG_SCHED_SMT */
@@ -6130,7 +6135,7 @@ static int select_idle_cpu(struct ta

Re: [RFC PATCH v8] sched/fair: select idle cpu from idle cpumask for task wakeup

2021-03-04 Thread Li, Aubrey

Hi Peter,

On 2020/12/11 23:07, Vincent Guittot wrote:
> On Thu, 10 Dec 2020 at 02:44, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql and
>> kbuild have been tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
snip
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> This version looks good to me. I don't see regressions of v5 anymore
> and see some improvements on heavy cases
> 
> Reviewed-by: Vincent Guittot 

May I know your thoughts about this patch? 
Is it cpumask operation potentially too expensive to be here?

Thanks,
-Aubrey
> 
>> ---
>>  include/linux/sched/topology.h | 13 ++
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 45 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 ++-
>>  6 files changed, 70 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..b136e2440ea4 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>
>>  #ifdef CONFIG_SMP
>> rq->idle_balance = idle_cpu(cpu);
>> +   update_idle_cpumask(cpu, rq->idle_balance);
>> trigger_load_balance(rq);
>>  #endif
>>  }
>> @@ -7186,6 +7187,7 @@ void __init sched_init(void)
>> rq->idle_stamp = 0;
>> rq->avg_idle = 2*sysctl_sched_migration_cost;
>> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>> +   rq->last_idle_state = 1;
>>
>> INIT_LIST_HEAD(&rq->cfs_tasks);
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index c0c4d9ad7da8..25f36ecfee54 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6146,7 +6146,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works only because select_idle_cpu is called with
>> +* sd_llc.
>> +*/
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> @@ -6806,6 +6811,44 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
>> struct rq_flags *rf)
>>
>> return newidle_balance(rq, rf) != 0;
>>  }
>> +
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + *
>> + * This function is called with interrupts disabled.
>> + */
>> +void update_idle_cpumask(int cpu, bool idle)
>> +{
>> +   struct sched_domain *sd;
>> +   struct rq *rq = cpu_rq(cpu);
>> +   int idle_state;
>> +
>> +   /*
>> +* Also set SCHED_IDLE cpu in idle cpumask to
>> +* allow SCHED_IDLE cpu as a wakeup target.
>> +*/
>> +   idle_state = idle || sched_idle_cpu(cpu);
>> +   /*
>> +* No need to update idle cpumask if the state
>> +* does not change.
>> +*/
>> +   if (rq->last_idle_state == idle_state)

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-21 Thread Li, Aubrey

Hi Peter,

On 2021/3/20 23:34, Peter Zijlstra wrote:
> On Fri, Mar 19, 2021 at 04:32:48PM -0400, Joel Fernandes (Google) wrote:
>> @@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>   * We do not migrate tasks that are:
>>   * 1) throttled_lb_pair, or
>>   * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> - * 3) running (obviously), or
>> - * 4) are cache-hot on their current CPU.
>> + * 3) task's cookie does not match with this CPU's core cookie
>> + * 4) running (obviously), or
>> + * 5) are cache-hot on their current CPU.
>>   */
>>  if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>  return 0;
>> @@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>  return 0;
>>  }
>>  
>> +/*
>> + * Don't migrate task if the task's cookie does not match
>> + * with the destination CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +return 0;
>> +
>>  /* Record that we found atleast one task that could run on dst_cpu */
>>  env->flags &= ~LBF_ALL_PINNED;
>>  
> 
> This one is too strong.. persistent imbalance should be able to override
> it.
> 

IIRC, this change can avoid the following scenario:

One sysbench cpu thread(cookieA) and sysbench mysql thread(cookieB) running
on the two siblings of core_1, the other sysbench cpu thread(cookieA) and
sysbench mysql thread(cookieB) running on the two siblings of core2, which
causes 50% force idle.

This is not an imbalance case.

Thanks,
-Aubrey

Re: [PATCH] sched/fair: remove redundant test_idle_cores for non-smt

2021-03-21 Thread Li, Aubrey

Hi Barry,

On 2021/3/21 6:14, Barry Song wrote:
> update_idle_core() is only done for the case of sched_smt_present.
> but test_idle_cores() is done for all machines even those without
> smt.

The patch looks good to me.
May I know for what case we need to keep CONFIG_SCHED_SMT for non-smt
machines?

Thanks,
-Aubrey


> this could contribute to up 8%+ hackbench performance loss on a
> machine like kunpeng 920 which has no smt. this patch removes the
> redundant test_idle_cores() for non-smt machines.
> 
> we run the below hackbench with different -g parameter from 2 to
> 14, for each different g, we run the command 10 times and get the
> average time:
> $ numactl -N 0 hackbench -p -T -l 2 -g $1
> 
> hackbench will report the time which is needed to complete a certain
> number of messages transmissions between a certain number of tasks,
> for example:
> $ numactl -N 0 hackbench -p -T -l 2 -g 10
> Running in threaded mode with 10 groups using 40 file descriptors each
> (== 400 tasks)
> Each sender will pass 2 messages of 100 bytes
> 
> The below is the result of hackbench w/ and w/o this patch:
> g=2  4 6   8  10 12  14
> w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
> w/ : 1.8428 3.7436 5.4501 6.9522 8.2882  9.9535 11.3367
>   +4.1%  +8.3%  +7.3%   +6.3%
> 
> Signed-off-by: Barry Song 
> ---
>  kernel/sched/fair.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2e2ab1e..de42a32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6038,9 +6038,11 @@ static inline bool test_idle_cores(int cpu, bool def)
>  {
>   struct sched_domain_shared *sds;
>  
> - sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> - if (sds)
> - return READ_ONCE(sds->has_idle_cores);
> + if (static_branch_likely(&sched_smt_present)) {
> + sds = rcu_dereference(per_cpu(sd_llc_shared, cpu));
> + if (sds)
> + return READ_ONCE(sds->has_idle_cores);
> + }
>  
>   return def;
>  }
>

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 15:48, Peter Zijlstra wrote:
> On Sun, Mar 21, 2021 at 09:34:00PM +0800, Li, Aubrey wrote:
>> Hi Peter,
>>
>> On 2021/3/20 23:34, Peter Zijlstra wrote:
>>> On Fri, Mar 19, 2021 at 04:32:48PM -0400, Joel Fernandes (Google) wrote:
>>>> @@ -7530,8 +7543,9 @@ int can_migrate_task(struct task_struct *p, struct 
>>>> lb_env *env)
>>>> * We do not migrate tasks that are:
>>>> * 1) throttled_lb_pair, or
>>>> * 2) cannot be migrated to this CPU due to cpus_ptr, or
>>>> -   * 3) running (obviously), or
>>>> -   * 4) are cache-hot on their current CPU.
>>>> +   * 3) task's cookie does not match with this CPU's core cookie
>>>> +   * 4) running (obviously), or
>>>> +   * 5) are cache-hot on their current CPU.
>>>> */
>>>>if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>>>>return 0;
>>>> @@ -7566,6 +7580,13 @@ int can_migrate_task(struct task_struct *p, struct 
>>>> lb_env *env)
>>>>return 0;
>>>>}
>>>>  
>>>> +  /*
>>>> +   * Don't migrate task if the task's cookie does not match
>>>> +   * with the destination CPU's core cookie.
>>>> +   */
>>>> +  if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>>>> +  return 0;
>>>> +
>>>>/* Record that we found atleast one task that could run on dst_cpu */
>>>>env->flags &= ~LBF_ALL_PINNED;
>>>>  
>>>
>>> This one is too strong.. persistent imbalance should be able to override
>>> it.
>>>
>>
>> IIRC, this change can avoid the following scenario:
>>
>> One sysbench cpu thread(cookieA) and sysbench mysql thread(cookieB) running
>> on the two siblings of core_1, the other sysbench cpu thread(cookieA) and
>> sysbench mysql thread(cookieB) running on the two siblings of core2, which
>> causes 50% force idle.
>>
>> This is not an imbalance case.
> 
> But suppose there is an imbalance; then this cookie crud can forever
> stall balance.
> 
> Imagine this cpu running a while(1); with a uniqie cookie on, then it
> will _never_ accept other tasks == BAD.
> 

How about putting the following check in sched_core_cookie_match()?

+   /*
+* Ignore cookie match if there is a big imbalance between the src rq
+* and dst rq.
+*/
+   if ((src_rq->cfs.h_nr_running - rq->cfs.h_nr_running) > 1)
+   return true;

This change has significant impact of my sysbench cpu+mysql colocation.

- with this change,
  sysbench cpu tput = 2796 events/s, sysbench mysql = 1315 events/s

- without it, 
  sysbench cpu tput= 3513 events/s, sysbench mysql = 646 events.

Do you have any suggestions before we drop it?

Thanks,
-Aubrey

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 16:57, Peter Zijlstra wrote:

> 
>> Do you have any suggestions before we drop it?
> 
> Yeah, how about you make it part of task_hot() ? Have task_hot() refuse
> migration it the cookie doesn't match.
> 
> task_hot() is a hint and will get ignored when appropriate.
> 

Please let me know if I put cookie match check at the right position
in task_hot(), if so, I'll obtain some performance data of it.

Thanks,
-Aubrey

===
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7f2fb08..d4bdcf9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5847,11 +5854,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6109,7 +6120,9 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+
+   if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+   sched_cpu_cookie_match(cpu_rq(cpu), p))
break;
}
 
@@ -7427,6 +7440,14 @@ static int task_hot(struct task_struct *p, struct lb_env 
*env)
 
if (sysctl_sched_migration_cost == -1)
return 1;
+
+   /*
+* Don't migrate task if the task's cookie does not match
+* with the destination CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
+   return 1;
+
if (sysctl_sched_migration_cost == 0)
return 0;
 
@@ -8771,6 +8792,10 @@ find_idlest_group(struct sched_domain *sd, struct 
task_struct *p, int this_cpu)
p->cpus_ptr))
continue;
 
+   /* Skip over this group if no cookie matched */
+   if (!sched_group_cookie_match(cpu_rq(this_cpu), p, group))
+   continue;
+
local_group = cpumask_test_cpu(this_cpu,
   sched_group_span(group));
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f094435..13254ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1093,6 +1093,7 @@ static inline int cpu_of(struct rq *rq)
 
 #ifdef CONFIG_SCHED_CORE
 DECLARE_STATIC_KEY_FALSE(__sched_core_enabled);
+static inline struct cpumask *sched_group_span(struct sched_group *sg);
 
 static inline bool sched_core_enabled(struct rq *rq)
 {
@@ -1109,6 +1110,61 @@ static inline raw_spinlock_t *rq_lockp(struct rq *rq)
 
 bool cfs_prio_less(struct task_struct *a, struct task_struct *b, bool fi);
 
+/*
+ * Helpers to check if the CPU's core cookie matches with the task's cookie
+ * when core scheduling is enabled.
+ * A special case is that the task's cookie always matches with CPU's core
+ * cookie if the CPU is in an idle core.
+ */
+static inline bool sched_cpu_cookie_match(struct rq *rq, struct task_struct *p)
+{
+   /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+   if (!sched_core_enabled(rq))
+   return true;
+
+   return rq->core->core_cookie == p->core_cookie;
+}
+
+static inline bool sched_core_cookie_match(struct rq *rq, struct task_struct 
*p)
+{
+   bool idle_core = true;
+   int cpu;
+
+   /* Ignore cookie match if core scheduler is not enabled on the CPU. */
+   if (!sched_core_enabled(rq))
+   return true;
+
+   for_each_cpu(cpu, cpu_smt_mask(cpu_of(rq))) {
+   if (!available_idle_cpu(cpu)) {
+   idle_core = false;
+   break;
+   }
+   }
+
+   /*
+* A CPU in an idle core is always the best choic

Re: [PATCH 1/6] sched: migration changes for core scheduling

2021-03-22 Thread Li, Aubrey

On 2021/3/22 20:56, Peter Zijlstra wrote:
> On Mon, Mar 22, 2021 at 08:31:09PM +0800, Li, Aubrey wrote:
>> Please let me know if I put cookie match check at the right position
>> in task_hot(), if so, I'll obtain some performance data of it.
>>
>> Thanks,
>> -Aubrey
>>
>> ===
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7f2fb08..d4bdcf9 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1912,6 +1912,13 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>>  if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>>  continue;
>>  
>> +/*
>> + * Skip this cpu if source task's cookie does not match
>> + * with CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +continue;
>> +
>>  env->dst_cpu = cpu;
>>  if (task_numa_compare(env, taskimp, groupimp, maymove))
>>  break;
> 
> This one might need a little help too, I've not fully considered NUMA
> balancing though.
> 
>> @@ -6109,7 +6120,9 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>  for_each_cpu_wrap(cpu, cpus, target) {
>>  if (!--nr)
>>  return -1;
>> -if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> +
>> +if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>> +sched_cpu_cookie_match(cpu_rq(cpu), p))
>>  break;
>>  }
>>  
> 
> This doesn't even apply... That code has changed.
> 
>> @@ -7427,6 +7440,14 @@ static int task_hot(struct task_struct *p, struct 
>> lb_env *env)
>>  
>>  if (sysctl_sched_migration_cost == -1)
>>  return 1;
>> +
>> +/*
>> + * Don't migrate task if the task's cookie does not match
>> + * with the destination CPU's core cookie.
>> + */
>> +if (!sched_core_cookie_match(cpu_rq(env->dst_cpu), p))
>> +return 1;
>> +
>>  if (sysctl_sched_migration_cost == 0)
>>  return 0;
>>  
> 
> Should work I think, but you've put it in a weird spot for breaking up
> that sysctl_sched_migration_cost thing. I'd have put it either in front
> or after that other SMT thing we have there.
> 

I did it on purpose.

If migration cost is huge, the task should not migrate, no matter the
cookie is matched or not. So have to after sysctl_sched_migration_cost == -1.

And if migration cost = 0 or delta < migrate cost , the task can be migrated,
but before migrate, We need to check whether cookie is matched or not. So before
sysctl_sched_migration_cost == 0.

Please correct me if I was wrong.

Thanks,
-Aubrey

Re: [PATCH v2] sched/fair: reduce long-tail newly idle balance cost

2021-03-15 Thread Li, Aubrey

On 2021/2/24 16:15, Aubrey Li wrote:
> A long-tail load balance cost is observed on the newly idle path,
> this is caused by a race window between the first nr_running check
> of the busiest runqueue and its nr_running recheck in detach_tasks.
> 
> Before the busiest runqueue is locked, the tasks on the busiest
> runqueue could be pulled by other CPUs and nr_running of the busiest
> runqueu becomes 1 or even 0 if the running task becomes idle, this
> causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
> load_balance redo at the same sched_domain level.
> 
> In order to find the new busiest sched_group and CPU, load balance will
> recompute and update the various load statistics, which eventually leads
> to the long-tail load balance cost.
> 
> This patch clears LBF_ALL_PINNED flag for this race condition, and hence
> reduces the long-tail cost of newly idle balance.

Ping...

> 
> Cc: Vincent Guittot 
> Cc: Mel Gorman 
> Cc: Andi Kleen 
> Cc: Tim Chen 
> Cc: Srinivas Pandruvada 
> Cc: Rafael J. Wysocki 
> Signed-off-by: Aubrey Li 
> ---
>  kernel/sched/fair.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce2..5c67804 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7675,6 +7675,15 @@ static int detach_tasks(struct lb_env *env)
>  
>   lockdep_assert_held(&env->src_rq->lock);
>  
> + /*
> +  * Source run queue has been emptied by another CPU, clear
> +  * LBF_ALL_PINNED flag as we will not test any task.
> +  */
> + if (env->src_rq->nr_running <= 1) {
> + env->flags &= ~LBF_ALL_PINNED;
> + return 0;
> + }
> +
>   if (env->imbalance <= 0)
>   return 0;
>  
>

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-18 Thread Li, Aubrey

On 2020/11/18 20:06, Valentin Schneider wrote:
> 
> On 16/11/20 20:04, Aubrey Li wrote:
>> From: Aubrey Li 
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
> 
> That's missing a v3 -> v4 change summary
> 

okay, I'll add in the next version soon.

Thanks,
-Aubrey

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-18 Thread Li, Aubrey

Hi Vincent,

On 2020/11/18 21:36, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 04:48, Aubrey Li  wrote:
>>
>> From: Aubrey Li 
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 +
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 52 +-
>>  kernel/sched/idle.c|  7 +++--
>>  kernel/sched/sched.h   |  2 ++
>>  kernel/sched/topology.c|  3 +-
>>  6 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>> atomic_tref;
>> atomic_tnr_busy_cpus;
>> int has_idle_cores;
>> +   /*
>> +* Span of all idle CPUs in this domain.
>> +*
>> +* NOTE: this field is variable length. (Allocated dynamically
>> +* by attaching extra space to the end of the structure,
>> +* depending on how many CPUs the kernel has booted up with)
>> +*/
>> +   unsigned long   idle_cpus_span[];
>>  };
>>
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +   return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>> /* These fields must be setup */
>> struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index b1e0da56abca..c86ae0495163 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3994,6 +3994,7 @@ void scheduler_tick(void)
>> rq_lock(rq, &rf);
>>
>> update_rq_clock(rq);
>> +   update_idle_cpumask(rq, false);
>> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>> update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure);
>> curr->sched_class->task_tick(rq, curr, 0);
>> @@ -7192,6 +7193,7 @@ void __init sched_init(void)
>> rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
>>  #endif
>>  #endif /* CONFIG_SMP */
>> +   rq->last_idle_state = 1;
>> hrtick_rq_init(rq);
>> atomic_set(&rq->nr_iowait, 0);
>> }
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 48a6d442b444..d67fba5e406b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6145,7 +6145,12 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>>
>> time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   /*
>> +* sched_domain_shared is set only at shared cache level,
>> +* this works only because select_idle_cpu is called with
>> +* sd_llc.
>> +*/
>> +   cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> @@ -6807,6 +6812,51 @@ balance_fair(struct rq *rq, struct task_struct *prev, 
>> struct rq_flags *rf)
>>  }
>>  #endif /* CONFIG_SMP */
>>
>> +/*
>> + * Update cpu idle state and record this information
>> + * in sd_llc_shared->idle_cpus_span.
>> + */
>> +void update_idle_cpumask(struct rq *rq, bool set_idle)
>> +{
>> +   struct sched_domain *sd;
>> +   int cpu = cpu_of(rq);
>> +   int idle_state;
>> +
>> +   /*
>> +* If called from scheduler tick, only update
>> +* idle cpumask if the CPU is busy, as idle
>> +* cpumask is also updated on idle entry.
>> +*
>> +*/
>> +   if (!set_idle && idle

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-08 Thread Li, Aubrey

On 2020/12/9 0:03, Vincent Guittot wrote:
> On Tue, 8 Dec 2020 at 16:35, Mel Gorman  wrote:
>>
>> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
>> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
>> check and while we are at it, exclude the cost of initialising the CPU
>> mask from the average scan cost.
>>
>> Signed-off-by: Mel Gorman 
>> ---
>>  kernel/sched/fair.c | 14 --
>>  1 file changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ac7b34e7372b..5c41875aec23 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> if (!this_sd)
>> return -1;
> 
> Just noticed while reviewing the patch that the above related to
> this_sd can also go under sched_feat(SIS_PROP)
> 
>>
>> +   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +
>> if (sched_feat(SIS_PROP)) {
>> u64 avg_cost, avg_idle, span_avg;
>>
>> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> nr = div_u64(span_avg, avg_cost);
>> else
>> nr = 4;
>> -   }
>> -
>> -   time = cpu_clock(this);
>>
>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>> +   time = cpu_clock(this);
>> +   }
>>
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)

nr is the key of this throttling mechanism, need to be placed under 
sched_feat(SIS_PROP) as well.

Thanks,
-Aubrey

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Li, Aubrey

On 2020/12/9 16:15, Vincent Guittot wrote:
> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
>> v6->v7:
>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>
>> v5->v6:
>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>   in idle cpumask every time the CPU enters idle
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available.
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>
>> Cc: Peter Zijlstra 
>> Cc: Mel Gorman 
>> Cc: Vincent Guittot 
>> Cc: Qais Yousef 
>> Cc: Valentin Schneider 
>> Cc: Jiang Biao 
>> Cc: Tim Chen 
>> Signed-off-by: Aubrey Li 
>> ---
>>  include/linux/sched/topology.h | 13 +
>>  kernel/sched/core.c|  2 ++
>>  kernel/sched/fair.c| 51 +-
>>  kernel/sched/idle.c|  5 
>>  kernel/sched/sched.h   |  4 +++
>>  kernel/sched/topology.c|  3 +-
>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index 820511289857..b47b85163607 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>  atomic_tref;
>>  atomic_tnr_busy_cpus;
>>  int has_idle_cores;
>> +/*
>> + * Span of all idle CPUs in this domain.
>> + *
>> + * NOTE: this field is variable length. (Allocated dynamically
>> + * by attaching extra space to the end of the structure,
>> + * depending on how many CPUs the kernel has booted up with)
>> + */
>> +unsigned long   idle_cpus_span[];
>>  };
>>  
>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared *sds)
>> +{
>> +return to_cpumask(sds->idle_cpus_span);
>> +}
>> +
>>  struct sched_domain {
>>  /* These fields must be setup */
>>  struct sched_domain __rcu *parent;  /* top domain must be null 
>> terminated */
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c4da7e17b906..c4c51ff3402a 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>  
>>  #ifdef CONFIG_SMP
>>  rq->idle_balance = idle_cpu(cpu);
>> +update_idle_cpumask(cpu, false);
> 
> Test rq->idle_balance here instead of adding the test in update_idle_cpumask 
> which is only
> relevant for this situation.

If called from idle path, because !set_idle is false, rq->idle_balance won't be 
tested actually.

if (!set_idle && rq->idle_balance)
return;

So is it okay to leave it here to keep scheduler_tick a bit concise?

Thanks,
-Aubrey

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Li, Aubrey

On 2020/12/9 17:05, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 01:28:11PM +0800, Li, Aubrey wrote:
>>>> nr = div_u64(span_avg, avg_cost);
>>>> else
>>>> nr = 4;
>>>> -   }
>>>> -
>>>> -   time = cpu_clock(this);
>>>>
>>>> -   cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
>>>> +   time = cpu_clock(this);
>>>> +   }
>>>>
>>>> for_each_cpu_wrap(cpu, cpus, target) {
>>>> if (!--nr)
>>
>> nr is the key of this throttling mechanism, need to be placed under 
>> sched_feat(SIS_PROP) as well.
>>
> 
> It isn't necessary as nr in initialised to INT_MAX if !SIS_PROP.
>If !SIS_PROP, nr need to -1 then tested in the loop, instead of testing 
>directly.
But with SIS_PROP, need adding a test in the loop.
Since SIS_PROP is default true, I think it's okay to keep the current way.

Thanks,
-Aubrey

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-09 Thread Li, Aubrey

On 2020/12/9 21:09, Vincent Guittot wrote:
> On Wed, 9 Dec 2020 at 11:58, Li, Aubrey  wrote:
>>
>> On 2020/12/9 16:15, Vincent Guittot wrote:
>>> Le mercredi 09 déc. 2020 à 14:24:04 (+0800), Aubrey Li a écrit :
>>>> Add idle cpumask to track idle cpus in sched domain. Every time
>>>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>>>> target. And if the CPU is not in idle, the CPU is cleared in idle
>>>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has lower cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>>>> and kbuild were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>>>> found.
>>>>
>>>> v6->v7:
>>>> - place the whole idle cpumask mechanism under CONFIG_SMP.
>>>>
>>>> v5->v6:
>>>> - decouple idle cpumask update from stop_tick signal, set idle CPU
>>>>   in idle cpumask every time the CPU enters idle
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available.
>>>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>>>
>>>> v2->v3:
>>>> - change setting idle cpumask to every idle entry, otherwise schbench
>>>>   has a regression of 99th percentile latency.
>>>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>>>   idle cpumask is ratelimited in the idle exiting path.
>>>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>>>
>>>> v1->v2:
>>>> - idle cpumask is updated in the nohz routines, by initializing idle
>>>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>>>   behavior.
>>>>
>>>> Cc: Peter Zijlstra 
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>> ---
>>>>  include/linux/sched/topology.h | 13 +
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 51 +-
>>>>  kernel/sched/idle.c|  5 
>>>>  kernel/sched/sched.h   |  4 +++
>>>>  kernel/sched/topology.c|  3 +-
>>>>  6 files changed, 76 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>>  atomic_tref;
>>>>  atomic_tnr_busy_cpus;
>>>>  int has_idle_cores;
>>>> +/*
>>>> + * Span of all idle CPUs in this domain.
>>>> + *
>>>> + * NOTE: this field is variable length. (Allocated dynamically
>>>> + * by attaching extra space to the end of the structure,
>>>> + * depending on how many CPUs the kernel has booted up with)
>>>> + */
>>>> +unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>> *sds)
>>>> +{
>>>> +return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>>  /* These fields must be setup */
>>>>  struct sched_domain __rcu *parent;  /* top domain must be null 
>>>> terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index c4da7e17b906..c4c51ff3402a 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -4011,6 +4011,7 @@ void scheduler_tick(void)
>>>>
>>>>  #ifdef CONFIG_SMP
>>>>  rq->idle_balance = idle_cpu(cpu);
>>>> +update_idle_cpumask(cpu, false);
>>>
>>> Test rq->idle_balance here instead of adding the test in 
>>> update_idle_cpumask which is only
>>> relevant for this situation.
>>
>> If called from idle path, because !set_idle is false, rq->idle_balance won't 
>> be tested actually.
>>
>> if (!set_idle && rq->idle_balance)
>> return;
>>
>> So is it okay to leave it here to keep scheduler_tick a bit concise?
> 
> I don't like having a tick specific condition in a generic function.
> rq->idle_balance is only relevant in this case
> 
> calling update_idle_cpumask(rq->idle_balance) in scheduler_tick()
> should do the job and we can remove the check of rq->idle_balance in
> update_idle_cpumask()
> 
> In case of scheduler_tick() called when idle , we will only test if
> (rq->last_idle_state == idle_state) and return
> 

I see, will come up with a v8 soon.

Thanks,
-Aubrey

Re: [PATCH 2/4] sched/fair: Move avg_scan_cost calculations under SIS_PROP

2020-12-09 Thread Li, Aubrey

On 2020/12/8 23:34, Mel Gorman wrote:
> As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
> even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
> check and while we are at it, exclude the cost of initialising the CPU
> mask from the average scan cost.
> 
> Signed-off-by: Mel Gorman 
> ---
>  kernel/sched/fair.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ac7b34e7372b..5c41875aec23 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6153,6 +6153,8 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   if (!this_sd)
>   return -1;
>  
> + cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +
>   if (sched_feat(SIS_PROP)) {
>   u64 avg_cost, avg_idle, span_avg;
>  
> @@ -6168,11 +6170,9 @@ static int select_idle_cpu(struct task_struct *p, 
> struct sched_domain *sd, int t
>   nr = div_u64(span_avg, avg_cost);
>   else
>   nr = 4;
> - }
> -
> - time = cpu_clock(this);
>  
> - cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> + time = cpu_clock(this);
> + }
>  
>   for_each_cpu_wrap(cpu, cpus, target) {
>   if (!--nr)
>   return -1;

I thought about this again and here seems not to be consistent:
- even if nr reduces to 0, shouldn't avg_scan_cost be updated as well before 
return -1?
- if avg_scan_cost is not updated because nr is throttled, the first 
time = cpu_clock(this);
  can be optimized. As nr is calculated and we already know which of the weight 
of cpumask and nr is greater.

Thanks,
-Aubrey

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Li, Aubrey

Hi Mel,

On 2020/12/9 22:36, Mel Gorman wrote:
> On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
>> Add idle cpumask to track idle cpus in sched domain. Every time
>> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
>> target. And if the CPU is not in idle, the CPU is cleared in idle
>> cpumask during scheduler tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has lower cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks including hackbench, schbench, uperf, sysbench mysql
>> and kbuild were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs, no regression
>> found.
>>
> 
> I ran this patch with tbench on top of of the schedstat patches that
> track SIS efficiency. The tracking adds overhead so it's not a perfect
> performance comparison but the expectation would be that the patch reduces
> the number of runqueues that are scanned

Thanks for the measurement! I don't play with tbench so may need a while
to digest the data.

> 
> tbench4
>   5.10.0-rc6 5.10.0-rc6
>   schedstat-v1r1  idlemask-v7r1
> Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
> Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
> Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
> Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
> Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
> Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*
> Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*
> Hmean 12822278.41 (   0.00%)22304.33 *   0.12%*
> Hmean 25621455.52 (   0.00%)20900.13 *  -2.59%*
> Hmean 32021802.38 (   0.00%)21928.81 *   0.58%*
> 
> Not very optimistic result. The schedstats indicate;

How many client threads was the following schedstats collected?

> 
> 5.10.0-rc6 5.10.0-rc6
> schedstat-v1r1  idlemask-v7r1
> Ops TTWU Count   5599714302.00  5589495123.00
> Ops TTWU Local   2687713250.00  2563662550.00
> Ops SIS Search   5596677950.00  5586381168.00
> Ops SIS Domain Search3268344934.00  3229088045.00
> Ops SIS Scanned 15909069113.00 16568899405.00
> Ops SIS Domain Scanned  13580736097.00 14211606282.00
> Ops SIS Failures 2944874939.00  2843113421.00
> Ops SIS Core Search   262853975.00   311781774.00
> Ops SIS Core Hit  185189656.00   216097102.00
> Ops SIS Core Miss  77664319.0095684672.00
> Ops SIS Recent Used Hit   124265515.00   146021086.00
> Ops SIS Recent Used Miss  338142547.00   403547579.00
> Ops SIS Recent Attempts   462408062.00   549568665.00
> Ops SIS Search Efficiency35.18  33.72
> Ops SIS Domain Search Eff24.07  22.72
> Ops SIS Fast Success Rate41.60  42.20
> Ops SIS Success Rate 47.38  49.11
> Ops SIS Recent Success Rate  26.87  26.57
> 
> The field I would expect to decrease is SIS Domain Scanned -- the number
> of runqueues that were examined but it's actually worse and graphing over
> time shows it's worse for the client thread counts.  select_idle_cpu()
> is definitely being called because "Domain Search" is 10 times higher than
> "Core Search" and there "Core Miss" is non-zero.

Why SIS Domain Scanned can be decreased?

I thought SIS Scanned was supposed to be decreased but it seems not on your 
side.

I printed some trace log on my side by uperf workload, and it looks properly.
To make the log easy to read, I started a 4 VCPU VM to run 2-second uperf 8 
threads.

stage 1: system idle, update_idle_cpumask is called from idle thread, set 
cpumask to 0-3

  -0   [002] d..1   137.408681: update_idle_cpumask: 
set_idle-1, cpumask: 2
  -0   [000] d..1   137.408713: update_idle_cpumask: 
set_idle-1, cpumask: 0,2
  -0   [003] d..1   137.408924: update_idle_cpumask: 
set_idle-1, cpumask: 0,2-3
  -0   [001] d..1   137.409035: update_idle_cpumask: 
set_idle-1, cpumask: 0-3

stage 2: uperf ramp up, cpumask changes back and forth

   uperf-561 [003] d..3   137.410620: select_task_rq_fair: 
scanning: 0-3
   uperf-560 [000] d..5   137.411384: select_task_rq_fair: 
scanning: 0-3
kworker/u8:3-110 [000] d..4   137.411436: select_task_rq_fair: 
scanning: 0-3
   uperf-560 [000] d.h1   137.412562: update_idle_cpumask: 
set_idle-0, cpumask: 1-3
   uperf-570 [002] d.h2   137.412580: update_idle_cpumask: 
set_idle-0, cpumask: 1,3

Re: [RFC PATCH v7] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-12-10 Thread Li, Aubrey

On 2020/12/10 19:34, Mel Gorman wrote:
> On Thu, Dec 10, 2020 at 04:23:47PM +0800, Li, Aubrey wrote:
>>> I ran this patch with tbench on top of of the schedstat patches that
>>> track SIS efficiency. The tracking adds overhead so it's not a perfect
>>> performance comparison but the expectation would be that the patch reduces
>>> the number of runqueues that are scanned
>>
>> Thanks for the measurement! I don't play with tbench so may need a while
>> to digest the data.
>>
> 
> They key point is that it appears the idle mask was mostly equivalent to
> the full domain mask, at least for this test.

I'm more interested in how tbench with heavy load behaves.
If the load is heavy enough and idle thread has no chance to switch in,
idle cpumask will be empty at the first scheduler tick and remain empty
before the load comes down, during this period of heavy load:
- default select_idle_cpu still scan the entire sched domain(or throttled to
  4) everytime
- the patched select_idle_cpu does not scan at all
> 
>>>
>>> tbench4
>>>   5.10.0-rc6 5.10.0-rc6
>>>   schedstat-v1r1  idlemask-v7r1
>>> Hmean 1504.76 (   0.00%)  500.14 *  -0.91%*
>>> Hmean 2   1001.22 (   0.00%)  970.37 *  -3.08%*
>>> Hmean 4   1930.56 (   0.00%) 1880.96 *  -2.57%*
>>> Hmean 8   3688.05 (   0.00%) 3537.72 *  -4.08%*
>>> Hmean 16  6352.71 (   0.00%) 6439.53 *   1.37%*
>>> Hmean 32 10066.37 (   0.00%)10124.65 *   0.58%*
>>> Hmean 64 12846.32 (   0.00%)11627.27 *  -9.49%*
>>> Hmean 12822278.41 (   0.00%)22304.33 *   0.12%*
>>> Hmean 25621455.52 (   0.00%)20900.13 *  -2.59%*
>>> Hmean 32021802.38 (   0.00%)21928.81 *   0.58%*
>>>
>>> Not very optimistic result. The schedstats indicate;
>>
>> How many client threads was the following schedstats collected?
>>
> 
> That's the overall summary for all client counts. While proc-schedstat
> was measured every few seconds over all client counts, presenting that
> in text format is not easy to parse. However, looking at the graphs over
> time, it did not appear that scan rates were consistently lower for any
> client count for tbench.
> 
>>>
>>> 5.10.0-rc6 5.10.0-rc6
>>> schedstat-v1r1  idlemask-v7r1
>>> Ops TTWU Count   5599714302.00  5589495123.00
>>> Ops TTWU Local   2687713250.00  2563662550.00
>>> Ops SIS Search   5596677950.00  5586381168.00
>>> Ops SIS Domain Search3268344934.00  3229088045.00
>>> Ops SIS Scanned 15909069113.00 16568899405.00
>>> Ops SIS Domain Scanned  13580736097.00 14211606282.00
>>> Ops SIS Failures 2944874939.00  2843113421.00
>>> Ops SIS Core Search   262853975.00   311781774.00
>>> Ops SIS Core Hit  185189656.00   216097102.00
>>> Ops SIS Core Miss  77664319.0095684672.00
>>> Ops SIS Recent Used Hit   124265515.00   146021086.00
>>> Ops SIS Recent Used Miss  338142547.00   403547579.00
>>> Ops SIS Recent Attempts   462408062.00   549568665.00
>>> Ops SIS Search Efficiency35.18  33.72
>>> Ops SIS Domain Search Eff24.07  22.72
>>> Ops SIS Fast Success Rate41.60  42.20
>>> Ops SIS Success Rate 47.38  49.11
>>> Ops SIS Recent Success Rate  26.87  26.57
>>>
>>> The field I would expect to decrease is SIS Domain Scanned -- the number
>>> of runqueues that were examined but it's actually worse and graphing over
>>> time shows it's worse for the client thread counts.  select_idle_cpu()
>>> is definitely being called because "Domain Search" is 10 times higher than
>>> "Core Search" and there "Core Miss" is non-zero.
>>
>> Why SIS Domain Scanned can be decreased?
>>
> 
> Because if idle CPUs are being targetted and its a subset of the entire
> domain then it follows that fewer runqueues should be examined when
> scanning the domain.

Sorry, I probably messed up "SIS Domain Scanned" and "SIS Domain Search".
How is "SIS Domain Scanned" calculated?

> 
>> I thought SIS Scanned was supposed to be decreased but it seems not on your 
>> side.
>>
> 
> It *should* have been decreased but it&#x

Re: [RFC PATCH v4] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-19 Thread Li, Aubrey

On 2020/11/19 16:19, Vincent Guittot wrote:
> On Thu, 19 Nov 2020 at 02:34, Li, Aubrey  wrote:
>>
>> Hi Vincent,
>>
>> On 2020/11/18 21:36, Vincent Guittot wrote:
>>> On Wed, 18 Nov 2020 at 04:48, Aubrey Li  wrote:
>>>>
>>>> From: Aubrey Li 
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>> tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>> schbench have no notable change, uperf has:
>>>>
>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>
>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>   961   0.831.233.27
>>>>   144   1   1.031.672.67
>>>>   192   1   0.691.813.59
>>>>   240   1   2.841.512.67
>>>>
>>>> Cc: Mel Gorman 
>>>> Cc: Vincent Guittot 
>>>> Cc: Qais Yousef 
>>>> Cc: Valentin Schneider 
>>>> Cc: Jiang Biao 
>>>> Cc: Tim Chen 
>>>> Signed-off-by: Aubrey Li 
>>>> ---
>>>>  include/linux/sched/topology.h | 13 +
>>>>  kernel/sched/core.c|  2 ++
>>>>  kernel/sched/fair.c| 52 +-
>>>>  kernel/sched/idle.c|  7 +++--
>>>>  kernel/sched/sched.h   |  2 ++
>>>>  kernel/sched/topology.c|  3 +-
>>>>  6 files changed, 74 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/sched/topology.h 
>>>> b/include/linux/sched/topology.h
>>>> index 820511289857..b47b85163607 100644
>>>> --- a/include/linux/sched/topology.h
>>>> +++ b/include/linux/sched/topology.h
>>>> @@ -65,8 +65,21 @@ struct sched_domain_shared {
>>>> atomic_tref;
>>>> atomic_tnr_busy_cpus;
>>>> int has_idle_cores;
>>>> +   /*
>>>> +* Span of all idle CPUs in this domain.
>>>> +*
>>>> +* NOTE: this field is variable length. (Allocated dynamically
>>>> +* by attaching extra space to the end of the structure,
>>>> +* depending on how many CPUs the kernel has booted up with)
>>>> +*/
>>>> +   unsigned long   idle_cpus_span[];
>>>>  };
>>>>
>>>> +static inline struct cpumask *sds_idle_cpus(struct sched_domain_shared 
>>>> *sds)
>>>> +{
>>>> +   return to_cpumask(sds->idle_cpus_span);
>>>> +}
>>>> +
>>>>  struct sched_domain {
>>>> /* These fields must be setup */
>>>> struct sched_domain __rcu *parent;  /* top domain must be null 
>>>> terminated */
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index b1e0da56abca..c86ae0495163 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -3994,6 +3994,7 @@ void scheduler_tick(void)
>>>> rq_lock(rq, &rf);
>>>>
>>>> update_rq_clock(rq);
>>>> +   update_idle_cpumask(rq, false);
>>>> thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq));
>>>> update_thermal_load_avg(rq_clock_thermal(rq), rq, 
>>>> thermal_pressure);
>>>> curr->sched_class->task_tick(rq, curr, 0);
>>>> @@ -7192,6 +7193,7 @@ void __init sched_init(void)
>>>> rq_csd_init(rq, &rq->nohz_csd, nohz_csd_func);
>>>>  #endif
>>>>  #endif /* CONFIG_SMP */
>>>> +   rq->last_idle_state = 1;
>>>> hrtick_rq_init(rq);
>>>&g

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-30 Thread Li, Aubrey

On 2020/11/30 17:33, Balbir Singh wrote:
> On Thu, Nov 26, 2020 at 05:26:31PM +0800, Li, Aubrey wrote:
>> On 2020/11/26 16:32, Balbir Singh wrote:
>>> On Thu, Nov 26, 2020 at 11:20:41AM +0800, Li, Aubrey wrote:
>>>> On 2020/11/26 6:57, Balbir Singh wrote:
>>>>> On Wed, Nov 25, 2020 at 11:12:53AM +0800, Li, Aubrey wrote:
>>>>>> On 2020/11/24 23:42, Peter Zijlstra wrote:
>>>>>>> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>>>>>>>> +#ifdef CONFIG_SCHED_CORE
>>>>>>>>>> +/*
>>>>>>>>>> + * Skip this cpu if source task's cookie does not match
>>>>>>>>>> + * with CPU's core cookie.
>>>>>>>>>> + */
>>>>>>>>>> +if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>>>>>>>> +continue;
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>
>>>>>>>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>>>>>>>> the check for sched_core_enabled() do the right thing even when
>>>>>>>>> CONFIG_SCHED_CORE is not enabed?> 
>>>>>>>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>>>>>>>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>>>>>>>> sense to leave a core scheduler specific function here even at compile
>>>>>>>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>>>>>>>> a judgment.
>>>>>>>
>>>>>>> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
>>>>>>> more.
>>>>>>>
>>>>>>
>>>>>> Okay, I pasted the refined patch here.
>>>>>> @Joel, please let me know if you want me to send it in a separated 
>>>>>> thread.
>>>>>>
>>>>>
>>>>> You still have a bunch of #ifdefs, can't we just do
>>>>>
>>>>> #ifndef CONFIG_SCHED_CORE
>>>>> static inline bool sched_core_enabled(struct rq *rq)
>>>>> {
>>>>> return false;
>>>>> }
>>>>> #endif
>>>>>
>>>>> and frankly I think even that is not needed because there is a jump
>>>>> label __sched_core_enabled that tells us if sched_core is enabled or
>>>>> not.
>>>>
>>>> Hmm..., I need another wrapper for CONFIG_SCHED_CORE specific variables.
>>>> How about this one?
>>>>
>>>
>>> Much better :)
>>>  
>>>> Thanks,
>>>> -Aubrey
>>>>
>>>> From 61dac9067e66b5b9ea26c684c8c8235714bab38a Mon Sep 17 00:00:00 2001
>>>> From: Aubrey Li 
>>>> Date: Thu, 26 Nov 2020 03:08:04 +
>>>> Subject: [PATCH] sched: migration changes for core scheduling
>>>>
>>>>  - Don't migrate if there is a cookie mismatch
>>>>  Load balance tries to move task from busiest CPU to the
>>>>  destination CPU. When core scheduling is enabled, if the
>>>>  task's cookie does not match with the destination CPU's
>>>>  core cookie, this task will be skipped by this CPU. This
>>>>  mitigates the forced idle time on the destination CPU.
>>>>
>>>>  - Select cookie matched idle CPU
>>>>  In the fast path of task wakeup, select the first cookie matched
>>>>  idle CPU instead of the first idle CPU.
>>>>
>>>>  - Find cookie matched idlest CPU
>>>>  In the slow path of task wakeup, find the idlest CPU whose core
>>>>  cookie matches with task's cookie
>>>>
>>>>  - Don't migrate task if cookie not match
>>>>  For the NUMA load balance, don't migrate task to the CPU whose
>>>>  core cookie does not match with task's cookie
>>>>
>>>> Tested-by: Julien Desfossez 
>>>> Signed-off-by: Aubrey Li 
>>>> Signed-off-by: Tim Chen 
>>>> Signed-off-by: Vineeth Remanan Pillai 
>>>> Signed-off-by: Joel Fernandes (Google) 
>>>> ---
>>>>  kernel/sched

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-30 Thread Li, Aubrey

On 2020/11/30 18:35, Vincent Guittot wrote:
> On Wed, 18 Nov 2020 at 00:20, Joel Fernandes (Google)
>  wrote:
>>
>> From: Aubrey Li 
>>
>>  - Don't migrate if there is a cookie mismatch
>>  Load balance tries to move task from busiest CPU to the
>>  destination CPU. When core scheduling is enabled, if the
>>  task's cookie does not match with the destination CPU's
>>  core cookie, this task will be skipped by this CPU. This
>>  mitigates the forced idle time on the destination CPU.
>>
>>  - Select cookie matched idle CPU
>>  In the fast path of task wakeup, select the first cookie matched
>>  idle CPU instead of the first idle CPU.
>>
>>  - Find cookie matched idlest CPU
>>  In the slow path of task wakeup, find the idlest CPU whose core
>>  cookie matches with task's cookie
>>
>>  - Don't migrate task if cookie not match
>>  For the NUMA load balance, don't migrate task to the CPU whose
>>  core cookie does not match with task's cookie
>>
>> Tested-by: Julien Desfossez 
>> Signed-off-by: Aubrey Li 
>> Signed-off-by: Tim Chen 
>> Signed-off-by: Vineeth Remanan Pillai 
>> Signed-off-by: Joel Fernandes (Google) 
>> ---
>>  kernel/sched/fair.c  | 64 
>>  kernel/sched/sched.h | 29 
>>  2 files changed, 88 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index de82f88ba98c..ceb3906c9a8a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1921,6 +1921,15 @@ static void task_numa_find_cpu(struct task_numa_env 
>> *env,
>> if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
>> continue;
>>
>> +#ifdef CONFIG_SCHED_CORE
>> +   /*
>> +* Skip this cpu if source task's cookie does not match
>> +* with CPU's core cookie.
>> +*/
>> +   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>> +   continue;
>> +#endif
>> +
>> env->dst_cpu = cpu;
>> if (task_numa_compare(env, taskimp, groupimp, maymove))
>> break;
>> @@ -5867,11 +5876,17 @@ find_idlest_group_cpu(struct sched_group *group, 
>> struct task_struct *p, int this
>>
>> /* Traverse only the allowed CPUs */
>> for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
>> +   struct rq *rq = cpu_rq(i);
>> +
>> +#ifdef CONFIG_SCHED_CORE
>> +   if (!sched_core_cookie_match(rq, p))
>> +   continue;
>> +#endif
>> +
>> if (sched_idle_cpu(i))
>> return i;
>>
>> if (available_idle_cpu(i)) {
>> -   struct rq *rq = cpu_rq(i);
>> struct cpuidle_state *idle = idle_get_state(rq);
>> if (idle && idle->exit_latency < min_exit_latency) {
>> /*
>> @@ -6129,8 +6144,18 @@ static int select_idle_cpu(struct task_struct *p, 
>> struct sched_domain *sd, int t
>> for_each_cpu_wrap(cpu, cpus, target) {
>> if (!--nr)
>> return -1;
>> -   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
>> -   break;
>> +
>> +   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
>> +#ifdef CONFIG_SCHED_CORE
>> +   /*
>> +* If Core Scheduling is enabled, select this cpu
>> +* only if the process cookie matches core cookie.
>> +*/
>> +   if (sched_core_enabled(cpu_rq(cpu)) &&
>> +   p->core_cookie == cpu_rq(cpu)->core->core_cookie)
>> +#endif
>> +   break;
>> +   }
> 
> This makes code unreadable.
> Put this coresched specific stuff in an inline function; You can have
> a look at what is done with asym_fits_capacity()
> 
This is done in a refined version. Sorry I pasted the version embedded in this 
thread,
this is not the latest version.

>> }
>>
>> time = cpu_clock(this) - time;
>> @@ -7530,8 +7555,9 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>>  * We do not migrate tasks that are:
>>  * 1) throttled_lb_pair, or
>>  * 2) cannot be migrated to this CPU due to cpus_ptr, or
>> -* 3) running (obviously), or
>> -* 4) are cache-hot on their current CPU.
>> +* 3) task's cookie does not match with this CPU's core cookie
>> +* 4) running (obviously), or
>> +* 5) are cache-hot on their current CPU.
>>  */
>> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>> return 0;
>> @@ -7566,6 +7592,15 @@ int can_migrate_task(struct task_struct *p, struct 
>> lb_env *env)
>> return 0;
>> }
>>
>> +#ifdef CONFIG_SCHED_CORE
>>

Re: [PATCH] sched/fair: Clear SMT siblings after determining the core is not idle

2020-11-30 Thread Li, Aubrey

On 2020/11/30 22:47, Vincent Guittot wrote:
> On Mon, 30 Nov 2020 at 15:40, Mel Gorman  wrote:
>>
>> The clearing of SMT siblings from the SIS mask before checking for an idle
>> core is a small but unnecessary cost. Defer the clearing of the siblings
>> until the scan moves to the next potential target. The cost of this was
>> not measured as it is borderline noise but it should be self-evident.
> 
> Good point

This is more reasonable, thanks Mel.

> 
>>
>> Signed-off-by: Mel Gorman 
> 
> Reviewed-by: Vincent Guittot 
> 
>> ---
>>  kernel/sched/fair.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0d54d69ba1a5..d9acd55d309b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6087,10 +6087,11 @@ static int select_idle_core(struct task_struct *p, 
>> struct sched_domain *sd, int
>> break;
>> }
>> }
>> -   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>>
>> if (idle)
>> return core;
>> +
>> +   cpumask_andnot(cpus, cpus, cpu_smt_mask(core));
>> }
>>
>> /*

Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-23 Thread Li, Aubrey

On 2020/11/23 12:38, Balbir Singh wrote:
> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>> From: Peter Zijlstra 
>>
>> When a sibling is forced-idle to match the core-cookie; search for
>> matching tasks to fill the core.
>>
>> rcu_read_unlock() can incur an infrequent deadlock in
>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>
> ...
>> +
>> +if (p->core_occupation > dst->idle->core_occupation)
>> +goto next;
>> +
> 
> I am unable to understand this check, a comment or clarification in the
> changelog will help. I presume we are looking at either one or two cpus
> to define the core_occupation and we expect to match it against the
> destination CPU.

IIUC, this check prevents a task from keeping jumping among the cores forever.

For example, on a SMT2 platform:
- core0 runs taskA and taskB, core_occupation is 2
- core1 runs taskC, core_occupation is 1

Without this check, taskB could ping-pong between core0 and core1 by core load
balance.

Thanks,
-Aubrey

Re: [PATCH -tip 13/32] sched: Trivial forced-newidle balancer

2020-11-23 Thread Li, Aubrey

On 2020/11/24 7:35, Balbir Singh wrote:
> On Mon, Nov 23, 2020 at 11:07:27PM +0800, Li, Aubrey wrote:
>> On 2020/11/23 12:38, Balbir Singh wrote:
>>> On Tue, Nov 17, 2020 at 06:19:43PM -0500, Joel Fernandes (Google) wrote:
>>>> From: Peter Zijlstra 
>>>>
>>>> When a sibling is forced-idle to match the core-cookie; search for
>>>> matching tasks to fill the core.
>>>>
>>>> rcu_read_unlock() can incur an infrequent deadlock in
>>>> sched_core_balance(). Fix this by using the RCU-sched flavor instead.
>>>>
>>> ...
>>>> +
>>>> +  if (p->core_occupation > dst->idle->core_occupation)
>>>> +  goto next;
>>>> +
>>>
>>> I am unable to understand this check, a comment or clarification in the
>>> changelog will help. I presume we are looking at either one or two cpus
>>> to define the core_occupation and we expect to match it against the
>>> destination CPU.
>>
>> IIUC, this check prevents a task from keeping jumping among the cores 
>> forever.
>>
>> For example, on a SMT2 platform:
>> - core0 runs taskA and taskB, core_occupation is 2
>> - core1 runs taskC, core_occupation is 1
>>
>> Without this check, taskB could ping-pong between core0 and core1 by core 
>> load
>> balance.
> 
> But the comparison is p->core_occuption (as in tasks core occuptation,
> not sure what that means, can a task have a core_occupation of > 1?)
>

p->core_occupation is assigned to the core occupation in the last 
pick_next_task.
(so yes, it can have a > 1 core_occupation).

Thanks,
-Aubrey

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-23 Thread Li, Aubrey

Hi Vincent,

On 2020/11/23 17:27, Vincent Guittot wrote:
> Hi Aubrey,
> 
> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>
>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>> enters idle, if the idle driver indicates to stop tick, this CPU
>> is set in the idle cpumask to be a wakeup target. And if the CPU
>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>> tick to ratelimit idle cpumask update.
>>
>> When a task wakes up to select an idle cpu, scanning idle cpumask
>> has low cost than scanning all the cpus in last level cache domain,
>> especially when the system is heavily loaded.
>>
>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>> schbench have no notable change, uperf has:
>>
>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>
>>   threads   baseline-avg%stdpatch-avg   %std
>>   961   0.831.233.27
>>   144   1   1.031.672.67
>>   192   1   0.691.813.59
>>   240   1   2.841.512.67
>>
>> v4->v5:
>> - add update_idle_cpumask for s2idle case
>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>   idle_cpumask() everywhere
>>
>> v3->v4:
>> - change setting idle cpumask from every idle entry to tickless idle
>>   if cpu driver is available.
> 
> Could you remind me why you did this change ? Clearing the cpumask is
> done during the tick to rate limit the number of updates of the
> cpumask but It's not clear for me why you have associated the set with
> the tick stop condition too.

I found the current implementation has better performance at a more 
suitable load range.

The two kinds of implementions(v4 and v5) have the same rate(scheduler
tick) to shrink idle cpumask when the system is busy, but

- Setting the idle mask everytime the cpu enters idle requires a much
heavier load level to preserve the idle cpumask(not call into idle),
otherwise the bits cleared in scheduler tick will be restored when the
cpu enters idle. That is, idle cpumask is almost equal to the domain
cpumask during task wakeup if the system load is not heavy enough.

- Associating with tick stop tolerates idle to preserve the idle cpumask
but only short idle, which causes tick retains. This is more fitable for
the real workload.

> 
> This change means that a cpu will not be part of the idle mask if the
> tick is not stopped. On some arm/arm64 platforms, the tick stops only
> if the idle duration is expected to be higher than 1-2ms which starts
> to be significantly long. Also, the cpuidle governor can easily
> mis-predict a short idle duration whereas it will be finally a long
> idle duration; In this case, the next tick will correct the situation
> and select a deeper state, but this can happen up to 4ms later on
> arm/arm64.

Yes this is intented. If the tick is not stopped, that indicates the
CPU is very busy, cpu idle governor selected the polling idle state, and/or 
the expected idle duration is shorter than the tick period length. For
example, uperf enters and exits 80 times between two ticks when utilizes
100% CPU, and the average idle residency < 50us.

If this CPU is added to idle cpumask, the wakeup task likely needs to 
wait in the runqueue as this CPU will run its current task very soon.

> 
> So I would prefer to keep trying to set the idle mask everytime the
> cpu enters idle. If a tick has not happened between 2 idle phases, the
> cpumask will not be updated and the overhead will be mostly testing if
> (rq->last_idle_state == idle_state).

Not sure if I addressed your concern, did you see any workloads any cases
v4 performs better than v5?

Thanks,
-Aubrey

> 
> 
>> - move clearing idle cpumask to scheduler_tick to decouple nohz mode.
>>
>> v2->v3:
>> - change setting idle cpumask to every idle entry, otherwise schbench
>>   has a regression of 99th percentile latency.
>> - change clearing idle cpumask to nohz_balancer_kick(), so updating
>>   idle cpumask is ratelimited in the idle exiting path.
>> - set SCHED_IDLE cpu in idle cpumask to allow it as a wakeup target.
>>
>> v1->v2:
>> - idle cpumask is updated in the nohz routines, by initializing idle
>>   cpumask with sched_domain_span(sd), nohz=off case remains the original
>>   behavior.
>>

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:17, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
> wrote:
>>
>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  wrote:
>>>
>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
> The intent was that the sibling might still be an idle candidate. In
> the current draft of the series, I do not even clear this so that the
> SMT sibling is considered as an idle candidate. The reasoning is that if
> there are no idle cores then an SMT sibling of the target is as good an
> idle CPU to select as any.

 Isn't the purpose of select_idle_smt ?

>>>
>>> Only in part.
>>>
 select_idle_core() looks for an idle core and opportunistically saves
 an idle CPU candidate to skip select_idle_cpu. In this case this is
 useless loops for select_idle_core() because we are sure that the core
 is not idle

>>>
>>> If select_idle_core() finds an idle candidate other than the sibling,
>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>
>> My point is that it's a waste of time to loop the sibling cpus of
>> target in select_idle_core because it will not help to find an idle
>> core. The sibling  cpus will then be check either by select_idle_cpu
>> of select_idle_smt
> 
> also, while looping the cpumask, the sibling cpus of not idle cpu are
> removed and will not be check
>

IIUC, select_idle_core and select_idle_cpu share the same 
cpumask(select_idle_mask)?
If the target's sibling is removed from select_idle_mask from 
select_idle_core(),
select_idle_cpu() will lose the chance to pick it up?

Thanks,
-Aubrey

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:40, Li, Aubrey wrote:
> On 2020/12/4 21:17, Vincent Guittot wrote:
>> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
>> wrote:
>>>
>>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  wrote:
>>>>
>>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
>>>>>> The intent was that the sibling might still be an idle candidate. In
>>>>>> the current draft of the series, I do not even clear this so that the
>>>>>> SMT sibling is considered as an idle candidate. The reasoning is that if
>>>>>> there are no idle cores then an SMT sibling of the target is as good an
>>>>>> idle CPU to select as any.
>>>>>
>>>>> Isn't the purpose of select_idle_smt ?
>>>>>
>>>>
>>>> Only in part.
>>>>
>>>>> select_idle_core() looks for an idle core and opportunistically saves
>>>>> an idle CPU candidate to skip select_idle_cpu. In this case this is
>>>>> useless loops for select_idle_core() because we are sure that the core
>>>>> is not idle
>>>>>
>>>>
>>>> If select_idle_core() finds an idle candidate other than the sibling,
>>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>>
>>> My point is that it's a waste of time to loop the sibling cpus of
>>> target in select_idle_core because it will not help to find an idle
>>> core. The sibling  cpus will then be check either by select_idle_cpu
>>> of select_idle_smt
>>
>> also, while looping the cpumask, the sibling cpus of not idle cpu are
>> removed and will not be check
>>
> 
> IIUC, select_idle_core and select_idle_cpu share the same 
> cpumask(select_idle_mask)?
> If the target's sibling is removed from select_idle_mask from 
> select_idle_core(),
> select_idle_cpu() will lose the chance to pick it up?

aha, no, select_idle_mask will be re-assigned in select_idle_cpu() by:

cpumask_and(cpus, sds_idle_cpus(sd->shared), p->cpus_ptr);

So, yes, I guess we can remove the cpu_smt_mask(target) from select_idle_core() 
safely.

> 
> Thanks,
> -Aubrey
>

Re: [PATCH 06/10] sched/fair: Clear the target CPU from the cpumask of CPUs searched

2020-12-04 Thread Li, Aubrey

On 2020/12/4 21:47, Vincent Guittot wrote:
> On Fri, 4 Dec 2020 at 14:40, Li, Aubrey  wrote:
>>
>> On 2020/12/4 21:17, Vincent Guittot wrote:
>>> On Fri, 4 Dec 2020 at 14:13, Vincent Guittot  
>>> wrote:
>>>>
>>>> On Fri, 4 Dec 2020 at 12:30, Mel Gorman  
>>>> wrote:
>>>>>
>>>>> On Fri, Dec 04, 2020 at 11:56:36AM +0100, Vincent Guittot wrote:
>>>>>>> The intent was that the sibling might still be an idle candidate. In
>>>>>>> the current draft of the series, I do not even clear this so that the
>>>>>>> SMT sibling is considered as an idle candidate. The reasoning is that if
>>>>>>> there are no idle cores then an SMT sibling of the target is as good an
>>>>>>> idle CPU to select as any.
>>>>>>
>>>>>> Isn't the purpose of select_idle_smt ?
>>>>>>
>>>>>
>>>>> Only in part.
>>>>>
>>>>>> select_idle_core() looks for an idle core and opportunistically saves
>>>>>> an idle CPU candidate to skip select_idle_cpu. In this case this is
>>>>>> useless loops for select_idle_core() because we are sure that the core
>>>>>> is not idle
>>>>>>
>>>>>
>>>>> If select_idle_core() finds an idle candidate other than the sibling,
>>>>> it'll use it if there is no idle core -- it picks a busy sibling based
>>>>> on a linear walk of the cpumask. Similarly, select_idle_cpu() is not
>>>>
>>>> My point is that it's a waste of time to loop the sibling cpus of
>>>> target in select_idle_core because it will not help to find an idle
>>>> core. The sibling  cpus will then be check either by select_idle_cpu
>>>> of select_idle_smt
>>>
>>> also, while looping the cpumask, the sibling cpus of not idle cpu are
>>> removed and will not be check
>>>
>>
>> IIUC, select_idle_core and select_idle_cpu share the same 
>> cpumask(select_idle_mask)?
>> If the target's sibling is removed from select_idle_mask from 
>> select_idle_core(),
>> select_idle_cpu() will lose the chance to pick it up?
> 
> This is only relevant for patch 10 which is not to be included IIUC
> what mel said in cover letter : "Patches 9 and 10 are stupid in the
> context of this series."

So the target's sibling can be removed from cpumask in select_idle_core
in patch 6, and need to be added back in select_idle_core in patch 10, :)

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-24 Thread Li, Aubrey

On 2020/11/25 1:01, Vincent Guittot wrote:
> Hi Aubrey,
> 
> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>> Hi Vincent,
>>
>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>>>
>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>> tick to ratelimit idle cpumask update.
>>>>
>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>> has low cost than scanning all the cpus in last level cache domain,
>>>> especially when the system is heavily loaded.
>>>>
>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>> schbench have no notable change, uperf has:
>>>>
>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>
>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>   961   0.831.233.27
>>>>   144   1   1.031.672.67
>>>>   192   1   0.691.813.59
>>>>   240   1   2.841.512.67
>>>>
>>>> v4->v5:
>>>> - add update_idle_cpumask for s2idle case
>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>   idle_cpumask() everywhere
>>>>
>>>> v3->v4:
>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>   if cpu driver is available.
>>>
>>> Could you remind me why you did this change ? Clearing the cpumask is
>>> done during the tick to rate limit the number of updates of the
>>> cpumask but It's not clear for me why you have associated the set with
>>> the tick stop condition too.
>>
>> I found the current implementation has better performance at a more 
>> suitable load range.
>>
>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>> tick) to shrink idle cpumask when the system is busy, but
> 
> I'm ok with the part above
> 
>>
>> - Setting the idle mask everytime the cpu enters idle requires a much
>> heavier load level to preserve the idle cpumask(not call into idle),
>> otherwise the bits cleared in scheduler tick will be restored when the
>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>> cpumask during task wakeup if the system load is not heavy enough.
> 
> But setting the idle cpumask is useful because it helps to select an idle
> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
> the idle cpu mask is useful in heavy cases because a system, which is
> already fully busy with work, doesn't want to waste time looking for an
> idle cpu that doesn't exist. 

Yes, this is what v3 does.

> But if there is an idle cpu, we should still looks for it.

IMHO, this is a potential opportunity can be improved. The idle cpu could be
in different idle state, the idle duration could be long or could be very short.
For example, if there are two idle cpus:

- CPU1 is very busy, the pattern is 50us idle and 950us work. 
- CPU2 is in idle for a tick length and wake up to do the regular work

If both added to the idle cpumask, we want the latter one, or we can just add
the later one into the idle cpumask. That's why I want to associate tick stop
signal with it.

> 
>>
>>
>> - Associating with tick stop tolerates idle to preserve the idle cpumask
>> but only short idle, which causes tick retains. This is more fitable for
>> the real workload.
> 
> I don't agree with this and real use cases with interaction will probably
> not agree as well as they want to run on an idle cpu if any but not wait
> on an already busy one.

The problem is scan overhead, scanning idle cpu need time. If an idle cpu
is in the short idle mode, it's very likely that when it's picked up for a
wakeup task, it goes back to work again, and the wakeup task has to wait too,
maybe longer because the running task just starts. 

One benefit of waiting on the previous one is warm cache.

> Also keep in mind that a tick can be up to 10ms long

Right, but the point here is, if this 10ms tick retains, t

Re: [PATCH -tip 14/32] sched: migration changes for core scheduling

2020-11-24 Thread Li, Aubrey

On 2020/11/24 23:42, Peter Zijlstra wrote:
> On Mon, Nov 23, 2020 at 12:36:10PM +0800, Li, Aubrey wrote:
>>>> +#ifdef CONFIG_SCHED_CORE
>>>> +  /*
>>>> +   * Skip this cpu if source task's cookie does not match
>>>> +   * with CPU's core cookie.
>>>> +   */
>>>> +  if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
>>>> +  continue;
>>>> +#endif
>>>> +
>>>
>>> Any reason this is under an #ifdef? In sched_core_cookie_match() won't
>>> the check for sched_core_enabled() do the right thing even when
>>> CONFIG_SCHED_CORE is not enabed?> 
>> Yes, sched_core_enabled works properly when CONFIG_SCHED_CORE is not
>> enabled. But when CONFIG_SCHED_CORE is not enabled, it does not make
>> sense to leave a core scheduler specific function here even at compile
>> time. Also, for the cases in hot path, this saves CPU cycles to avoid
>> a judgment.
> 
> No, that's nonsense. If it works, remove the #ifdef. Less (#ifdef) is
> more.
> 

Okay, I pasted the refined patch here.
@Joel, please let me know if you want me to send it in a separated thread.

Thanks,
-Aubrey
==
>From 18e4f4592c2a159fcbae637f3a422e37ad24cb5a Mon Sep 17 00:00:00 2001
From: Aubrey Li 
Date: Wed, 25 Nov 2020 02:43:46 +
Subject: [PATCH 14/33] sched: migration changes for core scheduling

 - Don't migrate if there is a cookie mismatch
 Load balance tries to move task from busiest CPU to the
 destination CPU. When core scheduling is enabled, if the
 task's cookie does not match with the destination CPU's
 core cookie, this task will be skipped by this CPU. This
 mitigates the forced idle time on the destination CPU.

 - Select cookie matched idle CPU
 In the fast path of task wakeup, select the first cookie matched
 idle CPU instead of the first idle CPU.

 - Find cookie matched idlest CPU
 In the slow path of task wakeup, find the idlest CPU whose core
 cookie matches with task's cookie

 - Don't migrate task if cookie not match
 For the NUMA load balance, don't migrate task to the CPU whose
 core cookie does not match with task's cookie

Tested-by: Julien Desfossez 
Signed-off-by: Aubrey Li 
Signed-off-by: Tim Chen 
Signed-off-by: Vineeth Remanan Pillai 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/sched/fair.c  | 58 
 kernel/sched/sched.h | 33 +
 2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de82f88ba98c..7eea5da6685a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1921,6 +1921,13 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, env->p->cpus_ptr))
continue;
 
+   /*
+* Skip this cpu if source task's cookie does not match
+* with CPU's core cookie.
+*/
+   if (!sched_core_cookie_match(cpu_rq(cpu), env->p))
+   continue;
+
env->dst_cpu = cpu;
if (task_numa_compare(env, taskimp, groupimp, maymove))
break;
@@ -5867,11 +5874,15 @@ find_idlest_group_cpu(struct sched_group *group, struct 
task_struct *p, int this
 
/* Traverse only the allowed CPUs */
for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
+   struct rq *rq = cpu_rq(i);
+
+   if (!sched_core_cookie_match(rq, p))
+   continue;
+
if (sched_idle_cpu(i))
return i;
 
if (available_idle_cpu(i)) {
-   struct rq *rq = cpu_rq(i);
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
/*
@@ -6129,8 +6140,18 @@ static int select_idle_cpu(struct task_struct *p, struct 
sched_domain *sd, int t
for_each_cpu_wrap(cpu, cpus, target) {
if (!--nr)
return -1;
-   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
-   break;
+
+   if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) {
+#ifdef CONFIG_SCHED_CORE
+   /*
+* If Core Scheduling is enabled, select this cpu
+* only if the process cookie matches core cookie.
+*/
+   if (sched_core_enabled(cpu_rq(cpu)) &&
+   p->core

Re: [RFC PATCH v5] sched/fair: select idle cpu from idle cpumask for task wakeup

2020-11-25 Thread Li, Aubrey

On 2020/11/25 16:31, Vincent Guittot wrote:
> On Wed, 25 Nov 2020 at 03:03, Li, Aubrey  wrote:
>>
>> On 2020/11/25 1:01, Vincent Guittot wrote:
>>> Hi Aubrey,
>>>
>>> Le mardi 24 nov. 2020 à 15:01:38 (+0800), Li, Aubrey a écrit :
>>>> Hi Vincent,
>>>>
>>>> On 2020/11/23 17:27, Vincent Guittot wrote:
>>>>> Hi Aubrey,
>>>>>
>>>>> On Thu, 19 Nov 2020 at 13:15, Aubrey Li  wrote:
>>>>>>
>>>>>> Add idle cpumask to track idle cpus in sched domain. When a CPU
>>>>>> enters idle, if the idle driver indicates to stop tick, this CPU
>>>>>> is set in the idle cpumask to be a wakeup target. And if the CPU
>>>>>> is not in idle, the CPU is cleared in idle cpumask during scheduler
>>>>>> tick to ratelimit idle cpumask update.
>>>>>>
>>>>>> When a task wakes up to select an idle cpu, scanning idle cpumask
>>>>>> has low cost than scanning all the cpus in last level cache domain,
>>>>>> especially when the system is heavily loaded.
>>>>>>
>>>>>> Benchmarks were tested on a x86 4 socket system with 24 cores per
>>>>>> socket and 2 hyperthreads per core, total 192 CPUs. Hackbench and
>>>>>> schbench have no notable change, uperf has:
>>>>>>
>>>>>> uperf throughput: netperf workload, tcp_nodelay, r/w size = 90
>>>>>>
>>>>>>   threads   baseline-avg%stdpatch-avg   %std
>>>>>>   961   0.831.233.27
>>>>>>   144   1   1.031.672.67
>>>>>>   192   1   0.691.813.59
>>>>>>   240   1   2.841.512.67
>>>>>>
>>>>>> v4->v5:
>>>>>> - add update_idle_cpumask for s2idle case
>>>>>> - keep the same ordering of tick_nohz_idle_stop_tick() and update_
>>>>>>   idle_cpumask() everywhere
>>>>>>
>>>>>> v3->v4:
>>>>>> - change setting idle cpumask from every idle entry to tickless idle
>>>>>>   if cpu driver is available.
>>>>>
>>>>> Could you remind me why you did this change ? Clearing the cpumask is
>>>>> done during the tick to rate limit the number of updates of the
>>>>> cpumask but It's not clear for me why you have associated the set with
>>>>> the tick stop condition too.
>>>>
>>>> I found the current implementation has better performance at a more
>>>> suitable load range.
>>>>
>>>> The two kinds of implementions(v4 and v5) have the same rate(scheduler
>>>> tick) to shrink idle cpumask when the system is busy, but
>>>
>>> I'm ok with the part above
>>>
>>>>
>>>> - Setting the idle mask everytime the cpu enters idle requires a much
>>>> heavier load level to preserve the idle cpumask(not call into idle),
>>>> otherwise the bits cleared in scheduler tick will be restored when the
>>>> cpu enters idle. That is, idle cpumask is almost equal to the domain
>>>> cpumask during task wakeup if the system load is not heavy enough.
>>>
>>> But setting the idle cpumask is useful because it helps to select an idle
>>> cpu at wake up instead of waiting ifor ILB to fill the empty CPU. IMO,
>>> the idle cpu mask is useful in heavy cases because a system, which is
>>> already fully busy with work, doesn't want to waste time looking for an
>>> idle cpu that doesn't exist.
>>
>> Yes, this is what v3 does.
>>
>>> But if there is an idle cpu, we should still looks for it.
>>
>> IMHO, this is a potential opportunity can be improved. The idle cpu could be
>> in different idle state, the idle duration could be long or could be very 
>> short.
>> For example, if there are two idle cpus:
>>
>> - CPU1 is very busy, the pattern is 50us idle and 950us work.
>> - CPU2 is in idle for a tick length and wake up to do the regular work
>>
>> If both added to the idle cpumask, we want the latter one, or we can just add
>> the later one into the idle cpumask. That's why I want to associate tick stop
>> signal with it.
>>
>>>
>>>>
>>>>
>>>> - Associating with tick stop to

1 2 3 4 >

1 - 100 of 327 matches

Mail list logo