from:"Vincent Guittot"

Re: [PATCH 1/3] sched: Fix nohz_kick_needed to consider the nr_busy of the parent domain's group

2013-10-23 Thread Vincent Guittot

Hi Preeti

On 23 October 2013 11:50, Preeti U Murthy  wrote:
> Hi Peter
>
> On 10/23/2013 03:41 AM, Peter Zijlstra wrote:
>> This nohz stuff really needs to be re-thought and made more scalable --
>> its a royal pain :/
>
> Why  not do something like the below instead? It does the following.
>
> This patch introduces sd_busy just like your suggested patch, except that
> it points to the parent of the highest level sched domain which has the
> SD_SHARE_PKG_RESOURCES set and initializes it in update_top_cache_domain().
> This is the sched domain that is relevant in nohz_kick_needed().
>
> sd_set_sd_state_busy(), sd_set_sd_state_idle() and nohz_kick_needed() query
> and update *only* this sched domain(sd_busy) for nr_busy_cpus. They are the
> only users of this parameter. While we are at it, we might as well change
> the nohz_idle parameter to be updated at the sd_busy domain level alone and
> not the base domain level of a CPU. This will unify the concept of busy cpus
> at just one level of sched domain.
>
> There is no need to iterate through all levels of sched domains of a cpu to
> update nr_busy_cpus since it is irrelevant at all other sched domains except
> at sd_busy level.
>
> De-couple asymmetric load balancing from the nr_busy parameter which the
> PATCH 2/3 anyway does. sd_busy therefore is irrelevant for asymmetric load
> balancing.
>
> Regards
> Preeti U Murthy
> START_PATCH---
>
> sched: Fix nohz_kick_needed()
>
> ---
>  kernel/sched/core.c  |4 
>  kernel/sched/fair.c  |   40 ++--
>  kernel/sched/sched.h |1 +
>  3 files changed, 27 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c06b8d3..c1dd11c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5271,6 +5271,7 @@ DEFINE_PER_CPU(struct sched_domain *, sd_llc);
>  DEFINE_PER_CPU(int, sd_llc_size);
>  DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
> +DEFINE_PER_CPU(struct sched_domain *, sd_busy);
>
>  static void update_top_cache_domain(int cpu)
>  {
> @@ -5290,6 +5291,9 @@ static void update_top_cache_domain(int cpu)
>
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> +
> +   sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)->parent;

highest_flag_domain can return null pointer

> +   rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);
>  }
>
>  /*
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 813dd61..71e6f14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6515,16 +6515,16 @@ static inline void nohz_balance_exit_idle(int cpu)
>  static inline void set_cpu_sd_state_busy(void)
>  {
> struct sched_domain *sd;
> +   int cpu = smp_processor_id();
>
> rcu_read_lock();
> -   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> +   sd = per_cpu(sd_busy, cpu);

Don't you need to use rcu_dereference when using sd_busy ?

>
> if (!sd || !sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 0;
>
> -   for (; sd; sd = sd->parent)
> -   atomic_inc(&sd->groups->sgp->nr_busy_cpus);
> +   atomic_inc(&sd->groups->sgp->nr_busy_cpus);
>  unlock:
> rcu_read_unlock();
>  }
> @@ -6532,16 +6532,16 @@ unlock:
>  void set_cpu_sd_state_idle(void)
>  {
> struct sched_domain *sd;
> +   int cpu = smp_processor_id();
>
> rcu_read_lock();
> -   sd = rcu_dereference_check_sched_domain(this_rq()->sd);
> +   sd = per_cpu(sd_busy, cpu);
>
> if (!sd || sd->nohz_idle)
> goto unlock;
> sd->nohz_idle = 1;
>
> -   for (; sd; sd = sd->parent)
> -   atomic_dec(&sd->groups->sgp->nr_busy_cpus);
> +   atomic_dec(&sd->groups->sgp->nr_busy_cpus);
>  unlock:
> rcu_read_unlock();
>  }
> @@ -6748,6 +6748,9 @@ static inline int nohz_kick_needed(struct rq *rq, int 
> cpu)
>  {
> unsigned long now = jiffies;
> struct sched_domain *sd;
> +   struct sched_group *sg;
> +   struct sched_group_power *sgp;
> +   int nr_busy;
>
> if (unlikely(idle_cpu(cpu)))
> return 0;
> @@ -6773,22 +6776,23 @@ static inline int nohz_kick_needed(struct rq *rq, int 
> cpu)
> goto need_kick;
>
> rcu_read_lock();
> -   for_each_domain(cpu, sd) {
> -   struct sched_group *sg = sd->groups;
> -   struct sched_group_power *sgp = sg->sgp;
> -   int nr_busy = atomic_read(&sgp->nr_busy_cpus);
> +   sd = per_cpu(sd_busy, cpu);
>
> -   if (sd->flags & SD_SHARE_PKG_RESOURCES && nr_busy > 1)
> -   goto need_kick_unlock;
> +   if (sd) {
> +   sg = sd->groups;

sg is not needed anymore

> +   sgp = sg->sgp;
> +   nr_busy = atomic_read(&sgp->nr_busy_cpus);
>
> -   i

Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug

2013-02-11 Thread Vincent Guittot

Hi Srivatsa,

I can try to run some of our stress tests on your patches. Have you
got a git tree that i can pull ?

Regards,
Vincent

On 8 February 2013 19:09, Srivatsa S. Bhat
 wrote:
> On 02/08/2013 10:14 PM, Srivatsa S. Bhat wrote:
>> On 02/08/2013 09:11 PM, Russell King - ARM Linux wrote:
>>> On Thu, Feb 07, 2013 at 11:41:34AM +0530, Srivatsa S. Bhat wrote:
 On 02/07/2013 09:44 AM, Rusty Russell wrote:
> "Srivatsa S. Bhat"  writes:
>> On 01/22/2013 01:03 PM, Srivatsa S. Bhat wrote:
>>  Avg. latency of 1 CPU offline (ms) [stop-cpu/stop-m/c 
>> latency]
>>
>> # online CPUsMainline (with stop-m/c)   This patchset (no 
>> stop-m/c)
>>
>>   8 17.04  7.73
>>
>>  16 18.05  6.44
>>
>>  32 17.31  7.39
>>
>>  64 32.40  9.28
>>
>> 128 98.23  7.35
>
> Nice!

 Thank you :-)

>  I wonder how the ARM guys feel with their quad-cpu systems...
>

 That would be definitely interesting to know :-)
>>>
>>> That depends what exactly you'd like tested (and how) and whether you'd
>>> like it to be a test-chip based quad core, or an OMAP dual-core SoC.
>>>
>>
>> The effect of stop_machine() doesn't really depend on the CPU architecture
>> used underneath or the platform. It depends only on the _number_ of
>> _logical_ CPUs used.
>>
>> And stop_machine() has 2 noticeable drawbacks:
>> 1. It makes the hotplug operation itself slow
>> 2. and it causes disruptions to the workloads running on the other
>> CPUs by hijacking the entire machine for significant amounts of time.
>>
>> In my experiments (mentioned above), I tried to measure how my patchset
>> improves (reduces) the duration of hotplug (CPU offline) itself. Which is
>> also slightly indicative of the impact it has on the rest of the system.
>>
>> But what would be nice to test, is a setup where the workloads running on
>> the rest of the system are latency-sensitive, and measure the impact of
>> CPU offline on them, with this patchset applied. That would tell us how
>> far is this useful in making CPU hotplug less disruptive on the system.
>>
>> Of course, it would be nice to also see whether we observe any reduction
>> in hotplug duration itself (point 1 above) on ARM platforms with lot
>> of CPUs. [This could potentially speed up suspend/resume, which is used
>> rather heavily on ARM platforms].
>>
>> The benefits from this patchset over mainline (both in terms of points
>> 1 and 2 above) is expected to increase, with increasing number of CPUs in
>> the system.
>>
>
> Adding Vincent to CC, who had previously evaluated the performance and
> latency implications of CPU hotplug on ARM platforms, IIRC.
>
> Regards,
> Srivatsa S. Bhat
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug

2013-02-18 Thread Vincent Guittot

On 15 February 2013 20:40, Srivatsa S. Bhat
 wrote:
> Hi Vincent,
>
> On 02/15/2013 06:58 PM, Vincent Guittot wrote:
>> Hi Srivatsa,
>>
>> I have run some tests with you branch (thanks Paul for the git tree)
>> and you will find results below.
>>
>
> Thank you very much for testing this patchset!
>
>> The tests condition are:
>> - 5 CPUs system in 2 clusters
>> - The test plugs/unplugs CPU2 and it increases the system load each 20
>> plug/unplug sequence with either more cyclictests threads
>> - The test is done with all CPUs online and with only CPU0 and CPU2
>>
>> The main conclusion is that there is no differences with and without
>> your patches with my stress tests. I'm not sure that it was the
>> expected results but the cpu_down is already quite low : 4-5ms in
>> average
>>
>
> Atleast my patchset doesn't perform _worse_ than mainline, with respect
> to cpu_down duration :-)

yes exactly and it has pass  more than 400 consecutive plug/unplug on
an ARM platform

>
> So, here is the analysis:
> Stop-machine() doesn't really slow down CPU-down operation, if the rest
> of the CPUs are mostly running in userspace all the time. Because, the
> CPUs running userspace workloads cooperate very eagerly with the stop-machine
> dance - they receive the resched IPI, and allow the per-cpu cpu-stopper
> thread to monopolize the CPU, almost immediately.
>
> The scenario where stop-machine() takes longer to take effect is when
> most of the online CPUs are running in kernelspace, because, then the
> probability that they call preempt_disable() frequently (and hence inhibit
> stop-machine) is higher. That's why, in my tests, I ran genload from LTP
> which generated a lot of system-time (system-time in 'top' indicates activity
> in kernelspace). Hence my patchset showed significant improvement over
> mainline in my tests.
>

ok, I hadn't noticed this important point for the test

> However, your test is very useful too, if we measure a different parameter:
> the latency impact on the workloads running on the system (cyclic test).
> One other important aim of this patchset is to make hotplug as less intrusive
> as possible, for other workloads running on the system. So if you measure
> the cyclictest numbers, I would expect my patchset to show better numbers
> than mainline, when you do cpu-hotplug in parallel (same test that you did).
> Mainline would run stop-machine and hence interrupt the cyclic test tasks
> too often. My patchset wouldn't do that, and hence cyclic test should
> ideally show better numbers.

In fact, I haven't looked at the results as i was more interested by
the load that was generated

>
> I'd really appreciate if you could try that out and let me know how it
> goes.. :-) Thank you very much!

ok, I'm going to try to run a test series

Vincent
>
> Regards,
> Srivatsa S. Bhat
>
>>
>>
>> On 12 February 2013 04:58, Srivatsa S. Bhat
>>  wrote:
>>> On 02/12/2013 12:38 AM, Paul E. McKenney wrote:
>>>> On Mon, Feb 11, 2013 at 05:53:41PM +0530, Srivatsa S. Bhat wrote:
>>>>> On 02/11/2013 05:28 PM, Vincent Guittot wrote:
>>>>>> On 8 February 2013 19:09, Srivatsa S. Bhat
>>>>>>  wrote:
>>>>
>>>> [ . . . ]
>>>>
>>>>>>> Adding Vincent to CC, who had previously evaluated the performance and
>>>>>>> latency implications of CPU hotplug on ARM platforms, IIRC.
>>>>>>>
>>>>>>
>>>>>> Hi Srivatsa,
>>>>>>
>>>>>> I can try to run some of our stress tests on your patches.
>>>>>
>>>>> Great!
>>>>>
>>>>>> Have you
>>>>>> got a git tree that i can pull ?
>>>>>>
>>>>>
>>>>> Unfortunately, no, none at the moment..  :-(
>>>>
>>>> You do need to create an externally visible git tree.
>>>
>>> Ok, I'll do that soon.
>>>
>>>>  In the meantime,
>>>> I have added your series at rcu/bhat.2013.01.21a on -rcu:
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
>>>>
>>>> This should appear soon on a kernel.org mirror near you.  ;-)
>>>>
>>>
>>> Thank you very much, Paul! :-)
>>>
>>> Regards,
>>> Srivatsa S. Bhat
>>>
>
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug

2013-02-18 Thread Vincent Guittot

On 18 February 2013 11:51, Srivatsa S. Bhat
 wrote:
> On 02/18/2013 04:04 PM, Srivatsa S. Bhat wrote:
>> On 02/18/2013 03:54 PM, Vincent Guittot wrote:
>>> On 15 February 2013 20:40, Srivatsa S. Bhat
>>>  wrote:
>>>> Hi Vincent,
>>>>
>>>> On 02/15/2013 06:58 PM, Vincent Guittot wrote:
>>>>> Hi Srivatsa,
>>>>>
>>>>> I have run some tests with you branch (thanks Paul for the git tree)
>>>>> and you will find results below.
>>>>>
>>>>
>>>> Thank you very much for testing this patchset!
>>>>
>>>>> The tests condition are:
>>>>> - 5 CPUs system in 2 clusters
>>>>> - The test plugs/unplugs CPU2 and it increases the system load each 20
>>>>> plug/unplug sequence with either more cyclictests threads
>>>>> - The test is done with all CPUs online and with only CPU0 and CPU2
>>>>>
>>>>> The main conclusion is that there is no differences with and without
>>>>> your patches with my stress tests. I'm not sure that it was the
>>>>> expected results but the cpu_down is already quite low : 4-5ms in
>>>>> average
>>>>>
>>>>
>>>> Atleast my patchset doesn't perform _worse_ than mainline, with respect
>>>> to cpu_down duration :-)
>>>
>>> yes exactly and it has pass  more than 400 consecutive plug/unplug on
>>> an ARM platform
>>>
>>
>> Great! However, did you turn on CPU_IDLE during your tests?
>>
>> In my tests, I had turned off cpu idle in the .config, like I had mentioned 
>> in
>> the cover letter. I'm struggling to get it working with CPU_IDLE/INTEL_IDLE
>> turned on, because it gets into a lockup almost immediately. It appears that
>> the lock-holder of clockevents_lock never releases it, for some reason..
>> See below for the full log. Lockdep has not been useful in debugging this,
>> unfortunately :-(
>>
>
> Ah, nevermind, the following diff fixes it :-) I had applied this fix on v5
> and tested but it still had races where I used to hit the lockups. Now after
> I fixed all the memory barrier issues that Paul and Oleg pointed out in v5,
> I applied this fix again and tested it just now - it works beautifully! :-)

My tests have been done without cpuidle because i have some issues
with function tracer and cpuidle

But the cpu hotplug and cpuidle work well when I run the tests without
enabling the function tracer

Vincent

>
> I'll include this fix and post a v6 soon.
>
> Regards,
> Srivatsa S. Bhat
>
> --->
>
>
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 30b6de0..ca340fd 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "tick-internal.h"
>
> @@ -431,6 +432,7 @@ void clockevents_notify(unsigned long reason, void *arg)
> unsigned long flags;
> int cpu;
>
> +   get_online_cpus_atomic();
> raw_spin_lock_irqsave(&clockevents_lock, flags);
> clockevents_do_notify(reason, arg);
>
> @@ -459,6 +461,7 @@ void clockevents_notify(unsigned long reason, void *arg)
> break;
> }
> raw_spin_unlock_irqrestore(&clockevents_lock, flags);
> +   put_online_cpus_atomic();
>  }
>  EXPORT_SYMBOL_GPL(clockevents_notify);
>  #endif
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug

2013-02-18 Thread Vincent Guittot

On 18 February 2013 16:30, Steven Rostedt  wrote:
> On Mon, 2013-02-18 at 11:58 +0100, Vincent Guittot wrote:
>
>> My tests have been done without cpuidle because i have some issues
>> with function tracer and cpuidle
>>
>> But the cpu hotplug and cpuidle work well when I run the tests without
>> enabling the function tracer
>>
>
> I know suspend and resume has issues with function tracing (because it
> makes things like calling smp_processor_id() crash the system), but I'm
> unaware of issues with hotplug itself. Could be some of the same issues.
>
> Can you give me more details, I'll try to investigate it.

yes for sure.
The problem is more linked to cpuidle and function tracer.

cpu hotplug and function tracer work when cpuidle is disable.
cpu hotplug and cpuidle works if i don't enable function tracer.
my platform is dead as soon as I enable function tracer if cpuidle is
enabled. It looks like some notrace are missing in my platform driver
but we haven't completely fix the issue yet

Vincent

>
> Thanks,
>
> -- Steve
>
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug

2013-02-19 Thread Vincent Guittot

On 18 February 2013 20:53, Steven Rostedt  wrote:
> On Mon, 2013-02-18 at 17:50 +0100, Vincent Guittot wrote:
>
>> yes for sure.
>> The problem is more linked to cpuidle and function tracer.
>>
>> cpu hotplug and function tracer work when cpuidle is disable.
>> cpu hotplug and cpuidle works if i don't enable function tracer.
>> my platform is dead as soon as I enable function tracer if cpuidle is
>> enabled. It looks like some notrace are missing in my platform driver
>> but we haven't completely fix the issue yet
>>
>
> You can bisect to find out exactly what function is the problem:
>
>  cat /debug/tracing/available_filter_functions > t
>
> f(t) {
>  num=`wc -l t`
>  sed -ne "1,${num}p" t > t1
>  let num=num+1
>  sed -ne "${num},$p" t > t2
>
>  cat t1 > /debug/tracing/set_ftrace_filter
>  # note this may take a long time to finish
>
>  echo function > /debug/tracing/current_tracer
>
>  
> }
>

Thanks, i'm going to have a look

Vincent

> -- Steve
>
>
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 00/46] CPU hotplug: stop_machine()-free CPU hotplug

2013-03-01 Thread Vincent Guittot

Hi Srivatsa,

I have run some tests with genload on my ARM platform but even with
the mainline the cpu_down is quite short and stable ( around  4ms )
with 5 or 2 online cores. The duration is similar with your patches

I have maybe not used the right option for genload ? I have used
genload -m 10 which seems to generate the most system time. Which
command have you used for your tests ?

Vincent

On 18 February 2013 13:38, Srivatsa S. Bhat
 wrote:
> Hi,
>
> This patchset removes CPU hotplug's dependence on stop_machine() from the CPU
> offline path and provides an alternative (set of APIs) to preempt_disable() to
> prevent CPUs from going offline, which can be invoked from atomic context.
> The motivation behind the removal of stop_machine() is to avoid its 
> ill-effects
> and thus improve the design of CPU hotplug. (More description regarding this
> is available in the patches).
>
> All the users of preempt_disable()/local_irq_disable() who used to use it to
> prevent CPU offline, have been converted to the new primitives introduced in 
> the
> patchset. Also, the CPU_DYING notifiers have been audited to check whether
> they can cope up with the removal of stop_machine() or whether they need to
> use new locks for synchronization (all CPU_DYING notifiers looked OK, without
> the need for any new locks).
>
> Applies on current mainline (v3.8-rc7+).
>
> This patchset is available in the following git branch:
>
> git://github.com/srivatsabhat/linux.git  stop-machine-free-cpu-hotplug-v6
>
>
> Overview of the patches:
> ---
>
> Patches 1 to 7 introduce a generic, flexible Per-CPU Reader-Writer Locking
> scheme.
>
> Patch 8 uses this synchronization mechanism to build the
> get/put_online_cpus_atomic() APIs which can be used from atomic context, to
> prevent CPUs from going offline.
>
> Patch 9 is a cleanup; it converts preprocessor macros to static inline
> functions.
>
> Patches 10 to 43 convert various call-sites to use the new APIs.
>
> Patch 44 is the one which actually removes stop_machine() from the CPU
> offline path.
>
> Patch 45 decouples stop_machine() and CPU hotplug from Kconfig.
>
> Patch 46 updates the documentation to reflect the new APIs.
>
>
> Changes in v6:
> --
>
> * Fixed issues related to memory barriers, as pointed out by Paul and Oleg.
> * Fixed the locking issue related to clockevents_lock, which was being
>   triggered when cpu idle was enabled.
> * Some code restructuring to improve readability and to enhance some fastpath
>   optimizations.
> * Randconfig build-fixes, reported by Fengguang Wu.
>
>
> Changes in v5:
> --
>   Exposed a new generic locking scheme: Flexible Per-CPU Reader-Writer locks,
>   based on the synchronization schemes already discussed in the previous
>   versions, and used it in CPU hotplug, to implement the new APIs.
>
>   Audited the CPU_DYING notifiers in the kernel source tree and replaced
>   usages of preempt_disable() with the new get/put_online_cpus_atomic() APIs
>   where necessary.
>
>
> Changes in v4:
> --
>   The synchronization scheme has been simplified quite a bit, which makes it
>   look a lot less complex than before. Some highlights:
>
> * Implicit ACKs:
>
>   The earlier design required the readers to explicitly ACK the writer's
>   signal. The new design uses implicit ACKs instead. The reader switching
>   over to rwlock implicitly tells the writer to stop waiting for that reader.
>
> * No atomic operations:
>
>   Since we got rid of explicit ACKs, we no longer have the need for a reader
>   and a writer to update the same counter. So we can get rid of atomic ops
>   too.
>
> Changes in v3:
> --
> * Dropped the _light() and _full() variants of the APIs. Provided a single
>   interface: get/put_online_cpus_atomic().
>
> * Completely redesigned the synchronization mechanism again, to make it
>   fast and scalable at the reader-side in the fast-path (when no hotplug
>   writers are active). This new scheme also ensures that there is no
>   possibility of deadlocks due to circular locking dependency.
>   In summary, this provides the scalability and speed of per-cpu rwlocks
>   (without actually using them), while avoiding the downside (deadlock
>   possibilities) which is inherent in any per-cpu locking scheme that is
>   meant to compete with preempt_disable()/enable() in terms of flexibility.
>
>   The problem with using per-cpu locking to replace preempt_disable()/enable
>   was explained here:
>   https://lkml.org/lkml/2012/12/6/290
>
>   Basically we use per-cpu counters (for scalability) when no writers are
>   active, and then switch to global rwlocks (for lock-safety) when a writer
>   becomes active. It is a slightly complex scheme, but it is based on
>   standard principles of distributed algorithms.
>
> Changes in v2:
> -
> * Completely redesigned the synchronization scheme to avoid using any extra
>   cpumasks.
>
> * Provided APIs for 2 types of atomic hot

Re: [PATCH v2] sched: Fix compiler warnings

2014-06-24 Thread Vincent Guittot

On 25 June 2014 03:05, Guenter Roeck  wrote:
> Commit 143e1e28cb (sched: Rework sched_domain topology definition)
> introduced a number of functions with a return value of 'const int'.
> gcc doesn't know what to do with that and, if the kernel is compiled
> with W=1, complains with the following warnings whenever sched.h
> is included.
>
> include/linux/sched.h:875:25: warning:
> type qualifiers ignored on function return type
> include/linux/sched.h:882:25: warning:
> type qualifiers ignored on function return type
> include/linux/sched.h:889:25: warning:
> type qualifiers ignored on function return type
> include/linux/sched.h:1002:21: warning:
> type qualifiers ignored on function return type
>
> Commits fb2aa855 (sched, ARM: Create a dedicated scheduler topology table)
> and 607b45e9a (sched, powerpc: Create a dedicated topology table) introduce
> the same warning in the arm and powerpc code.
>
> Drop 'const' from the function declarations to fix the problem.
>
> The fix for all three patches has to be applied together to avoid
> compilation failures for the affected architectures.
>
> Cc: Dietmar Eggemann 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Benjamin Herrenschmidt 
> Cc: Vincent Guittot 

Acked-by Vincent Guittot 

> Signed-off-by: Guenter Roeck 
> ---
> v2: Fix problem in all affected architectures with a single patch
> to avoid compilation errors.
>
>  arch/arm/kernel/topology.c | 2 +-
>  arch/powerpc/kernel/smp.c  | 2 +-
>  include/linux/sched.h  | 8 
>  3 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 9d85318..e35d880 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -275,7 +275,7 @@ void store_cpu_topology(unsigned int cpuid)
> cpu_topology[cpuid].socket_id, mpidr);
>  }
>
> -static inline const int cpu_corepower_flags(void)
> +static inline int cpu_corepower_flags(void)
>  {
> return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
>  }
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 51a3ff7..1007fb8 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -747,7 +747,7 @@ int setup_profiling_timer(unsigned int multiplier)
>
>  #ifdef CONFIG_SCHED_SMT
>  /* cpumask of CPUs with asymetric SMT dependancy */
> -static const int powerpc_smt_flags(void)
> +static int powerpc_smt_flags(void)
>  {
> int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 306f4f0..0376b05 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -872,21 +872,21 @@ enum cpu_idle_type {
>  #define SD_NUMA0x4000  /* cross-node balancing */
>
>  #ifdef CONFIG_SCHED_SMT
> -static inline const int cpu_smt_flags(void)
> +static inline int cpu_smt_flags(void)
>  {
> return SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
>  }
>  #endif
>
>  #ifdef CONFIG_SCHED_MC
> -static inline const int cpu_core_flags(void)
> +static inline int cpu_core_flags(void)
>  {
> return SD_SHARE_PKG_RESOURCES;
>  }
>  #endif
>
>  #ifdef CONFIG_NUMA
> -static inline const int cpu_numa_flags(void)
> +static inline int cpu_numa_flags(void)
>  {
> return SD_NUMA;
>  }
> @@ -999,7 +999,7 @@ void free_sched_domains(cpumask_var_t doms[], unsigned 
> int ndoms);
>  bool cpus_share_cache(int this_cpu, int that_cpu);
>
>  typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
> -typedef const int (*sched_domain_flags_f)(void);
> +typedef int (*sched_domain_flags_f)(void);
>
>  #define SDTL_OVERLAP   0x01
>
> --
> 1.9.1
>
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-05 Thread Vincent Guittot

Le lundi 05 juin 2023 à 10:07:16 (+0200), Tobias Huschle a écrit :
> On 2023-05-16 15:36, Vincent Guittot wrote:
> > On Mon, 15 May 2023 at 13:46, Tobias Huschle 
> > wrote:
> > > 
> > > The current load balancer implementation implies that scheduler
> > > groups,
> > > within the same domain, all host the same number of CPUs. This is
> > > reflected in the condition, that a scheduler group, which is load
> > > balancing and classified as having spare capacity, should pull work
> > > from the busiest group, if the local group runs less processes than
> > > the busiest one. This implies that these two groups should run the
> > > same number of processes, which is problematic if the groups are not
> > > of the same size.
> > > 
> > > The assumption that scheduler groups within the same scheduler domain
> > > host the same number of CPUs appears to be true for non-s390
> > > architectures. Nevertheless, s390 can have scheduler groups of unequal
> > > size.
> > > 
> > > This introduces a performance degredation in the following scenario:
> > > 
> > > Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> > > the remaining 2 are located on another socket:
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > > 
> > > Placing some workload ( x = one task ) yields the following
> > > scenarios:
> > > 
> > > The first 5 tasks are distributed evenly across the two groups.
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > >  x x x  x x
> > > 
> > > Adding a 6th task yields the following distribution:
> > > 
> > > Socket   -1--2-
> > > CPU  1 2 3 4 5 67 8
> > > SMT1 x x x  x x
> > > SMT2x
> > 
> > Your description is a bit confusing for me. What you name CPU above
> > should be named Core, doesn' it ?
> > 
> > Could you share with us your scheduler topology ?
> > 
> 
> You are correct, it should say core instead of CPU.
> 
> One actual configuration from one of my test machines (lscpu -e):
> 

[...]

> 
> So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

Thaks for the details

> 
> If I run stress-ng with 8 cpu stressors on the original code I get a
> distribution
> like this:
> 
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
> x x x x  x  x  x  x
> 
> Which means that the two cores in the smaller group are running into SMT
> while two
> cores in the larger group are still idle. This is caused by the
> prefer_sibling path
> which really wants to see both groups run the same number of tasks.

yes and it considers that there are the same number of CPUs per group

> 
> > > 
> > > The task is added to the 2nd scheduler group, as the scheduler has the
> > > assumption that scheduler groups are of the same size, so they should
> > > also host the same number of tasks. This makes CPU 7 run into SMT
> > > thread, which comes with a performance penalty. This means, that in
> > > the window of 6-8 tasks, load balancing is done suboptimally, because
> > > SMT is used although there is no reason to do so as fully idle CPUs
> > > are still available.
> > > 
> > > Taking the weight of the scheduler groups into account, ensures that
> > > a load balancing CPU within a smaller group will not try to pull tasks
> > > from a bigger group while the bigger group still has idle CPUs
> > > available.
> > > 
> > > Signed-off-by: Tobias Huschle 
> > > ---
> > >  kernel/sched/fair.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 48b6f0ca13ac..b1307d7e4065 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -10426,7 +10426,8 @@ static struct sched_group
> > > *find_busiest_group(struct lb_env *env)
> > >  * group's child domain.
> > >  */
> > > if (sds.prefer_sibling && local->group_type ==
> > > group_has_spare &&
> > > -   busiest->sum_nr_running > local->sum_nr_running + 1)
> > > +   busiest->sum_nr_running * local->group_weight >
> > > +   local->sum_nr_running *
> > >

Re: [RFC][PATCH] sched: Rename DIE domain

2023-07-17 Thread Vincent Guittot

On Wed, 12 Jul 2023 at 16:11, Peter Zijlstra  wrote:
>
> Hi
>
> Thomas just tripped over the x86 topology setup creating a 'DIE' domain
> for the package mask :-)

May be a link to the change that triggers this patch could be useful

>
> Since these names are SCHED_DEBUG only, rename them.
> I don't think anybody *should* be relying on this, but who knows.

Apart the remaining reference to DIE already mentioned by others,
looks good to me

>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/powerpc/kernel/smp.c   | 2 +-
>  arch/s390/kernel/topology.c | 2 +-
>  arch/x86/kernel/smpboot.c   | 2 +-
>  kernel/sched/topology.c | 2 +-
>  4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index fbbb695bae3d..5ed6b9fe5094 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1050,7 +1050,7 @@ static struct sched_domain_topology_level 
> powerpc_topology[] = {
>  #endif
> { shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) 
> },
> { cpu_mc_mask, SD_INIT_NAME(MC) },
> -   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +   { cpu_cpu_mask, SD_INIT_NAME(PKG) },
> { NULL, },
>  };
>
> diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
> index 68adf1de..c803f5e6ab46 100644
> --- a/arch/s390/kernel/topology.c
> +++ b/arch/s390/kernel/topology.c
> @@ -522,7 +522,7 @@ static struct sched_domain_topology_level s390_topology[] 
> = {
> { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
> { cpu_book_mask, SD_INIT_NAME(BOOK) },
> { cpu_drawer_mask, SD_INIT_NAME(DRAWER) },
> -   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +   { cpu_cpu_mask, SD_INIT_NAME(PKG) },
> { NULL, },
>  };
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index e1aa2cd7734b..09cc9d0aa358 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -653,7 +653,7 @@ static void __init build_sched_topology(void)
>  */
> if (!x86_has_numa_in_package) {
> x86_topology[i++] = (struct sched_domain_topology_level){
> -   cpu_cpu_mask, SD_INIT_NAME(DIE)
> +   cpu_cpu_mask, SD_INIT_NAME(PKG)
> };
> }
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d3a3b2646ec4..e9d9cf776b7a 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1670,7 +1670,7 @@ static struct sched_domain_topology_level 
> default_topology[] = {
>  #ifdef CONFIG_SCHED_MC
> { cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
>  #endif
> -   { cpu_cpu_mask, SD_INIT_NAME(DIE) },
> +   { cpu_cpu_mask, SD_INIT_NAME(PKG) },
> { NULL, },
>  };
>
>

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-05-16 Thread Vincent Guittot

On Mon, 15 May 2023 at 13:46, Tobias Huschle  wrote:
>
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not
> of the same size.
>
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390
> architectures. Nevertheless, s390 can have scheduler groups of unequal
> size.
>
> This introduces a performance degredation in the following scenario:
>
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
>
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
>
> Placing some workload ( x = one task ) yields the following
> scenarios:
>
> The first 5 tasks are distributed evenly across the two groups.
>
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
>  x x x  x x
>
> Adding a 6th task yields the following distribution:
>
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
> SMT1 x x x  x x
> SMT2x

Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?

>
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT
> thread, which comes with a performance penalty. This means, that in
> the window of 6-8 tasks, load balancing is done suboptimally, because
> SMT is used although there is no reason to do so as fully idle CPUs
> are still available.
>
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
>
> Signed-off-by: Tobias Huschle 
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>  * group's child domain.
>  */
> if (sds.prefer_sibling && local->group_type == group_has_spare &&
> -   busiest->sum_nr_running > local->sum_nr_running + 1)
> +   busiest->sum_nr_running * local->group_weight >
> +   local->sum_nr_running * busiest->group_weight + 1)

This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for such topology ? the default
path compares the number of idle CPUs when groups has spare capacity


> goto force_balance;
>
> if (busiest->group_type != group_overloaded) {
> --
> 2.34.1
>

Re: [RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag

2024-03-06 Thread Vincent Guittot

> https://lore.kernel.org/lkml/cakftptc446lo9catpp7pexdklhhqfobuy-jmgc7agohy4hs...@mail.gmail.com/
>
> This series is based on tip:sched/core at tag "sched-core-2024-01-08".
> ---
> Gautham R. Shenoy (4):
>   thread_info: Add helpers to test and clear TIF_NOTIFY_IPI
>   sched: Define a need_resched_or_ipi() helper and use it treewide
>   sched/core: Use TIF_NOTIFY_IPI to notify an idle CPU in TIF_POLLING
> mode of pending IPI
>   x86/thread_info: Introduce TIF_NOTIFY_IPI flag
>
> K Prateek Nayak (10):
>   arm/thread_info: Introduce TIF_NOTIFY_IPI flag
>   alpha/thread_info: Introduce TIF_NOTIFY_IPI flag
>   openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag
>   powerpc/thread_info: Introduce TIF_NOTIFY_IPI flag
>   sh/thread_info: Introduce TIF_NOTIFY_IPI flag
>   sparc/thread_info: Introduce TIF_NOTIFY_IPI flag
>   csky/thread_info: Introduce TIF_NOTIFY_IPI flag
>   parisc/thread_info: Introduce TIF_NOTIFY_IPI flag
>   nios2/thread_info: Introduce TIF_NOTIFY_IPI flag
>   microblaze/thread_info: Introduce TIF_NOTIFY_IPI flag
> ---
> Cc: Richard Henderson 
> Cc: Ivan Kokshaysky 
> Cc: Matt Turner 
> Cc: Russell King 
> Cc: Guo Ren 
> Cc: Michal Simek 
> Cc: Dinh Nguyen 
> Cc: Jonas Bonn 
> Cc: Stefan Kristiansson 
> Cc: Stafford Horne 
> Cc: "James E.J. Bottomley" 
> Cc: Helge Deller 
> Cc: Michael Ellerman 
> Cc: Nicholas Piggin 
> Cc: Christophe Leroy 
> Cc: "Aneesh Kumar K.V" 
> Cc: "Naveen N. Rao" 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: John Paul Adrian Glaubitz 
> Cc: "David S. Miller" 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Dave Hansen 
> Cc: "H. Peter Anvin" 
> Cc: "Rafael J. Wysocki" 
> Cc: Daniel Lezcano 
> Cc: Peter Zijlstra 
> Cc: Juri Lelli 
> Cc: Vincent Guittot 
> Cc: Dietmar Eggemann 
> Cc: Steven Rostedt 
> Cc: Ben Segall 
> Cc: Mel Gorman 
> Cc: Daniel Bristot de Oliveira 
> Cc: Valentin Schneider 
> Cc: Al Viro 
> Cc: Linus Walleij 
> Cc: Ard Biesheuvel 
> Cc: Andrew Donnellan 
> Cc: Nicholas Miehlbradt 
> Cc: Andrew Morton 
> Cc: Arnd Bergmann 
> Cc: Josh Poimboeuf 
> Cc: "Kirill A. Shutemov" 
> Cc: Rick Edgecombe 
> Cc: Tony Battersby 
> Cc: Brian Gerst 
> Cc: Tim Chen 
> Cc: David Vernet 
> Cc: x...@kernel.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linux-al...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-c...@vger.kernel.org
> Cc: linux-openr...@vger.kernel.org
> Cc: linux-par...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> Cc: linux...@vger.kernel.org
> ---
>  arch/alpha/include/asm/thread_info.h  |  2 ++
>  arch/arm/include/asm/thread_info.h|  3 ++
>  arch/csky/include/asm/thread_info.h   |  2 ++
>  arch/microblaze/include/asm/thread_info.h |  2 ++
>  arch/nios2/include/asm/thread_info.h  |  2 ++
>  arch/openrisc/include/asm/thread_info.h   |  2 ++
>  arch/parisc/include/asm/thread_info.h |  2 ++
>  arch/powerpc/include/asm/thread_info.h|  2 ++
>  arch/sh/include/asm/thread_info.h |  2 ++
>  arch/sparc/include/asm/thread_info_32.h   |  2 ++
>  arch/sparc/include/asm/thread_info_64.h   |  2 ++
>  arch/x86/include/asm/mwait.h  |  2 +-
>  arch/x86/include/asm/thread_info.h|  2 ++
>  arch/x86/kernel/process.c |  2 +-
>  drivers/cpuidle/cpuidle-powernv.c |  2 +-
>  drivers/cpuidle/cpuidle-pseries.c |  2 +-
>  drivers/cpuidle/poll_state.c  |  2 +-
>  include/linux/sched.h |  5 +++
>  include/linux/sched/idle.h| 12 +++
>  include/linux/thread_info.h   | 43 +++
>  kernel/sched/core.c   | 41 -
>  kernel/sched/idle.c   | 23 
>  22 files changed, 133 insertions(+), 26 deletions(-)
>
> --
> 2.34.1
>

Re: [RFC PATCH 00/14] Introducing TIF_NOTIFY_IPI flag

2024-03-06 Thread Vincent Guittot

On Wed, 6 Mar 2024 at 11:18, K Prateek Nayak  wrote:
>
> Hello Vincent,
>
> Thank you for taking a look at the series.
>
> On 3/6/2024 3:29 PM, Vincent Guittot wrote:
> > Hi Prateek,
> >
> > Adding Julia who could be interested in this patchset. Your patchset
> > should trigger idle load balance instead of newly idle load balance
> > now when the polling is used. This was one reason for not migrating
> > task in idle CPU
>
> Thank you.
>
> >
> > On Tue, 20 Feb 2024 at 18:15, K Prateek Nayak  
> > wrote:
> >>
> >> Hello everyone,
> >>
> >> [..snip..]
> >>
> >>
> >> Skipping newidle_balance()
> >> ==
> >>
> >> In an earlier attempt to solve the challenge of the long IRQ disabled
> >> section, newidle_balance() was skipped when a CPU waking up from idle
> >> was found to have no runnable tasks, and was transitioning back to
> >> idle [2]. Tim [3] and David [4] had pointed out that newidle_balance()
> >> may be viable for CPUs that are idling with tick enabled, where the
> >> newidle_balance() has the opportunity to pull tasks onto the idle CPU.
> >>
> >> Vincent [5] pointed out a case where the idle load kick will fail to
> >> run on an idle CPU since the IPI handler launching the ILB will check
> >> for need_resched(). In such cases, the idle CPU relies on
> >> newidle_balance() to pull tasks towards itself.
> >
> > Calling newidle_balance() instead of the normal idle load balance
> > prevents the CPU to pull tasks from other groups
>
> Thank you for the correction.
>
> >
> >>
> >> Using an alternate flag instead of NEED_RESCHED to indicate a pending
> >> IPI was suggested as the correct approach to solve this problem on the
> >> same thread.
> >>
> >>
> >> Proposed solution: TIF_NOTIFY_IPI
> >> =
> >>
> >> Instead of reusing TIF_NEED_RESCHED bit to pull an TIF_POLLING CPU out
> >> of idle, TIF_NOTIFY_IPI is a newly introduced flag that
> >> call_function_single_prep_ipi() sets on a target TIF_POLLING CPU to
> >> indicate a pending IPI, which the idle CPU promises to process soon.
> >>
> >> On architectures that do not support the TIF_NOTIFY_IPI flag (this
> >> series only adds support for x86 and ARM processors for now),
> >
> > I'm surprised that you are mentioning ARM processors because they
> > don't use TIF_POLLING.
>
> Yup I just realised that after Linus Walleij pointed it out on the
> thread.
>
> >
> >> call_function_single_prep_ipi() will fallback to setting
> >> TIF_NEED_RESCHED bit to pull the TIF_POLLING CPU out of idle.
> >>
> >> Since the pending IPI handlers are processed before the call to
> >> schedule_idle() in do_idle(), schedule_idle() will only be called if the
> >> IPI handler have woken / migrated a new task on the idle CPU and has set
> >> TIF_NEED_RESCHED bit to indicate the same. This avoids running into the
> >> long IRQ disabled section in schedule_idle() unnecessarily, and any
> >> need_resched() check within a call function will accurately notify if a
> >> task is waiting for CPU time on the CPU handling the IPI.
> >>
> >> Following is the crude visualization of how the situation changes with
> >> the newly introduced TIF_NOTIFY_IPI flag:
> >> --
> >> CPU0CPU1
> >> 
> >> do_idle() {
> >> 
> >> __current_set_polling();
> >> ...
> >> 
> >> monitor(addr);
> >> if 
> >> (!need_resched_or_ipi())
> >> 
> >> mwait() {
> >> /* 
> >> Waiting */
> >> smp_call_function_single(CPU1, func, wait = 1) {   
> >>  ...
> >> ...
> >>  ...
> >>

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-19 Thread Vincent Guittot

On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
>
> On 2024-03-18 15:45, Luis Machado wrote:
> > On 3/14/24 13:45, Tobias Huschle wrote:
> >> On Fri, Mar 08, 2024 at 03:11:38PM +, Luis Machado wrote:
> >>> On 2/28/24 16:10, Tobias Huschle wrote:
> 
>  Questions:
>  1. The kworker getting its negative lag occurs in the following
>  scenario
> - kworker and a cgroup are supposed to execute on the same CPU
> - one task within the cgroup is executing and wakes up the
>  kworker
> - kworker with 0 lag, gets picked immediately and finishes its
>   execution within ~5000ns
> - on dequeue, kworker gets assigned a negative lag
> Is this expected behavior? With this short execution time, I
>  would
> expect the kworker to be fine.
> >>>
> >>> That strikes me as a bit odd as well. Have you been able to determine
> >>> how a negative lag
> >>> is assigned to the kworker after such a short runtime?
> >>>
> >>
> >> I did some more trace reading though and found something.
> >>
> >> What I observed if everything runs regularly:
> >> - vhost and kworker run alternating on the same CPU
> >> - if the kworker is done, it leaves the runqueue
> >> - vhost wakes up the kworker if it needs it
> >> --> this means:
> >>   - vhost starts alone on an otherwise empty runqueue
> >>   - it seems like it never gets dequeued
> >> (unless another unrelated task joins or migration hits)
> >>   - if vhost wakes up the kworker, the kworker gets selected
> >>   - vhost runtime > kworker runtime
> >> --> kworker gets positive lag and gets selected immediately next
> >> time
> >>
> >> What happens if it does go wrong:
> >> From what I gather, there seem to be occasions where the vhost either
> >> executes suprisingly quick, or the kworker surprinsingly slow. If
> >> these
> >> outliers reach critical values, it can happen, that
> >>vhost runtime < kworker runtime
> >> which now causes the kworker to get the negative lag.
> >>
> >> In this case it seems like, that the vhost is very fast in waking up
> >> the kworker. And coincidentally, the kworker takes, more time than
> >> usual
> >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> >>
> >> So, for these outliers, the scheduler extrapolates that the kworker
> >> out-consumes the vhost and should be slowed down, although in the
> >> majority
> >> of other cases this does not happen.
> >
> > Thanks for providing the above details Tobias. It does seem like EEVDF
> > is strict
> > about the eligibility checks and making tasks wait when their lags are
> > negative, even
> > if just a little bit as in the case of the kworker.
> >
> > There was a patch to disable the eligibility checks
> > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefes...@chromium.org/),
> > which would make EEVDF more like EVDF, though the deadline comparison
> > would
> > probably still favor the vhost task instead of the kworker with the
> > negative lag.
> >
> > I'm not sure if you tried it, but I thought I'd mention it.
>
> Haven't seen that one yet. Unfortunately, it does not help to ignore the
> eligibility.
>
> I'm inclined to rather propose propose a documentation change, which
> describes that tasks should not rely on woken up tasks being scheduled
> immediately.

Where do you see such an assumption ? Even before eevdf, there were
nothing that ensures such behavior. When using CFS (legacy or eevdf)
tasks, you can't know if the newly wakeup task will run 1st or not


>
> Changing things in the code to address for the specific scenario I'm
> seeing seems to mostly create unwanted side effects and/or would require
> the definition of some magic cut-off values.
>
>

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-20 Thread Vincent Guittot

On Wed, 20 Mar 2024 at 08:04, Tobias Huschle  wrote:
>
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> > On Tue, 19 Mar 2024 at 10:08, Tobias Huschle  wrote:
> > >
...
> > >
> > > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > > eligibility.
> > >
> > > I'm inclined to rather propose propose a documentation change, which
> > > describes that tasks should not rely on woken up tasks being scheduled
> > > immediately.
> >
> > Where do you see such an assumption ? Even before eevdf, there were
> > nothing that ensures such behavior. When using CFS (legacy or eevdf)
> > tasks, you can't know if the newly wakeup task will run 1st or not
> >
>
> There was no guarantee of course. place_entity was reducing the vruntime of
> woken up tasks though, giving it a slight boost, right?. For the scenario

It was rather the opposite, It was ensuring that long sleeping tasks
will not get too much bonus because of vruntime too far in the past.
This is similar although not exactly the same intent as the lag. The
bonus was up to 24ms previously whereas it's not more than a slice now

> that I observed, that boost was enough to make sure, that the woken up tasks
> gets scheduled consistently. This might still not be true for all scenarios,
> but in general EEVDF seems to be stricter with woken up tasks.
>
> Dismissing the lag on wakeup also does obviously not guarantee getting
> scheduled, as other tasks might still be involved.
>
> The question would be if it should be explicitly mentioned somewhere that,
> at this point, woken up tasks are not getting any special treatment and
> noone should rely on that boost for woken up tasks.
>
> > >
> > > Changing things in the code to address for the specific scenario I'm
> > > seeing seems to mostly create unwanted side effects and/or would require
> > > the definition of some magic cut-off values.
> > >
> > >

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-03-22 Thread Vincent Guittot

On Thu, 21 Mar 2024 at 13:18, Tobias Huschle  wrote:
>
> On Wed, Mar 20, 2024 at 02:51:00PM +0100, Vincent Guittot wrote:
> > On Wed, 20 Mar 2024 at 08:04, Tobias Huschle  wrote:
> > > There was no guarantee of course. place_entity was reducing the vruntime 
> > > of
> > > woken up tasks though, giving it a slight boost, right?. For the scenario
> >
> > It was rather the opposite, It was ensuring that long sleeping tasks
> > will not get too much bonus because of vruntime too far in the past.
> > This is similar although not exactly the same intent as the lag. The
> > bonus was up to 24ms previously whereas it's not more than a slice now
> >
>
> I might have gotten this quite wrong then. I was looking at place_entity
> and saw that non-initial placements get their vruntime reduced via
>
> vruntime -= thresh;

and then
se->vruntime = max_vruntime(se->vruntime, vruntime)

>
> which would mean that the placed task would have a vruntime smaller than
> cfs_rq->min_vruntime, based on pre-EEVDF behavior, last seen at:
>
>af4cf40470c2 sched/fair: Add cfs_rq::avg_vruntime
>
> If there was no such benefit for woken up tasks. Then the scenario I observed
> is just conincidentally worse with EEVDF, which can happen when exchanging an
> algorithm I suppose. Or EEVDF just exposes a so far hidden problem in that
> scenario.

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-12 Thread Vincent Guittot

On Mon, 12 Apr 2021 at 11:37, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 11:54:36AM +0530, Srikar Dronamraju wrote:
> > * Gautham R. Shenoy  [2021-04-02 11:07:54]:
> >
> > >
> > > To remedy this, this patch proposes that the LLC be moved to the MC
> > > level which is a group of cores in one half of the chip.
> > >
> > >   SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE
> > >
> >
> > I think marking Hemisphere as a LLC in a P10 scenario is a good idea.
> >
> > > While there is no cache being shared at this level, this is still the
> > > level where some amount of cache-snooping takes place and it is
> > > relatively faster to access the data from the caches of the cores
> > > within this domain. With this change, we no longer see regressions on
> > > P10 for applications which require single threaded performance.
> >
> > Peter, Valentin, Vincent, Mel, etal
> >
> > On architectures where we have multiple levels of cache access latencies
> > within a DIE, (For example: one within the current LLC or SMT core and the
> > other at MC or Hemisphere, and finally across hemispheres), do you have any
> > suggestions on how we could handle the same in the core scheduler?

I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
only rely on cache

> >
>
> Minimally I think it would be worth detecting when there are multiple
> LLCs per node and detecting that in generic code as a static branch. In
> select_idle_cpu, consider taking two passes -- first on the LLC domain
> and if no idle CPU is found then taking a second pass if the search depth

We have done a lot of changes to reduce and optimize the fast path and
I don't think re adding another layer  in the fast path makes sense as
you will end up unrolling the for_each_domain behind some
static_banches.

SD_SHARE_PKG_RESOURCES should be set to the last level where we can
efficiently move task between CPUs at wakeup

> allows within the node with the LLC CPUs masked out. While there would be
> a latency hit because cache is not shared, it would still be a CPU local
> to memory that is idle. That would potentially be beneficial on Zen*
> as well without having to introduce new domains in the topology hierarchy.

What is the current sched_domain topology description for zen ?

>
> --
> Mel Gorman
> SUSE Labs

Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-13 Thread Vincent Guittot

On Mon, 12 Apr 2021 at 17:24, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
> SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

But the SMT4 level is still present  here with select_idle_core taking
of the spreading

>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.

What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB

>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
> spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
> L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
> what it means as such and whether it has any preferences on
> search order.
>
> ARM has similar requirements again according to "scheduler: expose the
> topology of clusters and add cluster scheduler" and that one *does*
> add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
>
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> >
>
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.

That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered  in fast path

>
> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in th

Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Vincent Guittot

On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
 wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 
> ---
> Changes since v3:
>   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
>   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
>   * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
>   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
>   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
>  kernel/sched/fair.c | 95 +
>  1 file changed, 95 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd411cefb63f..8a1a2a43732c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8531,10 +8531,99 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:   Load-balancing data with statistics of the local group
> + * @sgs:   Load-balancing statistics of the candidate busiest group
> + * @sg:The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +   struct sg_lb_stats *sgs,
> +   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +   bool local_is_smt, sg_is_smt;
> +   int sg_busy_cpus;
> +
> +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +   if (!local_is_smt) {
> +   /*
> +* If we are here, @dst_cpu is idle and does not have SMT
> +* siblings. Pull tasks if candidate group has two or more
> +* busy CPUs.
> +*/
> +   if (sg_is_smt && sg_busy_cpus >= 2)
> +   return true;
> +
> +   /*
> +* @dst_cpu does not have SMT siblings. @sg may have SMT
> +* siblings and only one is busy. In such case, @dst_cpu
> +* can help if it has higher priority and is idle.
> +*/
> +   return !sds->local_stat.group_util &&

sds->local_stat.group_util can't be used to decide if a CPU or group
of CPUs is idle. util_avg is usually not null when a CPU becomes idle
and you can have to wait  more than

Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-27 Thread Vincent Guittot

On Fri, 27 Aug 2021 at 16:50, Peter Zijlstra  wrote:
>
> On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiet group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, 
> > > it can
> > > + * pull tasks if two or more of the SMT siblings of @sg are busy. If 
> > > only one
> > > + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, even the number of idle 
> > > CPUs
> > > + * between @sds::local and @sg. Thus, pull tasks from @sg if the 
> > > difference
> > > + * between the number of busy CPUs is 2 or more. If the difference is of 
> > > 1,
> > > + * only pull if @dst_cpu has higher priority. If @sg does not have SMT 
> > > siblings
> > > + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and 
> > > @sg
> > > + * has lower priority.
> > > + */
> > > +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> > > +   struct sg_lb_stats *sgs,
> > > +   struct sched_group *sg)
> > > +{
> > > +#ifdef CONFIG_SCHED_SMT
> > > +   bool local_is_smt, sg_is_smt;
> > > +   int sg_busy_cpus;
> > > +
> > > +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> > > +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> > > +
> > > +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> > > +
> > > +   if (!local_is_smt) {
> > > +   /*
> > > +* If we are here, @dst_cpu is idle and does not have SMT
> > > +* siblings. Pull tasks if candidate group has two or more
> > > +* busy CPUs.
> > > +*/
> > > +   if (sg_is_smt && sg_busy_cpus >= 2)
> > > +   return true;
> > > +
> > > +   /*
> > > +* @dst_cpu does not have SMT siblings. @sg may have SMT
> > > +* siblings and only one is busy. In such case, @dst_cpu
> > > +* can help if it has higher priority and is idle.
> > > +*/
> > > +   return !sds->local_stat.group_util &&
> >
> > sds->local_stat.group_util can't be used to decide if a CPU or group
> > of CPUs is idle. util_avg is usually not null when a CPU becomes idle
> > and you can have to wait  more than 300ms before it becomes Null
> > At the opposite, the utilization of a CPU can be null but a task with
> > null utilization has just woken up on it.
> > Utilization is used to reflect the average work of the CPU or group of
> > CPUs but not the current state
>
> If you want immediate idle, sgs->nr_running == 0 or sgs->idle_cpus ==
> sgs->group_weight come to mind.

yes, I have the same in mind

>
> > > +  sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> > > +   }
> > > +
> > > +   /* @dst_cpu has SMT siblings. */
> > > +
> > > +   if (sg_is_smt) {
> > > +   int local_busy_cpus = sds->local->group_weight -
> > > + sds->local_stat.idle_cpus;
> > > +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> > > +
> > > +   /* Local can always help to even the number busy CPUs. */
> >
> > default behavior of the load balance already tries to even the number
 a> > of idle CPUs.
>
> Right, but I suppose this is because we're trapped here and have to deal
> with the SMT-SMT case too. Ricardo, can you clarify?

IIUC, this function is used in sg_lb_stats to set
sgs->group_asym_packing which is then used to set the group state to
group_asym_packing and force asym migration.
But if we only want to even the number of busy CPUs between the group,
we should not n

Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-08-28 Thread Vincent Guittot

On Fri, 27 Aug 2021 at 21:45, Ricardo Neri
 wrote:
>
> On Fri, Aug 27, 2021 at 12:13:42PM +0200, Vincent Guittot wrote:
> > On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
> >  wrote:
> > > @@ -9540,6 +9629,12 @@ static struct rq *find_busiest_queue(struct lb_env 
> > > *env,
> > > nr_running == 1)
> > > continue;
> > >
> > > +   /* Make sure we only pull tasks from a CPU of lower 
> > > priority */
> > > +   if ((env->sd->flags & SD_ASYM_PACKING) &&
> > > +   sched_asym_prefer(i, env->dst_cpu) &&
> > > +   nr_running == 1)
> > > +   continue;
> >
> > This really looks similar to the test above for SD_ASYM_CPUCAPACITY.
> > More generally speaking SD_ASYM_PACKING and SD_ASYM_CPUCAPACITY share
> > a lot of common policy and I wonder if at some point we could not
> > merge their behavior in LB
>
> I would like to confirm with you that you are not expecting this merge
> as part of this series, right?

Merging them will probably need more tests on both x86 and Arm so I
suppose that we could keep them separate for now

Regards,
Vincent

>
> Thanks and BR,
> Ricardo

Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-15 Thread Vincent Guittot

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 
> ---
> Changes since v4:
>   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> (Vincent, Peter)
>   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
>   * Updated function documentation and corrected a typo.
>
> Changes since v3:
>   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
>   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
>   * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
>   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
>   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
>  kernel/sched/fair.c | 94 +
>  1 file changed, 94 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26db017c14a3..8d763dd0174b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:   Load-balancing data with statistics of the local group
> + * @sgs:   Load-balancing statistics of the candidate busiest group
> + * @sg:The candidate busiest group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks.
> + *
> + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more 
> of
> + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull 
> tasks
> + * only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
> + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
> priority.
> + * Bigger imbalances in the number of busy CPUs will be dealt with in
> + * update_sd_pick_busiest().
> + *
> + * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
> siblings
> + * of @dst_cpu are idle and @sg has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +   struct sg_lb_stats *sgs,
> +   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +   bool local_is_smt, sg_is_smt;
> +   int sg_busy_cpus;
> +
> +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +   if (!local_is_smt) {
> +   /*
> +* If we are here, @dst_cpu is idle and does not have SMT
> +* siblings. Pull tasks if candidate group has two or more
> +* busy CPUs.
> +*/
> +   if (sg_is_smt && sg_busy_cpus >= 2)

Do you really need to test sg_is_smt ? if sg_busy_cpus >= 2 then
sd_is_smt must be true ?

Also, This is the default behavior where we want to even the number of
busy cpu. Shouldn't you return false and fall back to the default
behavior ?

That being said, the default behavio

Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Vincent Guittot

On Fri, 17 Sept 2021 at 03:01, Ricardo Neri
 wrote:
>
> On Wed, Sep 15, 2021 at 05:43:44PM +0200, Vincent Guittot wrote:
> > On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
> >  wrote:
> > >
> > > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > > check for the idle state of the destination CPU, dst_cpu, but also of
> > > its SMT siblings.
> > >
> > > If dst_cpu is idle but its SMT siblings are busy, performance suffers
> > > if it pulls tasks from a medium priority CPU that does not have SMT
> > > siblings.
> > >
> > > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> > > siblings of both dst_cpu and the CPUs in the candidate busiest group.
> > >
> > > Cc: Aubrey Li 
> > > Cc: Ben Segall 
> > > Cc: Daniel Bristot de Oliveira 
> > > Cc: Dietmar Eggemann 
> > > Cc: Mel Gorman 
> > > Cc: Quentin Perret 
> > > Cc: Rafael J. Wysocki 
> > > Cc: Srinivas Pandruvada 
> > > Cc: Steven Rostedt 
> > > Cc: Tim Chen 
> > > Reviewed-by: Joel Fernandes (Google) 
> > > Reviewed-by: Len Brown 
> > > Signed-off-by: Ricardo Neri 
> > > ---
> > > Changes since v4:
> > >   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> > > (Vincent, Peter)
> > >   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
> > >   * Updated function documentation and corrected a typo.
> > >
> > > Changes since v3:
> > >   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> > > powerpc folks showed that this patch should not impact them. Also, 
> > > more
> > > recent powerpc processor no longer use asym_packing. (PeterZ)
> > >   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> > >   * Removed unnecessary check for local CPUs when the local group has zero
> > > utilization. (Joel)
> > >   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> > > the fact that it deals with SMT cases.
> > >   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> > > that callers can deal with non-SMT cases.
> > >
> > > Changes since v2:
> > >   * Reworded the commit message to reflect updates in code.
> > >   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> > > balancing. (PeterZ)
> > >   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> > > sched_asym().
> > >
> > > Changes since v1:
> > >   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> > > tasks. Instead, reclassify the candidate busiest group, as it
> > > may still be selected. (PeterZ)
> > >   * Avoid an expensive and unnecessary call to cpumask_weight() when
> > > determining if a sched_group is comprised of SMT siblings.
> > > (PeterZ).
> > > ---
> > >  kernel/sched/fair.c | 94 +
> > >  1 file changed, 94 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 26db017c14a3..8d763dd0174b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int 
> > > imbalance_pct,
> > > return group_has_spare;
> > >  }
> > >
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiest group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks.
> > > + *
> > > + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or 
> > > more of
> > > + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, 
> > > pull tasks
> > > + * only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one 
> > > more
> > > + * busy CPU than @sds::local, let @dst_cpu pull tasks if

Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-17 Thread Vincent Guittot

On Fri, 17 Sept 2021 at 03:01, Ricardo Neri
 wrote:
>
> On Wed, Sep 15, 2021 at 05:43:44PM +0200, Vincent Guittot wrote:
> > On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
> >  wrote:
> > >
> > > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > > check for the idle state of the destination CPU, dst_cpu, but also of
> > > its SMT siblings.
> > >
> > > If dst_cpu is idle but its SMT siblings are busy, performance suffers
> > > if it pulls tasks from a medium priority CPU that does not have SMT
> > > siblings.
> > >
> > > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> > > siblings of both dst_cpu and the CPUs in the candidate busiest group.
> > >
> > > Cc: Aubrey Li 
> > > Cc: Ben Segall 
> > > Cc: Daniel Bristot de Oliveira 
> > > Cc: Dietmar Eggemann 
> > > Cc: Mel Gorman 
> > > Cc: Quentin Perret 
> > > Cc: Rafael J. Wysocki 
> > > Cc: Srinivas Pandruvada 
> > > Cc: Steven Rostedt 
> > > Cc: Tim Chen 
> > > Reviewed-by: Joel Fernandes (Google) 
> > > Reviewed-by: Len Brown 
> > > Signed-off-by: Ricardo Neri 
> > > ---
> > > Changes since v4:
> > >   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> > > (Vincent, Peter)
> > >   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
> > >   * Updated function documentation and corrected a typo.
> > >
> > > Changes since v3:
> > >   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> > > powerpc folks showed that this patch should not impact them. Also, 
> > > more
> > > recent powerpc processor no longer use asym_packing. (PeterZ)
> > >   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> > >   * Removed unnecessary check for local CPUs when the local group has zero
> > > utilization. (Joel)
> > >   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> > > the fact that it deals with SMT cases.
> > >   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> > > that callers can deal with non-SMT cases.
> > >
> > > Changes since v2:
> > >   * Reworded the commit message to reflect updates in code.
> > >   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> > > balancing. (PeterZ)
> > >   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> > > sched_asym().
> > >
> > > Changes since v1:
> > >   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> > > tasks. Instead, reclassify the candidate busiest group, as it
> > > may still be selected. (PeterZ)
> > >   * Avoid an expensive and unnecessary call to cpumask_weight() when
> > > determining if a sched_group is comprised of SMT siblings.
> > > (PeterZ).
> > > ---
> > >  kernel/sched/fair.c | 94 +
> > >  1 file changed, 94 insertions(+)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 26db017c14a3..8d763dd0174b 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int 
> > > imbalance_pct,
> > > return group_has_spare;
> > >  }
> > >
> > > +/**
> > > + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can 
> > > pull tasks
> > > + * @dst_cpu:   Destination CPU of the load balancing
> > > + * @sds:   Load-balancing data with statistics of the local group
> > > + * @sgs:   Load-balancing statistics of the candidate busiest group
> > > + * @sg:The candidate busiest group
> > > + *
> > > + * Check the state of the SMT siblings of both @sds::local and @sg and 
> > > decide
> > > + * if @dst_cpu can pull tasks.
> > > + *
> > > + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or 
> > > more of
> > > + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, 
> > > pull tasks
> > > + * only if @dst_cpu has higher priority.
> > > + *
> > > + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one 
> > > more
> > > + * busy CPU than @sds::local, let @dst_cpu pull tasks if

Re: [PATCH v5 2/6] sched/topology: Introduce sched_group::flags

2021-09-17 Thread Vincent Guittot

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> There exist situations in which the load balance needs to know the
> properties of the CPUs in a scheduling group. When using asymmetric
> packing, for instance, the load balancer needs to know not only the
> state of dst_cpu but also of its SMT siblings, if any.
>
> Use the flags of the child scheduling domains to initialize scheduling
> group flags. This will reflect the properties of the CPUs in the
> group.
>
> A subsequent changeset will make use of these new flags. No functional
> changes are introduced.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Originally-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Clear the flags of the scheduling groups of a domain if its child is
> destroyed.
>   * Minor rewording of the commit message.
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/sched.h|  1 +
>  kernel/sched/topology.c | 21 ++---
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3d3e5793e117..86ab33ce529d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1809,6 +1809,7 @@ struct sched_group {
> unsigned intgroup_weight;
> struct sched_group_capacity *sgc;
> int asym_prefer_cpu;/* CPU of highest 
> priority in group */
> +   int flags;
>
> /*
>  * The CPUs this group covers.
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 4e8698e62f07..c56faae461d9 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -716,8 +716,20 @@ cpu_attach_domain(struct sched_domain *sd, struct 
> root_domain *rd, int cpu)
> tmp = sd;
> sd = sd->parent;
> destroy_sched_domain(tmp);
> -   if (sd)
> +   if (sd) {
> +   struct sched_group *sg = sd->groups;
> +
> +   /*
> +* sched groups hold the flags of the child sched
> +* domain for convenience. Clear such flags since
> +* the child is being destroyed.
> +*/
> +   do {
> +   sg->flags = 0;
> +   } while (sg != sd->groups);
> +
> sd->child = NULL;
> +   }
> }
>
> for (tmp = sd; tmp; tmp = tmp->parent)
> @@ -916,10 +928,12 @@ build_group_from_child_sched_domain(struct sched_domain 
> *sd, int cpu)
> return NULL;
>
> sg_span = sched_group_span(sg);
> -   if (sd->child)
> +   if (sd->child) {
> cpumask_copy(sg_span, sched_domain_span(sd->child));
> -   else
> +   sg->flags = sd->child->flags;
> +   } else {
> cpumask_copy(sg_span, sched_domain_span(sd));
> +   }
>
> atomic_inc(&sg->ref);
> return sg;
> @@ -1169,6 +1183,7 @@ static struct sched_group *get_group(int cpu, struct 
> sd_data *sdd)
> if (child) {
> cpumask_copy(sched_group_span(sg), sched_domain_span(child));
> cpumask_copy(group_balance_mask(sg), sched_group_span(sg));
> +   sg->flags = child->flags;
> } else {
> cpumask_set_cpu(cpu, sched_group_span(sg));
> cpumask_set_cpu(cpu, group_balance_mask(sg));
> --
> 2.17.1
>

Re: [PATCH v5 3/6] sched/fair: Optimize checking for group_asym_packing

2021-09-17 Thread Vincent Guittot

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> sched_asmy_prefer() always returns false when called on the local group. By
> checking local_group, we can avoid additional checks and invoking
> sched_asmy_prefer() when it is not needed. No functional changes are
> introduced.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Further rewording of the commit message. (Len)
>
> Changes since v2:
>   * Reworded the commit message for clarity. (Peter Z)
>
> Changes since v1:
>   * None
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ff69f245b939..7a054f528bcc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8657,7 +8657,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> }
>
> /* Check if dst CPU is idle and preferred to this group */
> -   if (env->sd->flags & SD_ASYM_PACKING &&
> +   if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
> env->idle != CPU_NOT_IDLE &&
> sgs->sum_h_nr_running &&
> sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> --
> 2.17.1
>

Re: [PATCH v5 4/6] sched/fair: Provide update_sg_lb_stats() with sched domain statistics

2021-09-17 Thread Vincent Guittot

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> Before deciding to pull tasks when using asymmetric packing of tasks,
> on some architectures (e.g., x86) it is necessary to know not only the
> state of dst_cpu but also of its SMT siblings. The decision to classify
> a candidate busiest group as group_asym_packing is done in
> update_sg_lb_stats(). Give this function access to the scheduling domain
> statistics, which contains the statistics of the local group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Originally-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * None
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/fair.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7a054f528bcc..c5851260b4d8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8605,6 +8605,7 @@ group_type group_classify(unsigned int imbalance_pct,
>   * @sg_status: Holds flag indicating the status of the sched_group
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
> + struct sd_lb_stats *sds,
>   struct sched_group *group,
>   struct sg_lb_stats *sgs,
>   int *sg_status)
> @@ -8613,7 +8614,7 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
>
> memset(sgs, 0, sizeof(*sgs));
>
> -   local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
> +   local_group = group == sds->local;
>
> for_each_cpu_and(i, sched_group_span(group), env->cpus) {
> struct rq *rq = cpu_rq(i);
> @@ -9176,7 +9177,7 @@ static inline void update_sd_lb_stats(struct lb_env 
> *env, struct sd_lb_stats *sd
> update_group_capacity(env->sd, env->dst_cpu);
> }
>
> -   update_sg_lb_stats(env, sg, sgs, &sg_status);
> +   update_sg_lb_stats(env, sds, sg, sgs, &sg_status);
>
> if (local_group)
> goto next_group;
> --
> 2.17.1
>

Re: [PATCH v5 5/6] sched/fair: Carve out logic to mark a group for asymmetric packing

2021-09-17 Thread Vincent Guittot

On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> Create a separate function, sched_asym(). A subsequent changeset will
> introduce logic to deal with SMT in conjunction with asmymmetric
> packing. Such logic will need the statistics of the scheduling
> group provided as argument. Update them before calling sched_asym().
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Co-developed-by: Peter Zijlstra (Intel) 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ricardo Neri 

Reviewed-by: Vincent Guittot 

> ---
> Changes since v4:
>   * None
>
> Changes since v3:
>   * Remove a redundant check for the local group in sched_asym().
> (Dietmar)
>   * Reworded commit message for clarity. (Len)
>
> Changes since v2:
>   * Introduced this patch.
>
> Changes since v1:
>   * N/A
> ---
>  kernel/sched/fair.c | 20 +---
>  1 file changed, 13 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c5851260b4d8..26db017c14a3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8597,6 +8597,13 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +static inline bool
> +sched_asym(struct lb_env *env, struct sd_lb_stats *sds,  struct sg_lb_stats 
> *sgs,
> +  struct sched_group *group)
> +{
> +   return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> +}
> +
>  /**
>   * update_sg_lb_stats - Update sched_group's statistics for load balancing.
>   * @env: The load balancing environment.
> @@ -8657,18 +8664,17 @@ static inline void update_sg_lb_stats(struct lb_env 
> *env,
> }
> }
>
> +   sgs->group_capacity = group->sgc->capacity;
> +
> +   sgs->group_weight = group->group_weight;
> +
> /* Check if dst CPU is idle and preferred to this group */
> if (!local_group && env->sd->flags & SD_ASYM_PACKING &&
> -   env->idle != CPU_NOT_IDLE &&
> -   sgs->sum_h_nr_running &&
> -   sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu)) {
> +   env->idle != CPU_NOT_IDLE && sgs->sum_h_nr_running &&
> +   sched_asym(env, sds, sgs, group)) {
> sgs->group_asym_packing = 1;
> }
>
> -   sgs->group_capacity = group->sgc->capacity;
> -
> -   sgs->group_weight = group->group_weight;
> -
> sgs->group_type = group_classify(env->sd->imbalance_pct, group, sgs);
>
> /* Computing avg_load makes sense only when group is overloaded */
> --
> 2.17.1
>

Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-18 Thread Vincent Guittot

On Fri, 17 Sept 2021 at 20:47, Peter Zijlstra  wrote:
>
> On Fri, Sep 17, 2021 at 05:25:02PM +0200, Vincent Guittot wrote:
>
> > With the removal of the condition !sds->local_stat.sum_nr_running
> > which seems useless because dst_cpu is idle and not SMT, this patch
> > looks good to me
>
> I've made it look like this. Thanks!

Reviewed-by: Vincent Guittot 

Thanks

>
> ---
> Subject: sched/fair: Consider SMT in ASYM_PACKING load balance
> From: Ricardo Neri 
> Date: Fri, 10 Sep 2021 18:18:19 -0700
>
> From: Ricardo Neri 
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Signed-off-by: Ricardo Neri 
> Signed-off-by: Peter Zijlstra (Intel) 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Link: 
> https://lkml.kernel.org/r/20210911011819.12184-7-ricardo.neri-calde...@linux.intel.com
> ---
>  kernel/sched/fair.c |   92 
> 
>  1 file changed, 92 insertions(+)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8538,10 +8538,96 @@ group_type group_classify(unsigned int i
> return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:   Load-balancing data with statistics of the local group
> + * @sgs:   Load-balancing statistics of the candidate busiest group
> + * @sg:The candidate busiest group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks.
> + *
> + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more 
> of
> + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull 
> tasks
> + * only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
> + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
> priority.
> + * Bigger imbalances in the number of busy CPUs will be dealt with in
> + * update_sd_pick_busiest().
> + *
> + * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
> siblings
> + * of @dst_cpu are idle and @sg has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +   struct sg_lb_stats *sgs,
> +   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +   bool local_is_smt, sg_is_smt;
> +   int sg_busy_cpus;
> +
> +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +   if (!local_is_smt) {
> +   /*
> +* If we are here, @dst_cpu is idle and does not have SMT
> +* siblings. Pull tasks if candidate group has two or more
> +* busy CPUs.
> +*/
> +   if (sg_busy_cpus >= 2) /* implies sg_is_smt */
> +   return true;
> +
> +   /*
> +* @dst_cpu does not have SMT siblings. @sg may have SMT
> +* siblings and only one is busy. In such case, @dst_cpu
> +* can help if it has higher priority and is idle (i.e.,
> +* it has no running tasks).
> +*/
> +   return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +   }
> +
> +   /* @dst_cpu has SMT siblings. */
> +
> +   if (sg_is_smt) {
> +   int local_busy_cpus = sds->local->group_weight -
> + sds->local_stat.idle_cpus;
> +   int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> +   if (busy_cpus_delta == 1)
> +   return sched_asym_prefer(dst_cpu, 
> sg->asym_prefer_cpu);
> +
> +   return false;
> +   }
> +
> +   /*
> +* @sg does not have SMT siblings. Ensure that @sds::local does not 
> end
> +* up with more than one busy SMT sibling and only pull tasks if there
&

Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-10-01 Thread Vincent Guittot

Hi Guillaume,

This patch and the patchset which includes this patch only impacts
systems with hyperthreading which is not the case of rk3328-rock64
AFAICT. So there is no reason that this code is used by the board. The
only impact should be an increase of the binary for this platform.
Could it be that it reached a limit ?

Regards,
Vincent

On Fri, 1 Oct 2021 at 11:33, Guillaume Tucker
 wrote:
>
> Hi Ricardo,
>
> Please see the bisection report below about a boot failure on
> rk3288-rock64.
>
> Reports aren't automatically sent to the public while we're
> trialing new bisection features on kernelci.org but this one
> looks valid.
>
> Some more details can be found here:
>
>   https://linux.kernelci.org/test/case/id/6155a4b0836c79a98f99a31d/
>
> It looks like some i.MX arm64 platforms also regressed with
> next-20210920 although it hasn't been verified yet whether that's
> due to the same commit:
>
>   
> https://linux.kernelci.org/test/job/next/branch/master/kernel/next-20210930/plan/baseline/
>
> The x86 platforms don't seem to be impacted though.
>
> Please let us know if you need help debugging the issue or if you
> have a fix to try.
>
> Best wishes,
> Guillaume
>
>
> GitHub: https://github.com/kernelci/kernelci-project/issues/65
>
> ---
>
>
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> * This automated bisection report was sent to you on the basis  *
> * that you may be involved with the breaking commit it has  *
> * found.  No manual investigation has been done to verify it,   *
> * and the root cause of the problem may be somewhere else.  *
> *   *
> * If you do send a fix, please include this trailer:*
> *   Reported-by: "kernelci.org bot"   *
> *   *
> * Hope this helps!  *
> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
>
> next/master bisection: baseline.login on rk3328-rock64
>
> Summary:
>   Start:  2d02a18f75fc Add linux-next specific files for 20210929
>   Plain log:  
> https://storage.kernelci.org/next/master/next-20210929/arm64/defconfig+CONFIG_RANDOMIZE_BASE=y/gcc-8/lab-baylibre/baseline-rk3328-rock64.txt
>   HTML log:   
> https://storage.kernelci.org/next/master/next-20210929/arm64/defconfig+CONFIG_RANDOMIZE_BASE=y/gcc-8/lab-baylibre/baseline-rk3328-rock64.html
>   Result: eac6f3841f1d sched/fair: Consider SMT in ASYM_PACKING load 
> balance
>
> Checks:
>   revert: PASS
>   verify: PASS
>
> Parameters:
>   Tree:   next
>   URL:
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
>   Branch: master
>   Target: rk3328-rock64
>   CPU arch:   arm64
>   Lab:lab-baylibre
>   Compiler:   gcc-8
>   Config: defconfig+CONFIG_RANDOMIZE_BASE=y
>   Test case:  baseline.login
>
> Breaking commit found:
>
> ---
> commit eac6f3841f1dac7b6f43002056b63f44cc1f1543
> Author: Ricardo Neri 
> Date:   Fri Sep 10 18:18:19 2021 -0700
>
> sched/fair: Consider SMT in ASYM_PACKING load balance
>
>
> On 11/09/2021 03:18, Ricardo Neri wrote:
> > When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> > check for the idle state of the destination CPU, dst_cpu, but also of
> > its SMT siblings.
> >
> > If dst_cpu is idle but its SMT siblings are busy, performance suffers
> > if it pulls tasks from a medium priority CPU that does not have SMT
> > siblings.
> >
> > Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> > siblings of both dst_cpu and the CPUs in the candidate busiest group.
> >
> > Cc: Aubrey Li 
> > Cc: Ben Segall 
> > Cc: Daniel Bristot de Oliveira 
> > Cc: Dietmar Eggemann 
> > Cc: Mel Gorman 
> > Cc: Quentin Perret 
> > Cc: Rafael J. Wysocki 
> > Cc: Srinivas Pandruvada 
> > Cc: Steven Rostedt 
> > Cc: Tim Chen 
> > Reviewed-by: Joel Fernandes (Google) 
> > Reviewed-by: Len Brown 
> > Signed-off-by: Ricardo Neri 
> > ---
> > Changes since v4:
> >   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> > (Vincent, Peter)
> >   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
> >   * Updated function documentation and corrected a typo.
> >
> > Changes since v3:
> >   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> > powerpc folks showed that this patch should not impact them. Also, more
> > recent powerpc processor no longer use asym_packing. (PeterZ)
> >   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> >   * Removed unnecessary check for local CPUs when the local group has zero
> > utilization. (Joel)
> >   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> > the fact that it deals with

Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests

2021-06-21 Thread Vincent Guittot

On Mon, 21 Jun 2021 at 11:39, Odin Ugedal  wrote:
>
> man. 21. jun. 2021 kl. 08:33 skrev Sachin Sant :
> >
> > While running LTP tests (cfs_bandwidth01) against 5.13.0-rc7 kernel on a 
> > powerpc box
> > following warning is seen
> >
> > [ 6611.331827] [ cut here ]
> > [ 6611.331855] rq->tmp_alone_branch != &rq->leaf_cfs_rq_list
> > [ 6611.331862] WARNING: CPU: 8 PID: 0 at kernel/sched/fair.c:401 
> > unthrottle_cfs_rq+0x4cc/0x590
> > [ 6611.331883] Modules linked in: nfsv3 nfs_acl nfs lockd grace fscache 
> > netfs tun brd overlay vfat fat btrfs blake2b_generic xor zstd_compress 
> > raid6_pq xfs loop sctp ip6_udp_tunnel udp_tunnel libcrc32c dm_mod bonding 
> > rfkill sunrpc pseries_rng xts vmx_crypto sch_fq_codel ip_tables ext4 
> > mbcache jbd2 sd_mod t10_pi sg ibmvscsi ibmveth scsi_transport_srp fuse 
> > [last unloaded: init_module]
> > [ 6611.331957] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G   OE 
> > 5.13.0-rc6-gcba5e97280f5 #1
> > [ 6611.331968] NIP:  c01b7aac LR: c01b7aa8 CTR: 
> > c0722d30
> > [ 6611.331976] REGS: c274f3a0 TRAP: 0700   Tainted: G   OE  
> > (5.13.0-rc6-gcba5e97280f5)
> > [ 6611.331985] MSR:  80029033   CR: 48000224  
> > XER: 0005
> > [ 6611.332002] CFAR: c014ca20 IRQMASK: 1
> > [ 6611.332002] GPR00: c01b7aa8 c274f640 c1abaf00 
> > 002d
> > [ 6611.332002] GPR04: 7fff c274f300 0027 
> > c00efdb07e08
> > [ 6611.332002] GPR08: 0023 0001 0027 
> > c1976680
> > [ 6611.332002] GPR12:  c00effc0be80 c00ef07b3f90 
> > 1eefe200
> > [ 6611.332002] GPR16:    
> > 
> > [ 6611.332002] GPR20: 0001 c0fa6c08 c0fa6030 
> > 0001
> > [ 6611.332002] GPR24:   c00efde12380 
> > 0001
> > [ 6611.332002] GPR28: 0001  c00efde12400 
> > 
> > [ 6611.332094] NIP [c01b7aac] unthrottle_cfs_rq+0x4cc/0x590
> > [ 6611.332104] LR [c01b7aa8] unthrottle_cfs_rq+0x4c8/0x590
> > [ 6611.332113] Call Trace:
> > [ 6611.332116] [c274f640] [c01b7aa8] 
> > unthrottle_cfs_rq+0x4c8/0x590 (unreliable)
> > [ 6611.332128] [c274f6e0] [c01b7e38] 
> > distribute_cfs_runtime+0x1d8/0x280
> > [ 6611.332139] [c274f7b0] [c01b81d0] 
> > sched_cfs_period_timer+0x140/0x330
> > [ 6611.332149] [c274f870] [c022a03c] 
> > __hrtimer_run_queues+0x17c/0x380
> > [ 6611.332158] [c274f8f0] [c022ac68] 
> > hrtimer_interrupt+0x128/0x2f0
> > [ 6611.332168] [c274f9a0] [c002940c] 
> > timer_interrupt+0x13c/0x370
> > [ 6611.332179] [c274fa00] [c0009c04] 
> > decrementer_common_virt+0x1a4/0x1b0
> > [ 6611.332189] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x24
> > [ 6611.332199] NIP:  c00f6af8 LR: c0a05f68 CTR: 
> > 
> > [ 6611.332206] REGS: c274fa70 TRAP: 0900   Tainted: G   OE  
> > (5.13.0-rc6-gcba5e97280f5)
> > [ 6611.332214] MSR:  8280b033   
> > CR: 28000224  XER: 
> > [ 6611.332234] CFAR: 0c00 IRQMASK: 0
> > [ 6611.332234] GPR00:  c274fd10 c1abaf00 
> > 
> > [ 6611.332234] GPR04: 00c0 0080 0001a91c68b80fa1 
> > 03dc
> > [ 6611.332234] GPR08: 0001f400 0001  
> > 
> > [ 6611.332234] GPR12:  c00effc0be80 c00ef07b3f90 
> > 1eefe200
> > [ 6611.332234] GPR16:    
> > 
> > [ 6611.332234] GPR20: 0001 0002 0010 
> > c19fe2f8
> > [ 6611.332234] GPR24: 0001 0603517d757e  
> > 
> > [ 6611.332234] GPR28: 0001  c1231f90 
> > c1231f98
> > [ 6611.332323] NIP [c00f6af8] plpar_hcall_norets_notrace+0x18/0x24
> > [ 6611.332332] LR [c0a05f68] check_and_cede_processor+0x48/0x60
> > [ 6611.332340] --- interrupt: 900
> > [ 6611.332345] [c274fd10] [c00efdb92380] 0xc00efdb92380 
> > (unreliable)
> > [ 6611.332355] [c274fd70] [c0a063bc] 
> > dedicated_cede_loop+0x9c/0x1b0
> > [ 6611.332364] [c274fdc0] [c0a02b04] 
> > cpuidle_enter_state+0x2e4/0x4e0
> > [ 6611.332375] [c274fe20] [c0a02da0] cpuidle_enter+0x50/0x70
> > [ 6611.332385] [c274fe60] [c01a883c] call_cpuidle+0x4c/0x80
> > [ 6611.332393] [c274fe80] [c01a8ee0] do_idle+0x380/0x3e0
> > [ 6611.332402] [c274ff00] [c01a91bc] 
> > cpu_startup_entry+0x3c/0x40
> > [ 6611.332411] [c2

Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests

2021-06-21 Thread Vincent Guittot

Le lundi 21 juin 2021 à 14:42:23 (+0200), Odin Ugedal a écrit :
> Hi,
> 
> Did some more research, and it looks like this is what happens:
> 
> $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
> /sys/fs/cgroup/ltp/
> |-- drain
> `-- test-6851
> `-- level2
> |-- level3a
> |   |-- worker1
> |   `-- worker2
> `-- level3b
> `-- worker3
> 
> Timeline (ish):
> - worker3 gets throttled
> - level3b is decayed, since it has no more load
> - level2 get throttled
> - worker3 get unthrottled
> - level2 get unthrottled
>   - worker3 is added to list
>   - level3b is not added to list, since nr_running==0 and is decayed
> 
> 
> The attached diff (based on
> https://lore.kernel.org/lkml/20210518125202.78658-3-o...@uged.al/)
> fixes the issue for me. Not the most elegant solution, but the
> simplest one as of now, and to show what is wrong.
> 
> Any thoughts Vincent?


I would prefer that we use the reason of adding the cfs in the list instead.

Something like the below should also fixed the problem. It is based on a
proposal I made to Rik sometimes ago when he tried to flatten the rq:
https://lore.kernel.org/lkml/20190906191237.27006-6-r...@surriel.com/

This will ensure that a cfs is added in the list whenever one of its  child
is still in the list. 

---
 kernel/sched/fair.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea7de54cb022..e751061a9449 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3272,6 +3272,31 @@ static inline void cfs_rq_util_change(struct cfs_rq 
*cfs_rq, int flags)

 #ifdef CONFIG_SMP
 #ifdef CONFIG_FAIR_GROUP_SCHED
+/*
+ * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
+ * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
+ * bottom-up, we only have to test whether the cfs_rq before us on the list
+ * is our child.
+ * If cfs_rq is not on the list, test wether a child needs its to be added to
+ * connect a branch to the tree  * (see list_add_leaf_cfs_rq() for details).
+ */
+static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
+{
+   struct cfs_rq *prev_cfs_rq;
+   struct list_head *prev;
+
+   if (cfs_rq->on_list) {
+   prev = cfs_rq->leaf_cfs_rq_list.prev;
+   } else {
+   struct rq *rq = rq_of(cfs_rq);
+
+   prev = rq->tmp_alone_branch;
+   }
+
+   prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
+
+   return (prev_cfs_rq->tg->parent == cfs_rq->tg);
+}

 static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 {
@@ -3287,6 +3312,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
if (cfs_rq->avg.runnable_sum)
return false;

+   if (child_cfs_rq_on_list(cfs_rq))
+   return false;
+
return true;
 }

--
2.17.1



> 
> Thanks
> Odin
> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..aa32e9c29efd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct
> cfs_rq *cfs_rq)
> return false;
>  }
> 
> -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> +/* Returns 1 if cfs_rq was present in the list and removed */
> +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
>  {
> if (cfs_rq->on_list) {
> struct rq *rq = rq_of(cfs_rq);
> @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct
> cfs_rq *cfs_rq)
> 
> list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> cfs_rq->on_list = 0;
> +   return 1;
> }
> +   return 0;
>  }
> 
>  static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> @@ -3298,24 +3301,6 @@ static inline void cfs_rq_util_change(struct
> cfs_rq *cfs_rq, int flags)
> 
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> -
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> -   if (cfs_rq->load.weight)
> -   return false;
> -
> -   if (cfs_rq->avg.load_sum)
> -   return false;
> -
> -   if (cfs_rq->avg.util_sum)
> -   return false;
> -
> -   if (cfs_rq->avg.runnable_sum)
> -   return false;
> -
> -   return true;
> -}
> -
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> @@ -4109,11 +4094,6 @@ static inline void update_misfit_status(struct
> task_struct *p, struct rq *rq)
> 
>  #else /* CONFIG_SMP */
> 
> -static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> -{
> -   return true;
> -}
> -
>  #define UPDATE_TG  0x0
>  #define SKIP_AGE_LOAD  0x0
>  #define DO_ATTACH  0x0
> @@ -4771,10 +4751,11 @@ static int tg_unthrottle_up(struct task_group
> *tg, void *data)
> if (!cfs_rq->throttle_count) {
> cfs_rq->throttled_clock_task_time += rq_clock_task(r

Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests

2021-06-21 Thread Vincent Guittot

On Mon, 21 Jun 2021 at 18:45, Odin Ugedal  wrote:
>
> man. 21. jun. 2021 kl. 18:22 skrev Vincent Guittot 
> :
> > I would prefer that we use the reason of adding the cfs in the list instead.
> >
> > Something like the below should also fixed the problem. It is based on a
> > proposal I made to Rik sometimes ago when he tried to flatten the rq:
> > https://lore.kernel.org/lkml/20190906191237.27006-6-r...@surriel.com/
> >
> > This will ensure that a cfs is added in the list whenever one of its  child
> > is still in the list.
>
> Oh, yeah, that is a much more elegant solution! It fixes the issue as well!
>
> Feel free to add this when/if you submit it as a patch:
> Acked-by: Odin Ugedal 

Thanks

>
> Odin

Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests

2021-06-21 Thread Vincent Guittot

Hi Sacha

On Mon, 21 Jun 2021 at 18:22, Vincent Guittot
 wrote:
>
> Le lundi 21 juin 2021 à 14:42:23 (+0200), Odin Ugedal a écrit :
> > Hi,
> >
> > Did some more research, and it looks like this is what happens:
> >
> > $ tree /sys/fs/cgroup/ltp/ -d --charset=ascii
> > /sys/fs/cgroup/ltp/
> > |-- drain
> > `-- test-6851
> > `-- level2
> > |-- level3a
> > |   |-- worker1
> > |   `-- worker2
> > `-- level3b
> > `-- worker3
> >
> > Timeline (ish):
> > - worker3 gets throttled
> > - level3b is decayed, since it has no more load
> > - level2 get throttled
> > - worker3 get unthrottled
> > - level2 get unthrottled
> >   - worker3 is added to list
> >   - level3b is not added to list, since nr_running==0 and is decayed
> >
> >
> > The attached diff (based on
> > https://lore.kernel.org/lkml/20210518125202.78658-3-o...@uged.al/)
> > fixes the issue for me. Not the most elegant solution, but the
> > simplest one as of now, and to show what is wrong.
> >
> > Any thoughts Vincent?
>
>
> I would prefer that we use the reason of adding the cfs in the list instead.
>
> Something like the below should also fixed the problem. It is based on a
> proposal I made to Rik sometimes ago when he tried to flatten the rq:
> https://lore.kernel.org/lkml/20190906191237.27006-6-r...@surriel.com/
>
> This will ensure that a cfs is added in the list whenever one of its  child
> is still in the list.

Could you confirm that this patch fixes the problem for you too ?

>
> ---
>  kernel/sched/fair.c | 28 
>  1 file changed, 28 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ea7de54cb022..e751061a9449 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3272,6 +3272,31 @@ static inline void cfs_rq_util_change(struct cfs_rq 
> *cfs_rq, int flags)
>
>  #ifdef CONFIG_SMP
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> +/*
> + * Because list_add_leaf_cfs_rq always places a child cfs_rq on the list
> + * immediately before a parent cfs_rq, and cfs_rqs are removed from the list
> + * bottom-up, we only have to test whether the cfs_rq before us on the list
> + * is our child.
> + * If cfs_rq is not on the list, test wether a child needs its to be added to
> + * connect a branch to the tree  * (see list_add_leaf_cfs_rq() for details).
> + */
> +static inline bool child_cfs_rq_on_list(struct cfs_rq *cfs_rq)
> +{
> +   struct cfs_rq *prev_cfs_rq;
> +   struct list_head *prev;
> +
> +   if (cfs_rq->on_list) {
> +   prev = cfs_rq->leaf_cfs_rq_list.prev;
> +   } else {
> +   struct rq *rq = rq_of(cfs_rq);
> +
> +   prev = rq->tmp_alone_branch;
> +   }
> +
> +   prev_cfs_rq = container_of(prev, struct cfs_rq, leaf_cfs_rq_list);
> +
> +   return (prev_cfs_rq->tg->parent == cfs_rq->tg);
> +}
>
>  static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  {
> @@ -3287,6 +3312,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
> *cfs_rq)
> if (cfs_rq->avg.runnable_sum)
> return false;
>
> +   if (child_cfs_rq_on_list(cfs_rq))
> +   return false;
> +
> return true;
>  }
>
> --
> 2.17.1
>
>
>
> >
> > Thanks
> > Odin
> >
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bfaa6e1f6067..aa32e9c29efd 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -376,7 +376,8 @@ static inline bool list_add_leaf_cfs_rq(struct
> > cfs_rq *cfs_rq)
> > return false;
> >  }
> >
> > -static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > +/* Returns 1 if cfs_rq was present in the list and removed */
> > +static inline bool list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> >  {
> > if (cfs_rq->on_list) {
> > struct rq *rq = rq_of(cfs_rq);
> > @@ -393,7 +394,9 @@ static inline void list_del_leaf_cfs_rq(struct
> > cfs_rq *cfs_rq)
> >
> > list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > cfs_rq->on_list = 0;
> > +   return 1;
> > }
> > +   return 0;
> >  }
> >
> >  static inline void assert_list_leaf_cfs_rq(struct rq *rq)
> > @@ -3298,24 +3301,6 @@ static inline void cfs_rq_util_change(struct
> > cfs_rq *cfs_rq, int flags)
> >
> >  #ifdef CONFIG_SMP
> >  #ifdef CONFIG_FAIR_GROUP_SCHED
&

Re: [powerpc][5.13.0-rc7] Kernel warning (kernel/sched/fair.c:401) while running LTP tests

2021-06-21 Thread Vincent Guittot

On Mon, 21 Jun 2021 at 19:32, Sachin Sant  wrote:
>
> >>> Any thoughts Vincent?
> >>
> >>
> >> I would prefer that we use the reason of adding the cfs in the list 
> >> instead.
> >>
> >> Something like the below should also fixed the problem. It is based on a
> >> proposal I made to Rik sometimes ago when he tried to flatten the rq:
> >> https://lore.kernel.org/lkml/20190906191237.27006-6-r...@surriel.com/
> >>
> >> This will ensure that a cfs is added in the list whenever one of its  child
> >> is still in the list.
> >
> > Could you confirm that this patch fixes the problem for you too ?
> >
> Thanks for the fix.
>
> The patch fixes the reported problem. The test ran to completion without
> any failure.
>
> Reported-by: Sachin Sant 
> Tested-by: Sachin Sant 

Thanks

>
> -Sachin
>

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-22 Thread Vincent Guittot

Hi Sachin,

On Tue, 22 Jun 2021 at 09:39, Sachin Sant  wrote:
>
> While booting 5.13.0-rc7-next-20210621 on a PowerVM LPAR following warning
> is seen
>
> [   30.922154] [ cut here ]
> [   30.922201] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> cfs_rq->avg.runnable_avg
> [   30.922219] WARNING: CPU: 6 PID: 762 at kernel/sched/fair.c:3277 
> update_blocked_averages+0x758/0x780
> [   30.922259] Modules linked in: pseries_rng xts vmx_crypto uio_pdrv_genirq 
> uio sch_fq_codel ip_tables sd_mod t10_pi sg fuse
> [   30.922309] CPU: 6 PID: 762 Comm: augenrules Not tainted 
> 5.13.0-rc7-next-20210621 #1
> [   30.922329] NIP:  c01b27e8 LR: c01b27e4 CTR: 
> c07cfda0
> [   30.922344] REGS: c00023fcb660 TRAP: 0700   Not tainted  
> (5.13.0-rc7-next-20210621)
> [   30.922359] MSR:  80029033   CR: 48488224  
> XER: 0005
> [   30.922394] CFAR: c014d120 IRQMASK: 1
>GPR00: c01b27e4 c00023fcb900 c2a08400 
> 0048
>GPR04: 7fff c00023fcb5c0 0027 
> c00f6fdd7e18
>GPR08: 0023 0001 0027 
> c28a6650
>GPR12: 8000 c00f6fff7680 c00f6fe62600 
> 0032
>GPR16: 0007331a989a c00f6fe62600 c000238a6800 
> 0001
>GPR20:  c2a4dfe0  
> 0006
>GPR24:  c00f6fe63010 0001 
> c00f6fe62680
>GPR28: 0006 c000238a69c0  
> c00f6fe62600
> [   30.922569] NIP [c01b27e8] update_blocked_averages+0x758/0x780
> [   30.922599] LR [c01b27e4] update_blocked_averages+0x754/0x780
> [   30.922624] Call Trace:
> [   30.922631] [c00023fcb900] [c01b27e4] 
> update_blocked_averages+0x754/0x780 (unreliable)
> [   30.922653] [c00023fcba20] [c01bd668] 
> newidle_balance+0x258/0x5c0
> [   30.922674] [c00023fcbab0] [c01bdaac] 
> pick_next_task_fair+0x7c/0x4d0
> [   30.922692] [c00023fcbb10] [c0dcd31c] __schedule+0x15c/0x1780
> [   30.922708] [c00023fcbc50] [c01a5a04] do_task_dead+0x64/0x70
> [   30.922726] [c00023fcbc80] [c0156338] do_exit+0x848/0xcc0
> [   30.922743] [c00023fcbd50] [c0156884] do_group_exit+0x64/0xe0
> [   30.922758] [c00023fcbd90] [c0156924] sys_exit_group+0x24/0x30
> [   30.922774] [c00023fcbdb0] [c00310c0] 
> system_call_exception+0x150/0x2d0
> [   30.922792] [c00023fcbe10] [c000cc5c] 
> system_call_common+0xec/0x278
> [   30.922808] --- interrupt: c00 at 0x7fffb3acddcc
> [   30.922821] NIP:  7fffb3acddcc LR: 7fffb3a27f04 CTR: 
> 
> [   30.922833] REGS: c00023fcbe80 TRAP: 0c00   Not tainted  
> (5.13.0-rc7-next-20210621)
> [   30.922847] MSR:  8280f033   
> CR: 28444202  XER: 
> [   30.922882] IRQMASK: 0
>GPR00: 00ea 7fffc8f21780 7fffb3bf7100 
> 
>GPR04:  000155f142f0  
> 7fffb3d23740
>GPR08: fbad2a87   
> 
>GPR12:  7fffb3d2aeb0 000116be95e0 
> 0032
>GPR16:  7fffc8f21cd8 002d 
> 0024
>GPR20: 7fffc8f21cd4 7fffb3bf4f98 0001 
> 0001
>GPR24: 7fffb3bf0950   
> 0001
>GPR28:   7fffb3d23ec0 
> 
> [   30.923023] NIP [7fffb3acddcc] 0x7fffb3acddcc
> [   30.923035] LR [7fffb3a27f04] 0x7fffb3a27f04
> [   30.923045] --- interrupt: c00
> [   30.923052] Instruction dump:
> [   30.923061] 3863be48 9be97ae6 4bf9a8f9 6000 0fe0 4bfff980 e9210070 
> e8610088
> [   30.923088] 3941 99490003 4bf9a8d9 6000 <0fe0> 4bfffc24 
> 3d22fff5 89297ae3
> [   30.923113] ---[ end trace ed07974d2149c499 ]—
>
> This warning was introduced with commit 9e077b52d86a
> sched/pelt: Check that *_avg are null when *_sum are

Yes. That was exactly the purpose of the patch. There is one last
remaining part which could generate this. I'm going to prepare a patch

Thanks

>
> next-20210618 was good.
>
> Thanks
> -Sachin

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-22 Thread Vincent Guittot

Le mardi 22 juin 2021 à 09:49:31 (+0200), Vincent Guittot a écrit :
> Hi Sachin,
> 
> On Tue, 22 Jun 2021 at 09:39, Sachin Sant  wrote:
> >
> > While booting 5.13.0-rc7-next-20210621 on a PowerVM LPAR following warning
> > is seen
> >
> > [   30.922154] [ cut here ]
> > [   30.922201] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> > cfs_rq->avg.runnable_avg
> > [   30.922219] WARNING: CPU: 6 PID: 762 at kernel/sched/fair.c:3277 
> > update_blocked_averages+0x758/0x780
> > [   30.922259] Modules linked in: pseries_rng xts vmx_crypto 
> > uio_pdrv_genirq uio sch_fq_codel ip_tables sd_mod t10_pi sg fuse
> > [   30.922309] CPU: 6 PID: 762 Comm: augenrules Not tainted 
> > 5.13.0-rc7-next-20210621 #1
> > [   30.922329] NIP:  c01b27e8 LR: c01b27e4 CTR: 
> > c07cfda0
> > [   30.922344] REGS: c00023fcb660 TRAP: 0700   Not tainted  
> > (5.13.0-rc7-next-20210621)
> > [   30.922359] MSR:  80029033   CR: 48488224  
> > XER: 0005
> > [   30.922394] CFAR: c014d120 IRQMASK: 1
> >GPR00: c01b27e4 c00023fcb900 c2a08400 
> > 0048
> >GPR04: 7fff c00023fcb5c0 0027 
> > c00f6fdd7e18
> >GPR08: 0023 0001 0027 
> > c28a6650
> >GPR12: 8000 c00f6fff7680 c00f6fe62600 
> > 0032
> >GPR16: 0007331a989a c00f6fe62600 c000238a6800 
> > 0001
> >GPR20:  c2a4dfe0  
> > 0006
> >GPR24:  c00f6fe63010 0001 
> > c00f6fe62680
> >GPR28: 0006 c000238a69c0  
> > c00f6fe62600
> > [   30.922569] NIP [c01b27e8] update_blocked_averages+0x758/0x780
> > [   30.922599] LR [c01b27e4] update_blocked_averages+0x754/0x780
> > [   30.922624] Call Trace:
> > [   30.922631] [c00023fcb900] [c01b27e4] 
> > update_blocked_averages+0x754/0x780 (unreliable)
> > [   30.922653] [c00023fcba20] [c01bd668] 
> > newidle_balance+0x258/0x5c0
> > [   30.922674] [c00023fcbab0] [c01bdaac] 
> > pick_next_task_fair+0x7c/0x4d0
> > [   30.922692] [c00023fcbb10] [c0dcd31c] __schedule+0x15c/0x1780
> > [   30.922708] [c00023fcbc50] [c01a5a04] do_task_dead+0x64/0x70
> > [   30.922726] [c00023fcbc80] [c0156338] do_exit+0x848/0xcc0
> > [   30.922743] [c00023fcbd50] [c0156884] do_group_exit+0x64/0xe0
> > [   30.922758] [c00023fcbd90] [c0156924] 
> > sys_exit_group+0x24/0x30
> > [   30.922774] [c00023fcbdb0] [c00310c0] 
> > system_call_exception+0x150/0x2d0
> > [   30.922792] [c00023fcbe10] [c000cc5c] 
> > system_call_common+0xec/0x278
> > [   30.922808] --- interrupt: c00 at 0x7fffb3acddcc
> > [   30.922821] NIP:  7fffb3acddcc LR: 7fffb3a27f04 CTR: 
> > 
> > [   30.922833] REGS: c00023fcbe80 TRAP: 0c00   Not tainted  
> > (5.13.0-rc7-next-20210621)
> > [   30.922847] MSR:  8280f033   
> > CR: 28444202  XER: 
> > [   30.922882] IRQMASK: 0
> >GPR00: 00ea 7fffc8f21780 7fffb3bf7100 
> > 
> >GPR04:  000155f142f0  
> > 7fffb3d23740
> >GPR08: fbad2a87   
> > 
> >GPR12:  7fffb3d2aeb0 000116be95e0 
> > 0032
> >GPR16:  7fffc8f21cd8 002d 
> > 0024
> >GPR20: 7fffc8f21cd4 7fffb3bf4f98 0001 
> > 0001
> >GPR24: 7fffb3bf0950   
> > 0001
> >GPR28:   7fffb3d23ec0 
> > 
> > [   30.923023] NIP [7fffb3acddcc] 0x7fffb3acddcc
> > [   30.923035] LR [7fffb3a27f04] 0x7fffb3a27f04
> > [   30.923045] --- interrupt: c00
> > [   30.923052] Instruction dump:
> > [   30.923061] 3863be48 9be97ae6 4bf9a8f9 6000 0fe0 4bfff980 
> > e9210070 e8610088
> > [   30.923088] 3941 99490003 4bf9a8d9 6000 <0fe0> 4bfffc24 
> > 3d22fff5 89297ae3
>

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

Hi Sachin,

Le mardi 22 juin 2021 à 21:29:36 (+0530), Sachin Sant a écrit :
> >> On Tue, 22 Jun 2021 at 09:39, Sachin Sant  
> >> wrote:
> >>> 
> >>> While booting 5.13.0-rc7-next-20210621 on a PowerVM LPAR following warning
> >>> is seen
> >>> 
> >>> [   30.922154] [ cut here ]
> >>> [   30.922201] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> >>> cfs_rq->avg.runnable_avg
> >>> [   30.922219] WARNING: CPU: 6 PID: 762 at kernel/sched/fair.c:3277 
> >>> update_blocked_averages+0x758/0x780
> >> 
> >> Yes. That was exactly the purpose of the patch. There is one last
> >> remaining part which could generate this. I'm going to prepare a patch
> > 
> > Could you try the patch below ? I have been able to reproduce the problem 
> > locally and this
> > fix it on my system:
> > 
> I can recreate the issue with this patch.

ok, so your problem seem to be different from my assumption. Could you try
the patch below on top of the previous one ?

This will help us to confirm that the problem comes from load_avg and that
it's linked to the cfs load_avg and it's not a problem happening earlier in
the update of PELT.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da91db1c137f..8a6566f945a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3030,8 +3030,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   u32 divider = get_pelt_divider(&se->avg);
cfs_rq->avg.load_avg += se->avg.load_avg;
-   cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+   cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
 }
 
 static inline void
@@ -3304,9 +3305,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
 * Make sure that rounding and/or propagation of PELT values never
 * break this.
 */
-   SCHED_WARN_ON(cfs_rq->avg.load_avg ||
- cfs_rq->avg.util_avg ||
- cfs_rq->avg.runnable_avg);
+   SCHED_WARN_ON(cfs_rq->avg.load_avg);
+   SCHED_WARN_ON(cfs_rq->avg.util_avg);
+   SCHED_WARN_ON(cfs_rq->avg.runnable_avg);
 
return true;
 }


> 
>  Starting Terminate Plymouth Boot Screen...
>  Starting Hold until boot process finishes up...
> [FAILED] Failed to start Crash recovery kernel arming.
> See 'systemctl status kdump.service' for details.
> [   10.737913] [ cut here ]
> [   10.737960] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> cfs_rq->avg.runnable_avg
> [   10.737976] WARNING: CPU: 27 PID: 146 at kernel/sched/fair.c:3279 
> update_blocked_averages+0x758/0x780
> [   10.738010] Modules linked in: stp llc rfkill sunrpc pseries_rng xts 
> vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables xfs libcrc32c sr_mod 
> sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror 
> dm_region_hash dm_log dm_mod fuse
> [   10.738089] CPU: 27 PID: 146 Comm: ksoftirqd/27 Not tainted 
> 5.13.0-rc7-next-20210621-dirty #2
> [   10.738103] NIP:  c01b2768 LR: c01b2764 CTR: 
> c0729120
> [   10.738116] REGS: c00015973840 TRAP: 0700   Not tainted  
> (5.13.0-rc7-next-20210621-dirty)
> [   10.738130] MSR:  8282b033   CR: 
> 48000224  XER: 0005
> [   10.738161] CFAR: c014d120 IRQMASK: 1 
> [   10.738161] GPR00: c01b2764 c00015973ae0 c29bb900 
> 0048 
> [   10.738161] GPR04: fffe c000159737a0 0027 
> c0154f9f7e18 
> [   10.738161] GPR08: 0023 0001 0027 
> c0167f1d7fe8 
> [   10.738161] GPR12:  c0154ffd7e80 c0154fa82580 
> b78a 
> [   10.738161] GPR16: 00028007883c 02ed c00038d31000 
>  
> [   10.738161] GPR20:  c29fdfe0  
> 037b 
> [   10.738161] GPR24:  c0154fa82f90 0001 
> c0003d4ca400 
> [   10.738161] GPR28: 02ed c00038d311c0 c00038d31100 
>  
> [   10.738281] NIP [c01b2768] update_blocked_averages+0x758/0x780
> [   10.738290] LR [c01b2764] update_blocked_averages+0x754/0x780
> [   10.738299] Call Trace:
> [   10.738303] [c00015973ae0] [c01b2764] 
> update_blocked_averages+0x754/0x780 (unreliable)
> [   10.738315] [c00015973c00] [c01be720] 
> run_rebalance_domains+0xa0/0xd0
> [   10.738326] [c00015973c30] [c0cf9acc] __do_softirq+0x15c/0x3d4
> [   10.738337] [c00015973d20] [c0158464] run_ksoftirqd+0x64/0x90
> [   10.738346] [c00015973d40] [c018fd24] 
> smpboot_thread_fn+0x204/0x270
> [   10.738357] [c00015973da0] [c0189770] kthread+0x190/0x1a0
> [   10.738367] [c00015973e10] [c000ceec] 
> ret_from_kernel_thread+0x5c/0x70
> [   10.738381] Instruction dump:
> [   10.73838

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

Le mercredi 23 juin 2021 à 15:52:59 (+0530), Sachin Sant a écrit :
> 
> 
> > On 23-Jun-2021, at 1:28 PM, Sachin Sant  wrote:
> > 
> > 
>  Could you try the patch below ? I have been able to reproduce the 
>  problem locally and this
>  fix it on my system:
>  
> >>> I can recreate the issue with this patch.
> >> 
> >> ok, so your problem seem to be different from my assumption. Could you try
> >> the patch below on top of the previous one ?
> >> 
> >> This will help us to confirm that the problem comes from load_avg and that
> >> it's linked to the cfs load_avg and it's not a problem happening earlier in
> >> the update of PELT.
> >> 
> > 
> > Indeed. With both the patches applied I see following warning related to 
> > load_avg
> 
> I left the machine running for sometime. Then attempted a kernel compile.
> I subsequently saw warnings triggered for util_avg as well as runnable_avg
> 
> [ 8371.964935] [ cut here ]
> [ 8371.964958] cfs_rq->avg.util_avg
> [ 8371.964969] WARNING: CPU: 16 PID: 479551 at kernel/sched/fair.c:3283 
> update_blocked_averages+0x700/0x830
> ……..
> ……..
> [ 8664.754506] [ cut here ]
> [ 8664.754569] cfs_rq->avg.runnable_avg
> [ 8664.754583] WARNING: CPU: 23 PID: 125 at kernel/sched/fair.c:3284 
> update_blocked_averages+0x730/0x830
> …….
>

Ok. This becomes even more weird. Could you share your config file and more 
details about
you setup ?

Have you applied the patch below ? 
https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/

Regarding the load_avg warning, I can see possible problem during attach. Could 
you add
the patch below. The load_avg warning seems to happen during boot and 
sched_entity
creation.

---
 kernel/sched/fair.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a6566f945a0..5e86139524c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3753,11 +3753,12 @@ static void attach_entity_load_avg(struct cfs_rq 
*cfs_rq, struct sched_entity *s
 
se->avg.runnable_sum = se->avg.runnable_avg * divider;
 
-   se->avg.load_sum = divider;
if (se_weight(se)) {
se->avg.load_sum =
-   div_u64(se->avg.load_avg * se->avg.load_sum, 
se_weight(se));
-   }
+   div_u64(se->avg.load_avg * divider, se_weight(se));
+} else {
+   se->avg.load_avg = 0;
+}
 
enqueue_load_avg(cfs_rq, se);
cfs_rq->avg.util_avg += se->avg.util_avg;
-- 
2.17.1


> > 
> > Starting NTP client/server...
> > Starting VDO volume services...
> > [9.029054] [ cut here ]
> > [9.029084] cfs_rq->avg.load_avg
> > [9.029111] WARNING: CPU: 21 PID: 1169 at kernel/sched/fair.c:3282 
> > update_blocked_averages+0x760/0x830
> > [9.029151] Modules linked in: pseries_rng xts vmx_crypto 
> > uio_pdrv_genirq uio sch_fq_codel ip_tables xfs libcrc32c sr_mod sd_mod 
> > cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror 
> > dm_region_hash dm_log dm_mod fuse
> > [9.029233] CPU: 21 PID: 1169 Comm: grep Not tainted 
> > 5.13.0-rc7-next-20210621-dirty #3
> > [9.029246] NIP:  c01b6150 LR: c01b614c CTR: 
> > c0728f40
> > [9.029259] REGS: ce177650 TRAP: 0700   Not tainted  
> > (5.13.0-rc7-next-20210621-dirty)
> > [9.029271] MSR:  80029033   CR: 48088224  
> > XER: 0005
> > [9.029296] CFAR: c014d120 IRQMASK: 1 
> > [9.029296] GPR00: c01b614c ce1778f0 c29bb900 
> > 0014 
> > [9.029296] GPR04: fffe ce1775b0 0027 
> > c0154f637e18 
> > [9.029296] GPR08: 0023 0001 0027 
> > c0167f1d7fe8 
> > [9.029296] GPR12: 8000 c0154ffe0e80 b820 
> > 00021a2c6864 
> > [9.029296] GPR16: c000482cc000 c0154f6c2580 0001 
> >  
> > [9.029296] GPR20: c291a7f9 c000482cc100  
> > 020d 
> > [9.029296] GPR24:  c0154f6c2f90 0001 
> > c00030b84400 
> > [9.029296] GPR28: 020d c000482cc1c0 0338 
> >  
> > [9.029481] NIP [c01b6150] update_blocked_averages+0x760/0x830
> > [9.029494] LR [c01b614c] update_blocked_averages+0x75c/0x830
> > [9.029508] Call Trace:
> > [9.029515] [ce1778f0] [c01b614c] 
> > update_blocked_averages+0x75c/0x830 (unreliable)
> > [9.029533] [ce177a20] [c01bd388] 
> > newidle_balance+0x258/0x5c0
> > [9.029542] [ce177ab0] [c01bd7cc] 
> > pick_next_task_fair+0x7c/0x4c0
> > [9.029574] [ce177b10] [c0cee3dc] __schedule+0x15c/0x1780
> > [9.029599] [ce177c50] [c01a5984] do_t

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

On Wed, 23 Jun 2021 at 14:18, Odin Ugedal  wrote:
>
> Hi,
>
> Wouldn't the attached diff below also help when load is removed,
> Vincent? Isn't there a theoretical chance that x_sum ends up at zero
> while x_load ends up as a positive value (without this patch)? Can
> post as a separate patch if it works for Sachin.

In theory it should not because _sum should be always larger or equal
to _avg * divider. Otherwise, it means that we have something wrong
somewhere else

>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..def48bc2e90b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3688,15 +3688,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> -   sub_positive(&sa->load_sum, r * divider);
> +   sa->load_sum = sa->load_avg * divider;
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> -   sub_positive(&sa->util_sum, r * divider);
> +   sa->util_sum = sa->util_avg * divider;
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> -   sub_positive(&sa->runnable_sum, r * divider);
> +   sa->runnable_sum = sa->runnable_avg * divider;
>
> /*
>  * removed_runnable is the unweighted version of
> removed_load so we

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

On Wed, 23 Jun 2021 at 14:37, Odin Ugedal  wrote:
>
> ons. 23. jun. 2021 kl. 14:22 skrev Vincent Guittot 
> :
> >
> > In theory it should not because _sum should be always larger or equal
> > to _avg * divider. Otherwise, it means that we have something wrong
> > somewhere else
>
> Yeah, that might be the case. Still trying to wrap my head around
> this. I might be wrong, but isn't there a possibility
> that avg->period_contrib is increasing in PELTs accumulate_sum,
> without _sum is increasing. This makes the pelt divider increase,
> making the statement "_sum should be always larger or equal to _avg *"
> false? Or am I missing something obvious here?

The pelt value of sched_entity is synced with  cfs and its contrib
before being removed.
Then, we start to remove this load in update_cfs_rq_load_avg() before
calling __update_load_avg_cfs_rq so contrib should not have change and
we should be safe

>
> Still unable to reproduce what Sachin is reporting tho.
>
> Odin

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

On Wed, 23 Jun 2021 at 17:13, Odin Ugedal  wrote:
>
> ons. 23. jun. 2021 kl. 15:56 skrev Vincent Guittot 
> :
> >
> >
> > The pelt value of sched_entity is synced with  cfs and its contrib
> > before being removed.
>
>
> Hmm. Not sure what you mean by sched_entity here, since this is only
> taking the "removed" load_avg
> and removing it from cfs_rq, together with (removed.load_avg *
> divider) from load_sum. (Although. ".removed" comes from
> a sched entity)

The sched_entity's load_avg that is put in removed.load, is sync with
the cfs_rq PELT signal, which includes contrib, before being added to
removed.load.

>
> > Then, we start to remove this load in update_cfs_rq_load_avg() before
> > calling __update_load_avg_cfs_rq so contrib should not have change and
> > we should be safe
>
> For what it is worth, I am now able to reproduce it (maybe
> CONFIG_HZ=300/250 is the thing) as reported by Sachin,
> and my patch makes it disappear. Without my patch I see situations
> where _sum is zero while _avg is eg. 1 or 2 or 14 (in that range).

hmm, so there is something wrong in the propagation

> This happens for both load, runnable and util.
>
> Lets see what Sachin reports back.
>
> Thanks
> Odin

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

On Wed, 23 Jun 2021 at 18:46, Sachin Sant  wrote:
>
>
> > Ok. This becomes even more weird. Could you share your config file and more 
> > details about
> > you setup ?
> >
> > Have you applied the patch below ?
> > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> >
> > Regarding the load_avg warning, I can see possible problem during attach. 
> > Could you add
> > the patch below. The load_avg warning seems to happen during boot and 
> > sched_entity
> > creation.
> >
>
> Here is a summary of my testing.
>
> I have a POWER box with PowerVM hypervisor. On this box I have a logical 
> partition(LPAR) or guest
> (allocated with 32 cpus 90G memory) running linux-next.
>
> I started with a clean slate.
> Moved to linux-next 5.13.0-rc7-next-20210622 as base code.
> Applied patch #1 from Vincent which contains changes to dequeue_load_avg()
> Applied patch #2 from Vincent which contains changes to enqueue_load_avg()
> Applied patch #3 from Vincent which contains changes to 
> attach_entity_load_avg()
> Applied patch #4 from 
> https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
>
> With these changes applied I was still able to recreate the issue. I could 
> see kernel warning
> during boot.
>
> I then applied patch #5 from Odin which contains changes to 
> update_cfs_rq_load_avg()
>
> With all the 5 patches applied I was able to boot the kernel without any 
> warning messages.
> I also ran scheduler related tests from ltp (./runltp -f sched) . All tests 
> including cfs_bandwidth01
> ran successfully. No kernel warnings were observed.

ok so Odin's patch fixes the problem which highlights that we
overestimate _sum or don't sync _avg and _sum correctly

I'm going to look at this further

>
> Have also attached .config in case it is useful. config has CONFIG_HZ_100=y

Thanks, i will have a look

>
> Thanks
> -Sachin
>

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot

On Wed, 23 Jun 2021 at 18:55, Vincent Guittot
 wrote:
>
> On Wed, 23 Jun 2021 at 18:46, Sachin Sant  wrote:
> >
> >
> > > Ok. This becomes even more weird. Could you share your config file and 
> > > more details about
> > > you setup ?
> > >
> > > Have you applied the patch below ?
> > > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> > >
> > > Regarding the load_avg warning, I can see possible problem during attach. 
> > > Could you add
> > > the patch below. The load_avg warning seems to happen during boot and 
> > > sched_entity
> > > creation.
> > >
> >
> > Here is a summary of my testing.
> >
> > I have a POWER box with PowerVM hypervisor. On this box I have a logical 
> > partition(LPAR) or guest
> > (allocated with 32 cpus 90G memory) running linux-next.
> >
> > I started with a clean slate.
> > Moved to linux-next 5.13.0-rc7-next-20210622 as base code.
> > Applied patch #1 from Vincent which contains changes to dequeue_load_avg()
> > Applied patch #2 from Vincent which contains changes to enqueue_load_avg()
> > Applied patch #3 from Vincent which contains changes to 
> > attach_entity_load_avg()
> > Applied patch #4 from 
> > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> >
> > With these changes applied I was still able to recreate the issue. I could 
> > see kernel warning
> > during boot.
> >
> > I then applied patch #5 from Odin which contains changes to 
> > update_cfs_rq_load_avg()
> >
> > With all the 5 patches applied I was able to boot the kernel without any 
> > warning messages.
> > I also ran scheduler related tests from ltp (./runltp -f sched) . All tests 
> > including cfs_bandwidth01
> > ran successfully. No kernel warnings were observed.
>
> ok so Odin's patch fixes the problem which highlights that we
> overestimate _sum or don't sync _avg and _sum correctly
>
> I'm going to look at this further

The problem is  "_avg * divider" makes the assumption that all pending
contrib are not null contributions whereas they can be null.

Odin patch is the right way to fix this. Other patches should not be
useful for your problem

>
> >
> > Have also attached .config in case it is useful. config has CONFIG_HZ_100=y
>
> Thanks, i will have a look
>
> >
> > Thanks
> > -Sachin
> >

Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag

2024-06-14 Thread Vincent Guittot

On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra  wrote:
>
> On Thu, Jun 13, 2024 at 06:15:59PM +, K Prateek Nayak wrote:
> > Effects of call_function_single_prep_ipi()
> > ==
> >
> > To pull a TIF_POLLING thread out of idle to process an IPI, the sender
> > sets the TIF_NEED_RESCHED bit in the idle task's thread info in
> > call_function_single_prep_ipi() and avoids sending an actual IPI to the
> > target. As a result, the scheduler expects a task to be enqueued when
> > exiting the idle path. This is not the case with non-polling idle states
> > where the idle CPU exits the non-polling idle state to process the
> > interrupt, and since need_resched() returns false, soon goes back to
> > idle again.
> >
> > When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(),
> > a large part of which runs with local IRQ disabled. In case of ipistorm,
> > when measuring IPI throughput, this large IRQ disabled section delays
> > processing of IPIs. Further auditing revealed that in absence of any
> > runnable tasks, pick_next_task_fair(), which is called from the
> > pick_next_task() fast path, will always call newidle_balance() in this
> > scenario, further increasing the time spent in the IRQ disabled section.
> >
> > Following is the crude visualization of the problem with relevant
> > functions expanded:
> > --
> > CPU0  CPU1
> >   
> >   do_idle() {
> >   
> > __current_set_polling();
> >   ...
> >   monitor(addr);
> >   if 
> > (!need_resched())
> >   
> > mwait() {
> >   /* 
> > Waiting */
> > smp_call_function_single(CPU1, func, wait = 1) {
> >   ...
> >   ...   
> >   ...
> >   set_nr_if_polling(CPU1) { 
> >   ...
> >   /* Realizes CPU1 is polling */
> >   ...
> >   try_cmpxchg(addr, 
> >   ...
> >   &val, 
> >   ...
> >   val | _TIF_NEED_RESCHED); 
> >   ...
> >   } /* Does not send an IPI */  
> >   ...
> >   ... } /* 
> > mwait exit due to write at addr */
> >   csd_lock_wait() {   }
> >   /* Waiting */   
> > preempt_set_need_resched();
> >   ... 
> > __current_clr_polling();
> >   ... 
> > flush_smp_call_function_queue() {
> >   ... 
> > func();
> >   } /* End of wait */ }
> > } 
> > schedule_idle() {
> >   ...
> >   
> > local_irq_disable();
> > smp_call_function_single(CPU1, func, wait = 1) {  ...
> >   ... ...
> >   arch_send_call_function_single_ipi(CPU1);   ...
> >   \   ...
> >\  
> > newidle_balance() {
> > \   
> >   ...
> > /* Delay */ 
> >   ...
> >   \   }
> >\  ...
> > \-->  
> > local_irq_enable();
> >   /* 
> > Processes the IPI */
> > --
> >
> >
> > Skipping newidle_balance()
> > ==
> >
> > In an earlier attempt to solve the challenge of the long IRQ disabled
> > section, newidle_balance() was skipped when a CPU waking up from idle
> > was found to have no runnable tasks, and was transitioning back to
> > idle [2]. Tim

Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag

2024-06-17 Thread Vincent Guittot

On Sat, 15 Jun 2024 at 03:28, Peter Zijlstra  wrote:
>
> On Fri, Jun 14, 2024 at 12:48:37PM +0200, Vincent Guittot wrote:
> > On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra  wrote:
>
> > > > Vincent [5] pointed out a case where the idle load kick will fail to
> > > > run on an idle CPU since the IPI handler launching the ILB will check
> > > > for need_resched(). In such cases, the idle CPU relies on
> > > > newidle_balance() to pull tasks towards itself.
> > >
> > > Is this the need_resched() in _nohz_idle_balance() ? Should we change
> > > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or
> > > something long those lines?
> >
> > It's not only this but also in do_idle() as well which exits the loop
> > to look for tasks to schedule
>
> Is that really a problem? Reading the initial email the problem seems to
> be newidle balance, not hitting schedule. Schedule should be fairly
> quick if there's nothing to do, no?

There are 2 problems:
- Because of NEED_RESCHED being set, we go through the full schedule
path for no reason and we finally do a sched_balance_newidle()
- Because of need_resched being set o wake up the cpu, we will not
kick the softirq to run the nohz idle load balance and get a chance to
pull a task on an idle CPU

>
> > > I mean, it's fairly trivial to figure out if there really is going to be
> > > work there.
> > >
> > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending
> > > > IPI was suggested as the correct approach to solve this problem on the
> > > > same thread.
> > >
> > > So adding per-arch changes for this seems like something we shouldn't
> > > unless there really is no other sane options.
> > >
> > > That is, I really think we should start with something like the below
> > > and then fix any fallout from that.
> >
> > The main problem is that need_resched becomes somewhat meaningless
> > because it doesn't  only mean "I need to resched a task" and we have
> > to add more tests around even for those not using polling
>
> True, however we already had some of that by having the wakeup list,
> that made nr_running less 'reliable'.
>
> The thing is, most architectures seem to have the TIF_POLLING_NRFLAG
> bit, even if their main idle routine isn't actually using it, much of

Yes, I'm surprised that Arm arch has the TIF_POLLING_NRFLAG whereas it
has never been supported by the arch

> the idle loop until it hits the arch idle will be having it set and will
> thus tickle these cases *sometimes*.
>
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index 0935f9d4bb7b..cfa45338ae97 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -5799,7 +5800,7 @@ static inline struct task_struct *
> > >  __pick_next_task(struct rq *rq, struct task_struct *prev, struct 
> > > rq_flags *rf)
> > >  {
> > > const struct sched_class *class;
> > > -   struct task_struct *p;
> > > +   struct task_struct *p = NULL;
> > >
> > > /*
> > >  * Optimization: we know that if all tasks are in the fair class 
> > > we can
> > > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct 
> > > *prev, struct rq_flags *rf)
> > > if (likely(!sched_class_above(prev->sched_class, 
> > > &fair_sched_class) &&
> > >rq->nr_running == rq->cfs.h_nr_running)) {
> > >
> > > -   p = pick_next_task_fair(rq, prev, rf);
> > > -   if (unlikely(p == RETRY_TASK))
> > > -   goto restart;
> > > +   if (rq->nr_running) {
> >
> > How do you make the diff between a spurious need_resched() because of
> > polling and a cpu becoming idle ? isn't rq->nr_running null in both
> > cases ?
>
> Bah, true. It should also check current being idle, which then makes a
> mess of things again. Still, we shouldn't be calling newidle from idle,
> that's daft.
>
> I should probably not write code at 3am, but the below horror is what
> I came up with.
>
> ---
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 0935f9d4bb7b..cfe8d3350819 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6343,19 +6344,12 @@ pick_next_task(struct rq *rq, struct task_struct 
> *prev, struct rq_flags *rf)
>   * Constants f

46 matches

Mail list logo