Hi Rafael,

On 04/07/2019 12:14, Rafael J. Wysocki wrote:
> On Thursday, June 20, 2019 1:58:08 PM CEST Daniel Lezcano wrote:
>> The objective is the same for all the governors: save energy, but at
>> the end the governors menu, ladder and teo aim to improve the
>> performances with an acceptable energy drop for some workloads which
>> are identified for servers and desktops (with the help of a firmware).
>>
>> The ladder governor is designed for server with a periodic tick
>> configuration.
>>
>> The menu governor does not behave nicely with the mobile platform and
>> the energy saving for the multimedia workloads is worst than picking
>> up randomly an idle state.
>>
>> The teo governor acts efficiently, it promotes shallower state for
>> performances which is perfect for the servers / desktop but inadequate
>> for mobile because the energy consumed is too high.
>>
>> It is very difficult to do changes in these governors for embedded
>> systems without impacting performances on servers/desktops or ruin the
>> optimizations for the workloads on these platforms.
>>
>> The mobile governor is a new governor targeting embedded systems
>> running on battery where the energy saving has a higher priority than
>> servers or desktops. This governor aims to save energy as much as
>> possible but with a performance degradation tolerance.
>>
>> In this way, we can optimize the governor for specific mobile workload
>> and more generally embedded systems without impacting other platforms.
>>
>> The mobile governor is built on top of the paradigm 'separate the wake
>> up sources signals and analyze them'. Three categories of wake up
>> signals are identified:
>>  - deterministic : timers
>>  - predictable : most of the devices interrupt
>>  - unpredictable : IPI rescheduling, random signals
>>
>> The latter needs an iterative approach and the help of the scheduler
>> to give more input to the governor.
>>
>> The governor uses the irq timings where we predict the next interrupt
>> occurrences on the current CPU and the next timer. It is well suited
>> for mobile and more generally embedded systems where the interrupts
>> are usually pinned on one CPU and where the power is more important
>> than the performances.
>>
>> The multimedia applications on the embedded system spawn multiple
>> threads which are migrated across the different CPUs and waking
>> between them up. In order to catch this situation we have also to
>> track the idle task rescheduling duration with a relative degree of
>> confidence as the scheduler is involved in the task migrations. The
>> resched information is in the scope of the governor via the reflect
>> callback.
>>
>> The governor begins with a clean foundation basing the prediction on
>> the irq behavior returned by the irq timings, the timers and the idle
>> task rescheduling. The advantage of the approach is we have a full
>> view of the wakeup sources as we identify them separately and then we
>> can control the situation without relying on biased heuristics.
>>
>> This first iteration provides a basic prediction but improves on some
>> mobile platforms better energy for better performance for multimedia
>> workloads.
>>
>> The scheduling aspect will be optimized iteratively with non
>> regression testing for previous identified workloads on an Android
>> reference platform.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezc...@linaro.org>
> 
> Note that there are build issues reported by 0-day that need to be fixed.
> Also, IMO this really should be documented better in the tree, not just in 
> the changelog.
> At least the use case to be covered by this governor should be clearly 
> documented and
> it would be good to describe the algorithm.

Ok, I will add some documentation.

>> ---
>>  drivers/cpuidle/Kconfig            |  11 ++-
>>  drivers/cpuidle/governors/Makefile |   1 +
>>  drivers/cpuidle/governors/mobile.c | 151 +++++++++++++++++++++++++++++
>>  3 files changed, 162 insertions(+), 1 deletion(-)
>>  create mode 100644 drivers/cpuidle/governors/mobile.c
>>
>> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
>> index a4ac31e4a58c..e2376d85e288 100644
>> --- a/drivers/cpuidle/Kconfig
>> +++ b/drivers/cpuidle/Kconfig
>> @@ -5,7 +5,7 @@ config CPU_IDLE
>>      bool "CPU idle PM support"
>>      default y if ACPI || PPC_PSERIES
>>      select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE)
>> -    select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO
>> +    select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO 
>> && !CPU_IDLE_GOV_MOBILE
>>      help
>>        CPU idle is a generic framework for supporting software-controlled
>>        idle processor power management.  It includes modular cross-platform
>> @@ -33,6 +33,15 @@ config CPU_IDLE_GOV_TEO
>>        Some workloads benefit from using it and it generally should be safe
>>        to use.  Say Y here if you are not happy with the alternatives.
>>  
>> +config CPU_IDLE_GOV_MOBILE
>> +    bool "Mobile governor"
>> +    select IRQ_TIMINGS
>> +    help
>> +      The mobile governor is based on irq timings measurements and
>> +      pattern research combined with the next timer. This governor
>> +      suits very well on embedded systems where the interrupts are
>> +      grouped on a single core and the power is the priority.
>> +
>>  config DT_IDLE_STATES
>>      bool
>>  
>> diff --git a/drivers/cpuidle/governors/Makefile 
>> b/drivers/cpuidle/governors/Makefile
>> index 42f44cc610dd..f09da7178670 100644
>> --- a/drivers/cpuidle/governors/Makefile
>> +++ b/drivers/cpuidle/governors/Makefile
>> @@ -6,3 +6,4 @@
>>  obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
>>  obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
>>  obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o
>> +obj-$(CONFIG_CPU_IDLE_GOV_MOBILE) += mobile.o
>> diff --git a/drivers/cpuidle/governors/mobile.c 
>> b/drivers/cpuidle/governors/mobile.c
>> new file mode 100644
>> index 000000000000..8fda0f9b960b
>> --- /dev/null
>> +++ b/drivers/cpuidle/governors/mobile.c
>> @@ -0,0 +1,151 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (C) 2019, Linaro Ltd
>> + * Author: Daniel Lezcano <daniel.lezc...@linaro.org>
>> + */
>> +#include <linux/cpuidle.h>
>> +#include <linux/kernel.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/tick.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/sched/clock.h>
>> +
>> +struct mobile_device {
>> +    u64 idle_ema_avg;
>> +    u64 idle_total;
>> +    unsigned long last_jiffies;
>> +};
>> +
>> +#define EMA_ALPHA_VAL               64
>> +#define EMA_ALPHA_SHIFT             7
>> +#define MAX_RESCHED_INTERVAL_MS     100
>> +
>> +static DEFINE_PER_CPU(struct mobile_device, mobile_devices);
>> +
>> +static int mobile_ema_new(s64 value, s64 ema_old)
>> +{
>> +    if (likely(ema_old))
>> +            return ema_old + (((value - ema_old) * EMA_ALPHA_VAL) >>
>> +                              EMA_ALPHA_SHIFT);
>> +    return value;
>> +}
>> +
>> +static void mobile_reflect(struct cpuidle_device *dev, int index)
>> +{
>> +        struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices);
>> +    struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
>> +    struct cpuidle_state *s = &drv->states[index];
>> +    int residency;
>> +
>> +    /*
>> +     * The idle task was not rescheduled since
>> +     * MAX_RESCHED_INTERVAL_MS, let's consider the duration is
>> +     * long enough to clear our stats.
>> +     */
>> +    if (time_after(jiffies, mobile_dev->last_jiffies +
>> +                   msecs_to_jiffies(MAX_RESCHED_INTERVAL_MS)))
>> +            mobile_dev->idle_ema_avg = 0;
> 
> Why jiffies?  Any particular reason?

I used jiffies to not use the local_clock() call for the overhead. I
agree the resolution could be too low. Perhaps it makes more sense to
move idle start and idle end variable from the cpuidle_enter function to
the cpuidle device structure, so the information can be reused by the
subsequent users.

>> +
>> +    /*
>> +     * Sum all the residencies in order to compute the total
>> +     * duration of the idle task.
>> +     */
>> +    residency = dev->last_residency - s->exit_latency;
>> +    if (residency > 0)
>> +            mobile_dev->idle_total += residency;
>> +
>> +    /*
>> +     * We exited the idle state with the need_resched() flag, the
>> +     * idle task will be rescheduled, so store the duration the
>> +     * idle task was scheduled in an exponential moving average and
>> +     * reset the total of the idle duration.
>> +     */
>> +    if (need_resched()) {
>> +            mobile_dev->idle_ema_avg = 
>> mobile_ema_new(mobile_dev->idle_total,
>> +                                                  mobile_dev->idle_ema_avg);
>> +            mobile_dev->idle_total = 0;
>> +            mobile_dev->last_jiffies = jiffies;
>> +    }
>> +}
>> +
>> +static int mobile_select(struct cpuidle_driver *drv, struct cpuidle_device 
>> *dev,
>> +                   bool *stop_tick)
>> +{
>> +    struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices);
>> +    int latency_req = cpuidle_governor_latency_req(dev->cpu);
>> +    int i, index = 0;
>> +    ktime_t delta_next;
>> +    u64 now, irq_length, timer_length;
>> +    u64 idle_duration_us;
>> +
>> +    /*
>> +     * Get the present time as reference for the next steps
>> +     */
>> +    now = local_clock();
>> +
>> +    /*
>> +     * Get the next interrupt event giving the 'now' as a
>> +     * reference, if the next event appears to have already
>> +     * expired then we get the 'now' returned which ends up with a
>> +     * zero duration.
>> +     */
>> +    irq_length = irq_timings_next_event(now) - now;
>> +
>> +    /*
>> +     * Get the timer duration before expiration.
>> +     */
> 
> This comment is rather redundant and the one below too. :-)

Right.

>> +    timer_length = ktime_to_ns(tick_nohz_get_sleep_length(&delta_next));
>> +
>> +    /*
>> +     * Get the smallest duration between the timer and the irq next event.
>> +     */
>> +    idle_duration_us = min_t(u64, irq_length, timer_length) / NSEC_PER_USEC;
>> +
>> +    /*
>> +     * Get the idle task duration average if the information is
>> +     * available.
> 
> IMO it would be good to explain this step in more detail, especially the 
> purpose of it.

Ok.

>> +     */
>> +    if (mobile_dev->idle_ema_avg)
>> +            idle_duration_us = min_t(u64, idle_duration_us,
>> +                                     mobile_dev->idle_ema_avg);
>> +
>> +    for (i = 0; i < drv->state_count; i++) {
>> +            struct cpuidle_state *s = &drv->states[i];
>> +            struct cpuidle_state_usage *su = &dev->states_usage[i];
>> +
>> +            if (s->disabled || su->disable)
>> +                    continue;
>> +
>> +            if (s->exit_latency > latency_req)
>> +                    break;
>> +
>> +            if (idle_duration_us > s->exit_latency)
>> +                    idle_duration_us = idle_duration_us - s->exit_latency;
> 
> Why do you want this?
> 
> It only causes you to miss an opportunity to select a deeper state sometimes,
> so what's the reason?

On the mobile platform the exit latencies are very high (with an order
of magnitude of several milliseconds) for a very limited number of idle
states. The real idle duration must be determined to compare to the
target residency. Without this test, the governor is constantly choosing
wrongly a deep idle state.

> Moreover, I don't think you should update idle_duration_us here, as the 
> updated
> value will go to the next step if the check below doesn't trigger.

Right, I spotted it also and fixed with:

+               if (s->exit_latency >= idle_duration_us)
+                       break;

+               if (s->target_residency > (idle_duration_us -
s->exit_latency))
                        break;

>> +
>> +            if (s->target_residency > idle_duration_us)
>> +                    break;
>> +
>> +            index = i;
>> +    }
>> +
>> +    if (!index)
>> +            *stop_tick = false;
> 
> Well, this means that the tick is stopped for all idle states deeper than 
> state 0.
> 
> If there are any states between state 0 and the deepest one and they are below
> the tick boundary, you may very well suffer the "powernightmares" problem
> because of this.

What would you suggest?

if (!index || ((idle_duration_us < TICK_USEC) &&
                !tick_nohz_tick_stopped()))
        *stop_tick = false;

?

There are too few idle states to restart a selection at this point, so
preventing stopping the tick is enough at this point IMO.


>> +    return index;
>> +}
>> +
>> +static struct cpuidle_governor mobile_governor = {
>> +    .name =         "mobile",
>> +    .rating =       20,
>> +    .select =       mobile_select,
>> +    .reflect =      mobile_reflect,
>> +};
>> +
>> +static int __init init_governor(void)
>> +{
>> +    irq_timings_enable();
>> +    return cpuidle_register_governor(&mobile_governor);
>> +}
>> +
>> +postcore_initcall(init_governor);
>>
> 
> 
> 
> 


-- 
 <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Reply via email to