Hi Rafael,
On 04/07/2019 12:14, Rafael J. Wysocki wrote: > On Thursday, June 20, 2019 1:58:08 PM CEST Daniel Lezcano wrote: >> The objective is the same for all the governors: save energy, but at >> the end the governors menu, ladder and teo aim to improve the >> performances with an acceptable energy drop for some workloads which >> are identified for servers and desktops (with the help of a firmware). >> >> The ladder governor is designed for server with a periodic tick >> configuration. >> >> The menu governor does not behave nicely with the mobile platform and >> the energy saving for the multimedia workloads is worst than picking >> up randomly an idle state. >> >> The teo governor acts efficiently, it promotes shallower state for >> performances which is perfect for the servers / desktop but inadequate >> for mobile because the energy consumed is too high. >> >> It is very difficult to do changes in these governors for embedded >> systems without impacting performances on servers/desktops or ruin the >> optimizations for the workloads on these platforms. >> >> The mobile governor is a new governor targeting embedded systems >> running on battery where the energy saving has a higher priority than >> servers or desktops. This governor aims to save energy as much as >> possible but with a performance degradation tolerance. >> >> In this way, we can optimize the governor for specific mobile workload >> and more generally embedded systems without impacting other platforms. >> >> The mobile governor is built on top of the paradigm 'separate the wake >> up sources signals and analyze them'. Three categories of wake up >> signals are identified: >> - deterministic : timers >> - predictable : most of the devices interrupt >> - unpredictable : IPI rescheduling, random signals >> >> The latter needs an iterative approach and the help of the scheduler >> to give more input to the governor. >> >> The governor uses the irq timings where we predict the next interrupt >> occurrences on the current CPU and the next timer. It is well suited >> for mobile and more generally embedded systems where the interrupts >> are usually pinned on one CPU and where the power is more important >> than the performances. >> >> The multimedia applications on the embedded system spawn multiple >> threads which are migrated across the different CPUs and waking >> between them up. In order to catch this situation we have also to >> track the idle task rescheduling duration with a relative degree of >> confidence as the scheduler is involved in the task migrations. The >> resched information is in the scope of the governor via the reflect >> callback. >> >> The governor begins with a clean foundation basing the prediction on >> the irq behavior returned by the irq timings, the timers and the idle >> task rescheduling. The advantage of the approach is we have a full >> view of the wakeup sources as we identify them separately and then we >> can control the situation without relying on biased heuristics. >> >> This first iteration provides a basic prediction but improves on some >> mobile platforms better energy for better performance for multimedia >> workloads. >> >> The scheduling aspect will be optimized iteratively with non >> regression testing for previous identified workloads on an Android >> reference platform. >> >> Signed-off-by: Daniel Lezcano <daniel.lezc...@linaro.org> > > Note that there are build issues reported by 0-day that need to be fixed. > Also, IMO this really should be documented better in the tree, not just in > the changelog. > At least the use case to be covered by this governor should be clearly > documented and > it would be good to describe the algorithm. Ok, I will add some documentation. >> --- >> drivers/cpuidle/Kconfig | 11 ++- >> drivers/cpuidle/governors/Makefile | 1 + >> drivers/cpuidle/governors/mobile.c | 151 +++++++++++++++++++++++++++++ >> 3 files changed, 162 insertions(+), 1 deletion(-) >> create mode 100644 drivers/cpuidle/governors/mobile.c >> >> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig >> index a4ac31e4a58c..e2376d85e288 100644 >> --- a/drivers/cpuidle/Kconfig >> +++ b/drivers/cpuidle/Kconfig >> @@ -5,7 +5,7 @@ config CPU_IDLE >> bool "CPU idle PM support" >> default y if ACPI || PPC_PSERIES >> select CPU_IDLE_GOV_LADDER if (!NO_HZ && !NO_HZ_IDLE) >> - select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO >> + select CPU_IDLE_GOV_MENU if (NO_HZ || NO_HZ_IDLE) && !CPU_IDLE_GOV_TEO >> && !CPU_IDLE_GOV_MOBILE >> help >> CPU idle is a generic framework for supporting software-controlled >> idle processor power management. It includes modular cross-platform >> @@ -33,6 +33,15 @@ config CPU_IDLE_GOV_TEO >> Some workloads benefit from using it and it generally should be safe >> to use. Say Y here if you are not happy with the alternatives. >> >> +config CPU_IDLE_GOV_MOBILE >> + bool "Mobile governor" >> + select IRQ_TIMINGS >> + help >> + The mobile governor is based on irq timings measurements and >> + pattern research combined with the next timer. This governor >> + suits very well on embedded systems where the interrupts are >> + grouped on a single core and the power is the priority. >> + >> config DT_IDLE_STATES >> bool >> >> diff --git a/drivers/cpuidle/governors/Makefile >> b/drivers/cpuidle/governors/Makefile >> index 42f44cc610dd..f09da7178670 100644 >> --- a/drivers/cpuidle/governors/Makefile >> +++ b/drivers/cpuidle/governors/Makefile >> @@ -6,3 +6,4 @@ >> obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o >> obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o >> obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o >> +obj-$(CONFIG_CPU_IDLE_GOV_MOBILE) += mobile.o >> diff --git a/drivers/cpuidle/governors/mobile.c >> b/drivers/cpuidle/governors/mobile.c >> new file mode 100644 >> index 000000000000..8fda0f9b960b >> --- /dev/null >> +++ b/drivers/cpuidle/governors/mobile.c >> @@ -0,0 +1,151 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Copyright (C) 2019, Linaro Ltd >> + * Author: Daniel Lezcano <daniel.lezc...@linaro.org> >> + */ >> +#include <linux/cpuidle.h> >> +#include <linux/kernel.h> >> +#include <linux/sched.h> >> +#include <linux/slab.h> >> +#include <linux/tick.h> >> +#include <linux/interrupt.h> >> +#include <linux/sched/clock.h> >> + >> +struct mobile_device { >> + u64 idle_ema_avg; >> + u64 idle_total; >> + unsigned long last_jiffies; >> +}; >> + >> +#define EMA_ALPHA_VAL 64 >> +#define EMA_ALPHA_SHIFT 7 >> +#define MAX_RESCHED_INTERVAL_MS 100 >> + >> +static DEFINE_PER_CPU(struct mobile_device, mobile_devices); >> + >> +static int mobile_ema_new(s64 value, s64 ema_old) >> +{ >> + if (likely(ema_old)) >> + return ema_old + (((value - ema_old) * EMA_ALPHA_VAL) >> >> + EMA_ALPHA_SHIFT); >> + return value; >> +} >> + >> +static void mobile_reflect(struct cpuidle_device *dev, int index) >> +{ >> + struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices); >> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev); >> + struct cpuidle_state *s = &drv->states[index]; >> + int residency; >> + >> + /* >> + * The idle task was not rescheduled since >> + * MAX_RESCHED_INTERVAL_MS, let's consider the duration is >> + * long enough to clear our stats. >> + */ >> + if (time_after(jiffies, mobile_dev->last_jiffies + >> + msecs_to_jiffies(MAX_RESCHED_INTERVAL_MS))) >> + mobile_dev->idle_ema_avg = 0; > > Why jiffies? Any particular reason? I used jiffies to not use the local_clock() call for the overhead. I agree the resolution could be too low. Perhaps it makes more sense to move idle start and idle end variable from the cpuidle_enter function to the cpuidle device structure, so the information can be reused by the subsequent users. >> + >> + /* >> + * Sum all the residencies in order to compute the total >> + * duration of the idle task. >> + */ >> + residency = dev->last_residency - s->exit_latency; >> + if (residency > 0) >> + mobile_dev->idle_total += residency; >> + >> + /* >> + * We exited the idle state with the need_resched() flag, the >> + * idle task will be rescheduled, so store the duration the >> + * idle task was scheduled in an exponential moving average and >> + * reset the total of the idle duration. >> + */ >> + if (need_resched()) { >> + mobile_dev->idle_ema_avg = >> mobile_ema_new(mobile_dev->idle_total, >> + mobile_dev->idle_ema_avg); >> + mobile_dev->idle_total = 0; >> + mobile_dev->last_jiffies = jiffies; >> + } >> +} >> + >> +static int mobile_select(struct cpuidle_driver *drv, struct cpuidle_device >> *dev, >> + bool *stop_tick) >> +{ >> + struct mobile_device *mobile_dev = this_cpu_ptr(&mobile_devices); >> + int latency_req = cpuidle_governor_latency_req(dev->cpu); >> + int i, index = 0; >> + ktime_t delta_next; >> + u64 now, irq_length, timer_length; >> + u64 idle_duration_us; >> + >> + /* >> + * Get the present time as reference for the next steps >> + */ >> + now = local_clock(); >> + >> + /* >> + * Get the next interrupt event giving the 'now' as a >> + * reference, if the next event appears to have already >> + * expired then we get the 'now' returned which ends up with a >> + * zero duration. >> + */ >> + irq_length = irq_timings_next_event(now) - now; >> + >> + /* >> + * Get the timer duration before expiration. >> + */ > > This comment is rather redundant and the one below too. :-) Right. >> + timer_length = ktime_to_ns(tick_nohz_get_sleep_length(&delta_next)); >> + >> + /* >> + * Get the smallest duration between the timer and the irq next event. >> + */ >> + idle_duration_us = min_t(u64, irq_length, timer_length) / NSEC_PER_USEC; >> + >> + /* >> + * Get the idle task duration average if the information is >> + * available. > > IMO it would be good to explain this step in more detail, especially the > purpose of it. Ok. >> + */ >> + if (mobile_dev->idle_ema_avg) >> + idle_duration_us = min_t(u64, idle_duration_us, >> + mobile_dev->idle_ema_avg); >> + >> + for (i = 0; i < drv->state_count; i++) { >> + struct cpuidle_state *s = &drv->states[i]; >> + struct cpuidle_state_usage *su = &dev->states_usage[i]; >> + >> + if (s->disabled || su->disable) >> + continue; >> + >> + if (s->exit_latency > latency_req) >> + break; >> + >> + if (idle_duration_us > s->exit_latency) >> + idle_duration_us = idle_duration_us - s->exit_latency; > > Why do you want this? > > It only causes you to miss an opportunity to select a deeper state sometimes, > so what's the reason? On the mobile platform the exit latencies are very high (with an order of magnitude of several milliseconds) for a very limited number of idle states. The real idle duration must be determined to compare to the target residency. Without this test, the governor is constantly choosing wrongly a deep idle state. > Moreover, I don't think you should update idle_duration_us here, as the > updated > value will go to the next step if the check below doesn't trigger. Right, I spotted it also and fixed with: + if (s->exit_latency >= idle_duration_us) + break; + if (s->target_residency > (idle_duration_us - s->exit_latency)) break; >> + >> + if (s->target_residency > idle_duration_us) >> + break; >> + >> + index = i; >> + } >> + >> + if (!index) >> + *stop_tick = false; > > Well, this means that the tick is stopped for all idle states deeper than > state 0. > > If there are any states between state 0 and the deepest one and they are below > the tick boundary, you may very well suffer the "powernightmares" problem > because of this. What would you suggest? if (!index || ((idle_duration_us < TICK_USEC) && !tick_nohz_tick_stopped())) *stop_tick = false; ? There are too few idle states to restart a selection at this point, so preventing stopping the tick is enough at this point IMO. >> + return index; >> +} >> + >> +static struct cpuidle_governor mobile_governor = { >> + .name = "mobile", >> + .rating = 20, >> + .select = mobile_select, >> + .reflect = mobile_reflect, >> +}; >> + >> +static int __init init_governor(void) >> +{ >> + irq_timings_enable(); >> + return cpuidle_register_governor(&mobile_governor); >> +} >> + >> +postcore_initcall(init_governor); >> > > > > -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog