Abhishek Goel's on June 17, 2019 7:56 pm: > Currently, the cpuidle governors determine what idle state a idling CPU > should enter into based on heuristics that depend on the idle history on > that CPU. Given that no predictive heuristic is perfect, there are cases > where the governor predicts a shallow idle state, hoping that the CPU will > be busy soon. However, if no new workload is scheduled on that CPU in the > near future, the CPU may end up in the shallow state. > > This is problematic, when the predicted state in the aforementioned > scenario is a shallow stop state on a tickless system. As we might get > stuck into shallow states for hours, in absence of ticks or interrupts. > > To address this, We forcefully wakeup the cpu by setting the > decrementer. The decrementer is set to a value that corresponds with the > residency of the next available state. Thus firing up a timer that will > forcefully wakeup the cpu. Few such iterations will essentially train the > governor to select a deeper state for that cpu, as the timer here > corresponds to the next available cpuidle state residency. Thus, cpu will > eventually end up in the deepest possible state. > > Signed-off-by: Abhishek Goel <hunt...@linux.vnet.ibm.com> > --- > > Auto-promotion > v1 : started as auto promotion logic for cpuidle states in generic > driver > v2 : Removed timeout_needed and rebased the code to upstream kernel > Forced-wakeup > v1 : New patch with name of forced wakeup started > v2 : Extending the forced wakeup logic for all states. Setting the > decrementer instead of queuing up a hrtimer to implement the logic. > > drivers/cpuidle/cpuidle-powernv.c | 38 +++++++++++++++++++++++++++++++ > 1 file changed, 38 insertions(+) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c > b/drivers/cpuidle/cpuidle-powernv.c > index 84b1ebe212b3..bc9ca18ae7e3 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -46,6 +46,26 @@ static struct stop_psscr_table > stop_psscr_table[CPUIDLE_STATE_MAX] __read_mostly > static u64 default_snooze_timeout __read_mostly; > static bool snooze_timeout_en __read_mostly; > > +static u64 forced_wakeup_timeout(struct cpuidle_device *dev, > + struct cpuidle_driver *drv, > + int index) > +{ > + int i; > + > + for (i = index + 1; i < drv->state_count; i++) { > + struct cpuidle_state *s = &drv->states[i]; > + struct cpuidle_state_usage *su = &dev->states_usage[i]; > + > + if (s->disabled || su->disable) > + continue; > + > + return (s->target_residency + 2 * s->exit_latency) * > + tb_ticks_per_usec; > + } > + > + return 0; > +}
It would be nice to not have this kind of loop iteration in the idle fast path. Can we add a flag or something to the idle state? > + > static u64 get_snooze_timeout(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > int index) > @@ -144,8 +164,26 @@ static int stop_loop(struct cpuidle_device *dev, > struct cpuidle_driver *drv, > int index) > { > + u64 dec_expiry_tb, dec, timeout_tb, forced_wakeup; > + > + dec = mfspr(SPRN_DEC); > + timeout_tb = forced_wakeup_timeout(dev, drv, index); > + forced_wakeup = 0; > + > + if (timeout_tb && timeout_tb < dec) { > + forced_wakeup = 1; > + dec_expiry_tb = mftb() + dec; > + } The compiler probably can't optimise away the SPR manipulations so try to avoid them if possible. > + > + if (forced_wakeup) > + mtspr(SPRN_DEC, timeout_tb); This should just be put in the above 'if'. > + > power9_idle_type(stop_psscr_table[index].val, > stop_psscr_table[index].mask); > + > + if (forced_wakeup) > + mtspr(SPRN_DEC, dec_expiry_tb - mftb()); This will sometimes go negative and result in another timer interrupt. It also breaks irq work (which can be set here by machine check I believe. May need to implement some timer code to do this for you. static void reset_dec_after_idle(void) { u64 now; u64 *next_tb; if (test_irq_work_pending()) return; now = mftb; next_tb = this_cpu_ptr(&decrementers_next_tb); if (now >= *next_tb) return; set_dec(*next_tb - now); if (test_irq_work_pending()) set_dec(1); } Something vaguely like that. See timer_interrupt(). Thanks, Nick