On Mon, 17 Mar 2014, Stanislav Fomichev wrote: > I think I'm hitting particularly subtle issue with NOHZ_IDLE kernel. > > The sequence is as follows: > - CPU enters idle, we disable tick > - hrtimer interrupt fires (for hrtimer_wakeup) > - for clock base #1 (REALTIME) we wake up SCHED_RT thread and > start RT period timer (from start_rt_bandwidth) for clock base #0 > (MONOTONIC) > - because we already checked expiry time for clock base #0 > we end up programming wrong wake up time (minutes, from tick_sched_timer) > - then, we exit idle loop and restart tick; > but because tick_sched_timer is not leftmost (the leftmost one > is RT period timer) we don't program it > > So in the end, I see working CPU without tick interrupt. > This eventually leads to RCU stall on that CPU: rcu_gp_kthread > is not woken up because there is no tick (this is the reason > I started digging this up).
Nice analysis! > The proposed fix runs expired timers first and only then tries to find > next expiry time for all clocks. > > Signed-off-by: Stanislav Fomichev <stfomic...@yandex-team.ru> > --- > kernel/hrtimer.c | 38 ++++++++++++++++++++++++-------------- > 1 file changed, 24 insertions(+), 14 deletions(-) > > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c > index 09094361dce5..63805f9f9531 100644 > --- a/kernel/hrtimer.c > +++ b/kernel/hrtimer.c > @@ -1282,6 +1282,9 @@ static void __run_hrtimer(struct hrtimer *timer, > ktime_t *now) > */ > void hrtimer_interrupt(struct clock_event_device *dev) > { > + struct hrtimer_clock_base *base; > + struct timerqueue_node *node; > + struct hrtimer *timer; > struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases); > ktime_t expires_next, now, entry_time, delta; > int i, retries = 0; > @@ -1304,8 +1307,6 @@ retry: > cpu_base->expires_next.tv64 = KTIME_MAX; > > for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { > - struct hrtimer_clock_base *base; > - struct timerqueue_node *node; > ktime_t basenow; > > if (!(cpu_base->active_bases & (1 << i))) > @@ -1315,8 +1316,6 @@ retry: > basenow = ktime_add(now, base->offset); > > while ((node = timerqueue_getnext(&base->active))) { > - struct hrtimer *timer; > - > timer = container_of(node, struct hrtimer, node); > > /* > @@ -1332,22 +1331,33 @@ retry: > * timer will have to trigger a wakeup anyway. > */ > > - if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) > { > - ktime_t expires; > - > - expires = ktime_sub(hrtimer_get_expires(timer), > - base->offset); > - if (expires.tv64 < 0) > - expires.tv64 = KTIME_MAX; > - if (expires.tv64 < expires_next.tv64) > - expires_next = expires; > + if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) > break; > - } > > __run_hrtimer(timer, &basenow); > } > } > > + for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { > + ktime_t expires; > + > + if (!(cpu_base->active_bases & (1 << i))) > + continue; > + > + base = cpu_base->clock_base + i; > + > + node = timerqueue_getnext(&base->active); > + if (!node) > + continue; > + > + timer = container_of(node, struct hrtimer, node); > + expires = ktime_sub(hrtimer_get_expires(timer), base->offset); > + if (expires.tv64 < 0) > + expires.tv64 = KTIME_MAX; > + if (expires.tv64 < expires_next.tv64) > + expires_next = expires; Actually we should store the expiry time of the first timer in a base in the base itself. That needs to be done at enqueue time and > + if (basenow.tv64 < hrtimer_get_softexpires_tv64(timer)) > break; here right before we break out of handling the base. When the last timer of the base goes away, set the value to KTIME_MAX. If we store it with the base offset already applied base->next = ktime_sub(hrtimer_get_expires(timer), base->offset); then the loop to find the next expiry in the interrupt becomes trivial and fast. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/