On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> On Sun, 20 Aug 2017 14:45:53 +1000
> Nicholas Piggin <npig...@gmail.com> wrote:
> 
> > On Wed, 16 Aug 2017 09:27:31 -0700
> > "Paul E. McKenney" <paul...@linux.vnet.ibm.com> wrote:
> > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > 
> > > Thomas, John, am I misinterpreting the timer trace event messages?  
> > 
> > So I did some digging, and what you find is that rcu_sched seems to do a
> > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > The process_timeout timer never fires (when it finally does wake after
> > one of these events, it usually removes the timer with del_timer_sync).
> > 
> > So this patch seems to fix it. Testing, comments welcome.
> 
> Okay this had a problem of trying to forward the timer from a timer
> callback function.
> 
> This was my other approach which also fixes the RCU warnings, but it's
> a little more complex. I reworked it a bit so the mod_timer fast path
> hopefully doesn't have much more overhead (actually by reading jiffies
> only when needed, it probably saves a load).

Giving this one a whirl!

                                                        Thanx, Paul

> Thanks,
> Nick
> 
> --
> [PATCH] timers: Fix excessive granularity of new timers after a nohz idle
> 
> When a timer base is idle, it is forwarded when a new timer is added to
> ensure that granularity does not become excessive. When not idle, the
> timer tick is expected to increment the base.
> 
> However there is a window after a timer is restarted from nohz, when it
> is marked not-idle, and before the timer tick on this CPU, where a timer
> may be added on an ancient base that does not get forwarded (beacause
> the timer appears not-idle).
> 
> This results in excessive granularity. So much so that a 1 jiffy timeout
> has blown out to 10s of seconds and triggered the RCU stall warning
> detector.
> 
> Fix this by keeping track of whether the timer has been idle since it was
> last run or forwarded, and allow forwarding in the case that is true (even
> if it is not currently idle).
> 
> Also add a comment noting a case where we could get an unexpectedly
> large granularity for a timer. I debugged this problem by adding
> warnings for such cases, but it seems we can't add them in general due
> to this corner case.
> 
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>
> ---
>  kernel/time/timer.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 8f5d1bf18854..ee7b8b688b48 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -203,6 +203,7 @@ struct timer_base {
>       bool                    migration_enabled;
>       bool                    nohz_active;
>       bool                    is_idle;
> +     bool                    was_idle;
>       DECLARE_BITMAP(pending_map, WHEEL_SIZE);
>       struct hlist_head       vectors[WHEEL_SIZE];
>  } ____cacheline_aligned;
> @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned 
> tflags)
> 
>  static inline void forward_timer_base(struct timer_base *base)
>  {
> -     unsigned long jnow = READ_ONCE(jiffies);
> +     unsigned long jnow;
> 
>       /*
> -      * We only forward the base when it's idle and we have a delta between
> -      * base clock and jiffies.
> +      * We only forward the base when we are idle or have just come out
> +      * of idle (was_idle logic), and have a delta between base clock
> +      * and jiffies. In the common case, run_timers will take care of it.
>        */
> -     if (!base->is_idle || (long) (jnow - base->clk) < 2)
> +     if (likely(!base->was_idle))
> +             return;
> +
> +     jnow = READ_ONCE(jiffies);
> +     base->was_idle = base->is_idle;
> +     if ((long)(jnow - base->clk) < 2)
>               return;
> 
>       /*
> @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long 
> expires, bool pending_only)
>        * same array bucket then just return:
>        */
>       if (timer_pending(timer)) {
> +             /*
> +              * The downside of this optimization is that it can result in
> +              * larger granularity than you would get from adding a new
> +              * timer with this expiry. Would a timer flag for networking
> +              * be appropriate, then we can try to keep expiry of general
> +              * timers within ~1/8th of their interval?
> +              */
>               if (timer->expires == expires)
>                       return 1;
> 
> @@ -1499,8 +1513,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 
> basem)
>               /*
>                * If we expect to sleep more than a tick, mark the base idle:
>                */
> -             if ((expires - basem) > TICK_NSEC)
> +             if ((expires - basem) > TICK_NSEC) {
> +                     base->was_idle = true;
>                       base->is_idle = true;
> +             }
>       }
>       raw_spin_unlock(&base->lock);
> 
> @@ -1587,6 +1603,12 @@ static inline void __run_timers(struct timer_base 
> *base)
>       struct hlist_head heads[LVL_DEPTH];
>       int levels;
> 
> +     /*
> +      * was_idle must be cleared before running timers so that any timer
> +      * functions that call mod_timer will not try to forward the base.
> +      */
> +     base->was_idle = false;
> +
>       if (!time_after_eq(jiffies, base->clk))
>               return;
> 
> -- 
> 2.13.3
> 

Reply via email to