Re: [patch] sched: don't use nutty scale_rt_power() output

Mike Galbraith Thu, 27 Feb 2014 02:40:16 -0800

On Thu, 2014-02-27 at 10:40 +0100, Peter Zijlstra wrote: 
> On Mon, Feb 24, 2014 at 09:06:51AM +0100, Mike Galbraith wrote:
> > Hi Peter,
> > 
> > I wonder if the below makes sense for mainline.
> > 
> > Background: I received some rather surprising news recently, a user of
> > old 2.6.32 kernels regularly receive log spam stemming from old 208 day
> > era warnings/protections inserted to prevent explosions from what was at
> > the time unknown bad juju happening (but don't report logs that look
> > like graffiti artist with an unlimited supply of spray paint gone mad).
> > 
> > The kernel that emitted the below does NOT contain..
> > 9993bc63 sched/x86: Fix overflow in cyc2ns_offset
> > ..though these folks use kexec fwtw.  They're one of those "You update
> > your kernel IFF world stops spinning" users, so will likely not be
> > terribly interested in me making their boxen say BUG(), and may even be
> > doing something naughty that induces it for all I know.
> > 
> > In any case, NOT using nutty output from the intentionally racy function
> > seems like a good plan no matter who or what makes weird unreproducible
> > (elsewhere) sh*t happen.  Wedging a bent 64 bit peg into 32 bit hole
> > could make boom, on top of doing funny things to balancing. 
> > 
> > sched: don't use nutty scale_rt_power() output
> > 
> > Boxen instructed to gripe if they see nutty cpu_power catch us
> > trashing it while seriously dazed and confused for an unknown reason.
> > 
> > Dec 18 05:50:56 kernel: [40091179.401405] update_group_power: cpu_power = 
> > 3148183471
> > Dec 18 05:51:01 /usr/sbin/cron[2279]: (root) CMD (/opt/blah/fix_cdr_bin.job 
> > >> /opt/blah/fix_cdr_bin.out 2>&1)
> > Dec 18 05:51:06 kernel: [40091189.455713] update_cpu_power: cpu_power = 
> > 19495027282; scale_rt = 19495027282
> > Dec 18 05:51:16 kernel: [22076800.665578] update_cpu_power: cpu_power = 
> > 2671067611; scale_rt = 18428729677871137243
> > Dec 18 05:51:16 kernel: [40091199.188773] update_cpu_power: cpu_power = 
> > 2675064501; scale_rt = 18428729677875134133
> > 
> > Don't do that, make a scary warning instead.
> > 
> 
> Yeah, I'm in two minds about that. Crappy clocks can make a whole lot of
> missery. Then again, we usually guard against them going backwards.
> 
> How about something like so? Most other sites don't complain about
> clocks going backwards either, they just deal with it.


Yeah, better to warp protect scale_rt_power() directly.

This small set of identical weird ass boxen should be reliable tsc.
They jump back and forth in time by _exactly 208 days_, and do that
straight from boot, and randomly thereafter.  Wish I could get my hands
on one of the things, but that ain't gonna happen.

Those boxen have long uptimes, which proves you can survive with a sched
clock that's going completely bonkers, which is kinda surprising to me.
On a busy box, I'd expect some poor victim to eat the mother of all
latency hits.

> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5564,6 +5564,7 @@ static unsigned long scale_rt_power(int
>  {
>       struct rq *rq = cpu_rq(cpu);
>       u64 total, available, age_stamp, avg;
> +     s64 delta;
>  
>       /*
>        * Since we're reading these variables without serialization make sure
> @@ -5572,7 +5573,11 @@ static unsigned long scale_rt_power(int
>       age_stamp = ACCESS_ONCE(rq->age_stamp);
>       avg = ACCESS_ONCE(rq->rt_avg);
>  
> -     total = sched_avg_period() + (rq_clock(rq) - age_stamp);
> +     delta = rq_clock(rq) - age_stamp;
> +     if (unlikely(delta < 0))
> +             delta = 0;
> +
> +     total = sched_avg_period() + delta;
>  
>       if (unlikely(total < avg)) {
>               /* Ensures that power won't end up being negative */


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] sched: don't use nutty scale_rt_power() output

Reply via email to