On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > This is *not* to say that this is the bug you're hitting. But it does show > that > > (a) a flaky HPET can do some seriously bad stuff > (b) the kernel is very fragile wrt time going backwards. > > and maybe we can use this test program to at least try to alleviate problem > (b).
Ok, so after several false starts (ktime_get() is really really fragile - called in scheduler things, and doing magic things at bootup), here is something that seems to alleviate the problem for me. I still get a lot of RCU messages like "self-detected stall" etc, but that's to be expected. When the clock does odd things, crap *will* happen. But what this does is: (a) make the error more visible as a clock error rather than various random downstream users IOW, it prints things out when it looks like we're getting odd clock read errors (arbitrary cut-off: we expect clock read-outs to be withing 1/8th of the range of the expected clock value) (b) try to alleviate the horrible things that happen when the clock error is big The patch tries to "correct" for the huge time jump by basically undoing it. We'll still see time jumps (there really is no way to avoid it), but we limit the range of them. With the attached patch, my machine seems to survive me writing to the HPET master counter register. It spews warnings, and it is noisy about the odd clock reads: ... Clocksource hpet had cycles off by 642817751 Cutting it too close for hpet in in update_wall_time (offset = 4034102337) INFO: rcu_sched self-detected stall on CPU { 0} (t=281743 jiffies g=4722 c=4721 q=14) ... and there may still be situations where it does horrible horrible things due to the (smaller) time leaps, but it does seem a lot more robust. NOTE! There's an (intentional) difference in how we handle the time leaps at time read time vs write (wall-clock update). At time read time, we just refuse to believe the big delta, and we set the "cycle_error" value so that future time reads will be relative to the error we just got. We also don't print anything out, because we're possibly deep in the scheduler or in tracing, and we do not want to spam the console about our fixup. At time *write* time, we first report about any read-time errors, and then we report (but believe in) overlarge clocksource delta errors as we update the time. This seems to be the best way to limit the damage. Also note that the patch is entirely clock-agnostic. It's just that I can trivially screw up my HPET, I didn't look at other clocks. One note: my current limit of clocksource delta errors is based on the range of the clock (1/8th of the range). I actually think that's bogus, and it should instead be based on the expected frequency of the clock (ie "we are guaranteed to update the wall clock at least once every second, so if the clock source delta read is larger than one second, we've done something wrong"). So this patch is meant very much as an RFC, rather than anything else. It's pretty hacky. But it does actually make a huge difference for me wrt the "mess up HPET time on purpose". That used to crash my machine pretty hard, and pretty reliably. With this patch, I've done it ten+ times, and while it spews a lot of garbage, the machine stays up and _works_. Making the sanity check tighter (ie the "one second" band rather than "1/8th of the clock range") would probably just improve it further. Thomas, what do you think? Hate it? Any better ideas? And again: this is not trying to make the kernel clock not jump. There is no way I can come up with even in theory to try to really *fix* a fundamentally broken clock. So this is not meant to be a real "fix" for anything, but is meant to make sure that if the clock is unreliable, we pinpoint the clock itself, and it mitigates the absolutely horrendously bad behavior we currently with bad clocks. So think of this as debug and band-aid rather than "this makes clocks magically reliable". .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my "testing" has been writing zero to the HPET lock (so the HPET clock difference tends to be pretty specific), while my next step is to see what happens when I write random values (and a lot of them). Since I expect that to cause more problems, I thought I'd send this RFC out before I start killing my machine again ;) Linus
include/linux/timekeeper_internal.h | 1 + kernel/time/timekeeping.c | 25 ++++++++++++++++++++++++- 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 05af9a334893..0fcb60d77079 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t (*read)(struct clocksource *cs); cycle_t mask; cycle_t cycle_last; + cycle_t cycle_error; u32 mult; u32 shift; u64 xtime_nsec; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 6a931852082f..1c842ddd567f 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) tk->tkr.read = clock->read; tk->tkr.mask = clock->mask; tk->tkr.cycle_last = tk->tkr.read(clock); + tk->tkr.cycle_error = 0; /* Do the ns -> cycle conversion first, using original mult */ tmp = NTP_INTERVAL_LENGTH; @@ -197,11 +198,17 @@ static inline s64 timekeeping_get_ns(struct tk_read_base *tkr) s64 nsec; /* read clocksource: */ - cycle_now = tkr->read(tkr->clock); + cycle_now = tkr->read(tkr->clock) + tkr->cycle_error; /* calculate the delta since the last update_wall_time: */ delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); + /* Hmm? This is really not good, we're too close to overflowing */ + if (unlikely(delta > (tkr->mask >> 3))) { + tkr->cycle_error = delta; + delta = 0; + } + nsec = delta * tkr->mult + tkr->xtime_nsec; nsec >>= tkr->shift; @@ -465,6 +472,16 @@ static void timekeeping_update(struct timekeeper *tk, unsigned int action) update_fast_timekeeper(tk); } +static void check_cycle_error(struct tk_read_base *tkr) +{ + cycle_t error = tkr->cycle_error; + + if (unlikely(error)) { + tkr->cycle_error = 0; + pr_err("Clocksource %s had cycles off by %llu\n", tkr->clock->name, error); + } +} + /** * timekeeping_forward_now - update clock to the current time * @@ -481,6 +498,7 @@ static void timekeeping_forward_now(struct timekeeper *tk) cycle_now = tk->tkr.read(clock); delta = clocksource_delta(cycle_now, tk->tkr.cycle_last, tk->tkr.mask); tk->tkr.cycle_last = cycle_now; + check_cycle_error(&tk->tkr); tk->tkr.xtime_nsec += delta * tk->tkr.mult; @@ -1237,6 +1255,7 @@ static void timekeeping_resume(void) /* Re-base the last cycle value */ tk->tkr.cycle_last = cycle_now; + tk->tkr.cycle_error = 0; tk->ntp_error = 0; timekeeping_suspended = 0; timekeeping_update(tk, TK_MIRROR | TK_CLOCK_WAS_SET); @@ -1591,11 +1610,15 @@ void update_wall_time(void) if (unlikely(timekeeping_suspended)) goto out; + check_cycle_error(&real_tk->tkr); + #ifdef CONFIG_ARCH_USES_GETTIMEOFFSET offset = real_tk->cycle_interval; #else offset = clocksource_delta(tk->tkr.read(tk->tkr.clock), tk->tkr.cycle_last, tk->tkr.mask); + if (unlikely(offset > (tk->tkr.mask >> 3))) + pr_err("Cutting it too close for %s in in update_wall_time (offset = %llu)\n", tk->tkr.clock->name, offset); #endif /* Check if there's really nothing to do */