On Wed, May 2, 2018 at 6:38 PM, Nicholas Piggin <npig...@gmail.com> wrote: > On Tue, 01 May 2018 23:07:28 +1000 > Balbir Singh <bsinghar...@gmail.com> wrote: > >> On Tue, 2018-05-01 at 12:22 +1000, Nicholas Piggin wrote: >> > Provide timebase and timebase of last heartbeat in watchdog lockup >> > messages. Also provide a stack trace of when a CPU becomes un-stuck, >> > which can be useful -- it could be where irqs are re-enabled, so it >> > may be the end of the critical section which is responsible for the >> > latency. >> > >> > Signed-off-by: Nicholas Piggin <npig...@gmail.com> >> > --- >> > >> > A lockup + unstuck event now looks like this (with irqtrace enabled): >> > >> > watchdog: CPU 1 self-detected hard LOCKUP @ udelay+0x40/0x60 >> > watchdog: CPU 1 TB:82611697355, last heartbeat TB:75431975757 >> >> Can we divide TB with tb_ticks_per_sec, TB itself is not very useful, the >> delta maybe, but it needs more work on behalf of the person looking >> at the output. > > I kind of prefer being able to examine register values and compare > directly with these logs, e.g., in mambo or xmon. > > But maybe end user prefers something friendlier. What about like > > watchdog: CPU 1 self-detected hard LOCKUP @ udelay+0x40/0x60 > watchdog: CPU 1 no heartbeat for 14.02s (TB:82611697355, last TB:75431975757) > > ?
That's better, if you really need the TB values, assuming we've handled overflows, which should practically never happen right? :) > >> > @@ -245,8 +260,6 @@ void soft_nmi_interrupt(struct pt_regs *regs) >> > >> > tb = get_tb(); >> > if (tb - per_cpu(wd_timer_tb, cpu) >= wd_panic_timeout_tb) { >> > - per_cpu(wd_timer_tb, cpu) = tb; >> > - >> >> Is this related to the print improvements? It looks like you don't want >> to reset the tb, but I would split it out > > > Yeah there isn't any real reason to reset it since we get marked as > stuck which prevents further messages, and it clobbrs our last heartbeat > value. I'll put it into its own change. > Sounds good! Balbir Singh.