On Tue, 01 May 2018 23:07:28 +1000 Balbir Singh <bsinghar...@gmail.com> wrote:
> On Tue, 2018-05-01 at 12:22 +1000, Nicholas Piggin wrote: > > Provide timebase and timebase of last heartbeat in watchdog lockup > > messages. Also provide a stack trace of when a CPU becomes un-stuck, > > which can be useful -- it could be where irqs are re-enabled, so it > > may be the end of the critical section which is responsible for the > > latency. > > > > Signed-off-by: Nicholas Piggin <npig...@gmail.com> > > --- > > > > A lockup + unstuck event now looks like this (with irqtrace enabled): > > > > watchdog: CPU 1 self-detected hard LOCKUP @ udelay+0x40/0x60 > > watchdog: CPU 1 TB:82611697355, last heartbeat TB:75431975757 > > Can we divide TB with tb_ticks_per_sec, TB itself is not very useful, the > delta maybe, but it needs more work on behalf of the person looking > at the output. I kind of prefer being able to examine register values and compare directly with these logs, e.g., in mambo or xmon. But maybe end user prefers something friendlier. What about like watchdog: CPU 1 self-detected hard LOCKUP @ udelay+0x40/0x60 watchdog: CPU 1 no heartbeat for 14.02s (TB:82611697355, last TB:75431975757) ? > > @@ -245,8 +260,6 @@ void soft_nmi_interrupt(struct pt_regs *regs) > > > > tb = get_tb(); > > if (tb - per_cpu(wd_timer_tb, cpu) >= wd_panic_timeout_tb) { > > - per_cpu(wd_timer_tb, cpu) = tb; > > - > > Is this related to the print improvements? It looks like you don't want > to reset the tb, but I would split it out Yeah there isn't any real reason to reset it since we get marked as stuck which prevents further messages, and it clobbrs our last heartbeat value. I'll put it into its own change. Thanks, Nick