Hello Sumit, please Cc kernel mailing list.
Cc Petr, Peter, Steven, kernel-list On (05/18/17 12:22), Sumit Gemini wrote: > Hi Sergey, > > I read your comment on [1]https://lkml.org/lkml/2016/9/27/279. Here you were > talking about using alt_printk instead of printk. > I Got this crash, could this crash related to call printk recursively? > Could you please tell me why i got this crash? Do you see any suspicious > entry here? > > [2324956.184374] Kernel panic - not syncing: Watchdog detected hard LOCKUP > on cpu 1 > [2324956.184374] Pid: 0, comm: kworker/0:0 Tainted: P > 3.1.10-gb20-default #1 > [2324956.184374] Call Trace: > [2324956.184374] [<ffffffff81003add>] dump_trace+0x98/0x223 > [2324956.184374] [<ffffffff813dc659>] dump_stack+0x69/0x6f > [2324956.184374] [<ffffffff813def08>] panic+0xa4/0x1b4 > [2324956.184374] [<ffffffff81094b48>] watchdog_overflow_callback+0x79/0x9e > [2324956.184374] [<ffffffff810b66b5>] __perf_event_overflow+0xf9/0x185 > [2324956.184374] [<ffffffff81010a2c>] intel_pmu_handle_irq+0x15d/0x1be > [2324956.184374] [<ffffffff813f375b>] perf_event_nmi_handler+0x3e/0x87 > [2324956.184374] [<ffffffff813f4c0f>] notifier_call_chain+0x2e/0x5b > [2324956.184374] [<ffffffff813f4c76>] __atomic_notifier_call_chain+0x3a/0x4d > [2324956.184374] [<ffffffff813f4cc1>] notify_die+0x2d/0x32 > [2324956.184374] [<ffffffff813f2bbb>] default_do_nmi+0x29/0x1b5 > [2324956.184374] [<ffffffff813f2eb7>] do_nmi+0x47/0x6f > [2324956.184374] [<ffffffff813f2a30>] nmi+0x20/0x30 > [2324956.184374] [<ffffffff8106b725>] arch_local_irq_save+0x1/0x17 > [2324956.184374] [<ffffffff813f1e9b>] _raw_spin_lock_irqsave+0xf/0x39 > [2324956.184374] [<ffffffff8105d9b7>] down_trylock+0xb/0x29 > [2324956.184374] [<ffffffff810409c2>] console_trylock+0xf/0x47 > [2324956.184374] [<ffffffff81040dc9>] console_unlock+0x100/0x129 > [2324956.184374] [<ffffffff8104120b>] vprintk+0x348/0x394 > [2324956.184374] [<ffffffff813df06b>] printk+0x53/0x58 > [2324956.184374] [<ffffffff81359902>] ip_handle_martian_source+0x6d/0xf6 > [2324956.184374] [<ffffffff8135b980>] ip_route_input_slow+0x45d/0x462 > [2324956.184374] [<ffffffff8135c02f>] ip_route_input_common+0x217/0x231 > [2324956.184374] [<ffffffff8137e995>] arp_process+0x181/0x4ac > [2324956.184374] [<ffffffff81337bf5>] __netif_receive_skb+0x317/0x36b > [2324956.184374] [<ffffffff81337b44>] __netif_receive_skb+0x266/0x36b > [2324956.184374] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84 > [2324956.184374] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31 > [2324956.184374] [<ffffffffa02e1dee>] igb_clean_rx_irq+0x30d/0x39e [igb] > [2324956.184374] [<ffffffffa02e1ecd>] igb_poll+0x4e/0x74 [igb] > [2324956.184374] [<ffffffff81339c88>] net_rx_action+0x65/0x178 > [2324956.184374] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d > [2324956.184374] [<ffffffff813f9aac>] call_softirq+0x1c/0x30 > [2324956.184374] [<ffffffff81003931>] do_softirq+0x3c/0x7b > [2324956.184374] [<ffffffff81045f98>] irq_exit+0x3c/0xac > [2324956.184374] [<ffffffff81003655>] do_IRQ+0x82/0x98 > [2324956.184374] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e > [2324956.184374] [<ffffffff8123f91d>] intel_idle+0xdd/0x104 > [2324956.184374] [<ffffffff8130b76b>] cpuidle_idle_call+0xdf/0x181 > [2324956.184374] [<ffffffff81001201>] cpu_idle+0x56/0xa9 > > I shell be very greatful to you. > your CPU1 couldn't acquire console_sem spin_lock down_trylock() raw_spin_lock_irqsave(&sem->lock, flags); and _probably_ looped on it with local IRQs disabled long enough to disappoint the NMI watchdog, which eventually declared a hard lockup condition. but I'm not entirely sure it stuck looping on sem->lock, it could be the case that your net IRQ took too much time and by the time NMI declared hard lockup we were just unlucky to execute down_trylock(). on the other hand, the fact that you see IRQ ip_handle_martian_source() printk() console_unlock() is already not really good. first of all, you had hit ip_route_input slow path; second, you hit pr_warn()/pr_err or print_hex_dump(), so the kernel was reporting you something abnormal. printk() invoked from IRQ context managed to acquire the console_sem and spent some time printing the messages to the serial console, console_unlock() in the call trace. which sometimes can be a call for troubles. printk() can't always be safely called from atomic contexts, just because of cases like this. I don't know how big ->hard_header_len can be, but that print_hex_dump() can take some extra time. if you also happened to have other printk messages coming from other CPUs in the meantime, then CPU1 had have to spent even more time in console_unlock() printing those messages; from IRQ handler. hm, this doesn't look to me as a printk recursion, at least after a quick look. either, a) something was holding sem->lock, can't tell what exactly (unlikely) or b) your IRQ handler just happened to execute way too long. given that there was printk()->console_unlock() and probably a hex dump of skb, this is a more likely root cause. -ss