I developed a small tool based on inotify to help users to check if their watchdog is being used.
Anyone can find instructions on how to run it here: https://github.com/inaddy/notifymydog Small Example: inaddy@host:~$ wget https://raw.githubusercontent.com/inaddy/notifymydog/master/notifymydog.c inaddy@host:~/notifymydog$ gcc -Wall -D_DEBUG=0 -D_SYSLOG=1 notifymydog.c -o notifymydog inaddy@host:~/notifymydog$ sudo ./notifymydog & inaddy@host:~$ sudo tail -f /var/log/syslog Mar 16 17:36:26 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED Mar 16 17:36:40 inaddygueto WATCHMYDOG[15766]: OK: WATCHDOG UPDATED Mar 16 17:36:44 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS CLOSED Mar 16 17:36:49 inaddygueto WATCHMYDOG[15766]: WARNING: WATCHDOG WAS OPENED So if you ever got a kernel panic on a HP Proliant Server DL360 and/or DL380 with no apparent reason and the stack trace shows NMIs generate, confirm if none of your userland programs have opened /dev/watchdog on purpose (not updating it frequent enough) and by accident (causing the watchdog HW to be triggered and panic'ing the machine after some time). Workaround: # echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf # update-initramfs -k all -u # upgrade-grub # reboot ** Summary changed: - HP Proliant Servers should not have HPWDT module loaded automatically + HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module loaded ** Summary changed: - HP Proliant Servers - Kernel Panic NMI - DL360 & DL380 - HPWDT module loaded + HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded ** Description changed: It was brought to me several situations where users where facing kernel - panics when machine was apparently idling: + panics when machine was apparently idling (for some HP Proliant Servers + like DL 360, DL 380). Examples: PID: 0 TASK: ffffffff81c1a480 CPU: 0 COMMAND: "swapper/0" - #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2 - #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3 - #2 [ffff88085fc05da0] panic at ffffffff8175b3f2 - #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9 - #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8 - #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e - #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9 - #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8 - #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21 - [exception RIP: native_safe_halt+6] - RIP: ffffffff81055186 RSP: ffffffff81c03e90 RFLAGS: 00000246 - RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246 - RDX: ffffffff81c03e90 RSI: 0000000000000018 RDI: 0000000000000001 - RBP: ffffffff81055186 R8: ffffffff81055186 R9: 0000000000000018 - R10: ffffffff81c03e90 R11: 0000000000000246 R12: ffffffffffffffff - R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 - ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 + #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2 + #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3 + #2 [ffff88085fc05da0] panic at ffffffff8175b3f2 + #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9 + #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8 + #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e + #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9 + #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8 + #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21 + [exception RIP: native_safe_halt+6] + RIP: ffffffff81055186 RSP: ffffffff81c03e90 RFLAGS: 00000246 + RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246 + RDX: ffffffff81c03e90 RSI: 0000000000000018 RDI: 0000000000000001 + RBP: ffffffff81055186 R8: ffffffff81055186 R9: 0000000000000018 + R10: ffffffff81c03e90 R11: 0000000000000246 R12: ffffffffffffffff + R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 + ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 --- <DOUBLEFAULT exception stack> --- - #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186 + #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186 #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325 #13 [ffffffff81c03f40] rest_init at ffffffff81751a37 #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7 #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733 OR - PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0" - #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391 - #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8 - #2 [ffff880fffa07d80] panic at ffffffff81730335 - #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt] - #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a - #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd - #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0 - #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81 - [exception RIP: intel_idle+204] - RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046 - RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046 - RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001 - RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018 - R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff - R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000 - ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 - --- <NMI exception stack> --- - #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec - #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf + PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0" + #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391 + #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8 + #2 [ffff880fffa07d80] panic at ffffffff81730335 + #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt] + #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a + #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd + #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0 + #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81 + [exception RIP: intel_idle+204] + RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046 + RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046 + RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001 + RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018 + R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff + R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000 + ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 + --- <NMI exception stack> --- + #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec + #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf It turned out that after investigating all idling situations and diverse kernel dump files - where we had most of the CPUs either MWAITing and or "relaxing", we discovered that HPWDT was loaded and corosync was opening /dev/watchdog file, triggering the ILO watchdog timer and not updating frequently enough as ILO expected. As described in /etc/modprobe.d/blacklist-watchdog.conf: """ # Watchdog drivers should not be loaded automatically, but only if a # watchdog daemon is installed. """ We should blacklist module "hpwdt" by default for all Ubuntu versions. ** Description changed: It was brought to me several situations where users where facing kernel panics when machine was apparently idling (for some HP Proliant Servers like DL 360, DL 380). + + ILO: + + "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An + Unrecoverable System Error (NMI) has occurred (System error code + 0x0000002B, 0x00000000)" Examples: PID: 0 TASK: ffffffff81c1a480 CPU: 0 COMMAND: "swapper/0" #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2 #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3 #2 [ffff88085fc05da0] panic at ffffffff8175b3f2 #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9 #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8 #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9 #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8 #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21 [exception RIP: native_safe_halt+6] RIP: ffffffff81055186 RSP: ffffffff81c03e90 RFLAGS: 00000246 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246 RDX: ffffffff81c03e90 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff81055186 R8: ffffffff81055186 R9: 0000000000000018 R10: ffffffff81c03e90 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 --- <DOUBLEFAULT exception stack> --- #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186 #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325 #13 [ffffffff81c03f40] rest_init at ffffffff81751a37 #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7 #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733 OR PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0" #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391 #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8 #2 [ffff880fffa07d80] panic at ffffffff81730335 #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt] #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0 #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81 [exception RIP: intel_idle+204] RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046 RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018 R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 --- <NMI exception stack> --- #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf It turned out that after investigating all idling situations and diverse kernel dump files - where we had most of the CPUs either MWAITing and or "relaxing", we discovered that HPWDT was loaded and corosync was opening /dev/watchdog file, triggering the ILO watchdog timer and not updating frequently enough as ILO expected. As described in /etc/modprobe.d/blacklist-watchdog.conf: """ # Watchdog drivers should not be loaded automatically, but only if a # watchdog daemon is installed. """ We should blacklist module "hpwdt" by default for all Ubuntu versions. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1432837 Title: HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded Status in linux package in Ubuntu: Confirmed Bug description: It was brought to me several situations where users where facing kernel panics when machine was apparently idling (for some HP Proliant Servers like DL 360, DL 380). ILO: "76 CriticalSystem Error03/12/2015 12:4203/12/2015 12:072 An Unrecoverable System Error (NMI) has occurred (System error code 0x0000002B, 0x00000000)" Examples: PID: 0 TASK: ffffffff81c1a480 CPU: 0 COMMAND: "swapper/0" #0 [ffff88085fc05c88] machine_kexec at ffffffff8104eac2 #1 [ffff88085fc05cd8] crash_kexec at ffffffff810f26a3 #2 [ffff88085fc05da0] panic at ffffffff8175b3f2 #3 [ffff88085fc05e20] sched_clock at ffffffff8101c3b9 #4 [ffff88085fc05e30] nmi_handle at ffffffff810170e8 #5 [ffff88085fc05e90] io_check_error at ffffffff8101758e #6 [ffff88085fc05eb0] default_do_nmi at ffffffff810176a9 #7 [ffff88085fc05ed8] do_nmi at ffffffff810177d8 #8 [ffff88085fc05ef0] end_repeat_nmi at ffffffff8176da21 [exception RIP: native_safe_halt+6] RIP: ffffffff81055186 RSP: ffffffff81c03e90 RFLAGS: 00000246 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000246 RDX: ffffffff81c03e90 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff81055186 R8: ffffffff81055186 R9: 0000000000000018 R10: ffffffff81c03e90 R11: 0000000000000246 R12: ffffffffffffffff R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 --- <DOUBLEFAULT exception stack> --- #9 [ffffffff81c03e90] native_safe_halt at ffffffff81055186 #10 [ffffffff81c03e98] default_idle at ffffffff8101d37f #11 [ffffffff81c03eb8] arch_cpu_idle at ffffffff8101dcaf #12 [ffffffff81c03ec8] cpu_startup_entry at ffffffff810b5325 #13 [ffffffff81c03f40] rest_init at ffffffff81751a37 #14 [ffffffff81c03f50] start_kernel at ffffffff81d320b7 #15 [ffffffff81c03f90] x86_64_start_reservations at ffffffff81d315ee #16 [ffffffff81c03fa0] x86_64_start_kernel at ffffffff81d31733 OR PID: 0 TASK: ffffffff81c14440 CPU: 0 COMMAND: "swapper/0" #0 [ffff880fffa07c40] machine_kexec at ffffffff8104b391 #1 [ffff880fffa07cb0] crash_kexec at ffffffff810d5fb8 #2 [ffff880fffa07d80] panic at ffffffff81730335 #3 [ffff880fffa07e00] hpwdt_pretimeout at ffffffffa02378b5 [hpwdt] #4 [ffff880fffa07e20] nmi_handle at ffffffff8174a76a #5 [ffff880fffa07ea0] default_do_nmi at ffffffff8174aacd #6 [ffff880fffa07ed0] do_nmi at ffffffff8174abe0 #7 [ffff880fffa07ef0] end_repeat_nmi at ffffffff81749c81 [exception RIP: intel_idle+204] RIP: ffffffff813f07ec RSP: ffffffff81c01d88 RFLAGS: 00000046 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000046 RDX: ffffffff81c01d88 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff813f07ec R8: ffffffff813f07ec R9: 0000000000000018 R10: ffffffff81c01d88 R11: 0000000000000046 R12: ffffffffffffffff R13: 0000000001c0d000 R14: ffffffff81c01fd8 R15: 0000000000000000 ORIG_RAX: 0000000000000000 CS: 0010 SS: 0018 --- <NMI exception stack> --- #8 [ffffffff81c01d88] intel_idle at ffffffff813f07ec #9 [ffffffff81c01dc0] cpuidle_enter_state at ffffffff815e76cf It turned out that after investigating all idling situations and diverse kernel dump files - where we had most of the CPUs either MWAITing and or "relaxing", we discovered that HPWDT was loaded and corosync was opening /dev/watchdog file, triggering the ILO watchdog timer and not updating frequently enough as ILO expected. As described in /etc/modprobe.d/blacklist-watchdog.conf: """ # Watchdog drivers should not be loaded automatically, but only if a # watchdog daemon is installed. """ We should blacklist module "hpwdt" by default for all Ubuntu versions. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp