On Fri, Jul 07, 2023 at 01:11:54PM +0000, Taylor R Campbell wrote: > FYI: In 10.99.5, I just added a new kernel diagnostic subsystem called > heartbeat(9) that will make the system crash rather than hang when > CPUs are stuck in certain ways that hardware watchdog timers can't > detect (or on systems without hardware watchdog timers). > > It's optional for now, but it's small and I'd like to make it > mandatory in the future. If you'd like to try it out, add the > following two lines to your kernel config: > > options HEARTBEAT > options HEARTBEAT_MAX_PERIOD_DEFAULT=15 > > You can disable it with `sysctl -w kern.heartbeat.max_period=0' at > runtime, or use that knob to change the maximum period before the > system will crash if not all (online) CPUs have made progress. > > > Here are some manual tests that you can use to exercise it -- these > are manual tests, not automatic tests, because some will deliberately > crash the kernel to make sure the diagnostic works, and the others, if > broken, will also crash the kernel. > > Notes: > - The magic numbers for debug.crashme.spl_spinout are for evbarm. > On x86, use IPL_SCHED=7, IPL_VM=6, and IPL_SOFTCLOCK=1. > For other architectures, consult the source for the numbers to use. > - If you're on a single-CPU system, skip the cpuctl offline/online > tests and just do (4) and (5). > - If you're on a >2-CPU system, then for the cpuctl offline/online > tests, try offlining all CPUs but one at a time. > > 1. cpuctl offline 0 > sleep 20 > cpuctl online 0
With this I get a panic on Xen: [ 225.4605386] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 [ 225.4605386] cpu0: Begin traceback... [ 225.4605386] vpanic() at netbsd:vpanic+0x163 [ 225.4605386] kern_assert() at netbsd:kern_assert+0x4b [ 225.4705333] heartbeat_resume() at netbsd:heartbeat_resume+0x82 [ 225.4705333] cpu_xc_online() at netbsd:cpu_xc_online+0x11 [ 225.4705333] xc_thread() at netbsd:xc_thread+0xc8 [ 225.4705333] cpu0: End traceback... [ 225.4705333] fatal breakpoint trap in supervisor mode [ 225.4705333] trap type 1 code 0 rip 0xffffffff8022e96d cs 0xe030 rflags 0x202 cr2 0xffff9b8030d32000 ilevel 0 rsp 0xffff9b8030985dd0 [ 225.4705333] curlwp 0xffff9b80007c6900 pid 0.7 lowest kstack 0xffff9b80309812c0 Stopped in pid 0.7 (system) at netbsd:breakpoint+0x5: leave breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x163 kern_assert() at netbsd:kern_assert+0x4b heartbeat_resume() at netbsd:heartbeat_resume+0x82 cpu_xc_online() at netbsd:cpu_xc_online+0x11 xc_thread() at netbsd:xc_thread+0xc8 Is it expected ? Nothing looks Xen-specific here > > 2. cpuctl offline 1 > sleep 20 > cpuctl online 1 same panic > > 3. cpuctl offline 0 > sysctl -w kern.heartbeat.max_period=5 > sleep 10 > sysctl -w kern.heartbeat.max_period=0 > sleep 10 > sysctl -w kern.heartbeat.max_period=15 > sleep 20 > cpuctl online 0 Here we have: # sysctl -w kern.heartbeat.max_period=15 [ 53.5704682] panic: kernel diagnostic assertion "kpreempt_disabled()" failed: file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_heartbeat.c", line 158 [ 53.5704682] cpu0: Begin traceback... [ 53.5704682] vpanic() at netbsd:vpanic+0x163 [ 53.5704682] kern_assert() at netbsd:kern_assert+0x4b [ 53.5704682] heartbeat_resume() at netbsd:heartbeat_resume+0x82 [ 53.5704682] xc_thread() at netbsd:xc_thread+0xc8 [ 53.5704682] cpu0: End traceback... > > 4. sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK > # verify system panics after 15sec my sysctl command did hang, but the system didn't panic > > 5. sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=6 # IPL_SCHED > # verify system panics after 15sec This one did panic > > 6. cpuctl offline 0 > sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=1 # IPL_SOFTCLOCK > # verify system panics after 15sec my sysctl command did hang, but the system didn't panic > > 7. cpuctl offline 0 > sysctl -w debug.crashme_enable=1 > sysctl -w debug.crashme.spl_spinout=5 # IPL_VM > # verify system panics after 15sec and this one did panic -- Manuel Bouyer <bou...@antioche.eu.org> NetBSD: 26 ans d'experience feront toujours la difference --