Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Tinker Mon, 17 Oct 2016 17:45:13 -0700

On 2016-10-18 05:25, li...@wrant.com wrote:

Mon, 17 Oct 2016 18:00:39 +0200 Karel Gardas <gard...@gmail.com>
1) use machine with proper ECC support
Hello Karel,
Please explain this "proper ECC support" for every laptop user outthere?

[..]

Mon, 17 Oct 2016 21:48:47 +0800 Tinker <ti...@openmailbox.org>
Sometimes a machine goes unresponsive. In this case, a non-ECC RAM
machine.
Hello Tinker,
This is one very intriguing problem with a very trivial solution:reboot.The idea to work around missing ECC support with software is aspractical
[..]


Hi Anton,

You misread me -

What I queried for was not how to trig some event logic on bit fliperrors (because on a non-ECC machine those will generally appear as datacorruption or undefined behavior only) or other hardware or kernelerror, but:

How to trig some event logic when the system has become vegetablebecause of overload by the userland?

My limited experience here says that system overload caused by userprocesses can lead to that all processes die or freeze, and that thesystem goes otherwise unresponsive, except for that terminal input stillis echoed.

And for that I speculated that such event logic could be implemented assome in-kernel code e.g. as a kernel thread, if those have some kind ofhigher execution guarantee than user process code,

E.g., when a userland watchdog/monitoring process didn't send any "I'mOK" signal to that thread for 60 seconds, that thread would dump thesystem's state to the console and reboot the machine.

This way I'd be able to distinguish userland-caused system crashes fromhardware/kernel crashes, as the further always make that output andreboot, whereas the latter don't (but instead reboot, crash to kerneldebug console, or just freeze the system altogether).


Do you see where I was heading now?

Tinker

Re: How assign some logic to handle system-gone-totally-unresponsive events (if not else then to enable admin with differentiated failure tracking between userland and hardware failures)

Reply via email to