Anton,
On 2016-10-18 09:46, li...@wrant.com wrote:
Hi Tinker,
[..]
How to trig some event logic when the system has become vegetable
because of overload by the userland?
You're referring here to a watchdog timer, as present in some (most)
BMC
controllers, this usually requires an OS timer reset process, see
these:
[..]
The watchdog is realised in HW with a BIOS option to enable its
timeout.
When timer is not cleared by the OS process, the BMC reboots the
system.
[..]
timer with a SW guard process.
This is an ARM SBC, it has no BMC and AFAIK no watchdog or other timer
that can be programmed to cause a reboot, if you are aware of anything
like that on ARM SBC:s let me know?
My limited experience here says that system overload caused by user
processes can lead to that all processes die or freeze, and that the
system goes otherwise unresponsive, except for that terminal input
still
is echoed.
Well, what are the process limits used for then, these should help
here?
Then as difficult as it gets, the mission is to run the system
reliably.
Because of limited RAM, RAM is scarce and under some pressure.
Running out of RAM is closer to happening on a limited-resources machine
like this where one process may rather consume 50-90% of the system's
RAM than say 10% which would be more typical on server hardware. However
RAM exhaustion could happen on a server also if processes collectively
use up all of it. Also I guess there are resources other than RAM
whereby userland could exhaust the system.
And for that I speculated that such event logic could be implemented
as
some in-kernel code e.g. as a kernel thread, if those have some kind
of
higher execution guarantee than user process code,
Most probably, you are well aware of kernel level tracing and
debugging.
[..]
Debugging user programs, and the kernel, is well documented in manuals.
Maybe you have some idea or proposal, that I am not able to understand.
What I was looking for is some foolproof logic for system exhaustion
caused by the userland, to dump state, sync filesystems, and reboot.
Kernel tracing and debugging functionality is perhaps involved in some
sense but not in the ordinary sense of being used by an admin via the
console.
SoftECC (a bit-flip detection mechanism / an ECC emulator) wouldn't help
this.
If you have any thought about how make that happen feel free to share.
Anyhow in the absence of any such logic, just doing a hardware reset is
fine, it's just a bit constrained as it comes without automated
reporting&recording that could be used to distinguish hardware/kernel
issues from userland issues, which encourages hardware replacement and
userland software debugging beyond what's really necessary.
Tinker