On Sat, Feb 23, 2013 at 12:22 PM, John Abreau wrote:
> RAM going bad silently is an aggravating problem, and we often don't think
> to test the RAM when some mysterious error crops up. It would be great if
> Nagios was able to test RAM automatically.
>
> Is it possible to test RAM on a live system
I recall hearing something a few years ago about memtest functionality being
added to the Linux kernel. Seems to me that making this functionality visible
to something like nagios would be an obvious goal.
On Feb 24, 2013, at 2:52 AM, Tom Metro wrote:
> Rich Pieri wrote:
>> John Abreau wrot
Maybe not every 5 minutes the way most things are configured in nagios. But
running it once a day, or even once a week, to allow nagios a chance to detect
memory errors might be worth the overhead.
It would be sufficient just to detect that bad RAM exists. You have to power
off the server anyw
On Sun, Feb 24, 2013 at 3:39 PM, John Abreau wrote:
> I recall hearing something a few years ago about memtest functionality being
> added to the Linux kernel. Seems to me that making this functionality visible
> to something like nagios would be an obvious goal.
I decided to look into this a l
I wonder if it could be automated? Perhaps a weekly or monthly cron job
that temporarily sets grub to default to the memtest config, then reboots,
runs the memtest and logs the results, and finally sets grub back to its
previous config?
I firmly believe that if a process can only be run manually,
On Sun, 24 Feb 2013 19:13:21 -0500
John Abreau wrote:
> I wonder if it could be automated? Perhaps a weekly or monthly cron
> job that temporarily sets grub to default to the memtest config, then
> reboots, runs the memtest and logs the results, and finally sets grub
> back to its previous config
My understanding is that ECC RAM merely makes a server crash on a memory
error, not detect the error and alert the sysadmin. Is that not the case?
And no, I'm not looking to implement this in the near future, I just prefer
automating routine sysadmin chores and relying on Nagios for routine system
Incorrect. ECC RAM lets the server repair a single bit error and
continue operating without interruption. (The error may be logged if
the motherboard supports that and a suitable daemon is active. See the
EDAC project: http://bluesmoke.sourceforge.net/ ) Parity memory will
crash the server with a m
On Sun, 24 Feb 2013 19:34:19 -0500
John Abreau wrote:
> My understanding is that ECC RAM merely makes a server crash on a
> memory error, not detect the error and alert the sysadmin. Is that
> not the case?
It does neither. ECC (error-correcting code) RAM corrects single-bit
errors automatically