Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-24 Thread Borislav Petkov
On Thu, Mar 23, 2017 at 11:20:44AM -0700, Luck, Tony wrote: > Keeping every PFN would be overkill (most of them should be taken > offline with no issues). A fixed array of a few of them with timestamps > to drop the oldest would likely be a good enough(TM) solution. The reason being? Prevent the

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-23 Thread Luck, Tony
On Thu, Mar 23, 2017 at 06:28:39PM +0100, Borislav Petkov wrote: > Meh, I don't like the idea of keeping an evergrowing list of PFNs we > can't do anything about anyway. Keeping every PFN would be overkill (most of them should be taken offline with no issues). A fixed array of a few of them with

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-23 Thread Borislav Petkov
On Thu, Mar 23, 2017 at 10:20:31AM -0700, Luck, Tony wrote: > It can happen if Linux didn't actually take the page offline > (because it was a kernel page). The CEC code only knows that > it queued this page to be taken offline ... and has no way > to know if that succeeded or not. Right, that's t

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-23 Thread Luck, Tony
On Thu, Mar 23, 2017 at 04:22:28PM +0100, Borislav Petkov wrote: > On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote: > > Lemme try to write a small script exercising exactly that scenario to > > see whether I'm actually not talking crap here :-) > > Ok, here's a snapshot from the CE

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-23 Thread Borislav Petkov
On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote: > Lemme try to write a small script exercising exactly that scenario to > see whether I'm actually not talking crap here :-) Ok, here's a snapshot from the CEC after letting it run for a couple of hours in a guest with a script runni

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-22 Thread Borislav Petkov
On Wed, Mar 22, 2017 at 12:00:25PM -0700, Luck, Tony wrote: > You also need to check that bit 61 of m->status is zero here. > The collector is hiding uncorrected errors too. Good catch. I think I wanna do something like this: if (memory_error(m) && !(m->status & MCI_STATUS_UC) ... as we

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-22 Thread Luck, Tony
On Thu, Mar 09, 2017 at 11:08:17AM +0100, Borislav Petkov wrote: > +static bool cec_add_mce(struct mce *m) > +{ > + if (!m) > + return false; > + > + if (memory_error(m) && mce_usable_address(m)) > + if (!cec_add_elem(m->addr >> PAGE_SHIFT)) > + r

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-22 Thread Borislav Petkov
On Mon, Mar 20, 2017 at 03:48:24PM -0700, Luck, Tony wrote: > You added "count_threshold" for me ... so the condition isn't quite > "overflows" > like it was in the early versions. It is a max count which, when reached, causes the soft offline attempt. What did you mean with "overflows" exactly t

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-20 Thread Luck, Tony
On Thu, Mar 09, 2017 at 11:08:17AM +0100, Borislav Petkov wrote: > +config RAS_CEC > + bool "Correctable Errors Collector" > + depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS > + ---help--- > + This is a small cache which collects correctable memory errors per 4K > + page P

Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-12 Thread Boris Petkov
On March 9, 2017 11:08:17 AM GMT+01:00, Borislav Petkov wrote: >From: Borislav Petkov ... >diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig >index 0bc60a308730..2a2d89d39af6 100644 >--- a/arch/x86/ras/Kconfig >+++ b/arch/x86/ras/Kconfig >@@ -7,3 +7,17 @@ config MCE_AMD_INJ > asp

[PATCH 3/4] RAS: Add a Corrected Errors Collector

2017-03-09 Thread Borislav Petkov
From: Borislav Petkov A simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it an