On Thu, Mar 23, 2017 at 11:20:44AM -0700, Luck, Tony wrote:
> Keeping every PFN would be overkill (most of them should be taken
> offline with no issues). A fixed array of a few of them with timestamps
> to drop the oldest would likely be a good enough(TM) solution.
The reason being? Prevent the
On Thu, Mar 23, 2017 at 06:28:39PM +0100, Borislav Petkov wrote:
> Meh, I don't like the idea of keeping an evergrowing list of PFNs we
> can't do anything about anyway.
Keeping every PFN would be overkill (most of them should be taken
offline with no issues). A fixed array of a few of them with
On Thu, Mar 23, 2017 at 10:20:31AM -0700, Luck, Tony wrote:
> It can happen if Linux didn't actually take the page offline
> (because it was a kernel page). The CEC code only knows that
> it queued this page to be taken offline ... and has no way
> to know if that succeeded or not.
Right, that's t
On Thu, Mar 23, 2017 at 04:22:28PM +0100, Borislav Petkov wrote:
> On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote:
> > Lemme try to write a small script exercising exactly that scenario to
> > see whether I'm actually not talking crap here :-)
>
> Ok, here's a snapshot from the CE
On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote:
> Lemme try to write a small script exercising exactly that scenario to
> see whether I'm actually not talking crap here :-)
Ok, here's a snapshot from the CEC after letting it run for a couple of
hours in a guest with a script runni
On Wed, Mar 22, 2017 at 12:00:25PM -0700, Luck, Tony wrote:
> You also need to check that bit 61 of m->status is zero here.
> The collector is hiding uncorrected errors too.
Good catch.
I think I wanna do something like this:
if (memory_error(m) && !(m->status & MCI_STATUS_UC) ...
as we
On Thu, Mar 09, 2017 at 11:08:17AM +0100, Borislav Petkov wrote:
> +static bool cec_add_mce(struct mce *m)
> +{
> + if (!m)
> + return false;
> +
> + if (memory_error(m) && mce_usable_address(m))
> + if (!cec_add_elem(m->addr >> PAGE_SHIFT))
> + r
On Mon, Mar 20, 2017 at 03:48:24PM -0700, Luck, Tony wrote:
> You added "count_threshold" for me ... so the condition isn't quite
> "overflows"
> like it was in the early versions.
It is a max count which, when reached, causes the soft offline attempt.
What did you mean with "overflows" exactly t
On Thu, Mar 09, 2017 at 11:08:17AM +0100, Borislav Petkov wrote:
> +config RAS_CEC
> + bool "Correctable Errors Collector"
> + depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS
> + ---help---
> + This is a small cache which collects correctable memory errors per 4K
> + page P
On March 9, 2017 11:08:17 AM GMT+01:00, Borislav Petkov wrote:
>From: Borislav Petkov
...
>diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
>index 0bc60a308730..2a2d89d39af6 100644
>--- a/arch/x86/ras/Kconfig
>+++ b/arch/x86/ras/Kconfig
>@@ -7,3 +7,17 @@ config MCE_AMD_INJ
> asp
From: Borislav Petkov
A simple data structure for collecting correctable errors along with
accessors. More detailed description in the code itself.
The error decoding is done with the decoding chain now and
mce_first_notifier() gets to see the error first and the CEC decides
whether to log it an
11 matches
Mail list logo