Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

Borislav Petkov Sat, 20 Apr 2019 02:42:24 -0700

On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote:
> Now there isn't really anything better that CEC can do in
> this situation. It won't help to have a bigger array. Taking
> pages offline wouldn't solve the problem (though if that
> did happen at least it would break the silence).
> 
> Same situation for other DRAM failure modes that affect a
> wide range of pages (rank, bank, perhaps row ... though all
> the errors from a single row failure might fit in the CEC array).
> 
> Allowing the user to bypass CEC (without a reboot ... cloud folks
> hate to reboot their systems) would allow the sysadmin to see
> what is happening (either via /dev/mcelog, or via EDAC driver).


Err, this all sounds to me like the storm detection code should
*automatically* disable the CEC in such cases, I'd say. Because I
don't see a cloud admin going into the debugfs and turning it off.
Rather, if the detection heuristic we use is smart enough, disabling it
automatically should be a lot better serviceability action.

Hmmm?

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

Reply via email to