Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Borislav Petkov
On Mon, Apr 22, 2019 at 10:44:15AM -0700, Luck, Tony wrote: > Yes. Automating this would be a very good idea. Yeah, in general integrating the CEC better with the rest of the error chain is something we still need to discuss and do. > In the case of many errors at different addresses we are delet

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Luck, Tony
On Mon, Apr 22, 2019 at 07:15:32PM +0200, Borislav Petkov wrote: > On Mon, Apr 22, 2019 at 03:59:16PM +, Luck, Tony wrote: > > > Err, this all sounds to me like the storm detection code should > > > *automatically* disable the CEC in such cases, I'd say. > > > > Sounds good. But we should dist

RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Luck, Tony
> I think we're talking past each other here: I mean disable the CEC > *forever* and *never* use it. Use only a userspace agent and log errors > with it. > > Makes sense? Not really. We want pretty much everyone to enable and use CEC. That way people don't bother use about the occasional neutron s

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Borislav Petkov
On Mon, Apr 22, 2019 at 03:59:16PM +, Luck, Tony wrote: > > Err, this all sounds to me like the storm detection code should > > *automatically* disable the CEC in such cases, I'd say. > > Sounds good. But we should distinguish storms that have many different > addresses from storms that just p

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Borislav Petkov
On Mon, Apr 22, 2019 at 04:43:58PM +, Luck, Tony wrote: > >> Rebooting isn't popular in many end user situations. Many CSP (cloud > >> service providers) vehemently hate the idea of rebooting. > > > > I meant disable in Kconfig - not build it in at all. > > If rebooting is bad, then re-compili

RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Luck, Tony
>> Rebooting isn't popular in many end user situations. Many CSP (cloud >> service providers) vehemently hate the idea of rebooting. > > I meant disable in Kconfig - not build it in at all. If rebooting is bad, then re-compiling and rebooting is 100x worse. :-) -Tony

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Borislav Petkov
On Mon, Apr 22, 2019 at 04:29:35PM +, Luck, Tony wrote: > > Now, if you still want to know how many errors and where they happened > > and when they happened and yadda yadda, you *disable* the CEC. > > Rebooting isn't popular in many end user situations. Many CSP (cloud > service providers) ve

RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Luck, Tony
> Now, if you still want to know how many errors and where they happened > and when they happened and yadda yadda, you *disable* the CEC. Rebooting isn't popular in many end user situations. Many CSP (cloud service providers) vehemently hate the idea of rebooting. -Tony

RE: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-22 Thread Luck, Tony
> Err, this all sounds to me like the storm detection code should > *automatically* disable the CEC in such cases, I'd say. Sounds good. But we should distinguish storms that have many different addresses from storms that just ping a few addresses. CEC will see counts hit the threshold in the lat

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 03:02:29PM -0700, Tony Luck wrote: > Useful when running error injection tests that want to > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog. > > Signed-off-by: Tony Luck > --- > drivers/ras/cec.c | 20 +++- > 1 file changed, 19 insertions(+),

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Cong Wang
On Sat, Apr 20, 2019 at 11:47 AM Borislav Petkov wrote: > IOW, when you have the CEC enabled, you don't need to log memory errors > with a userspace agent. The CEC collects them and discards them if they > don't repeat. So, you mean breaking mcelog is intentionally, if so, why not break it loudly

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Borislav Petkov
On Sat, Apr 20, 2019 at 11:18:46AM -0700, Cong Wang wrote: > You didn't answer my question here, because I asked you whether > the following change (PoC only) makes sense: I answered it - the answer is to disable CONFIG_RAS_CEC. But let me do a more detailed answer, maybe that'll help. The PoC do

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Cong Wang
On Sat, Apr 20, 2019 at 2:13 AM Borislav Petkov wrote: > > On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote: > > With this change, although not even compiled, mcelog should still > > receive correctable memory errors like before, even when we have > > CONFIG_RAS_CEC=y. > > > > Does this m

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Borislav Petkov
On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote: > Now there isn't really anything better that CEC can do in > this situation. It won't help to have a bigger array. Taking > pages offline wouldn't solve the problem (though if that > did happen at least it would break the silence). > > S

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-20 Thread Borislav Petkov
On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote: > With this change, although not even compiled, mcelog should still > receive correctable memory errors like before, even when we have > CONFIG_RAS_CEC=y. > > Does this make any sense to you? Yes, the answer is in the mail you snipped. Di

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-19 Thread Cong Wang
On Thu, Apr 18, 2019 at 5:07 PM Luck, Tony wrote: > > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote: > > Which reminds me, Tony, I think all those debugging files "pfn" > > and "array" and the one you add now, should all be under a > > CONFIG_RAS_CEC_DEBUG which is default off an

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-19 Thread Cong Wang
On Thu, Apr 18, 2019 at 5:26 PM Borislav Petkov wrote: > > Now, if any of that above still doesn't make it clear, please state what > you're trying to achieve and I'll try to help. Sorry that I misled you to believe we don't even enable CONFIG_X86_MCELOG_LEGACY. Here is what we have and what we h

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-19 Thread Luck, Tony
On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote: > On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote: > > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote: > > > Which reminds me, Tony, I think all those debugging files "pfn" > > > and "array" and the one you

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote: > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote: > > Which reminds me, Tony, I think all those debugging files "pfn" > > and "array" and the one you add now, should all be under a > > CONFIG_RAS_CEC_DEBUG which is default

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 04:58:22PM -0700, Cong Wang wrote: > No, it is all about whether we should break users' expectation. What user expectation? > This doesn't sounds like a valid reason for us to break users' > expectation. I think it is *you* who has some sort of "expectation" but that "exp

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Luck, Tony
On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote: > Which reminds me, Tony, I think all those debugging files "pfn" > and "array" and the one you add now, should all be under a > CONFIG_RAS_CEC_DEBUG which is default off and used only for development. > Mind adding that too pls? Pat

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Cong Wang
On Thu, Apr 18, 2019 at 4:29 PM Borislav Petkov wrote: > > On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote: > > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote: > > > > > > Useful when running error injection tests that want to > > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/m

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote: > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote: > > > > Useful when running error injection tests that want to > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog. > > > > Signed-off-by: Tony Luck > > We saw the same proble

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Cong Wang
On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote: > > Useful when running error injection tests that want to > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog. > > Signed-off-by: Tony Luck We saw the same problem, CONFIG_RAS hijacks all the correctable memory errors, which leaves mcelo

[PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Tony Luck
Useful when running error injection tests that want to see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog. Signed-off-by: Tony Luck --- drivers/ras/cec.c | 20 +++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c index 2