On Mon, Apr 22, 2019 at 10:44:15AM -0700, Luck, Tony wrote:
> Yes. Automating this would be a very good idea.
Yeah, in general integrating the CEC better with the rest of the error
chain is something we still need to discuss and do.
> In the case of many errors at different addresses we are delet
On Mon, Apr 22, 2019 at 07:15:32PM +0200, Borislav Petkov wrote:
> On Mon, Apr 22, 2019 at 03:59:16PM +, Luck, Tony wrote:
> > > Err, this all sounds to me like the storm detection code should
> > > *automatically* disable the CEC in such cases, I'd say.
> >
> > Sounds good. But we should dist
> I think we're talking past each other here: I mean disable the CEC
> *forever* and *never* use it. Use only a userspace agent and log errors
> with it.
>
> Makes sense?
Not really. We want pretty much everyone to enable and use CEC. That way
people don't bother use about the occasional neutron s
On Mon, Apr 22, 2019 at 03:59:16PM +, Luck, Tony wrote:
> > Err, this all sounds to me like the storm detection code should
> > *automatically* disable the CEC in such cases, I'd say.
>
> Sounds good. But we should distinguish storms that have many different
> addresses from storms that just p
On Mon, Apr 22, 2019 at 04:43:58PM +, Luck, Tony wrote:
> >> Rebooting isn't popular in many end user situations. Many CSP (cloud
> >> service providers) vehemently hate the idea of rebooting.
> >
> > I meant disable in Kconfig - not build it in at all.
>
> If rebooting is bad, then re-compili
>> Rebooting isn't popular in many end user situations. Many CSP (cloud
>> service providers) vehemently hate the idea of rebooting.
>
> I meant disable in Kconfig - not build it in at all.
If rebooting is bad, then re-compiling and rebooting is 100x worse. :-)
-Tony
On Mon, Apr 22, 2019 at 04:29:35PM +, Luck, Tony wrote:
> > Now, if you still want to know how many errors and where they happened
> > and when they happened and yadda yadda, you *disable* the CEC.
>
> Rebooting isn't popular in many end user situations. Many CSP (cloud
> service providers) ve
> Now, if you still want to know how many errors and where they happened
> and when they happened and yadda yadda, you *disable* the CEC.
Rebooting isn't popular in many end user situations. Many CSP (cloud
service providers) vehemently hate the idea of rebooting.
-Tony
> Err, this all sounds to me like the storm detection code should
> *automatically* disable the CEC in such cases, I'd say.
Sounds good. But we should distinguish storms that have many different
addresses from storms that just ping a few addresses. CEC will see counts
hit the threshold in the lat
On Thu, Apr 18, 2019 at 03:02:29PM -0700, Tony Luck wrote:
> Useful when running error injection tests that want to
> see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
>
> Signed-off-by: Tony Luck
> ---
> drivers/ras/cec.c | 20 +++-
> 1 file changed, 19 insertions(+),
On Sat, Apr 20, 2019 at 11:47 AM Borislav Petkov wrote:
> IOW, when you have the CEC enabled, you don't need to log memory errors
> with a userspace agent. The CEC collects them and discards them if they
> don't repeat.
So, you mean breaking mcelog is intentionally, if so, why not break it
loudly
On Sat, Apr 20, 2019 at 11:18:46AM -0700, Cong Wang wrote:
> You didn't answer my question here, because I asked you whether
> the following change (PoC only) makes sense:
I answered it - the answer is to disable CONFIG_RAS_CEC. But let me do a
more detailed answer, maybe that'll help.
The PoC do
On Sat, Apr 20, 2019 at 2:13 AM Borislav Petkov wrote:
>
> On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote:
> > With this change, although not even compiled, mcelog should still
> > receive correctable memory errors like before, even when we have
> > CONFIG_RAS_CEC=y.
> >
> > Does this m
On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote:
> Now there isn't really anything better that CEC can do in
> this situation. It won't help to have a bigger array. Taking
> pages offline wouldn't solve the problem (though if that
> did happen at least it would break the silence).
>
> S
On Fri, Apr 19, 2019 at 10:43:03PM -0700, Cong Wang wrote:
> With this change, although not even compiled, mcelog should still
> receive correctable memory errors like before, even when we have
> CONFIG_RAS_CEC=y.
>
> Does this make any sense to you?
Yes, the answer is in the mail you snipped. Di
On Thu, Apr 18, 2019 at 5:07 PM Luck, Tony wrote:
>
> On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > Which reminds me, Tony, I think all those debugging files "pfn"
> > and "array" and the one you add now, should all be under a
> > CONFIG_RAS_CEC_DEBUG which is default off an
On Thu, Apr 18, 2019 at 5:26 PM Borislav Petkov wrote:
>
> Now, if any of that above still doesn't make it clear, please state what
> you're trying to achieve and I'll try to help.
Sorry that I misled you to believe we don't even enable
CONFIG_X86_MCELOG_LEGACY. Here is what we have and
what we h
On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote:
> On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > > Which reminds me, Tony, I think all those debugging files "pfn"
> > > and "array" and the one you
On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > Which reminds me, Tony, I think all those debugging files "pfn"
> > and "array" and the one you add now, should all be under a
> > CONFIG_RAS_CEC_DEBUG which is default
On Thu, Apr 18, 2019 at 04:58:22PM -0700, Cong Wang wrote:
> No, it is all about whether we should break users' expectation.
What user expectation?
> This doesn't sounds like a valid reason for us to break users'
> expectation.
I think it is *you* who has some sort of "expectation" but that
"exp
On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> Which reminds me, Tony, I think all those debugging files "pfn"
> and "array" and the one you add now, should all be under a
> CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> Mind adding that too pls?
Pat
On Thu, Apr 18, 2019 at 4:29 PM Borislav Petkov wrote:
>
> On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote:
> > >
> > > Useful when running error injection tests that want to
> > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/m
On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote:
> >
> > Useful when running error injection tests that want to
> > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> >
> > Signed-off-by: Tony Luck
>
> We saw the same proble
On Thu, Apr 18, 2019 at 3:02 PM Tony Luck wrote:
>
> Useful when running error injection tests that want to
> see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
>
> Signed-off-by: Tony Luck
We saw the same problem, CONFIG_RAS hijacks all the
correctable memory errors, which leaves mcelo
Useful when running error injection tests that want to
see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
Signed-off-by: Tony Luck
---
drivers/ras/cec.c | 20 +++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 2
25 matches
Mail list logo