mce: Add sysctl control for recovery action on MCE.

Aneesh Kumar K.V Wed, 08 Aug 2018 08:56:15 -0700

On 08/08/2018 08:26 PM, Michael Ellerman wrote:

Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes:

From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>


Introduce recovery action for recovered memory errors (MCEs). There are
soft memory errors like SLB Multihit, which can be a result of a bad
hardware OR software BUG. Kernel can easily recover from these soft errors
by flushing SLB contents. After the recovery kernel can still continue to
function without any issue. But in some scenario's we may keep getting
these soft errors until the root cause is fixed. To be able to analyze and
find the root cause, best way is to gather enough data and system state at
the time of MCE. Hence this patch introduces a sysctl knob where user can
decide either to continue after recovery or panic the kernel to capture the
dump.


I'm not convinced we want this.

As we've discovered it's often not possible to reconstruct what happened
based on a dump anyway.

The key thing you need is the content of the SLB and that's not included
in a dump.

So I think we should dump the SLB content when we get the MCE (which
this series does) and any other useful info, and then if we can recover
we should.

The reasoning there is what if we got multi-hit due to some corruptionin slb_cache_ptr. ie. some part of kernel is wrongly updating the pacadata structure due to wrong pointer. Now that is far fetched, but thenpossible right?. Hence the idea that, if we don't have much insight intowhy a slb multi-hit occur from the dmesg which include slb content,slb_cache contents etc, there should be an easy way to force a dump thatmight assist in further debug.


-aneesh

Re: [PATCH v7 8/9] powerpc/mce: Add sysctl control for recovery action on MCE.

Reply via email to