On Tue, 28 Feb 2017 11:27:29 +0530 Mahesh Jagannath Salgaonkar <mah...@linux.vnet.ibm.com> wrote:
> On 02/28/2017 07:30 AM, Nicholas Piggin wrote: > > A synchronous machine check is an exception raised by the attempt to > > execute the current instruction. If the error can't be corrected, it > > can make sense to SIGBUS the currently running process. > > > > In other cases, the error condition is not related to the current > > instruction, so killing the current process is not the right thing to > > do. > > > > Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no > > practical change. It will be used to handle POWER9 asynchronous > > machine checks. > > > > Signed-off-by: Nicholas Piggin <npig...@gmail.com> > > --- > > arch/powerpc/platforms/powernv/opal.c | 21 ++++++--------------- > > 1 file changed, 6 insertions(+), 15 deletions(-) > > > > diff --git a/arch/powerpc/platforms/powernv/opal.c > > b/arch/powerpc/platforms/powernv/opal.c > > index 86d9fde93c17..e0f856bfbfe8 100644 > > --- a/arch/powerpc/platforms/powernv/opal.c > > +++ b/arch/powerpc/platforms/powernv/opal.c > > @@ -395,7 +395,6 @@ static int opal_recover_mce(struct pt_regs *regs, > > struct machine_check_event *evt) > > { > > int recovered = 0; > > - uint64_t ea = get_mce_fault_addr(evt); > > > > if (!(regs->msr & MSR_RI)) { > > /* If MSR_RI isn't set, we cannot recover */ > > @@ -404,26 +403,18 @@ static int opal_recover_mce(struct pt_regs *regs, > > } else if (evt->disposition == MCE_DISPOSITION_RECOVERED) { > > /* Platform corrected itself */ > > recovered = 1; > > - } else if (ea && !is_kernel_addr(ea)) { > > + } else if (evt->severity == MCE_SEV_FATAL) { > > + /* Fatal machine check */ > > + pr_err("Machine check interrupt is fatal\n"); > > + recovered = 0; > > Setting recovered = 0 would trigger kernel panic. Should we panic the > kernel for asynchronous errors ? If it's not recoverable, I don't see what other option we have. SRR0 is meaningless for async machine checks. So it's much the same thing we do as if we don't have a process to kill or were running in kernel when a synchronous MCE occurred. Thanks, Nick