On Tue, 9 Oct 2018 14:01:37 +0200 Christophe LEROY <christophe.le...@c-s.fr> wrote:
> Le 09/10/2018 à 13:16, Nicholas Piggin a écrit : > > On Tue, 9 Oct 2018 09:36:18 +0000 > > Christophe Leroy <christophe.le...@c-s.fr> wrote: > > > >> On 10/09/2018 05:30 AM, Nicholas Piggin wrote: > >>> On Tue, 9 Oct 2018 06:46:30 +0200 > >>> Christophe LEROY <christophe.le...@c-s.fr> wrote: > >>> > >>>> Le 09/10/2018 à 06:32, Nicholas Piggin a écrit : > >>>>> On Mon, 8 Oct 2018 17:39:11 +0200 > >>>>> Christophe LEROY <christophe.le...@c-s.fr> wrote: > >>>>> > >>>>>> Hi Nick, > >>>>>> > >>>>>> Le 19/07/2017 à 08:59, Nicholas Piggin a écrit : > >>>>>>> Use nmi_enter similarly to system reset interrupts. This uses NMI > >>>>>>> printk NMI buffers and turns off various debugging facilities that > >>>>>>> helps avoid tripping on ourselves or other CPUs. > >>>>>>> > >>>>>>> Signed-off-by: Nicholas Piggin <npig...@gmail.com> > >>>>>>> --- > >>>>>>> arch/powerpc/kernel/traps.c | 9 ++++++--- > >>>>>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>>>>> > >>>>>>> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c > >>>>>>> index 2849c4f50324..6d31f9d7c333 100644 > >>>>>>> --- a/arch/powerpc/kernel/traps.c > >>>>>>> +++ b/arch/powerpc/kernel/traps.c > >>>>>>> @@ -789,8 +789,10 @@ int machine_check_generic(struct pt_regs *regs) > >>>>>>> > >>>>>>> void machine_check_exception(struct pt_regs *regs) > >>>>>>> { > >>>>>>> - enum ctx_state prev_state = exception_enter(); > >>>>>>> int recover = 0; > >>>>>>> + bool nested = in_nmi(); > >>>>>>> + if (!nested) > >>>>>>> + nmi_enter(); > >>>>>> > >>>>>> This alters preempt_count, then when die() is called > >>>>>> in_interrupt() returns true allthough the trap didn't happen in > >>>>>> interrupt, so oops_end() panics for "fatal exception in interrupt" > >>>>>> instead of gently sending SIGBUS the faulting app. > >>>>> > >>>>> Thanks for tracking that down. > >>>>> > >>>>>> Any idea on how to fix this ? > >>>>> > >>>>> I would say we have to deliver the sigbus by hand. > >>>>> > >>>>> if ((user_mode(regs))) > >>>>> _exception(SIGBUS, regs, BUS_MCEERR_AR, regs->nip); > >>>>> else > >>>>> die("Machine check", regs, SIGBUS); > >>>>> > >>>> > >>>> And what about all the other things done by 'die()' ? > >>>> > >>>> And what if it is a kernel thread ? > >>>> > >>>> In one of my boards, I have a kernel thread regularly checking the HW, > >>>> and if it gets a machine check I expect it to gently stop and the die > >>>> notification to be delivered to all registered notifiers. > >>>> > >>>> Until before this patch, it was working well. > >>> > >>> I guess the alternative is we could check regs->trap for machine > >>> check in the die test. Complication is having to account for MCE > >>> in an interrupt handler. > >>> > >>> if (in_interrupt()) { > >>> if (!IS_MCHECK_EXC(regs) || (irq_count() - (NMI_OFFSET > >>> + HARDIRQ_OFFSET))) > >>> panic("Fatal exception in interrupt"); > >>> } > >>> > >>> Something like that might work for you? We needs a ppc64 macro for the > >>> MCE, and can probably add something like in_nmi_from_interrupt() for > >>> the second part of the test. > >> > >> Don't know, I'm away from home on business trip so I won't be able to > >> test anything before next week. However it looks more or less like a > >> hack, doesn't it ? > > > > I thought it seemed okay (with the right functions added). Actually it > > could be a bit nicer to do this, then it works generally : > > > > if (in_interrupt()) { > > if (!in_nmi() || in_nmi_from_interrupt()) > > panic("Fatal exception in interrupt"); > > } > > > Yes looks nice, but: > 1/ what is in_nmi_from_interrupt() ? Is it (in_nmi() && (in_irq() || > in_softirq()) ? return (irq_count() - (NMI_OFFSET + HARDIRQ_OFFSET))) != 0; (basically just in_interrupt() with the nmi_enter undone) > 2/ what about in_nmi_from_nmi(), how do we detect that ? Oh good point, I'm not sure. I guess we could irq_enter() in the nested case, I think that would make in_nmi_from_interrupt() return true. Thanks, Nick