Re: Machine Check in P2010(e500v2)

Joakim Tjernlund Wed, 06 Sep 2017 03:17:21 -0700

On Wed, 2017-09-06 at 10:05 +0000, Laurentiu Tudor wrote:
> Hi Jocke,
> 
> On 09/01/2017 02:32 PM, Joakim Tjernlund wrote:
> > I am trying to debug a Machine Check for a P2010 (e500v2) CPU:
> > 
> > [   28.111816] Caused by (from MCSR=10008): Bus - Read Data Bus Error
> > [   28.117998] Oops: Machine check, sig: 7 [#1]
> > [   28.122263] P1010 RDB
> > [   28.124529] Modules linked in: linux_bcm_knet(PO) linux_user_bde(PO) 
> > linux_kernel_bde(PO)
> > [   28.132718] CPU: 0 PID: 470 Comm: emxp2_hw_bl Tainted: P           O    
> > 4.1.38+ #49
> > [   28.140376] task: db16cd10 ti: df128000 task.ti: df128000
> > [   28.145770] NIP: 00000000 LR: 10a4e404 CTR: 10046c38
> > [   28.150730] REGS: df129f10 TRAP: 0204   Tainted: P           O     
> > (4.1.38+)
> > [   28.157776] MSR: 0002d000 <CE,EE,PR,ME>  CR: 44002428  XER: 00000000
> > [   28.164140] DEAR: b7187000 ESR: 00000000
> > GPR00: 10a4e404 bf86ea30 b7ca94a0 132f9fa8 07006000 07000000 00000000 
> > 132f9fd8
> > GPR08: b7149000 b7159000 0003e000 bf86ea20 24004424 11d6cf7c 00000000 
> > 00000000
> > GPR16: 10f6e29c 10f6c872 10f6db01 0000b541 0000b541 11d92fcc 00000011 
> > 00000001
> > GPR24: 01a4d12d 132ffbf0 11d60000 00000000 07006000 00000000 132f9fa8 
> > 00000000
> > [   28.196375] NIP [00000000]   (null)
> > [   28.199859] LR [10a4e404] 0x10a4e404
> > [   28.203426] Call Trace:
> > [   28.205866] ---[ end trace f456255ddf9bee83 ]---
> > 
> > I cannot figure out why NIP is NULL ? It LOOKs like NIP is set to
> > MCSRR0 early on but maybe it is lost somehow?
> > 
> > Anyhow, looking at entry_32.S:
> >     .globl  mcheck_transfer_to_handler
> > mcheck_transfer_to_handler:
> >     mfspr   r0,SPRN_DSRR0
> >     stw     r0,_DSRR0(r11)
> >     mfspr   r0,SPRN_DSRR1
> >     stw     r0,_DSRR1(r11)
> >     /* fall through */
> > 
> >     .globl  debug_transfer_to_handler
> > debug_transfer_to_handler:
> >     mfspr   r0,SPRN_CSRR0
> >     stw     r0,_CSRR0(r11)
> >     mfspr   r0,SPRN_CSRR1
> >     stw     r0,_CSRR1(r11)
> >     /* fall through */
> > 
> >     .globl  crit_transfer_to_handler
> > crit_transfer_to_handler:
> > 
> > It looks odd that DSRRx is assigned in mcheck and CSRRx in debug and
> > crit has none. Should not this assigment be shifted down one level?
> > 
> 
> This does indeed looks weird. Have you tried moving the SPRN_CSRR* 
> saving in the crit section? Any results?


After looking at this somwhat I think this is intentional and OK.
I sorted NIP == NULL too:
@@ -996,7 +998,7 @@ int fsl_pci_mcheck_exception(struct pt_regs *regs)
        if (is_in_pci_mem_space(addr)) {
                if (user_mode(regs)) {
                        pagefault_disable();
-                       ret = get_user(regs->nip, &inst);
+                       ret = get_user(inst, (__u32 __user *)regs->nip);
                        pagefault_enable();
                } else {
                        ret = probe_kernel_address(regs->nip, inst);

But after this, the CPU is still locked after an Machine Check. Is this
to be expected? I figured the user space process would get a SIGBUS and kernel
would resume normal operations.

Scott, maybe you have some idea?

 Jocke

Re: Machine Check in P2010(e500v2)

Reply via email to