On Wed, Jan 25, 2017 at 10:46 AM, Thomas De Schampheleire <patrickdeping...@gmail.com> wrote: > Hi, > > We are experiencing kernel panics of the type "Unable to handle kernel paging > request for instruction fetch" but are stuck in our analysis. We would > appreciate any help you can give. > > The problem occurs from time to time on different instances of a particular > embedded systems. The kernel is very old, 2.6.36.4, and runs on an 8-core > Freescale P4080 QorIQ processor (e500mc-based). Upgrading the kernel to a > newer > version is not an option at this point. > > Here is an example panic log: (Panic 1) > > <1>[2497591.165517] Unable to handle kernel paging request for instruction > fetch > <1>[2497591.248027] Faulting instruction address: 0x4837ad68 > <4>[2497591.309541] Oops: Kernel access of bad area, sig: 11 [#1] > <4>[2497591.376218] PREEMPT SMP NR_CPUS=8 P4080 DS > <0>[2497591.427290] last sysfs file: > /sys/module/ndps_b_reboot_helper/parameters/panic_counter > <4>[2497591.524180] Modules linked in: uio_generic_driver > reborn_macfilter ndps_b_reboot_helper ramoops cpld_watchdog physmap_of > sysfs_exports write_unlock ndps_a_cpld reborn_class generic_access > <4>[2497591.726423] NIP: 4837ad68 LR: 4837ad69 CTR: c0009c30 > <4>[2497591.787899] REGS: e3619cc0 TRAP: 0400 Tainted: G W > (2.6.36.4) > <4>[2497591.871243] MSR: 00029002 <EE,ME,CE> CR: 24002844 XER: 00000000 > <4>[2497591.946313] TASK = ec2dc600[444] 'watchdog_monito' THREAD: > e3618000 CPU: 0 > <4>[2497592.028624] GPR00: 4837ad69 e3619d70 ec2dc600 00000001 > 00260100 00000008 00000001 0000039d > <4>[2497592.130812] GPR08: ffffffff 00000000 00029002 c3389040 > 00000000 100d624c 3b9ac9ff 00000201 > <4>[2497592.233007] GPR16: 000003fb 1009d604 1424120d e3638000 > 00000200 00000678 00000001 0000039d > <4>[2497592.335209] GPR24: 1424120d ec1f3800 00000001 4800001c > 4819b729 7fe5fb78 7c641b78 3c60c042 > <4>[2497592.439505] NIP [4837ad68] 0x4837ad68 > <4>[2497592.485353] LR [4837ad69] 0x4837ad69 > <4>[2497592.530159] Call Trace: > <4>[2497592.561424] [e3619d70] [4837ad69] 0x4837ad69 (unreliable) > <4>[2497592.628123] Instruction dump: > <4>[2497592.665641] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > XXXXXXXX XXXXXXXX XXXXXXXX > <4>[2497592.760511] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX > XXXXXXXX XXXXXXXX XXXXXXXX > <0>[2497592.855933] Kernel panic - not syncing: Fatal exception > <4>[2497592.920568] Call Trace: > <4>[2497592.951911] [e3619bf0] [c0006fe4] show_stack+0x78/0x18c (unreliable) > <4>[2497593.030177] [e3619c30] [c0389c34] panic+0xc0/0x1e8 > <4>[2497593.089648] [e3619c80] [c000ba9c] die+0x1f0/0x1fc > <4>[2497593.148080] [e3619ca0] [c001109c] bad_page_fault+0xb0/0xc8 > <4>[2497593.215872] [e3619cb0] [c000e3f8] handle_page_fault+0x7c/0x80 > <4>[2497593.286798] --- Exception: 400 at 0x4837ad68 > <4>[2497593.286803] LR = 0x4837ad69 > > Our analysis of the above panic log lead to the conclusion > that there is a kernel stack corruption, such that when restoring the link > register (LR) from the stack frame near the end of a function, an invalid > value > is restored, so that the fetching of that instruction upon return from that > function causes an instruction fetch exception (0x400). > This is based on the fact that registers R27-R31 contain values that are > actually coming from the vmlinux image itself, in this case they are parts > from the function smp_message_recv. In another case, we have seen opcodes from > generic_smp_call_function_interrupt. > > In a typical function epilogue (e.g. from doorbell_message_pass) we see: > > 0xc0009de8 <+440>: lwz r0,52(r1) # load value from memory > from address (r1+52) and store into r0 > 0xc0009dec <+444>: lwz r25,20(r1) > 0xc0009df0 <+448>: lwz r26,24(r1) > 0xc0009df4 <+452>: mtlr r0 # store value of r0 into > LR (link register) > 0xc0009df8 <+456>: lwz r27,28(r1) # load value from memory > from address (r1+28) and store into r27 > 0xc0009dfc <+460>: lwz r28,32(r1) # r28 <- (r1+32) > 0xc0009e00 <+464>: lwz r29,36(r1) # r29 <- (r1+36) > 0xc0009e04 <+468>: lwz r30,40(r1) # r30 <- (r1+40) > 0xc0009e08 <+472>: lwz r31,44(r1) # r31 <- (r1+44) > 0xc0009e0c <+476>: addi r1,r1,48 > 0xc0009e10 <+480>: blr # jump to value stored > in link register > > So, when the values at r1 + 52 (and other offsets nearby) are corrupted, then > after this code the registers r0, r25-31 contain bogus values. As r0 is used > to > provision the link register LR, an instruction fetch abort is unavoidable. > > Unfortunately, it is unclear what would cause the corruption of memory. > Moreover, in some of the other kernel panics, we do _not_ see this pattern of > r27-r31 containing values coming from the .text section of the kernel image > itself. There may be different corruption mechanisms at play here. > > While the signature of the different panics varies in several ways, the > recurring aspects are: > - always reported on CPU 0 (even though this is an SMP system with 8 running > cores) > - R0 always contains the same value as LR, confirming that LR is always > restored > from a corrupted R0 value > - the CTR register is always the same (0c0009c30) and points to the function > doorbell_message_pass > - the current userspace task is always the same ('watchdog_monito...') which > is > a simple shell script that reads from a fuse filesystem, but otherwise does > nothing funky. The shell interpreter is busybox ash. > > A developer also wondered whether there could be a problem with memory > coherency between cores. There are several errata existing in this area. > In particular, because some of the panic logs show references to functions in > the inter-core communication area, like smp_message_recv, > doorbell_message_pass > and generic_smp_call_function_interrupt. If data written to memory by one core > is read by another core, but there is an incoherency issue, then the other > core > may read bad data which can lead to other corruptions later. > > Does any of this ring a bell to you? Do you recall a problem that has been > fixed > that shows such symptoms? (I have searched in the commit logs on several > keywords and file changes, but did not find any hits that reminded me of what > we > see). > How should we interpret the fact that CTR is always pointing to > doorbell_message_pass? Could it lead us to the problem? > Do you have any suggestions on how to proceed? > > I include some other kernel panics below, in case it helps.
Any hint about this problem would be very appreciated. What strikes me is the seeming relation to inter-core communication or coherence. Do you know of any previous problems in this area? Thanks, Thomas