Public bug reported: == Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-18 04:18:33 == ---Problem Description--- Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] Contact Information = pavsu...@in.ibm.com ---uname output--- Linux ltc-garri2 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux ---Additional Hardware Info--- root@ltc-garri2:~# lspci 0000:00:00.0 PCI bridge: IBM Device 03dc 0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB] 0001:00:00.0 PCI bridge: IBM Device 03dc 0002:00:00.0 PCI bridge: IBM Device 03dc 0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1) 0003:00:00.0 PCI bridge: IBM Device 03dc 0004:00:00.0 PCI bridge: IBM Device 03dc 0005:00:00.0 PCI bridge: IBM Device 03dc 0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ab) 0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI Host Controller (rev 02) 0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11) 0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03) 0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30) 0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10) 0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 Gigabit Ethernet PCIe (rev 10) 0006:00:00.0 PCI bridge: IBM Device 03dc 0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1) 0007:00:00.0 PCI bridge: IBM Device 03dc 0008:00:00.0 Bridge: IBM Device 04ea 0008:00:00.1 Bridge: IBM Device 04ea 0008:00:01.0 Bridge: IBM Device 04ea 0008:00:01.1 Bridge: IBM Device 04ea 0009:00:00.0 Bridge: IBM Device 04ea 0009:00:00.1 Bridge: IBM Device 04ea 0009:00:01.0 Bridge: IBM Device 04ea 0009:00:01.1 Bridge: IBM Device 04ea
Machine Type = P8 ---Debugger--- A debugger is not configured ---Steps to Reproduce--- Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.10. Then execute the Frozen PE error injection tests as shown below: root@ltc-garri2:~# lspci | grep -i 0004:00:00.0 0004:00:00.0 PCI bridge: IBM Device 03dc root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1 eeh_slot_resets=0 root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct && lspci -ns 0004:00:00.0; echo $? 0004:00:00.0 0604: 1014:03dc 0 Immediately the kernel crashes with a Oops Message. Ubuntu Yakkety Yak (development branch) ltc-garri2 hvc0 ltc-garri2 login: [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected [ 3231.801639] EEH: PE location: N/A, PHB location: N/A [ 3231.802772] Unable to handle kernel paging request for data at address 0x00000010 [ 3231.802841] Faulting instruction address: [1718257727455,3] OPAL: Trying a CPU re-init with flags: 0x1 [1719041828975,3] OPAL: Trying a CPU re-init with flags: 0x2 0xc000000000083c7c [ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1] [ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV [ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables iptable_fi[1721485466204,3] PCI-SLOT-0000000000000000 Invalid state 00000000 [1722574789913,3] PCI-SLOT-0000000000000001 Invalid state 00000000 [1725814514431,3] PCI-SLOT-0000000000000002 Invalid state 00000000 [1726903821324,3] PCI-SLOT-0000000000000003 Invalid state 00000000 [1727993145172,3] PCI-SLOT-0000000000000008 Invalid state 00000000 Stack trace output: [ 3231.804789] Call Trace: [ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable) [ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0 [ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228 [ 3231.805005] [c000000feee8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440 [ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370 [ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0 [ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130 [ 3231.805452] [c000000feee8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4 Oops output: [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected [ 3231.801639] EEH: PE location: N/A, PHB location: N/A [ 3231.802463] EEH: This PCI device has failed 1 times in the last hour [ 3231.802465] EEH: Notify device drivers to shutdown [ 3231.802469] EEH: Collect temporary log [ 3231.802496] EEH: of node=0004:00:00:0 [ 3231.802499] EEH: PCI device/vendor: 03dc1014 [ 3231.802502] EEH: PCI cmd/status register: 00100106 [ 3231.802505] EEH: Bridge secondary status: 0000 [ 3231.802508] EEH: Bridge control: 0002 [ 3231.802509] EEH: PCI-E capabilities and status follow: [ 3231.802518] EEH: PCI-E 00: 00420010 00008002 00000040 00300103 [ 3231.802525] EEH: PCI-E 10: 01010008 00000000 00000000 00010010 [ 3231.802527] EEH: PCI-E 20: 00000000 [ 3231.802529] EEH: PCI-E AER capability register set follows: [ 3231.802537] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000 [ 3231.802544] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000 [ 3231.802551] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 [ 3231.802554] EEH: PCI-E AER 30: 00000000 00000000 [ 3231.802556] PHB3 PHB#4 Diag-data (Version: 1) [ 3231.802558] brdgCtl: 00000002 [ 3231.802560] UtlSts: 00080000 00000000 00000000 [ 3231.802562] RootSts: 00000040 00000000 01010008 00100102 00000000 [ 3231.802564] PhbSts: 0000001c00000000 0000001c00000000 [ 3231.802567] Lem: 0000000000100000 42498e367f502eae 0000000000000000 [ 3231.802569] InAErr: 4000000000000000 4000000000000000 0202000000000000 0000000000000000 [ 3231.802571] PE[ 0] A/B: 8440002b00000000 8000000000000000 [ 3231.802574] EEH: Reset with hotplug activity [ 3231.802590] pci_bus 0004:01: busn_res: [bus 01] is released [ 3231.802772] Unable to handle kernel paging request for data at address 0x00000010 [ 3231.802841] Faulting instruction address: 0xc000000000083c7c [ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1] [ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV [ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables iptable_filter ip_tables x_tables joydev input_leds mac_hid hid_generic usbhid hid at24 ofpart cmdlinepart powernv_flash ipmi_powernv uio_pdrv_genirq ipmi_msghandler mtd uio opal_prd powernv_rng ibmpowernv autofs4 uas usb_storage nouveau ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci libahci mlx5_core [ 3231.803524] CPU: 10 PID: 611 Comm: eehd Not tainted 4.4.0-31-generic #50-Ubuntu [ 3231.803583] task: c000000feee02a20 ti: c000000feee88000 task.ti: c000000feee88000 [ 3231.803644] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20 [ 3231.803707] REGS: c000000feee8b760 TRAP: 0300 Not tainted (4.4.0-31-generic) [ 3231.803765] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008822 XER: 00000000 [ 3231.803915] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 SOFTE: 1 GPR00: c000000000083c78 c000000feee8b9e0 c0000000015b5d00 0000000000000000 GPR04: 0000000000000001 c000000feee8bac0 c000001e4ec732b0 0000000000000ff0 GPR08: 0000000000000000 0000000000000000 0000000000000000 000000000000001b GPR12: c000000000083c20 c000000007b25f00 c0000000000e6318 c000001e4ecf0340 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000d42468 GPR24: c000000000d42440 0000000000000100 c000000000036460 0000000000000000 GPR28: c00000000161a3f0 0000000000000001 c000001fff766780 c000001e4ed44000 [ 3231.804708] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170 [ 3231.804749] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170 [ 3231.804789] Call Trace: [ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 (unreliable) [ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0 [ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228 [ 3231.805005] [c000000feee8bba0] [c00000000003c520] eeh_handle_normal_event+0x390/0x440 [ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] eeh_handle_event+0x184/0x370 [ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] eeh_event_handler+0x1d8/0x1e0 [ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130 [ 3231.805452] [c000000feee8be30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4 [ 3231.805519] Instruction dump: [ 3231.805550] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 2f890002 [ 3231.805653] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc e9290010 [ 3231.811881] ---[ end trace e5268555486ccf38 ]--- [ 3231.890155] [ 3231.890244] Sending IPI to other CPUs [ 3231.891305] IPI complete [ 3231.892479] kexec: waiting for cpu 1 (physical 17) to enter OPAL [ 3231.894550] kexec: waiting for cpu 24 (physical 48) to enter OPAL System Dump Info: The system is not configured to capture a system dump. *Additional Instructions for pavsu...@in.ibm.com: -Post a private note with access information to the machine that the bug is occuring on. -Attach sysctl -a output output to the bug. == Comment: #1 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-18 04:21:50 == Below two patches are needed to be incorporated for fixing this issue. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e ("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941 ("powerpc/eeh: Fix invalid cached PE primary bus") == Comment: #6 - Guo Wen Shan <gws...@au1.ibm.com> - 2016-07-18 22:38:44 == There is only one fix (as below) can be applied to ubuntu xenial. The other one can't be applied as it depends on the EEH support for SRIOV which isn't in ubuntu xenial yet. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941 ("powerpc/eeh: Fix invalid cached PE primary bus") Also, the above fix should be backported to ubuntu xenial. As the EEH code between ubuntu 16.04.1 and 16.10 should be almost same. The backported fix can be got from bug 143706. I also can attach it to this bugzilla on request. ** Affects: kernel-package (Ubuntu) Importance: Undecided Assignee: Taco Screen team (taco-screen-team) Status: New ** Tags: architecture-ppc64le bugnameltc-143838 severity-critical targetmilestone-inin1610 ** Tags added: architecture-ppc64le bugnameltc-143838 severity-critical targetmilestone-inin1610 ** Changed in: ubuntu Assignee: (unassigned) => Taco Screen team (taco-screen-team) ** Package changed: ubuntu => kernel-package (Ubuntu) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1604420 Title: [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/kernel-package/+bug/1604420/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs