[Bug 1604420] [NEW] [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection

bugproxy Tue, 19 Jul 2016 06:36:16 -0700

Public bug reported:

== Comment: #0 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-18 
04:18:33 ==
---Problem Description---
Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1]
 
Contact Information = pavsu...@in.ibm.com 
 
---uname output---
Linux ltc-garri2 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:05:18 UTC 2016 
ppc64le ppc64le ppc64le GNU/Linux
 
---Additional Hardware Info---
root@ltc-garri2:~# lspci
0000:00:00.0 PCI bridge: IBM Device 03dc
0000:01:00.0 Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
0001:00:00.0 PCI bridge: IBM Device 03dc
0002:00:00.0 PCI bridge: IBM Device 03dc
0002:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0003:00:00.0 PCI bridge: IBM Device 03dc
0004:00:00.0 PCI bridge: IBM Device 03dc
0005:00:00.0 PCI bridge: IBM Device 03dc
0005:01:00.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:01.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:02.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:03.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:02:04.0 PCI bridge: PLX Technology, Inc. PEX 8718 16-Lane, 5-Port PCI 
Express Gen 3 (8.0 GT/s) Switch (rev ab)
0005:03:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0 xHCI 
Host Controller (rev 02)
0005:04:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 
x2 4-port SATA 6 Gb/s Controller (rev 11)
0005:05:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 
03)
0005:06:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics 
Family (rev 30)
0005:07:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 
Gigabit Ethernet PCIe (rev 10)
0005:07:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5718 
Gigabit Ethernet PCIe (rev 10)
0006:00:00.0 PCI bridge: IBM Device 03dc
0006:01:00.0 3D controller: NVIDIA Corporation Device 15fe (rev a1)
0007:00:00.0 PCI bridge: IBM Device 03dc
0008:00:00.0 Bridge: IBM Device 04ea
0008:00:00.1 Bridge: IBM Device 04ea
0008:00:01.0 Bridge: IBM Device 04ea
0008:00:01.1 Bridge: IBM Device 04ea
0009:00:00.0 Bridge: IBM Device 04ea
0009:00:00.1 Bridge: IBM Device 04ea
0009:00:01.0 Bridge: IBM Device 04ea
0009:00:01.1 Bridge: IBM Device 04ea


 
Machine Type = P8 
 
---Debugger---
A debugger is not configured
 
---Steps to Reproduce---
Install a P8 Open Power 8335-GTB Hardware with Ubuntu 16.10.
Then execute the Frozen PE error injection tests as shown below:


root@ltc-garri2:~# lspci | grep -i 0004:00:00.0
0004:00:00.0 PCI bridge: IBM Device 03dc

root@ltc-garri2:~# cat /proc/powerpc/eeh | tail -n 1
eeh_slot_resets=0


root@ltc-garri2:~# echo 0:0:4:0:0 > /sys/kernel/debug/powerpc/PCI0004/err_injct 
&& lspci -ns 0004:00:00.0; echo $?
0004:00:00.0 0604: 1014:03dc
0

Immediately the kernel crashes with a Oops Message.

Ubuntu Yakkety Yak (development branch) ltc-garri2 hvc0

ltc-garri2 login: [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected
[ 3231.801639] EEH: PE location: N/A, PHB location: N/A
[ 3231.802772] Unable to handle kernel paging request for data at address 
0x00000010
[ 3231.802841] Faulting instruction address: [1718257727455,3] OPAL: Trying a 
CPU re-init with flags: 0x1
[1719041828975,3] OPAL: Trying a CPU re-init with flags: 0x2
0xc000000000083c7c
[ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV
[ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables 
iptable_fi[1721485466204,3] PCI-SLOT-0000000000000000 Invalid state 00000000
[1722574789913,3] PCI-SLOT-0000000000000001 Invalid state 00000000
[1725814514431,3] PCI-SLOT-0000000000000002 Invalid state 00000000
[1726903821324,3] PCI-SLOT-0000000000000003 Invalid state 00000000
[1727993145172,3] PCI-SLOT-0000000000000008 Invalid state 00000000

 
Stack trace output:
 [ 3231.804789] Call Trace:
[ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 
(unreliable)
[ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 3231.805005] [c000000feee8bba0] [c00000000003c520] 
eeh_handle_normal_event+0x390/0x440
[ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] 
eeh_handle_event+0x184/0x370
[ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] 
eeh_event_handler+0x1d8/0x1e0
[ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 3231.805452] [c000000feee8be30] [c000000000009538] 
ret_from_kernel_thread+0x5c/0xa4

 
Oops output:
 [ 3231.801472] EEH: Frozen PE#0 on PHB#4 detected
[ 3231.801639] EEH: PE location: N/A, PHB location: N/A
[ 3231.802463] EEH: This PCI device has failed 1 times in the last hour
[ 3231.802465] EEH: Notify device drivers to shutdown
[ 3231.802469] EEH: Collect temporary log
[ 3231.802496] EEH: of node=0004:00:00:0
[ 3231.802499] EEH: PCI device/vendor: 03dc1014
[ 3231.802502] EEH: PCI cmd/status register: 00100106
[ 3231.802505] EEH: Bridge secondary status: 0000
[ 3231.802508] EEH: Bridge control: 0002
[ 3231.802509] EEH: PCI-E capabilities and status follow:
[ 3231.802518] EEH: PCI-E 00: 00420010 00008002 00000040 00300103
[ 3231.802525] EEH: PCI-E 10: 01010008 00000000 00000000 00010010
[ 3231.802527] EEH: PCI-E 20: 00000000
[ 3231.802529] EEH: PCI-E AER capability register set follows:
[ 3231.802537] EEH: PCI-E AER 00: 14810001 00000000 0008d000 00000000
[ 3231.802544] EEH: PCI-E AER 10: 00000000 00000000 000001e0 00000000
[ 3231.802551] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 3231.802554] EEH: PCI-E AER 30: 00000000 00000000
[ 3231.802556] PHB3 PHB#4 Diag-data (Version: 1)
[ 3231.802558] brdgCtl:     00000002
[ 3231.802560] UtlSts:      00080000 00000000 00000000
[ 3231.802562] RootSts:     00000040 00000000 01010008 00100102 00000000
[ 3231.802564] PhbSts:      0000001c00000000 0000001c00000000
[ 3231.802567] Lem:         0000000000100000 42498e367f502eae 0000000000000000
[ 3231.802569] InAErr:      4000000000000000 4000000000000000 0202000000000000 
0000000000000000
[ 3231.802571] PE[  0] A/B: 8440002b00000000 8000000000000000
[ 3231.802574] EEH: Reset with hotplug activity
[ 3231.802590] pci_bus 0004:01: busn_res: [bus 01] is released
[ 3231.802772] Unable to handle kernel paging request for data at address 
0x00000010
[ 3231.802841] Faulting instruction address: 0xc000000000083c7c
[ 3231.802898] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3231.802944] SMP NR_CPUS=2048 NUMA PowerNV
[ 3231.802994] Modules linked in: ipmi_devintf ip6table_filter ip6_tables 
iptable_filter ip_tables x_tables joydev input_leds mac_hid hid_generic usbhid 
hid at24 ofpart cmdlinepart powernv_flash ipmi_powernv uio_pdrv_genirq 
ipmi_msghandler mtd uio opal_prd powernv_rng ibmpowernv autofs4 uas usb_storage 
nouveau ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops drm ahci libahci mlx5_core
[ 3231.803524] CPU: 10 PID: 611 Comm: eehd Not tainted 4.4.0-31-generic 
#50-Ubuntu
[ 3231.803583] task: c000000feee02a20 ti: c000000feee88000 task.ti: 
c000000feee88000
[ 3231.803644] NIP: c000000000083c7c LR: c000000000083c78 CTR: c000000000083c20
[ 3231.803707] REGS: c000000feee8b760 TRAP: 0300   Not tainted  
(4.4.0-31-generic)
[ 3231.803765] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 28008822  
XER: 00000000
[ 3231.803915] CFAR: c000000000008468 DAR: 0000000000000010 DSISR: 40000000 
SOFTE: 1
               GPR00: c000000000083c78 c000000feee8b9e0 c0000000015b5d00 
0000000000000000
               GPR04: 0000000000000001 c000000feee8bac0 c000001e4ec732b0 
0000000000000ff0
               GPR08: 0000000000000000 0000000000000000 0000000000000000 
000000000000001b
               GPR12: c000000000083c20 c000000007b25f00 c0000000000e6318 
c000001e4ecf0340
               GPR16: 0000000000000000 0000000000000000 0000000000000000 
0000000000000000
               GPR20: 0000000000000000 0000000000000000 0000000000000000 
c000000000d42468
               GPR24: c000000000d42440 0000000000000100 c000000000036460 
0000000000000000
               GPR28: c00000000161a3f0 0000000000000001 c000001fff766780 
c000001e4ed44000
[ 3231.804708] NIP [c000000000083c7c] pnv_eeh_reset+0x5c/0x170
[ 3231.804749] LR [c000000000083c78] pnv_eeh_reset+0x58/0x170
[ 3231.804789] Call Trace:
[ 3231.804812] [c000000feee8b9e0] [c000000000083c78] pnv_eeh_reset+0x58/0x170 
(unreliable)
[ 3231.804882] [c000000feee8ba60] [c000000000038250] eeh_reset_pe+0xb0/0x1c0
[ 3231.804944] [c000000feee8bb00] [c000000000af444c] eeh_reset_device+0xd8/0x228
[ 3231.805005] [c000000feee8bba0] [c00000000003c520] 
eeh_handle_normal_event+0x390/0x440
[ 3231.805073] [c000000feee8bc20] [c00000000003c9c4] 
eeh_handle_event+0x184/0x370
[ 3231.805311] [c000000feee8bcd0] [c00000000003cd88] 
eeh_event_handler+0x1d8/0x1e0
[ 3231.805391] [c000000feee8bd80] [c0000000000e6420] kthread+0x110/0x130
[ 3231.805452] [c000000feee8be30] [c000000000009538] 
ret_from_kernel_thread+0x5c/0xa4
[ 3231.805519] Instruction dump:
[ 3231.805550] 60000000 813f0000 ebdf0010 792affe3 408200d4 e95e0250 812a000c 
2f890002
[ 3231.805653] 419e0054 7fe3fb78 4bfb70c5 60000000 <e9230010> 2fa90000 419e00dc 
e9290010
[ 3231.811881] ---[ end trace e5268555486ccf38 ]---
[ 3231.890155]
[ 3231.890244] Sending IPI to other CPUs
[ 3231.891305] IPI complete
[ 3231.892479] kexec: waiting for cpu 1 (physical 17) to enter OPAL
[ 3231.894550] kexec: waiting for cpu 24 (physical 48) to enter OPAL

 
System Dump Info:
  The system is not configured to capture a system dump.
 
*Additional Instructions for pavsu...@in.ibm.com: 
-Post a private note with access information to the machine that the bug is 
occuring on. 
-Attach sysctl -a output output to the bug.

== Comment: #1 - PAVAMAN SUBRAMANIYAM <pavsu...@in.ibm.com> - 2016-07-18 
04:21:50 ==
Below two patches are needed to  be incorporated for fixing this issue.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=cca0e542e02e48cce541a49c4046ec094ec27c1e
("powerpc/eeh: Fix wrong argument passed to eeh_rmv_device()")

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")

== Comment: #6 - Guo Wen Shan <gws...@au1.ibm.com> - 2016-07-18 22:38:44 ==
There is only one fix (as below) can be applied to ubuntu xenial. The other one 
can't be applied as it depends on the EEH support for SRIOV which isn't in 
ubuntu xenial yet.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a3aa256b7258b3d19f8b44557cc64525a993b941
("powerpc/eeh: Fix invalid cached PE primary bus")

Also, the above fix should be backported to ubuntu xenial. As the EEH
code between ubuntu 16.04.1 and 16.10 should be almost same. The
backported fix can be got from bug 143706. I also can attach it to this
bugzilla on request.

** Affects: kernel-package (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-143838 severity-critical 
targetmilestone-inin1610

** Tags added: architecture-ppc64le bugnameltc-143838 severity-critical
targetmilestone-inin1610

** Changed in: ubuntu
     Assignee: (unassigned) => Taco Screen team (taco-screen-team)

** Package changed: ubuntu => kernel-package (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1604420

Title:
  [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad
  area, sig: 11 [#1] while executing Froze PE Error injection

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/kernel-package/+bug/1604420/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1604420] [NEW] [LTCTest][Opal][OP820] Machine crashed with Oops: Kernel access of bad area, sig: 11 [#1] while executing Froze PE Error injection

Reply via email to