Testing results on QDF2400 showing a recoverable DDR error, correctable
vendor specific error, correctable ARM cache error, and fatal vendor
specific error. All functionality appears to be working properly.

ubuntu@null-8cfdf006a3ef:~$ uname -a
Linux null-8cfdf006a3ef 4.10.0-29-generic #33~lp1706141+build.2-Ubuntu SMP Tue 
Jul 25 19:12:22 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux


ubuntu@null-8cfdf006a3ef:~$ dmesg | grep -i -E 'hest|ghes|edac|hardware'
[    0.000000] ACPI: HEST 0x0000000008A60000 000288 (v01 QCOM   QDF2400  
00000001 INTL 20150515)
[    0.538984] HEST: Table parsing has been initialized.
[    3.854385] EDAC MC: Ver: 3.0.0
[    5.537078] ghes_edac: This EDAC driver relies on BIOS to enumerate memory 
and get error reports.
[    5.545952] ghes_edac: Unfortunately, not all BIOSes reflect the memory 
layout correctly.
[    5.554123] ghes_edac: So, the end result of using this driver varies from 
vendor to vendor.
[    5.562555] ghes_edac: If you find incorrect reports, please contact your 
hardware vendor
[    5.570727] ghes_edac: to correct its BIOS.
[    5.574905] ghes_edac: This system has 6 DIMM sockets.
[    5.580205] EDAC MC0: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
[    5.589763] EDAC MC1: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
[    5.599319] EDAC MC2: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
[    5.608867] EDAC MC3: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
[    5.618416] EDAC MC4: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
[    5.628018] GHES: APEI firmware first mode is enabled by APEI bit and WHEA 
_OSC.
[    6.573372] qcom-emac QCOM8070:00 eth0: hardware id 64.1, hardware version 
1.3.0
[  224.669058] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 1
[  224.677330] {1}[Hardware Error]: event severity: recoverable
[  224.682992] {1}[Hardware Error]:  precise tstamp: 2017-07-26 15:58:19
[  224.689437] {1}[Hardware Error]:  Error 0, type: recoverable
[  224.695097] {1}[Hardware Error]:   section_type: memory error
[  224.700846] {1}[Hardware Error]:   error_status: 0x00000000000c0400
[  224.707113] {1}[Hardware Error]:   physical_address: 0x0000000000204e10
[  224.713726] {1}[Hardware Error]:   physical_address_mask: 0x00000fffffffffff
[  224.720776] {1}[Hardware Error]:   node: 0 card: 1 module: 0 rank: 0 bank: 0 
device: 0 row: 4 column: 306
[  224.730427] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
[  224.736356] EDAC MC0: 1 UE Multi-bit ECC on unknown label (node:0 card:1 
module:0 rank:0 bank:0 row:4 col:306 page:0x204 offset:0xe10 grain:-4096 - 
status(0x00000000000c0400): Storage error in DRAM memory)
[  224.736358] [Firmware Warn]: GHES: Invalid address in generic error data: 
0x204e10
[  251.685322] {2}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 2
[  251.685324] {2}[Hardware Error]: It has been corrected by h/w and requires 
no further action
[  251.685336] {2}[Hardware Error]: event severity: corrected
[  251.685341] {2}[Hardware Error]:  precise tstamp: 2017-07-26 15:58:30
[  251.685342] {2}[Hardware Error]:  Error 0, type: corrected
[  251.685348] {2}[Hardware Error]:   section type: unknown, 
d2e2621c-f936-468d-0d84-15a4ed015c8b
[  251.685349] {2}[Hardware Error]:   section length: 0x238
[  251.685355] {2}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 
435f4343  .RAM1 IMEM:ECC_C
[  251.685358] {2}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 
00000000  E_QSB_RD........
[  251.685361] {2}[Hardware Error]:   00000020: 00000000 00000000 00000000 
00000000  ................
[  251.685364] {2}[Hardware Error]:   00000030: 00000000 00000000 01010000 
01010000  ................
[  251.685367] {2}[Hardware Error]:   00000040: 00000000 00000000 00000005 
00000000  ................
[  251.685369] {2}[Hardware Error]:   00000050: 01010000 00000000 00000001 
00010100  ................
[  251.685372] {2}[Hardware Error]:   00000060: 00000000 00000000 00000000 
00000000  ................
[  251.685375] {2}[Hardware Error]:   00000070: 00000000 00000000 00000000 
00000000  ................
[  251.685378] {2}[Hardware Error]:   00000080: 00000000 00000000 00000000 
00000000  ................
[  251.685381] {2}[Hardware Error]:   00000090: 00000000 00000000 00000000 
00000000  ................
[  251.685384] {2}[Hardware Error]:   000000a0: 00000000 00000000 00000000 
00000000  ................
[  251.685387] {2}[Hardware Error]:   000000b0: 00000000 00000000 00000000 
00000000  ................
[  251.685389] {2}[Hardware Error]:   000000c0: 00000000 00000000 00000000 
00000000  ................
[  251.685392] {2}[Hardware Error]:   000000d0: 00000000 00000000 00000000 
00000000  ................
[  251.685395] {2}[Hardware Error]:   000000e0: 00000000 00000000 00000000 
00000000  ................
[  251.685398] {2}[Hardware Error]:   000000f0: 00000000 00000000 00000000 
00000000  ................
[  251.685402] {2}[Hardware Error]:   00000100: 00000000 00000000 00000000 
00000000  ................
[  251.685405] {2}[Hardware Error]:   00000110: 00000000 00000000 00000000 
00000000  ................
[  251.685408] {2}[Hardware Error]:   00000120: 00000000 00000000 00000000 
00000000  ................
[  251.685410] {2}[Hardware Error]:   00000130: 00000000 00000000 00000000 
00000000  ................
[  251.685413] {2}[Hardware Error]:   00000140: 00000000 00000000 00000000 
00000000  ................
[  251.685416] {2}[Hardware Error]:   00000150: 00000000 00000000 00000000 
00000000  ................
[  251.685419] {2}[Hardware Error]:   00000160: 00000000 00000000 00000000 
00000000  ................
[  251.685423] {2}[Hardware Error]:   00000170: 00000000 00000000 00000000 
00000000  ................
[  251.685426] {2}[Hardware Error]:   00000180: 00000000 00000000 00000000 
00000000  ................
[  251.685429] {2}[Hardware Error]:   00000190: 00000000 00000000 00000000 
00000000  ................
[  251.685432] {2}[Hardware Error]:   000001a0: 00000000 00000000 00000000 
00000000  ................
[  251.685434] {2}[Hardware Error]:   000001b0: 00000000 00000000 00000000 
00000000  ................
[  251.685437] {2}[Hardware Error]:   000001c0: 00000000 00000000 00000000 
00000000  ................
[  251.685440] {2}[Hardware Error]:   000001d0: 00000000 00000000 00000000 
00000000  ................
[  251.685443] {2}[Hardware Error]:   000001e0: 00000000 00000000 00000000 
00000000  ................
[  251.685446] {2}[Hardware Error]:   000001f0: 00000000 00000000 00000000 
00000000  ................
[  251.685449] {2}[Hardware Error]:   00000200: 00000000 00000000 00000000 
00000000  ................
[  251.685451] {2}[Hardware Error]:   00000210: 00000000 00000000 00000000 
00000000  ................
[  251.685454] {2}[Hardware Error]:   00000220: 00000000 00000000 00000000 
00000000  ................
[  251.685457] {2}[Hardware Error]:   00000230: 00000000 00000000               
     ........
[  357.701494] {3}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 2
[  357.701496] {3}[Hardware Error]: event severity: info
[  357.701508] {3}[Hardware Error]:  precise tstamp: 2017-07-26 16:00:12
[  357.701510] {3}[Hardware Error]:  Error 0, type: info
[  357.701513] {3}[Hardware Error]:   section_type: ARM processor error
[  357.701515] {3}[Hardware Error]:   MIDR: 0x00000000510f8000
[  357.701518] {3}[Hardware Error]:   Multiprocessor Affinity Register (MPIDR): 
0x0000000000000000
[  357.701520] {3}[Hardware Error]:   error affinity level: 2
[  357.701522] {3}[Hardware Error]:   running state: 0x1
[  357.701524] {3}[Hardware Error]:   Power State Coordination Interface state: 0
[  357.701527] {3}[Hardware Error]:   Error info structure 0:
[  357.701529] {3}[Hardware Error]:   num errors: 1
[  357.701531] {3}[Hardware Error]:    first error captured
[  357.701533] {3}[Hardware Error]:    last error captured
[  357.701535] {3}[Hardware Error]:    error_type: 0, cache error
[  357.701538] {3}[Hardware Error]:    error_info: 0x0000000000c20058
ubuntu@null-8cfdf006a3ef:~$
ubuntu@null-8cfdf006a3ef:~$ [  403.857832] {4}[Hardware Error]: Hardware error 
from APEI Generic Hardware Error Source: 1
[  403.866103] {4}[Hardware Error]: event severity: fatal
[  403.871244] {4}[Hardware Error]:  precise tstamp: 2017-07-26 16:01:18
[  403.877690] {4}[Hardware Error]:  Error 0, type: fatal
[  403.882831] {4}[Hardware Error]:   section type: unknown, 
d2e2621c-f936-468d-0d84-15a4ed015c8b
[  403.891445] {4}[Hardware Error]:   section length: 0x238
[  403.896762] {4}[Hardware Error]:   00000000: 4d415201 4d492031 453a4d45 
555f4343  .RAM1 IMEM:ECC_U
[  403.905721] {4}[Hardware Error]:   00000010: 53515f45 44525f42 00000000 
00000000  E_QSB_RD........
[  403.914682] {4}[Hardware Error]:   00000020: 00000000 00000000 00000000 
00000000  ................
[  403.923644] {4}[Hardware Error]:   00000030: 00000000 00000000 01010000 
01010000  ................
[  403.932605] {4}[Hardware Error]:   00000040: 00000000 00000000 00000005 
00000000  ................
[  403.941566] {4}[Hardware Error]:   00000050: 02020000 00000000 00000001 
00c6c600  ................
[  403.950531] {4}[Hardware Error]:   00000060: 00000000 00000000 00000000 
00000000  ................
[  403.959489] {4}[Hardware Error]:   00000070: 00000000 00000000 00000000 
00000000  ................
[  403.968450] {4}[Hardware Error]:   00000080: 00000000 00000000 00000000 
00000000  ................
[  403.977413] {4}[Hardware Error]:   00000090: 00000000 00000000 00000000 
00000000  ................
[  403.986374] {4}[Hardware Error]:   000000a0: 00000000 00000000 00000000 
00000000  ................
[  403.995339] {4}[Hardware Error]:   000000b0: 00000000 00000000 00000000 
00000000  ................
[  404.004302] {4}[Hardware Error]:   000000c0: 00000000 00000000 00000000 
00000000  ................
[  404.013263] {4}[Hardware Error]:   000000d0: 00000000 00000000 00000000 
00000000  ................
[  404.022223] {4}[Hardware Error]:   000000e0: 00000000 00000000 00000000 
00000000  ................
[  404.031183] {4}[Hardware Error]:   000000f0: 00000000 00000000 00000000 
00000000  ................
[  404.040143] {4}[Hardware Error]:   00000100: 00000000 00000000 00000000 
00000000  ................
[  404.049104] {4}[Hardware Error]:   00000110: 00000000 00000000 00000000 
00000000  ................
[  404.058064] {4}[Hardware Error]:   00000120: 00000000 00000000 00000000 
00000000  ................
[  404.067025] {4}[Hardware Error]:   00000130: 00000000 00000000 00000000 
00000000  ................
[  404.075986] {4}[Hardware Error]:   00000140: 00000000 00000000 00000000 
00000000  ................
[  404.084946] {4}[Hardware Error]:   00000150: 00000000 00000000 00000000 
00000000  ................
[  404.093907] {4}[Hardware Error]:   00000160: 00000000 00000000 00000000 
00000000  ................
[  404.102867] {4}[Hardware Error]:   00000170: 00000000 00000000 00000000 
00000000  ................
[  404.111828] {4}[Hardware Error]:   00000180: 00000000 00000000 00000000 
00000000  ................
[  404.120788] {4}[Hardware Error]:   00000190: 00000000 00000000 00000000 
00000000  ................
[  404.129752] {4}[Hardware Error]:   000001a0: 00000000 00000000 00000000 
00000000  ................
[  404.138710] {4}[Hardware Error]:   000001b0: 00000000 00000000 00000000 
00000000  ................
[  404.147673] {4}[Hardware Error]:   000001c0: 00000000 00000000 00000000 
00000000  ................
[  404.156632] {4}[Hardware Error]:   000001d0: 00000000 00000000 00000000 
00000000  ................
[  404.165593] {4}[Hardware Error]:   000001e0: 00000000 00000000 00000000 
00000000  ................
[  404.174555] {4}[Hardware Error]:   000001f0: 00000000 00000000 00000000 
00000000  ................
[  404.183516] {4}[Hardware Error]:   00000200: 00000000 00000000 00000000 
00000000  ................
[  404.192476] {4}[Hardware Error]:   00000210: 00000000 00000000 00000000 
00000000  ................
[  404.201438] {4}[Hardware Error]:   00000220: 00000000 00000000 00000000 
00000000  ................
[  404.210398] {4}[Hardware Error]:   00000230: 00000000 00000000               
     ........
[  404.218665] Kernel panic - not syncing: Fatal hardware error!
[  404.224406] CPU: 0 PID: 217 Comm: kworker/0:1 Not tainted 4.10.0-29-generic 
#33~lp1706141+build.2-Ubuntu
[  404.233876] Hardware name: Qualcomm Qualcomm Centriq(TM) 2400 Development 
Platform/ABW|SYS|CVR,1DPC|V3           , BIOS XBL.DF.2.0.R1-00512 QDF2400_REL CR
[  404.247695] Workqueue: kacpi_notify acpi_os_execute_deferred
[  404.253347] Call trace:
[  404.255790] [<ffff1e8f9e08b078>] dump_backtrace+0x0/0x2b0
[  404.261182] [<ffff1e8f9e08b34c>] show_stack+0x24/0x30
[  404.266230] [<ffff1e8f9e4da5e0>] dump_stack+0x9c/0xbc
[  404.271276] [<ffff1e8f9e208620>] panic+0x140/0x2b0
[  404.276061] [<ffff1e8f9e5ef8e0>] ghes_proc+0x1d8/0x568
[  404.281191] [<ffff1e8f9e5efcb4>] ghes_notify_sci+0x44/0x70
[  404.286670] [<ffff1e8f9e0f6424>] notifier_call_chain+0x5c/0xa0
[  404.292495] [<ffff1e8f9e0f6970>] __blocking_notifier_call_chain+0x58/0xa0
[  404.299274] [<ffff1e8f9e0f69f4>] blocking_notifier_call_chain+0x3c/0x50
[  404.305883] [<ffff1e8f9e5ea09c>] acpi_hed_notify+0x24/0x30
[  404.311361] [<ffff1e8f9e5b1710>] acpi_device_notify+0x30/0x40
[  404.317101] [<ffff1e8f9e5c8204>] acpi_ev_notify_dispatch+0x4c/0x70
[  404.323274] [<ffff1e8f9e5ac2e4>] acpi_os_execute_deferred+0x24/0x38
[  404.329535] [<ffff1e8f9e0ed330>] process_one_work+0x158/0x478
[  404.335273] [<ffff1e8f9e0ed6a0>] worker_thread+0x50/0x4a8
[  404.340665] [<ffff1e8f9e0f47a8>] kthread+0x108/0x138
[  404.345622] [<ffff1e8f9e0838a0>] ret_from_fork+0x10/0x30
[  404.350934] SMP: stopping secondary CPUs
[  404.356117] Starting crashdump kernel...
[  404.360034] Bye!

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1706141

Title:
  [ARM64] config EDAC_GHES=y depends on EDAC_MM_EDAC=y

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]
  In configs Zesty and Artful, EDAC_MM_EDAC is set to =m, this disables 
EDAC_GHES. Customers using RAS on ARM64 may want this functionality.

  According to RAS expert at QTI. EDAC_GHES is essential for ARMv8.0
  Server systems, as it enables firmware-first error handling of memory
  and CPU errors. Due to a lack of standard RAS architecture (or machine
  check architecture equivalent) on ARMv8.0 systems, APEI/GHES is the
  only mechanism available for reporting hardware errors (e.g. memory
  and CPU errors). This enables reporting of hardware errors, and also
  helps enable memory fault recovery mechanisms to extend the life of
  the system by offlining pages when recoverable uncorrected errors are
  encountered. Note that other ARM vendors will be going in this
  direction for hardware error handling.

  [Test]
  Test kernel available in 
https://launchpad.net/~centriq-team/+archive/ubuntu/lp1706141

  Boot the kernel and check dmesg for the following:
  $ dmesg | grep -i -E "edac|hest|ghes"
  [    0.000000] ACPI: HEST 0x0000000009160000 000288 (v01 QCOM   QDF2400  
00000001 INTL 20150515)
  [    0.620278] HEST: Table parsing has been initialized.
  [    4.178298] EDAC MC: Ver: 3.0.0
  [    5.664499] ghes_edac: This EDAC driver relies on BIOS to enumerate memory 
and get error reports.
  [    5.673371] ghes_edac: Unfortunately, not all BIOSes reflect the memory 
layout correctly.
  [    5.681542] ghes_edac: So, the end result of using this driver varies from 
vendor to vendor.
  [    5.689972] ghes_edac: If you find incorrect reports, please contact your 
hardware vendor
  [    5.698142] ghes_edac: to correct its BIOS.
  [    5.702320] ghes_edac: This system has 12 DIMM sockets.
  [    5.707717] EDAC MC0: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
  [    5.717264] EDAC MC1: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
  [    5.726806] EDAC MC2: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
  [    5.736344] EDAC MC3: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
  [    5.745883] EDAC MC4: Giving out device to module ghes_edac.c controller 
ghes_edac: DEV ghes (INTERRUPT)
  [    5.755469] GHES: APEI firmware first mode is enabled by APEI bit and WHEA 
_OSC.

  [Fix]
  1. Apply RAS patch series submitted for SRU in Bug #1696570
  2. Set config option EDAC_MM_EDAC=y for ARM64, this will automatically set 
EDAC_GHES=y
  3. Remove edac_core from
  debian.master/abi/<ver>/arm64/generic.modules

  [Regression Potential]
  The config change is limited to ARM64 architecture, and does not impact any 
other architecture. Potential for regressions is low.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1706141/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to