Testing results on QDF2400 showing a recoverable DDR error, correctable vendor specific error, correctable ARM cache error, and fatal vendor specific error. All functionality appears to be working properly.
ubuntu@null-8cfdf006a3ef:~$ uname -a Linux null-8cfdf006a3ef 4.10.0-29-generic #33~lp1706141+build.2-Ubuntu SMP Tue Jul 25 19:12:22 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux ubuntu@null-8cfdf006a3ef:~$ dmesg | grep -i -E 'hest|ghes|edac|hardware' [ 0.000000] ACPI: HEST 0x0000000008A60000 000288 (v01 QCOM QDF2400 00000001 INTL 20150515) [ 0.538984] HEST: Table parsing has been initialized. [ 3.854385] EDAC MC: Ver: 3.0.0 [ 5.537078] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports. [ 5.545952] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly. [ 5.554123] ghes_edac: So, the end result of using this driver varies from vendor to vendor. [ 5.562555] ghes_edac: If you find incorrect reports, please contact your hardware vendor [ 5.570727] ghes_edac: to correct its BIOS. [ 5.574905] ghes_edac: This system has 6 DIMM sockets. [ 5.580205] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.589763] EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.599319] EDAC MC2: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.608867] EDAC MC3: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.618416] EDAC MC4: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.628018] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. [ 6.573372] qcom-emac QCOM8070:00 eth0: hardware id 64.1, hardware version 1.3.0 [ 224.669058] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 224.677330] {1}[Hardware Error]: event severity: recoverable [ 224.682992] {1}[Hardware Error]: precise tstamp: 2017-07-26 15:58:19 [ 224.689437] {1}[Hardware Error]: Error 0, type: recoverable [ 224.695097] {1}[Hardware Error]: section_type: memory error [ 224.700846] {1}[Hardware Error]: error_status: 0x00000000000c0400 [ 224.707113] {1}[Hardware Error]: physical_address: 0x0000000000204e10 [ 224.713726] {1}[Hardware Error]: physical_address_mask: 0x00000fffffffffff [ 224.720776] {1}[Hardware Error]: node: 0 card: 1 module: 0 rank: 0 bank: 0 device: 0 row: 4 column: 306 [ 224.730427] {1}[Hardware Error]: error_type: 3, multi-bit ECC [ 224.736356] EDAC MC0: 1 UE Multi-bit ECC on unknown label (node:0 card:1 module:0 rank:0 bank:0 row:4 col:306 page:0x204 offset:0xe10 grain:-4096 - status(0x00000000000c0400): Storage error in DRAM memory) [ 224.736358] [Firmware Warn]: GHES: Invalid address in generic error data: 0x204e10 [ 251.685322] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 251.685324] {2}[Hardware Error]: It has been corrected by h/w and requires no further action [ 251.685336] {2}[Hardware Error]: event severity: corrected [ 251.685341] {2}[Hardware Error]: precise tstamp: 2017-07-26 15:58:30 [ 251.685342] {2}[Hardware Error]: Error 0, type: corrected [ 251.685348] {2}[Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 251.685349] {2}[Hardware Error]: section length: 0x238 [ 251.685355] {2}[Hardware Error]: 00000000: 4d415201 4d492031 453a4d45 435f4343 .RAM1 IMEM:ECC_C [ 251.685358] {2}[Hardware Error]: 00000010: 53515f45 44525f42 00000000 00000000 E_QSB_RD........ [ 251.685361] {2}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ [ 251.685364] {2}[Hardware Error]: 00000030: 00000000 00000000 01010000 01010000 ................ [ 251.685367] {2}[Hardware Error]: 00000040: 00000000 00000000 00000005 00000000 ................ [ 251.685369] {2}[Hardware Error]: 00000050: 01010000 00000000 00000001 00010100 ................ [ 251.685372] {2}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ [ 251.685375] {2}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ [ 251.685378] {2}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ [ 251.685381] {2}[Hardware Error]: 00000090: 00000000 00000000 00000000 00000000 ................ [ 251.685384] {2}[Hardware Error]: 000000a0: 00000000 00000000 00000000 00000000 ................ [ 251.685387] {2}[Hardware Error]: 000000b0: 00000000 00000000 00000000 00000000 ................ [ 251.685389] {2}[Hardware Error]: 000000c0: 00000000 00000000 00000000 00000000 ................ [ 251.685392] {2}[Hardware Error]: 000000d0: 00000000 00000000 00000000 00000000 ................ [ 251.685395] {2}[Hardware Error]: 000000e0: 00000000 00000000 00000000 00000000 ................ [ 251.685398] {2}[Hardware Error]: 000000f0: 00000000 00000000 00000000 00000000 ................ [ 251.685402] {2}[Hardware Error]: 00000100: 00000000 00000000 00000000 00000000 ................ [ 251.685405] {2}[Hardware Error]: 00000110: 00000000 00000000 00000000 00000000 ................ [ 251.685408] {2}[Hardware Error]: 00000120: 00000000 00000000 00000000 00000000 ................ [ 251.685410] {2}[Hardware Error]: 00000130: 00000000 00000000 00000000 00000000 ................ [ 251.685413] {2}[Hardware Error]: 00000140: 00000000 00000000 00000000 00000000 ................ [ 251.685416] {2}[Hardware Error]: 00000150: 00000000 00000000 00000000 00000000 ................ [ 251.685419] {2}[Hardware Error]: 00000160: 00000000 00000000 00000000 00000000 ................ [ 251.685423] {2}[Hardware Error]: 00000170: 00000000 00000000 00000000 00000000 ................ [ 251.685426] {2}[Hardware Error]: 00000180: 00000000 00000000 00000000 00000000 ................ [ 251.685429] {2}[Hardware Error]: 00000190: 00000000 00000000 00000000 00000000 ................ [ 251.685432] {2}[Hardware Error]: 000001a0: 00000000 00000000 00000000 00000000 ................ [ 251.685434] {2}[Hardware Error]: 000001b0: 00000000 00000000 00000000 00000000 ................ [ 251.685437] {2}[Hardware Error]: 000001c0: 00000000 00000000 00000000 00000000 ................ [ 251.685440] {2}[Hardware Error]: 000001d0: 00000000 00000000 00000000 00000000 ................ [ 251.685443] {2}[Hardware Error]: 000001e0: 00000000 00000000 00000000 00000000 ................ [ 251.685446] {2}[Hardware Error]: 000001f0: 00000000 00000000 00000000 00000000 ................ [ 251.685449] {2}[Hardware Error]: 00000200: 00000000 00000000 00000000 00000000 ................ [ 251.685451] {2}[Hardware Error]: 00000210: 00000000 00000000 00000000 00000000 ................ [ 251.685454] {2}[Hardware Error]: 00000220: 00000000 00000000 00000000 00000000 ................ [ 251.685457] {2}[Hardware Error]: 00000230: 00000000 00000000 ........ [ 357.701494] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 2 [ 357.701496] {3}[Hardware Error]: event severity: info [ 357.701508] {3}[Hardware Error]: precise tstamp: 2017-07-26 16:00:12 [ 357.701510] {3}[Hardware Error]: Error 0, type: info [ 357.701513] {3}[Hardware Error]: section_type: ARM processor error [ 357.701515] {3}[Hardware Error]: MIDR: 0x00000000510f8000 [ 357.701518] {3}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x0000000000000000 [ 357.701520] {3}[Hardware Error]: error affinity level: 2 [ 357.701522] {3}[Hardware Error]: running state: 0x1 [ 357.701524] {3}[Hardware Error]: Power State Coordination Interface state: 0 [ 357.701527] {3}[Hardware Error]: Error info structure 0: [ 357.701529] {3}[Hardware Error]: num errors: 1 [ 357.701531] {3}[Hardware Error]: first error captured [ 357.701533] {3}[Hardware Error]: last error captured [ 357.701535] {3}[Hardware Error]: error_type: 0, cache error [ 357.701538] {3}[Hardware Error]: error_info: 0x0000000000c20058 ubuntu@null-8cfdf006a3ef:~$ ubuntu@null-8cfdf006a3ef:~$ [ 403.857832] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 403.866103] {4}[Hardware Error]: event severity: fatal [ 403.871244] {4}[Hardware Error]: precise tstamp: 2017-07-26 16:01:18 [ 403.877690] {4}[Hardware Error]: Error 0, type: fatal [ 403.882831] {4}[Hardware Error]: section type: unknown, d2e2621c-f936-468d-0d84-15a4ed015c8b [ 403.891445] {4}[Hardware Error]: section length: 0x238 [ 403.896762] {4}[Hardware Error]: 00000000: 4d415201 4d492031 453a4d45 555f4343 .RAM1 IMEM:ECC_U [ 403.905721] {4}[Hardware Error]: 00000010: 53515f45 44525f42 00000000 00000000 E_QSB_RD........ [ 403.914682] {4}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................ [ 403.923644] {4}[Hardware Error]: 00000030: 00000000 00000000 01010000 01010000 ................ [ 403.932605] {4}[Hardware Error]: 00000040: 00000000 00000000 00000005 00000000 ................ [ 403.941566] {4}[Hardware Error]: 00000050: 02020000 00000000 00000001 00c6c600 ................ [ 403.950531] {4}[Hardware Error]: 00000060: 00000000 00000000 00000000 00000000 ................ [ 403.959489] {4}[Hardware Error]: 00000070: 00000000 00000000 00000000 00000000 ................ [ 403.968450] {4}[Hardware Error]: 00000080: 00000000 00000000 00000000 00000000 ................ [ 403.977413] {4}[Hardware Error]: 00000090: 00000000 00000000 00000000 00000000 ................ [ 403.986374] {4}[Hardware Error]: 000000a0: 00000000 00000000 00000000 00000000 ................ [ 403.995339] {4}[Hardware Error]: 000000b0: 00000000 00000000 00000000 00000000 ................ [ 404.004302] {4}[Hardware Error]: 000000c0: 00000000 00000000 00000000 00000000 ................ [ 404.013263] {4}[Hardware Error]: 000000d0: 00000000 00000000 00000000 00000000 ................ [ 404.022223] {4}[Hardware Error]: 000000e0: 00000000 00000000 00000000 00000000 ................ [ 404.031183] {4}[Hardware Error]: 000000f0: 00000000 00000000 00000000 00000000 ................ [ 404.040143] {4}[Hardware Error]: 00000100: 00000000 00000000 00000000 00000000 ................ [ 404.049104] {4}[Hardware Error]: 00000110: 00000000 00000000 00000000 00000000 ................ [ 404.058064] {4}[Hardware Error]: 00000120: 00000000 00000000 00000000 00000000 ................ [ 404.067025] {4}[Hardware Error]: 00000130: 00000000 00000000 00000000 00000000 ................ [ 404.075986] {4}[Hardware Error]: 00000140: 00000000 00000000 00000000 00000000 ................ [ 404.084946] {4}[Hardware Error]: 00000150: 00000000 00000000 00000000 00000000 ................ [ 404.093907] {4}[Hardware Error]: 00000160: 00000000 00000000 00000000 00000000 ................ [ 404.102867] {4}[Hardware Error]: 00000170: 00000000 00000000 00000000 00000000 ................ [ 404.111828] {4}[Hardware Error]: 00000180: 00000000 00000000 00000000 00000000 ................ [ 404.120788] {4}[Hardware Error]: 00000190: 00000000 00000000 00000000 00000000 ................ [ 404.129752] {4}[Hardware Error]: 000001a0: 00000000 00000000 00000000 00000000 ................ [ 404.138710] {4}[Hardware Error]: 000001b0: 00000000 00000000 00000000 00000000 ................ [ 404.147673] {4}[Hardware Error]: 000001c0: 00000000 00000000 00000000 00000000 ................ [ 404.156632] {4}[Hardware Error]: 000001d0: 00000000 00000000 00000000 00000000 ................ [ 404.165593] {4}[Hardware Error]: 000001e0: 00000000 00000000 00000000 00000000 ................ [ 404.174555] {4}[Hardware Error]: 000001f0: 00000000 00000000 00000000 00000000 ................ [ 404.183516] {4}[Hardware Error]: 00000200: 00000000 00000000 00000000 00000000 ................ [ 404.192476] {4}[Hardware Error]: 00000210: 00000000 00000000 00000000 00000000 ................ [ 404.201438] {4}[Hardware Error]: 00000220: 00000000 00000000 00000000 00000000 ................ [ 404.210398] {4}[Hardware Error]: 00000230: 00000000 00000000 ........ [ 404.218665] Kernel panic - not syncing: Fatal hardware error! [ 404.224406] CPU: 0 PID: 217 Comm: kworker/0:1 Not tainted 4.10.0-29-generic #33~lp1706141+build.2-Ubuntu [ 404.233876] Hardware name: Qualcomm Qualcomm Centriq(TM) 2400 Development Platform/ABW|SYS|CVR,1DPC|V3 , BIOS XBL.DF.2.0.R1-00512 QDF2400_REL CR [ 404.247695] Workqueue: kacpi_notify acpi_os_execute_deferred [ 404.253347] Call trace: [ 404.255790] [<ffff1e8f9e08b078>] dump_backtrace+0x0/0x2b0 [ 404.261182] [<ffff1e8f9e08b34c>] show_stack+0x24/0x30 [ 404.266230] [<ffff1e8f9e4da5e0>] dump_stack+0x9c/0xbc [ 404.271276] [<ffff1e8f9e208620>] panic+0x140/0x2b0 [ 404.276061] [<ffff1e8f9e5ef8e0>] ghes_proc+0x1d8/0x568 [ 404.281191] [<ffff1e8f9e5efcb4>] ghes_notify_sci+0x44/0x70 [ 404.286670] [<ffff1e8f9e0f6424>] notifier_call_chain+0x5c/0xa0 [ 404.292495] [<ffff1e8f9e0f6970>] __blocking_notifier_call_chain+0x58/0xa0 [ 404.299274] [<ffff1e8f9e0f69f4>] blocking_notifier_call_chain+0x3c/0x50 [ 404.305883] [<ffff1e8f9e5ea09c>] acpi_hed_notify+0x24/0x30 [ 404.311361] [<ffff1e8f9e5b1710>] acpi_device_notify+0x30/0x40 [ 404.317101] [<ffff1e8f9e5c8204>] acpi_ev_notify_dispatch+0x4c/0x70 [ 404.323274] [<ffff1e8f9e5ac2e4>] acpi_os_execute_deferred+0x24/0x38 [ 404.329535] [<ffff1e8f9e0ed330>] process_one_work+0x158/0x478 [ 404.335273] [<ffff1e8f9e0ed6a0>] worker_thread+0x50/0x4a8 [ 404.340665] [<ffff1e8f9e0f47a8>] kthread+0x108/0x138 [ 404.345622] [<ffff1e8f9e0838a0>] ret_from_fork+0x10/0x30 [ 404.350934] SMP: stopping secondary CPUs [ 404.356117] Starting crashdump kernel... [ 404.360034] Bye! -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1706141 Title: [ARM64] config EDAC_GHES=y depends on EDAC_MM_EDAC=y Status in linux package in Ubuntu: Incomplete Bug description: [Impact] In configs Zesty and Artful, EDAC_MM_EDAC is set to =m, this disables EDAC_GHES. Customers using RAS on ARM64 may want this functionality. According to RAS expert at QTI. EDAC_GHES is essential for ARMv8.0 Server systems, as it enables firmware-first error handling of memory and CPU errors. Due to a lack of standard RAS architecture (or machine check architecture equivalent) on ARMv8.0 systems, APEI/GHES is the only mechanism available for reporting hardware errors (e.g. memory and CPU errors). This enables reporting of hardware errors, and also helps enable memory fault recovery mechanisms to extend the life of the system by offlining pages when recoverable uncorrected errors are encountered. Note that other ARM vendors will be going in this direction for hardware error handling. [Test] Test kernel available in https://launchpad.net/~centriq-team/+archive/ubuntu/lp1706141 Boot the kernel and check dmesg for the following: $ dmesg | grep -i -E "edac|hest|ghes" [ 0.000000] ACPI: HEST 0x0000000009160000 000288 (v01 QCOM QDF2400 00000001 INTL 20150515) [ 0.620278] HEST: Table parsing has been initialized. [ 4.178298] EDAC MC: Ver: 3.0.0 [ 5.664499] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports. [ 5.673371] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly. [ 5.681542] ghes_edac: So, the end result of using this driver varies from vendor to vendor. [ 5.689972] ghes_edac: If you find incorrect reports, please contact your hardware vendor [ 5.698142] ghes_edac: to correct its BIOS. [ 5.702320] ghes_edac: This system has 12 DIMM sockets. [ 5.707717] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.717264] EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.726806] EDAC MC2: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.736344] EDAC MC3: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.745883] EDAC MC4: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) [ 5.755469] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. [Fix] 1. Apply RAS patch series submitted for SRU in Bug #1696570 2. Set config option EDAC_MM_EDAC=y for ARM64, this will automatically set EDAC_GHES=y 3. Remove edac_core from debian.master/abi/<ver>/arm64/generic.modules [Regression Potential] The config change is limited to ARM64 architecture, and does not impact any other architecture. Potential for regressions is low. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1706141/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp