** Description changed: [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 - For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 - [Regression Risk] - The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. + The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform.
** Description changed: [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 + [Fix] + https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974 + https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529 + https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6 + https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33 + [Regression Risk] The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. ** Description changed: [Impact] rasdaemon does not know how to decode MCE events from various new platforms, making it difficult to interpret errors reported up from the platform. [Test Case] On an AMD SMCA-capable system: #!/bin/bash modprobe mce-inject EINJ=/sys/kernel/debug/mce-inject # See /sys/kernel/debug/mce-inject/README echo hw > $EINJ/flags echo 0x9c2030000000011b > $EINJ/status echo 0x040000035dd8bfc0 > $EINJ/addr echo 0x0000c2030b404000 > $EINJ/synd echo 0 > $EINJ/bank # Wait for MCE to appear in dmesg sudo ras-mc-ctl --errors There should be a new MCE event in the output: 1 2020-04-13 19:19:55 +0000 error: Deferred error, no action required., CPU 2, bank Load Store Unit (bank=0), mcg mcgstatus=0, mci UECC, mcgcap=0x0000011c, status=0x9c2030000000011b, addr=0x35dd8bfc0, walltime=0x5e94bb5d, cpuid=0x00830f10 For Skylake, I regression tested by using mce-test w/ the "corrected" test, as I'm not sure how to inject a Skylake-specific event there. git clone https://github.com/andikleen/mce-inject cd mce-inject make sudo ./mce-inject < test/corrected sudo ras-mc-ctl --errors No Memory errors. No PCIe AER errors. No Extlog errors. MCE events: 1 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x0000abcd, walltime=0x5e950014, cpuid=0x00050654, bank=0x00000001 2 2020-04-14 00:13:07 +0000 error: No Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x0f000814, status=0x9400000000000000, addr=0x00001234, walltime=0x5e950014, cpu=0x00000001, cpuid=0x00050654, apicid=0x00000002, bank=0x00000002 [Fix] https://github.com/mchehab/rasdaemon/commit/b30a7fd4e5df8c4e61c7441f79e52d8f5f115974 https://github.com/mchehab/rasdaemon/commit/a16ca0711001957ee98f2c124abce0fa1f801529 https://github.com/mchehab/rasdaemon/commit/8704a85d8dc3483423ec2934fee8132f85f8fdb6 https://github.com/mchehab/rasdaemon/commit/22f2d8bb1d1065dede59b73b148ad4b4e2177c33 [Regression Risk] - The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause an issue on these systems such as a crash in rasdaemon, etc. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. + The new code added should only run on the newly supported systems, so regressions should be restricted to those systems. On those systems, a bug in the decoding code could cause e.g. as a crash in rasdaemon. That is mitigated by testing on those newly supported platforms. Note that one code path I could not exercise is the Hygon Dhyana support as I don't have that hardware - that patch is a trivial "do the same thing as AMD Zen", as it is a derivative platform. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1871965 Title: new platform support: Intel SkyLake, AMD Scalable MCA To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/rasdaemon/+bug/1871965/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs