** Also affects: linux (Ubuntu) Importance: Undecided Status: New
** Changed in: intel Status: New => Fix Released ** Changed in: linux (Ubuntu) Status: New => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1801254 Title: [AEP]EDAC may report the wrong DIMM when patrol scrubber finds an error Status in intel: Fix Released Status in linux package in Ubuntu: Fix Released Bug description: Description Facebook reported that on Broadwell systems EDAC sometimes reports the wrong DIMM for a memory error found by the patrol scrubber. The issue is rooted in h/w that only provides a 4KB page aligned address for the error in this case. This means that the EDAC driver will point at the DIMM matching offset 0x0 in the 4KB page, but because of interleaving across channels and ranks the actual DIMM involved may be different if the error is on some other cache line within the page. Fix: We can't actually get EDAC to point to the right DIMM because we don't know the offset within the page. But we should fix EDAC to say "I don't know" instead of pointing to the wrong DIMM. We can check the MCi_MISC register to know whether the address was cache-line aligned or page aligned. Bits 5:0 give the least significant bit that is valid. So a value of 6 is for cache line aligned (8 on Optane DC equipped systems that bundle 4 processor cache lines into a single Optane DC cache line). It will be 12 for patrol scrubber reported errors. Once we know we have a problem we should see how much information we can provide just from the "mce" structure passed the the EDAC driver. 1) We can get the socket from looking at m->extcup (the CMCI from the patrol scrubber will have been delivered to a logical CPU on the same socket) 2) The memory controller number. I think the m->bank will tell us this Need to check in the EDS for IvyBridge, Haswell and Broadwell. 3) The channel number. Low bits of MCi_STATUS.MCACOD should provide this. Facebook said that for many of their systems this should be enough for them (as a lot of systems only have one DIMM populated per channel). Note that Skylake is allegedly unaffected as the patrol scrubber should provide a cacheline aligned address. We should test and confirm Commits: 8489b17ce29d9a35a36c08bbea93cdce4c98a6ad Target Kernel: 4.20 Target Release: 19.04 To manage notifications about this bug go to: https://bugs.launchpad.net/intel/+bug/1801254/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp