** Also affects: linux (Ubuntu)
   Importance: Undecided
       Status: New

** Changed in: intel
       Status: New => Fix Released

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1801254

Title:
  [AEP]EDAC may report the wrong DIMM when patrol scrubber finds an
  error

Status in intel:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released

Bug description:
  Description
  Facebook reported that on Broadwell systems EDAC sometimes reports the wrong 
DIMM for a memory error found by the patrol scrubber.

  The issue is rooted in h/w that only provides a 4KB page aligned
  address for the error in this case. This means that the EDAC driver
  will point at the DIMM matching offset 0x0 in the 4KB page, but
  because of interleaving across channels and ranks the actual DIMM
  involved may be different if the error is on some other cache line
  within the page.

  Fix: We can't actually get EDAC to point to the right DIMM because we
  don't know the offset within the page. But we should fix EDAC to say
  "I don't know" instead of pointing to the wrong DIMM.

  We can check the MCi_MISC register to know whether the address was
  cache-line aligned or page aligned. Bits 5:0 give the least
  significant bit that is valid. So a value of 6 is for cache line
  aligned (8 on Optane DC equipped systems that bundle 4 processor cache
  lines into a single Optane DC cache line). It will be 12 for patrol
  scrubber reported errors.

  Once we know we have a problem we should see how much information we
  can provide just from the "mce" structure passed the the EDAC driver.

  1) We can get the socket from looking at m->extcup (the CMCI from the
  patrol scrubber will have been delivered to a logical CPU on the same
  socket)

  2) The memory controller number. I think the m->bank will tell us this
  Need to check in the EDS for IvyBridge, Haswell and Broadwell.

  3) The channel number. Low bits of MCi_STATUS.MCACOD should provide
  this.

  Facebook said that for many of their systems this should be enough for
  them (as a lot of systems only have one DIMM populated per channel).

  Note that Skylake is allegedly unaffected as the patrol scrubber
  should provide a cacheline aligned address. We should test and confirm

  Commits:
  8489b17ce29d9a35a36c08bbea93cdce4c98a6ad
  Target Kernel: 4.20
  Target Release: 19.04

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1801254/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to