>> > Reported-by: Shawn Fan <shawn....@intel.com> >> >> Interesting. What did Shawn report? (Closes:!). > > Tony or Shawn, could you please point me to the original report? Thanks!
Original report is internal to Intel, so no useful link for the community (but I still wanted to give credit). Recap of original problem is that some BIOS keep track of error threshold per-rank and use this GHES mechanism to report threshold exceeded on the rank. Systems that stay up a long time can accumulate enough soft errors to trigger this threshold. But the action of taking a page offline isn't going to help. For a 4K page this is merely annoying. For 1G page it can mess things up badly. My original patch for this just skipped the GHES->offline process for huge pages. But I wasn't aware of the sysctl control. That provides a better solution. -Tony