On Tue, Sep 16, 2025 at 03:20:49PM +0000, Luck, Tony wrote: > >> > Reported-by: Shawn Fan <shawn....@intel.com> > >> > >> Interesting. What did Shawn report? (Closes:!). > > > > Tony or Shawn, could you please point me to the original report? Thanks! > > Original report is internal to Intel, so no useful link for the community (but > I still wanted to give credit). > > Recap of original problem is that some BIOS keep track of error threshold > per-rank and use this GHES mechanism to report threshold exceeded on > the rank. > > Systems that stay up a long time can accumulate enough soft errors > to trigger this threshold. But the action of taking a page offline isn't > going to help. For a 4K page this is merely annoying. For 1G page > it can mess things up badly. > > My original patch for this just skipped the GHES->offline process > for huge pages. But I wasn't aware of the sysctl control. That provides > a better solution.
Tony, does that mean you're OK with using the existing sysctl interface? If so, I'll just send a separate patch to update the sysfs-memory-page-offline documentation and drop the rest. Thanks, Kyle Meyer