On Thu, Sep 18, 2025 at 1:34 AM Shuai Xue <xuesh...@linux.alibaba.com> wrote: > > > > 在 2025/9/18 02:59, Kyle Meyer 写道: > > On Wed, Sep 17, 2025 at 06:35:14AM +0000, Fan, Shawn wrote: > >>>> My original patch for this just skipped the GHES->offline process > >>>> for huge pages. But I wasn't aware of the sysctl control. That provides > >>>> a better solution. > >>> > >>> Tony, does that mean you're OK with using the existing sysctl interface? > >>> If > >>> so, I'll just send a separate patch to update the > >>> sysfs-memory-page-offline > >>> documentation and drop the rest. > >> > >> Kyle, > >> > >> It depends on which camp the external customer that reported this > >> falls into: > >> > >> 1) "I'm OK disabling all soft offline requests". > >> > >> or the: > >> > >> 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any > >> huge pages". > >> > >> Shawn: Can you please find out? > >> > >> > >> -> Prefer the 2nd option, "4K pages still go offline if the BIOS asks, > >> just not any huge pages." > > > > OK, thank you. > > > > Does that mean they want to avoid offlining transparent huge pages as well? > > > > Thanks, > > Kyle Meyer > > > Hi, Shawn, > > As memory access is typically interleaved between channels. When the > per-rank threshold is exceeded, soft-offlining the last accessed address > seems unreasonable - regardless of whether it's a 4KB page or a huge > page. The error accumulation happens at the rank level, but the action > is taken on a specific page that happened to trigger the threshold, > which doesn't address the underlying issue.
Does it mean the soft offline action taken by the kernel is almost useless from hw's PoV? Or, the current signals/info about the corrected errors kernel get from firmware are insufficient to make the kernel do anything meaningful? > > I prefer the first option that disabling all soft offline requests from > GHES driver. > > Thanks. > Shuai