On Thu, Sep 18, 2025 at 1:34 AM Shuai Xue <xuesh...@linux.alibaba.com> wrote:
>
>
>
> 在 2025/9/18 02:59, Kyle Meyer 写道:
> > On Wed, Sep 17, 2025 at 06:35:14AM +0000, Fan, Shawn wrote:
> >>>> My original patch for this just skipped the GHES->offline process
> >>>> for huge pages. But I wasn't aware of the sysctl control. That provides
> >>>> a better solution.
> >>>
> >>> Tony, does that mean you're OK with using the existing sysctl interface? 
> >>> If
> >>> so, I'll just send a separate patch to update the 
> >>> sysfs-memory-page-offline
> >>> documentation and drop the rest.
> >>
> >> Kyle,
> >>
> >> It depends on which camp the external customer that reported this
> >> falls into:
> >>
> >> 1) "I'm OK disabling all soft offline requests".
> >>
> >> or the:
> >>
> >> 2) "I'd like 4K pages to still go offline if the BIOS asks, just not any 
> >> huge pages".
> >>
> >> Shawn: Can you please find out?
> >>
> >>
> >> -> Prefer the 2nd option,  "4K pages still go offline if the BIOS asks, 
> >> just not any huge pages."
> >
> > OK, thank you.
> >
> > Does that mean they want to avoid offlining transparent huge pages as well?
> >
> > Thanks,
> > Kyle Meyer
>
>
> Hi, Shawn,
>
> As memory access is typically interleaved between channels. When the
> per-rank threshold is exceeded, soft-offlining the last accessed address
> seems unreasonable - regardless of whether it's a 4KB page or a huge
> page. The error accumulation happens at the rank level, but the action
> is taken on a specific page that happened to trigger the threshold,
> which doesn't address the underlying issue.

Does it mean the soft offline action taken by the kernel is almost
useless from hw's PoV? Or, the current signals/info about the
corrected errors kernel get from firmware are insufficient to make the
kernel do anything meaningful?

>
> I prefer the first option that disabling all soft offline requests from
> GHES driver.
>
> Thanks.
> Shuai

Reply via email to