AER: Introduce ratelimit for error logs

Ilpo Järvinen Wed, 21 May 2025 03:05:37 -0700

On Tue, 20 May 2025, Bjorn Helgaas wrote:

> On Tue, May 20, 2025 at 02:55:32PM +0300, Ilpo Järvinen wrote:
> > On Mon, 19 May 2025, Bjorn Helgaas wrote:
> > 
> > > From: Jon Pan-Doh <pan...@google.com>
> > > 
> > > Spammy devices can flood kernel logs with AER errors and slow/stall
> > > execution. Add per-device ratelimits for AER correctable and uncorrectable
> > > errors that use the kernel defaults (10 per 5s).
> > > 
> > > There are two AER logging entry points:
> > > 
> > >   - aer_print_error() is used by DPC and native AER
> > > 
> > >   - pci_print_aer() is used by GHES and CXL
> > > 
> > > The native AER aer_print_error() case includes a loop that may log details
> > > from multiple devices.  This is ratelimited by the union of ratelimits for
> > > these devices, set by add_error_device(), which collects the devices.  If
> > > no such device is found, the Error Source message is ratelimited by the
> > > Root Port or RCEC that received the ERR_* message.
> > > 
> > > The DPC aer_print_error() case is currently not ratelimited.
> > > 
> > > The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
> > > device.
> 
> > >  static int add_error_device(struct aer_err_info *e_info, struct pci_dev 
> > > *dev)
> > >  {
> > > + /*
> > > +  * Ratelimit AER log messages.  Generally we add the Error Source
> > > +  * device, but there are is_error_source() cases that can result in
> > > +  * multiple devices being added here, so we OR them all together.
> > 
> > I can see the code uses OR ;-) but I wasn't helpful because this comment 
> > didn't explain why at all. As this ratelimit thing is using reverse logic 
> > to begin with, this is a very tricky bit.
> > 
> > Perhaps something less vague like:
> > 
> > ... we ratelimit if all devices have reached their ratelimit.
> > 
> > Assuming that was the intention here? (I'm not sure.)
> 
> My intention was that if there's any downstream device that has an
> unmasked error logged and it has not reached its ratelimit, we should
> log messages for all devices with errors logged.  Does something like
> this help?
> 
>   /*
>    * Ratelimit AER log messages.  "dev" is either the source
>    * identified by the root's Error Source ID or it has an unmasked
>    * error logged in its own AER Capability.  If any of these devices
>    * has not reached its ratelimit, log messages for all of them.
>    * Messages are emitted when e_info->ratelimit is non-zero.
>    *
>    * Note that e_info->ratelimit was already initialized to 1 for the
>    * ERR_FATAL case.
>    */


Yes, this is much clearer of intent, thanks.

> The ERR_FATAL case is from this post-v6 change that I haven't posted
> yet:
> 
>   aer_isr_one_error(...)
>   {
>     ...
>     if (status & PCI_ERR_ROOT_UNCOR_RCV) {
>       int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
>       struct aer_err_info e_info = {
>         ...
>  +      .ratelimit = fatal ? 1 : 0;
> 
> 
> > > +  */
> > >   if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> > >           e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> > > +         e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
> > >           e_info->error_dev_num++;
> > >           return 0;
> > >   }
> 
> > > @@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct 
> > > work_struct *work)
> > >           pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
> > >                                              entry.devfn);
> > >           if (!pdev) {
> > > -                 pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
> > > -                        entry.domain, entry.bus,
> > > -                        PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> > > +                 pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev 
> > > found\n",
> > 
> > This case was not mentioned in the changelog.
> 
> Sharp eyes!  What do you think of this commit log text?
> 
>   The CXL pci_print_aer() case is ratelimited by the Error Source device.
> 
>   The GHES pci_print_aer() case is via aer_recover_work_func(), which
>   searches for the Error Source device.  If the device is not found, there's
>   no per-device ratelimit, so we use a system-wide ratelimit that covers all
>   error types (correctable, non-fatal, and fatal).

Works for me as long as it is mentioned.

> This isn't really ideal because in pci_print_aer(), the struct
> aer_capability_regs has already been filled by firmware and the
> logging doesn't read any registers from the device at all.
> 
> However, pci_print_aer() *does* want the pci_dev for statistics and
> tracing (pci_dev_aer_stats_incr()) and, of course, for the aer_printks
> themselves.

While not a perfect solution, this looks yet another case where it would 
help to create a dummy pci_dev struct with minimal setup which allows 
calling functions that input a pci_dev.

That solution is not perfect because it arms a trap. Downstream 
functions could get changed and if the developer assumes they have a full 
pci_dev at hand, it could cause issues with the dummy pci_dev. How likely
it happens is debatable but for many cases where the call-chain isn't 
overly complex such as here, dummy pci_dev seems helpful.

> We could leave this pr_err() completely alone; hopefully it's a rare
> case.  I think the CXL path just silently skips pci_print_aer() if
> this happens.
> 
> Eventually I would really like the native AER path to start by doing
> whatever firmware is doing, e.g., fill in struct aer_capability_regs,
> so the core of the AER handling could be identical between native AER
> and GHES/CXL.  If we could do that, maybe we could figure out a
> cleaner way to handle this corner case.


-- 
 i.

Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs

Reply via email to