Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

Raag Jadav Sat, 31 May 2025 09:59:40 -0700

On Fri, May 23, 2025 at 05:23:10PM +0200, Rafael J. Wysocki wrote:
> On Wed, May 21, 2025 at 1:27 PM Rafael J. Wysocki <raf...@kernel.org> wrote:
> > On Wed, May 21, 2025 at 10:54 AM Raag Jadav <raag.ja...@intel.com> wrote:
> > > On Tue, May 20, 2025 at 01:56:28PM -0500, Mario Limonciello wrote:
> > > > On 5/20/2025 1:42 PM, Raag Jadav wrote:
> > > > > On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:
> > > > > > On 5/20/2025 12:22 PM, Denis Benato wrote:
> > > > > > > On 5/20/25 17:49, Mario Limonciello wrote:
> > > > > > > > On 5/20/2025 10:47 AM, Raag Jadav wrote:
> > > > > > > > > On Tue, May 20, 2025 at 10:23:57AM -0500, Mario Limonciello 
> > > > > > > > > wrote:
> > > > > > > > > > On 5/20/2025 4:48 AM, Raag Jadav wrote:
> > > > > > > > > > > On Mon, May 19, 2025 at 11:42:31PM +0200, Denis Benato 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On 5/19/25 12:41, Raag Jadav wrote:
> > > > > > > > > > > > > On Mon, May 19, 2025 at 03:58:08PM +0530, Raag Jadav 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > If error status is set on an AER capable device, 
> > > > > > > > > > > > > > most likely either the
> > > > > > > > > > > > > > device recovery is in progress or has already 
> > > > > > > > > > > > > > failed. Neither of the
> > > > > > > > > > > > > > cases are well suited for power state transition of 
> > > > > > > > > > > > > > the device, since
> > > > > > > > > > > > > > this can lead to unpredictable consequences like 
> > > > > > > > > > > > > > resume failure, or in
> > > > > > > > > > > > > > worst case the device is lost because of it. Leave 
> > > > > > > > > > > > > > the device in its
> > > > > > > > > > > > > > existing power state to avoid such issues.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Raag Jadav <raag.ja...@intel.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > v2: Synchronize AER handling with PCI PM (Rafael)
> > > > > > > > > > > > > > v3: Move pci_aer_in_progress() to 
> > > > > > > > > > > > > > pci_set_low_power_state() (Rafael)
> > > > > > > > > > > > > >         Elaborate "why" (Bjorn)
> > > > > > > > > > > > > > v4: Rely on error status instead of device status
> > > > > > > > > > > > > >         Condense comment (Lukas)
> > > > > > > > > > > > > Since pci_aer_in_progress() is changed I've not 
> > > > > > > > > > > > > included Rafael's tag with
> > > > > > > > > > > > > my understanding of this needing a revisit. If this 
> > > > > > > > > > > > > was a mistake, please
> > > > > > > > > > > > > let me know.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Denis, Mario, does this fix your issue?
> > > > > > > > > > > > >
> > > > > > > > > > > > Hello,
> > > > > > > > > > > >
> > > > > > > > > > > > Unfortunately no, I have prepared a dmesg but had to 
> > > > > > > > > > > > remove the bootup process because it was too long of a 
> > > > > > > > > > > > few kb: https://pastebin.com/1uBEA1FL
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the test. It seems there's no hotplug event 
> > > > > > > > > > > this time around
> > > > > > > > > > > and endpoint device is still intact without any PCI 
> > > > > > > > > > > related failure.
> > > > > > > > > > >
> > > > > > > > > > > Also,
> > > > > > > > > > >
> > > > > > > > > > > amdgpu 0000:09:00.0: PCI PM: Suspend power state: D3hot
> > > > > > > > > > >
> > > > > > > > > > > Which means whatever you're facing is either not related 
> > > > > > > > > > > to this patch,
> > > > > > > > > > > or at best exposed some nasty side-effect that's not 
> > > > > > > > > > > handled correctly
> > > > > > > > > > > by the driver.
> > > > > > > > > > >
> > > > > > > > > > > I'd say amdgpu folks would be of better help for your 
> > > > > > > > > > > case.
> > > > > > > > > > >
> > > > > > > > > > > Raag
> > > > > > > > > >
> > > > > > > > > > So according to the logs Denis shared with v4
> > > > > > > > > > (https://pastebin.com/1uBEA1FL) the GPU should have been 
> > > > > > > > > > going to BOCO. This
> > > > > > > > > > stands for "Bus off Chip Off"
> > > > > > > > > >
> > > > > > > > > > amdgpu 0000:09:00.0: amdgpu: Using BOCO for runtime pm
> > > > > > > > > >
> > > > > > > > > > If it's going to D3hot - that's not going to be BOCO, it 
> > > > > > > > > > should be going to
> > > > > > > > > > D3cold.
> > > > > > > > >
> > > > > > > > > Yes, because upstream port is in D0 for some reason (might be 
> > > > > > > > > this patch
> > > > > > > > > but not sure) and so will be the root port.
> > > > > > > > >
> > > > > > > > > pcieport 0000:07:00.0: PCI PM: Suspend power state: D0
> > > > > > > > > pcieport 0000:07:00.0: PCI PM: Skipped
> > > > > > > > >
> > > > > > > > > and my best guess is the driver is not able to cope with the 
> > > > > > > > > lack of D3cold.
> > > > > > > >
> > > > > > > > Yes; if the driver is configured to expect BOCO (D3cold) if it 
> > > > > > > > doesn't get it, chaos ensues.
> > > > > > > >
> > > > > > > > I guess let's double check the behavior with CONFIG_PCI_DEBUG 
> > > > > > > > to verify this patch is what is changing that upstream port 
> > > > > > > > behavior.
> > > > > > >
> > > > > > >
> > > > > > > This is the very same exact kernel, minus the patch in question:  
> > > > > > > https://pastebin.com/rwMYgG7C
> > > > > > >
> > > > > > >
> > > > > > > Both previous kernel and this one have CONFIG_PCI_DEBUG=y.
> > > > > > >
> > > > > > > Removed the initial bootup sequence to be able to use pastebin.
> > > > > >
> > > > > > Thanks - this confirms that the problem is the root port not going 
> > > > > > to D3.
> > > > > > This new log shows:
> > > > > >
> > > > > > pcieport 0000:07:00.0: PCI PM: Suspend power state: D3hot
> > > > > >
> > > > > > So I feel we should fixate on solving that.
> > > > >
> > > > > Which means what you're looking for is error flag being set somewhere 
> > > > > in
> > > > > the hierarchy that is preventing suspend.
> > > >
> > > > Is the issue perhaps that this is now gated on both correctable and
> > > > uncorrectable errors?
> > > >
> > > > Perhaps should *correctable errors* be emitted with a warning and the
> > > > *uncorrectable errors* be fatal?
> > >
> > > That'd be more or less inline with hiding the issue, and it can also race
> > > with err_handler callback if driver has registered it.
> > >
> > > > > But regardless of it, my understanding is that root port suspend 
> > > > > depends
> > > > > on a lot of factors (now errors flags being one of them with this 
> > > > > patch)
> > > > > and endpoint driver can't possibly enforce or guarantee it - the best 
> > > > > it
> > > > > can do is try.
> > > > >
> > > > > What's probably needed is D3cold failure handling on driver side, but 
> > > > > I'm
> > > > > no PCI PM expert and perhaps Rafael can comment on it.
> > > > >
> > > > > Raag
> > > >
> > > > From the driver perspective it does have expectations that the parts 
> > > > outside
> > > > the driver did the right thing.  If the driver was expecting the root 
> > > > port
> > > > to be powered down at suspend and it wasn't there are hardware 
> > > > components
> > > > that didn't power cycle and that's what we're seeing here.
> > >
> > > Which means the expectation set by the driver is the opposite of the
> > > purpose of this patch, and it's going to fail if any kind of error is
> > > detected under root port during suspend.
> >
> > And IMV this driver's expectation is questionable at least.
> >
> > There is no promise whatsoever that the device will always be put into
> > D3cold during system suspend.
> 
> For instance, user space may disable D3cold for any PCI device via the
> d3cold_allowed attribute in sysfs.
> 
> If the driver cannot handle this, it needs to be fixed.


Thanks for confirming. So should we consider this patch to be valid
and worth moving forward?

Raag

Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

Reply via email to