Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

Mario Limonciello Tue, 20 May 2025 11:56:42 -0700

On 5/20/2025 1:42 PM, Raag Jadav wrote:

On Tue, May 20, 2025 at 12:39:12PM -0500, Mario Limonciello wrote:

On 5/20/2025 12:22 PM, Denis Benato wrote:

On 5/20/25 17:49, Mario Limonciello wrote:

On 5/20/2025 10:47 AM, Raag Jadav wrote:

On Tue, May 20, 2025 at 10:23:57AM -0500, Mario Limonciello wrote:

On 5/20/2025 4:48 AM, Raag Jadav wrote:

On Mon, May 19, 2025 at 11:42:31PM +0200, Denis Benato wrote:

On 5/19/25 12:41, Raag Jadav wrote:

On Mon, May 19, 2025 at 03:58:08PM +0530, Raag Jadav wrote:

If error status is set on an AER capable device, most likely either the
device recovery is in progress or has already failed. Neither of the
cases are well suited for power state transition of the device, since
this can lead to unpredictable consequences like resume failure, or in
worst case the device is lost because of it. Leave the device in its
existing power state to avoid such issues.


Signed-off-by: Raag Jadav <raag.ja...@intel.com>
---

v2: Synchronize AER handling with PCI PM (Rafael)
v3: Move pci_aer_in_progress() to pci_set_low_power_state() (Rafael)
        Elaborate "why" (Bjorn)
v4: Rely on error status instead of device status
        Condense comment (Lukas)

Since pci_aer_in_progress() is changed I've not included Rafael's tag with
my understanding of this needing a revisit. If this was a mistake, please
let me know.

Denis, Mario, does this fix your issue?

Hello,

Unfortunately no, I have prepared a dmesg but had to remove the bootup process 
because it was too long of a few kb: https://pastebin.com/1uBEA1FL


Thanks for the test. It seems there's no hotplug event this time around
and endpoint device is still intact without any PCI related failure.

Also,

amdgpu 0000:09:00.0: PCI PM: Suspend power state: D3hot

Which means whatever you're facing is either not related to this patch,
or at best exposed some nasty side-effect that's not handled correctly
by the driver.

I'd say amdgpu folks would be of better help for your case.

Raag


So according to the logs Denis shared with v4
(https://pastebin.com/1uBEA1FL) the GPU should have been going to BOCO. This
stands for "Bus off Chip Off"

amdgpu 0000:09:00.0: amdgpu: Using BOCO for runtime pm

If it's going to D3hot - that's not going to be BOCO, it should be going to
D3cold.


Yes, because upstream port is in D0 for some reason (might be this patch
but not sure) and so will be the root port.

pcieport 0000:07:00.0: PCI PM: Suspend power state: D0
pcieport 0000:07:00.0: PCI PM: Skipped

and my best guess is the driver is not able to cope with the lack of D3cold.


Yes; if the driver is configured to expect BOCO (D3cold) if it doesn't get it, 
chaos ensues.

I guess let's double check the behavior with CONFIG_PCI_DEBUG to verify this 
patch is what is changing that upstream port behavior.



This is the very same exact kernel, minus the patch in question:  
https://pastebin.com/rwMYgG7C


Both previous kernel and this one have CONFIG_PCI_DEBUG=y.

Removed the initial bootup sequence to be able to use pastebin.


Thanks - this confirms that the problem is the root port not going to D3.
This new log shows:

pcieport 0000:07:00.0: PCI PM: Suspend power state: D3hot

So I feel we should fixate on solving that.


Which means what you're looking for is error flag being set somewhere in
the hierarchy that is preventing suspend.

Is the issue perhaps that this is now gated on both correctable anduncorrectable errors?

Perhaps should *correctable errors* be emitted with a warning and the*uncorrectable errors* be fatal?


But regardless of it, my understanding is that root port suspend depends
on a lot of factors (now errors flags being one of them with this patch)
and endpoint driver can't possibly enforce or guarantee it - the best it
can do is try.

What's probably needed is D3cold failure handling on driver side, but I'm
no PCI PM expert and perhaps Rafael can comment on it.

Raag

From the driver perspective it does have expectations that the partsoutside the driver did the right thing. If the driver was expecting theroot port to be powered down at suspend and it wasn't there are hardwarecomponents that didn't power cycle and that's what we're seeing here.

Re: [PATCH v4] PCI: Prevent power state transition of erroneous device

Reply via email to