Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Hans Zhang Thu, 22 May 2025 02:34:52 -0700



On 2025/5/22 00:17, Sathyanarayanan Kuppuswamy wrote:

On 5/21/25 7:54 AM, Hans Zhang wrote:
On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
On 5/19/25 7:41 AM, Hans Zhang wrote:
On 2025/5/19 22:21, Hans Zhang wrote:
On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
On 5/16/25 9:55 AM, Hans Zhang wrote:
The following series introduces a new kernel command-line optionaer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministicrecoverfrom fatal PCIe errors by triggering a controlled kernel panicwhen device
recovery fails, avoiding indefinite system hangs.
Why would a device recovery failure lead to a system hang? Worst case
that device may not be accessible, right?  Any real use case?
Dear Sathyanarayanan,
Due to Synopsys and Cadence PCIe IP, their AER interrupts areusually SPI interrupts, not INTx/MSI/MSIx interrupts. (Somecustomers will design it as an MSI/MSIx interrupt, e.g.: RK3588,but not all customers have designed it this way.) For example,when many mobile phone SoCs of Qualcomm handle AER interrupts andthere is a link down, that is, a fatal problem occurs in thecurrent PCIe physical link, the system cannot recover. At thispoint, a system restart is needed to solve the problem.
And our company design of SOC: http://radxa.com/products/orion/o6/,it has 5 road PCIe port.There is also the same problem. If there is a problem with one ofthe PCIe ports, it will cause the entire system to hang. So I hopelinux OS can offer an option that enables SOC manufacturers tochoose to restart the system in case of fatal hardware errorsoccurring in PCIe.
There are also products such as mobile phones and tablets. We don'twant to wait until the battery is completely used up beforerestarting them.
For the specific code of Qualcomm, please refer to the email I sent.
Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
    /* Clear AXI link-down status */
    cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
If there has been a link down in this PCIe port, the registerCDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission tocontinue. This is different from Synopsys.
If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSDsaving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1,it causes CPU Core1 to be unable to send TLP transfers and hang.This is a very extreme situation.(The current Cadence code is Legacy PCIe IP, and the HPA IP is stillin the upstream process at present.)
Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/
It sounds like a system level issue to me. Why not they rely onwatchdog to reboot for
this case ?
Dear Sathyanarayanan,
Thank you for your reply. Yes, personally, I think it's also a problemat the system level. I conducted a local test. When I directlyunplugged the EP device on the slot, the system would hang. It hasbeen tested many times. Since we don't have a bus timeout responsemechanism for PCIe, it hangs easily.
Any comment on why watchdog is not used to reboot the unresponsive system?


Dear Sathyanarayanan,

Thank you very much for your reply.

After my testing, the watchdog doesn't work properly every time. Theremight be other reasons causing the entire system to hang.

Even if you want to add this support, I think it is more appropriateto add this to yourspecific PCIe controller driver. I don't see why you want to add itpart of generic
AER driver.
Because we want to use the processing logic of the general AER driver.If the recovery is successful, there will be no problem. If therecovery fails, my original intention was to restart the system.
If added to the specific PCIe controller driver, a lot of repetitiveAER processing logic will be written. So I was thinking whether theAER driver could be changed to be compiled as a KO module.
May be you can rely on err handler callbacks to get notification onfatal errors or you can even use uevent handler to detect thedisconnected device event and handle it there.


I will try the method you suggested.

If this series is not reasonable, I'll drop it.
Adding new kernel param to solve a specific system issue is notrecommended. Try to find some custom solution for your chip/controller.


Ok. Understood. Thank you again for your reply.

Best regards,
Hans

Re: [PATCH 0/4] pci: implement "pci=aer_panic"

Reply via email to