On Thu, Aug 28, 2025 at 01:25:12PM -0700, Brian Norris wrote: > On the flip side: it's not clear > PCI_ERS_RESULT_NEED_RESET+pci_channel_io_normal works as documented > either. An endpoint might think it's requesting a slot reset, but > pcie_do_recovery() will ignore that and skip reset_subordinates() > (pci_host_reset_root_port()). > > All in all, the docs sound like endpoints _should_ have control over > whether we exercise a full port/slot reset for all types of errors. But > in practice, we do not actually give it that control. i.e., your commit > message is correct, and the docs are not. > > I have half a mind to suggest the appended change, so the behavior > matches (some of) the docs a little better [1].
A change similar to the one you're proposing is already queued on the pci/aer topic branch for v6.18: https://git.kernel.org/pci/pci/c/d0a2dee7d458 Here's the corresponding cover letter: https://lore.kernel.org/r/cover.1755008151.git.lu...@wunner.de There was a discussion why I didn't take the exact same approach you're proposing, but only a similar one: https://lore.kernel.org/r/aj2ue6v46zib3...@wunner.de https://lore.kernel.org/r/akhwf3l0ncl_c...@wunner.de > Specifically, I'm trying to see what's supposed to happen with > PCI_ERS_RESULT_CAN_RECOVER. I see that for pci_channel_io_frozen, almost > all endpoint drivers return PCI_ERS_RESULT_NEED_RESET, but if drivers > actually return PCI_ERS_RESULT_CAN_RECOVER, it's unclear what should > happen. > > Today, we don't actually respect it; pcie_do_recovery() just calls > reset_subordinates() (pci_host_reset_root_port()) unconditionally. The > only thing that return code affects is whether we call > report_mmio_enabled() vs report_slot_reset() afterward. This seems odd. In the series queued on pci/aer, I've only allowed drivers to opt in to a reset on Non-Fatal Errors. I didn't dare also letting them opt out of a reset on Fatal Errors. These changes of behavior are always risky, so it seemed prudent to not introduce too many changes at once. There was no urgent need to also change behavior for Fatal Errors for the use case at hand (the xe graphics driver). I went through all drivers with pci_error_handlers to avoid breaking any of them. It's very tedious work, takes weeks. It would be necessary to do that again when changing behavior for Fatal Errors. pcieaer-howto.rst justifies the unconditional reset on Fatal Errors by saying that the link is unreliable and that a reset is thus required. On the other hand, pci-error-recovery.rst (which is a few months older than pcieaer-howto.rst) says in section "STEP 3: Link Reset": "This is a PCIe specific step and is done whenever a fatal error has been detected" I'm wondering if the authors of pcieaer-howto.rst took that at face value and thought they'd *have* to reset the link on Fatal Errors. Looking through the Fatal Errors in PCIe r7.0 sec 6.2.7, I think a reset is justified for some of them, but optional for others. Which leads me to believe that the AER driver should actually enforce a reset only for certain Fatal Errors, not all of them. So this seems like something worth revisiting in the future. > All in all, the docs sound like endpoints _should_ have control over > whether we exercise a full port/slot reset for all types of errors. But > in practice, we do not actually give it that control. i.e., your commit > message is correct, and the docs are not. Indeed the documentation is no longer in sync with the code. I've just submitted a series to rectify that and cc'ed you: https://lore.kernel.org/r/cover.1756451884.git.lu...@wunner.de Thanks, Lukas