Hi Lukas, On Fri, Aug 29, 2025 at 10:35:20AM +0200, Lukas Wunner wrote: > On Thu, Aug 28, 2025 at 01:25:12PM -0700, Brian Norris wrote: > > On the flip side: it's not clear > > PCI_ERS_RESULT_NEED_RESET+pci_channel_io_normal works as documented > > either. An endpoint might think it's requesting a slot reset, but > > pcie_do_recovery() will ignore that and skip reset_subordinates() > > (pci_host_reset_root_port()). > > > > All in all, the docs sound like endpoints _should_ have control over > > whether we exercise a full port/slot reset for all types of errors. But > > in practice, we do not actually give it that control. i.e., your commit > > message is correct, and the docs are not. > > > > I have half a mind to suggest the appended change, so the behavior > > matches (some of) the docs a little better [1]. > > A change similar to the one you're proposing is already queued on the > pci/aer topic branch for v6.18: > > https://git.kernel.org/pci/pci/c/d0a2dee7d458
Wow, nice coincidence. It's a reminder I should work off the maintainer / -next branch, instead of just mainline... > Here's the corresponding cover letter: > > https://lore.kernel.org/r/cover.1755008151.git.lu...@wunner.de > > There was a discussion why I didn't take the exact same approach you're > proposing, but only a similar one: > > https://lore.kernel.org/r/aj2ue6v46zib3...@wunner.de > https://lore.kernel.org/r/akhwf3l0ncl_c...@wunner.de Wow, that's a ton of great background and explanation. Thanks! > > Specifically, I'm trying to see what's supposed to happen with > > PCI_ERS_RESULT_CAN_RECOVER. I see that for pci_channel_io_frozen, almost > > all endpoint drivers return PCI_ERS_RESULT_NEED_RESET, but if drivers > > actually return PCI_ERS_RESULT_CAN_RECOVER, it's unclear what should > > happen. > > > > Today, we don't actually respect it; pcie_do_recovery() just calls > > reset_subordinates() (pci_host_reset_root_port()) unconditionally. The > > only thing that return code affects is whether we call > > report_mmio_enabled() vs report_slot_reset() afterward. This seems odd. > > In the series queued on pci/aer, I've only allowed drivers to opt in > to a reset on Non-Fatal Errors. I didn't dare also letting them opt > out of a reset on Fatal Errors. Right, I can see where the latter is risky. Frankly, while I have endpoint drivers suggesting they should be able to do this, I'm not sure that's a great idea. Or at least, I can see how it would potentially break other clients, as you explain. > These changes of behavior are always risky, so it seemed prudent to not > introduce too many changes at once. There was no urgent need to also > change behavior for Fatal Errors for the use case at hand (the xe graphics > driver). I went through all drivers with pci_error_handlers to avoid > breaking any of them. It's very tedious work, takes weeks. It would > be necessary to do that again when changing behavior for Fatal Errors. > > pcieaer-howto.rst justifies the unconditional reset on Fatal Errors by > saying that the link is unreliable and that a reset is thus required. > > On the other hand, pci-error-recovery.rst (which is a few months older > than pcieaer-howto.rst) says in section "STEP 3: Link Reset": > "This is a PCIe specific step and is done whenever a fatal error has been > detected" > > I'm wondering if the authors of pcieaer-howto.rst took that at face value > and thought they'd *have* to reset the link on Fatal Errors. > > Looking through the Fatal Errors in PCIe r7.0 sec 6.2.7, I think a reset > is justified for some of them, but optional for others. Which leads me > to believe that the AER driver should actually enforce a reset only for > certain Fatal Errors, not all of them. So this seems like something > worth revisiting in the future. Hmm, possibly. I haven't looked so closely at the details on all Fatal Errors, but I may have a look eventually. > > All in all, the docs sound like endpoints _should_ have control over > > whether we exercise a full port/slot reset for all types of errors. But > > in practice, we do not actually give it that control. i.e., your commit > > message is correct, and the docs are not. > > Indeed the documentation is no longer in sync with the code. I've just > submitted a series to rectify that and cc'ed you: > > https://lore.kernel.org/r/cover.1756451884.git.lu...@wunner.de Thanks! I'll try to take a pass at reviewing, but it may not be prompt. Thanks again for all the info and work here. Brian