eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling

Narayana Murty N Tue, 06 Jan 2026 22:23:41 -0800


On 11/12/25 9:15 PM, Timothy Pearson wrote:


----- Original Message -----

From: "Narayana Murty N" <[email protected]>
To: "mahesh" <[email protected]>, "Oliver" <[email protected]>, "Madhavan Srinivasan" 
<[email protected]>, "Michael
Ellerman" <[email protected]>, "npiggin" <[email protected]>, "christophe leroy" 
<[email protected]>
Cc: "Bjorn Helgaas" <[email protected]>, "Timothy Pearson" 
<[email protected]>, "linuxppc-dev"
<[email protected]>, "linux-kernel" <[email protected]>, 
"vaibhav" <[email protected]>,
"Shivaprasad G Bhat" <[email protected]>, [email protected]
Sent: Wednesday, December 10, 2025 8:25:59 AM
Subject: [PATCH v2 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove 
locking in EEH event handling
The recent commit 1010b4c012b0 ("powerpc/eeh: Make EEH driver device
hotplug safe") restructured the EEH driver to improve synchronization
with the PCI hotplug layer.

However, it inadvertently moved pci_lock_rescan_remove() outside its
intended scope in eeh_handle_normal_event(), leading to broken PCI
error reporting and improper EEH event triggering. Specifically,
eeh_handle_normal_event() acquired pci_lock_rescan_remove() before
calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to
acquire the same lock internally, causing nested locking and disrupting
normal EEH event handling paths.

This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(),
with two public wrappers:
    eeh_pe_bus_get() with locking enabled.
    eeh_pe_bus_get_nolock() that skips locking.

Callers that already hold pci_lock_rescan_remove() now use
eeh_pe_bus_get_nolock() to avoid recursive lock acquisition.

Additionally, pci_lock_rescan_remove() calls are restored to the correct
position—after eeh_pe_bus_get() and immediately before iterating affected
PEs and devices. This ensures EEH-triggered PCI removes occur under proper
bus rescan locking without recursive lock contention.

The eeh_pe_loc_get() function has been split into two functions:
    eeh_pe_loc_get(struct eeh_pe *pe) which retrieves the loc for given PE.
    eeh_pe_loc_get_bus(struct pci_bus *bus) which retrieves the location
    code for given bus.

Conceptually the patch sounds OK, but given the complexity of these subsystems 
it's difficult to forsee all interactions.  Was the patch verified not to break 
NVMe hotplug on PowerNV systems using actual hardware?  If not, I will need to 
do so before sending an ack.  Thanks!

Hi Timothy,

Thanks for your suggestion,I have now verified the change on a PowerNVsystem with NVMe hotplug.


Test setup:

Platform: PowerNV (“Hardware name: 9105-22A POWER10 (raw) 0x800200opal:v7.1-126-g9f16f2d9e PowerNV”).Kernel: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.gitthis patch on top of commit d358e5254674

Devices: two PCIe NVMe drives in hotpluggable slots.

Tests performed:
Basic hotplug:

Repeated NVMe add/remove cycles using the platform’s hotplug controls(slot power off/on or PCIe attention button, as applicable).Confirmed that each add/remove cycle correctly created and removed/dev/nvme* nodes, and that nvme list/I/O (e.g. fio or dd) worked beforeremoval and failed cleanly after removal.Confirmed there were no lockdep splats, warnings, or stack tracesrelated to pci_lock_rescan_remove() or EEH during these tests.


Regression checks:

With these tests, NVMe hotplug and EEH behaviour on PowerNV appearsunchanged except for the intended fix (no recursivepci_lock_rescan_remove() acquisition and normal EEH event handling).


Thanks,

Narayana

Re: [PATCH v2 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove locking in EEH event handling

Reply via email to