On 11/12/25 9:15 PM, Timothy Pearson wrote:

----- Original Message -----
From: "Narayana Murty N" <[email protected]>
To: "mahesh" <[email protected]>, "Oliver" <[email protected]>, "Madhavan Srinivasan" 
<[email protected]>, "Michael
Ellerman" <[email protected]>, "npiggin" <[email protected]>, "christophe leroy" 
<[email protected]>
Cc: "Bjorn Helgaas" <[email protected]>, "Timothy Pearson" 
<[email protected]>, "linuxppc-dev"
<[email protected]>, "linux-kernel" <[email protected]>, 
"vaibhav" <[email protected]>,
"Shivaprasad G Bhat" <[email protected]>, [email protected]
Sent: Wednesday, December 10, 2025 8:25:59 AM
Subject: [PATCH v2 1/1] powerpc/eeh: fix recursive pci_lock_rescan_remove 
locking in EEH event handling
The recent commit 1010b4c012b0 ("powerpc/eeh: Make EEH driver device
hotplug safe") restructured the EEH driver to improve synchronization
with the PCI hotplug layer.

However, it inadvertently moved pci_lock_rescan_remove() outside its
intended scope in eeh_handle_normal_event(), leading to broken PCI
error reporting and improper EEH event triggering. Specifically,
eeh_handle_normal_event() acquired pci_lock_rescan_remove() before
calling eeh_pe_bus_get(), but eeh_pe_bus_get() itself attempts to
acquire the same lock internally, causing nested locking and disrupting
normal EEH event handling paths.

This patch adds a boolean parameter do_lock to _eeh_pe_bus_get(),
with two public wrappers:
    eeh_pe_bus_get() with locking enabled.
    eeh_pe_bus_get_nolock() that skips locking.

Callers that already hold pci_lock_rescan_remove() now use
eeh_pe_bus_get_nolock() to avoid recursive lock acquisition.

Additionally, pci_lock_rescan_remove() calls are restored to the correct
position—after eeh_pe_bus_get() and immediately before iterating affected
PEs and devices. This ensures EEH-triggered PCI removes occur under proper
bus rescan locking without recursive lock contention.

The eeh_pe_loc_get() function has been split into two functions:
    eeh_pe_loc_get(struct eeh_pe *pe) which retrieves the loc for given PE.
    eeh_pe_loc_get_bus(struct pci_bus *bus) which retrieves the location
    code for given bus.
Conceptually the patch sounds OK, but given the complexity of these subsystems 
it's difficult to forsee all interactions.  Was the patch verified not to break 
NVMe hotplug on PowerNV systems using actual hardware?  If not, I will need to 
do so before sending an ack.  Thanks!
Hi Timothy,

Thanks for your suggestion,I have now verified the change on a PowerNV system with NVMe hotplug.

Test setup:
Platform: PowerNV (“Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-126-g9f16f2d9e PowerNV”). Kernel: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git this patch on top of commit  d358e5254674
Devices: two PCIe NVMe drives in hotpluggable slots.

Tests performed:
Basic hotplug:
Repeated NVMe add/remove cycles using the platform’s hotplug controls (slot power off/on or PCIe attention button, as applicable). Confirmed that each add/remove cycle correctly created and removed /dev/nvme* nodes, and that nvme list/I/O (e.g. fio or dd) worked before removal and failed cleanly after removal. Confirmed there were no lockdep splats, warnings, or stack traces related to pci_lock_rescan_remove() or EEH during these tests.

Regression checks:
With these tests, NVMe hotplug and EEH behaviour on PowerNV appears unchanged except for the intended fix (no recursive pci_lock_rescan_remove() acquisition and normal EEH event handling).

Thanks,

Narayana


Reply via email to