Hi Everyone, It's currently possible to cause a kernel crash by being (un)lucky while adding or removing PCI devices during EEH recovery. The cause of the problem is that there are trees of objects (struct eeh_pe, implemented with pointers and kernel linked lists) that are manipulated concurrently by the PCI core and the EEH recovery system, without synchronization. Unsurprisingly this allows the kernel to hit list poison values or free'd memory, and crash.
Here's a first draft of a way to address this, and I would welcome feedback at this stage: Does it make sense? Is it simple enough? Is it maintainable? Are there other options I should explore? etc... The design: We have to consider that: - eeh_dev_check_failure() can be called in any context, so the only methods that are safe to use there are a spinlock or atomics. - Some recovery operations are blocking (and can take a long time), so they can't be protected by a spinlock. These operations are performed while traversing lists of PEs. The simplest solution might initially seem to be to hold the pci_rescan_remove lock during EEH recovery, but it seems to have some significant problems: - It's a mutex, and can't be used in eeh_dev_check_failure(). - If there is a crash during recovery, either in the core kernel or a driver's call-back function, the lock would be held forever and that would block a lot of the PCI core. - The EEH code is quite recursive and it's very difficult to work out where to take the lock without causing deadlocks. If there are bugs, they would likely be deadlocks. So instead, this proposal uses a spinlock (eeh_pe_lock) to protect access to the list-membership fields of eeh_pe in combination with reference counting for blocking operations. Because the spinlock is only protecting the list fields (and no state), the lock only needs to be used over small sections that can be inspected easily. For blocking operations I'm using the standard kernel refcounting. It only involves one extra field on each PE but it does require tweaks to maintain the refcount in most functions that use PEs and sometimes that can be tricky. The way the refcount is used is fairly standard: it starts at 1 and is incremented for each pointer pointing to the PE. The initial value of 1 represents the 'existance' of the PE and it's membership in the 'tree' under it's PHB, regardless of where in that tree it is, or if it has children. A PE is removed from those lists before it's inital reference is dropped. The interaction of the spinlock and refcounting provides this property: If the spinlock is held and a ref is held on a PE then it is safe to start from that PE and traverse anywhere along the list fields or parent pointers to other PEs without taking references to them. (To keep a reference to a PE for use outside of the spinlock, a ref must be taken before the spinlock is released.) That property can be used to perform blocking operations while traversing the lists by 'sliding' the reference along, only holding the spinlock while moving to the next iteration. It seems to work well but is not perfect: when the lock is released, the current PE may be removed and there isn't any simple, safe way to continue the traversal. The situation can be detected, so there won't be a crash, but traversal can be left incomplete. Perhaps this could be fixed by using a temporary list of struct eeh_pe that is collected while holding the spinlock (I've experimented with this, and it works but seems a bit complicated), or perhaps we can just work around it where it matters. It hasn't happened at all during my testing. I'll look at it if this proposal goes ahead. For clarity, I have also kept the scope very tightly on the lists of struct eeh_pe and the parent pointer. This patchset does not protect the list of eeh_dev in each eeh_pe or the pdev member of struct eeh_dev, both of which are also affected by races. I've left a few 'this is unsafe' comments in the code in those areas. I'll look more at it if this proposal goes ahead, I don't think that they will be difficult compared to eeh_pe. Lastly, I've included an "orphan tracking system" that I used during development to verify that reference couting was acting as expected, but I have no idea if it should be included in the final version. It keeps track of PEs that have been removed from the PHB tree, but not yet freed and makes that list available in debugfs. Any PEs that remain orphans for very long are going to be the result of bugs. It's extra risk because it itself could contain bugs, but it could also be useful during debugging. Cheers, Sam. Sam Bobroff (15): powerpc/eeh: Introduce refcounting for struct eeh_pe powerpc/eeh: Rename eeh_pe_get() to eeh_pe_find() powerpc/eeh: Track orphaned struct eeh_pe powerpc/eeh: Sync eeh_pe_next(), eeh_pe_find() and early-out traversals powerpc/eeh: Sync eeh_pe_get_parent() powerpc/eeh: Sync eeh_phb_pe_get() powerpc/eeh: Sync eeh_add_to_parent_pe() and eeh_rmv_from_parent_pe() powerpc/eeh: Sync eeh_handle_normal_event() powerpw/eeh: Sync eeh_handle_special_event(), pnv_eeh_get_pe(), pnv_eeh_next_error() powerpc/eeh: Sync eeh_phb_check_failure() powerpc/eeh: Sync eeh_dev_check_failure() powerpc/eeh: Sync eeh_pe_get_state() powerpc/eeh: Sync pnv_eeh_ei_write() powerpc/eeh: Sync eeh_force_recover_write() powerpc/eeh: Sync pcibios_set_pcie_reset_state() arch/powerpc/include/asm/eeh.h | 20 +- arch/powerpc/kernel/eeh.c | 65 +++++- arch/powerpc/kernel/eeh_driver.c | 23 +- arch/powerpc/kernel/eeh_pe.c | 230 ++++++++++++++++--- arch/powerpc/platforms/powernv/eeh-powernv.c | 43 +++- 5 files changed, 329 insertions(+), 52 deletions(-) -- 2.22.0.216.g00a2a96fc9