On Wed, Jul 09, 2025 at 06:38:20PM -0500, Bjorn Helgaas wrote: > Housekeeping: Note subject line convention. Indent with spaces in > commit log. Remove spurious plus signs.
Thanks! > On Wed, Jul 09, 2025 at 04:55:26PM -0400, Michael S. Tsirkin wrote: > > At the moment, in case of a surprise removal, the regular remove > > callback is invoked, exclusively. This works well, because mostly, the > > cleanup would be the same. > > > > However, there's a race: imagine device removal was initiated by a user > > action, such as driver unbind, and it in turn initiated some cleanup and > > is now waiting for an interrupt from the device. If the device is now > > surprise-removed, that never arrives and the remove callback hangs > > forever. > > > > For example, this was reported for virtio-blk: > > > > 1. the graceful removal is ongoing in the remove() callback, where disk > > deletion del_gendisk() is ongoing, which waits for the requests +to > > complete, > > > > 2. Now few requests are yet to complete, and surprise removal started. > > > > At this point, virtio block driver will not get notified by the driver > > core layer, because it is likely serializing remove() happening by > > +user/driver unload and PCI hotplug driver-initiated device removal. So > > vblk driver doesn't know that device is removed, block layer is waiting > > for requests completions to arrive which it never gets. So > > del_gendisk() gets stuck. > > > > Drivers can artificially add timeouts to handle that, but it can be > > flaky. > > > > Instead, let's add a way for the driver to be notified about the > > disconnect. It can then do any necessary cleanup, knowing that the > > device is inactive. > > This relies on somebody (typically pciehp, I guess) calling > pci_dev_set_disconnected() when a surprise remove happens. > > Do you think it would be practical for the driver's .remove() method > to recognize that the device may stop responding at any point, even if > no hotplug driver is present to call pci_dev_set_disconnected()? > > Waiting forever for an interrupt seems kind of vulnerable in general. > Maybe "artificially adding timeouts" is alluding to *not* waiting > forever for interrupts? That doesn't seem artificial to me because > it's just a fact of life that devices can disappear at arbitrary > times. > > It seems a little fragile to me to depend on some other part of the > system to notice the surprise removal and tell you about it or > schedule your work function. I think it would be more robust for the > driver to check directly, i.e., assume writes to the device may be > lost, check for PCI_POSSIBLE_ERROR() after reads from the device, and > never wait for an interrupt without a timeout. virtio is ... kind of special, in that users already take it for granted that having a device as long as they want to respond still does not lead to errors and data loss. Makes it a bit harder as our timeout would have to check for presence and retry, we can't just fail as a normal hardware device does. And there's the overhead thing - poking at the device a lot puts a high load on the host. So I can imagine a very long timeout (minutes?), and then something like the WQ I am trying to propose here as a shortcut. > > Since cleanups can take a long time, this takes an approach > > of a work struct that the driver initiates and enables > > on probe, and tears down on remove. > > > > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > > --- > > drivers/pci/pci.h | 6 ++++++ > > include/linux/pci.h | 45 +++++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 51 insertions(+) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 12215ee72afb..3ca4ebfd46be 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -553,6 +553,12 @@ static inline int pci_dev_set_disconnected(struct > > pci_dev *dev, void *unused) > > pci_dev_set_io_state(dev, pci_channel_io_perm_failure); > > pci_doe_disconnected(dev); > > > > + if (READ_ONCE(dev->disconnect_work_enable)) { > > + /* Make sure work is up to date. */ > > + smp_rmb(); > > + schedule_work(&dev->disconnect_work); > > + } > > + > > return 0; > > } > > > > diff --git a/include/linux/pci.h b/include/linux/pci.h > > index 05e68f35f392..723b17145b62 100644 > > --- a/include/linux/pci.h > > +++ b/include/linux/pci.h > > @@ -548,6 +548,10 @@ struct pci_dev { > > /* These methods index pci_reset_fn_methods[] */ > > u8 reset_methods[PCI_NUM_RESET_METHODS]; /* In priority order */ > > > > + /* Report disconnect events. 0x0 - disable, 0x1 - enable */ > > + u8 disconnect_work_enable; > > + struct work_struct disconnect_work; > > + > > #ifdef CONFIG_PCIE_TPH > > u16 tph_cap; /* TPH capability offset */ > > u8 tph_mode; /* TPH mode */ > > @@ -1993,6 +1997,47 @@ pci_release_mem_regions(struct pci_dev *pdev) > > pci_select_bars(pdev, IORESOURCE_MEM)); > > } > > > > +/* > > + * Run this first thing after getting a disconnect work, to prevent it from > > + * running multiple times. > > + * Returns: true if disconnect was enabled, proceed. false if disabled, > > abort. > > + */ > > +static inline bool pci_test_and_clear_disconnect_enable(struct pci_dev > > *pdev) > > +{ > > + u8 enable = 0x1; > > + u8 disable = 0x0; > > + return try_cmpxchg(&pdev->disconnect_work_enable, &enable, disable); > > +} > > + > > +/* > > + * Caller must initialize @pdev->disconnect_work before invoking this. > > + * The work function must run and check > > pci_test_and_clear_disconnect_enable. > > + * Note that device can go away right after this call. > > + */ > > +static inline void pci_set_disconnect_work(struct pci_dev *pdev) > > +{ > > + /* Make sure WQ has been initialized already */ > > + smp_wmb(); > > + > > + WRITE_ONCE(pdev->disconnect_work_enable, 0x1); > > + > > + /* check the device did not go away meanwhile. */ > > + mb(); > > + > > + if (!pci_device_is_present(pdev)) > > + schedule_work(&pdev->disconnect_work); > > +} > > + > > +static inline void pci_clear_disconnect_work(struct pci_dev *pdev) > > +{ > > + WRITE_ONCE(pdev->disconnect_work_enable, 0x0); > > + > > + /* Make sure to stop using work from now on. */ > > + smp_wmb(); > > + > > + cancel_work_sync(&pdev->disconnect_work); > > +} > > + > > #else /* CONFIG_PCI is not enabled */ > > > > static inline void pci_set_flags(int flags) { } > > -- > > MST > >