On Mon, 3 Apr 2023 15:24:43 +0200 Yu Zhang <yu.zh...@ionos.com> wrote:
> Dear Laurent, > > recently we run into an issue with the following error: > > command '{ "execute": "device_del", "arguments": { "id": "virtio-diskX" } > }' for VM "id" failed ({ "return": {"class": "GenericError", "desc": > "Device virtio-diskX is already in the process of unplug"} }). > > The issue is reproducible. With a few seconds delay before hot-unplug, > hot-unplug just works fine. > > After a few digging, we found that the commit 9323f892b39 may incur the > issue. > ------------------ > failover: fix unplug pending detection > > Failover needs to detect the end of the PCI unplug to start migration > after the VFIO card has been unplugged. > > To do that, a flag is set in pcie_cap_slot_unplug_request_cb() and > reset in > pcie_unplug_device(). > > But since > 17858a169508 ("hw/acpi/ich9: Set ACPI PCI hot-plug as default on > Q35") > we have switched to ACPI unplug and these functions are not called > anymore > and the flag not set. So failover migration is not able to detect if > card > is really unplugged and acts as it's done as soon as it's started. So it > doesn't wait the end of the unplug to start the migration. We don't see > any > problem when we test that because ACPI unplug is faster than PCIe native > hotplug and when the migration really starts the unplug operation is > already done. > > See c000a9bd06ea ("pci: mark device having guest unplug request > pending") > a99c4da9fc2a ("pci: mark devices partially unplugged") > > Signed-off-by: Laurent Vivier <lviv...@redhat.com> > Reviewed-by: Ani Sinha <a...@anisinha.ca> > Message-Id: <20211118133225.324937-4-lviv...@redhat.com> > Reviewed-by: Michael S. Tsirkin <m...@redhat.com> > Signed-off-by: Michael S. Tsirkin <m...@redhat.com> > ------------------ > The purpose is for detecting the end of the PCI device hot-unplug. However, unplug is async process and issuing multiple unplug requests waiting for 'not found' error as a means to detect that device has been unplugged hardly a sane way to do that. Instead of swamping guest with unplug requests (which lead to hw interrupts) you should wait for DEVICE_DELETED QMP event. > we feel the error confusing. How is it possible that a disk "is already in > the process of unplug" during the first hot-unplug attempt? So far as I > know, the issue was also encountered by libvirt, but they simply ignored it: > > https://bugzilla.redhat.com/show_bug.cgi?id=1878659 > > Hence, a question is: should we have the line below in > acpi_pcihp_device_unplug_request_cb()? > > pdev->qdev.pending_deleted_event = true; comment 15 in above BZ describes how we could get rid of this line but also see comment 17 (in nutshell you get error because device hasn't been removed yet) > > It would be great if you as the author could give us a few hints. > > Thank you very much for your reply! > > Sincerely, > > Yu Zhang @ Compute Platform IONOS > 03.04.2013