Paolo Bonzini <pbonz...@redhat.com> writes: > In short, this patch gets rid of blockdev_mark_auto_del and > blockdev_auto_del. > > With these patches, it is possible to create a new -drive with the same > id as soon as the DEVICE_DELETED event is delivered (which equals to > unrealize). > > I'm sorry I'm not able to explain the history (and probably do not > understand the full ramifications) of this.
I'm going to try, because I need it myself to understand the ramifications. > That's why this is just > an RFC. It's technically not an RFC anymore. > The idea here is that reference counting the BlockBackend is enough to > defer the deletion of the block device as much as necessary; anticipating > the destruction of the DriveInfo is not a problem, and has the desired > effect of freeing the QemuOpts. Let me explain how we got ourselves into this mess, how it works, and where it fails. In the beginning, block frontends and backends got created during startup and lived as long as the process. Life was simple, if a bit dull. Enter hot plug / unplug, commit 6f338c3 "qemu: PCI device, disk and host network hot-add / hot-remove (Marcelo Tosatti)", Feb 2009, v0.10. In this early form, the interface mushed frontends and backends together, similar to the command line at that time. The interface we use today appeared in commit 3418bd2 "qdev hotplug: infrastructure and monitor commands.", Sep 2009, v0.12. Here's how it works: * Dynamically create a block backend: drive_add "" if=none,id=BE-ID,... * Hot plug a block frontend: device_add driver=DRV,id=FE-ID,drive=BE-ID,... * Hot unplug a block frontend device_del FE-ID For virtio-blk and SCSI devices, this also destroys the block backend. Back then, these were the only devices that could be unplugged, and this was the only way to destroy a block backend. This automatic backend destruction was a mistake. No other kind of backend gets destroyed that way. The (still experimental) blockdev-add gives us the opportunity to correct the mistake: backends created with it are not destroyed by frontend unplug. For some kinds of devices, such as USB, hot unplug is synchronous. For others, such as PCI, it's not: device_del merely initiates the unplug, which then completes in its own sweet time. Or doesn't: PCI unplug via ACPI actually requires guest cooperation. While there's some complexity in orchestrating the PCI-ACPI-dance, the block code proper is still simple: frontend destruction triggers backend destruction. Not only is automatic backend destruction a mistake, it's also not quite trivial. The backend gets destroyed by the frontend's qdev exit() method (which has since become its QOM unrealize()). This leaves the frontend's pointer to the backend dangling until the frontend finishes dying. Works as long as it doesn't get dereferenced during that time, but relying on that is a bit brittle. When it got in the way, it led to commit 14bafc5 "blockdev: Clean up automatic drive deletion", Jun 2010, v0.13. This is where blockdev_mark_auto_del() and blockdev_auto_del() come from. Instead of destroying the backend, we merely mark it, and destroy it when it's safe. The next piece of the puzzle is drive_del. While an image is in use by QEMU, it's generally unsafe to use by anything else. If you need it for something else, you need to first make QEMU relinquish it. Easy enough: hot unplug the frontend, thus destroy the backend, done. Except when the unplug needs guest cooperation, and the guest doesn't cooperate. This motivated commit 9063f81 "Implement drive_del to decouple block removal from device removal", Nov 2010, v0.14. The guest gets no say, and afterwards sees a terminally broken block device. In its initial form, drive_del simply closed and destroyed the block backend right away. It tried to zap the pointers first, but failed. We fixed the resulting use-after-free in commit d22b2f4 "Do not delete BlockDriverState when deleting the drive", Mar 2011, v0.15: we hide instead of destroy the closed backend when it's being used by a frontend. Why hide it? Backward compatibility. v0.14's drive_del made the backend's ID available for reuse immediately. What we can't destroy immediately, we need to hide. The final piece of the puzzle is event DEVICE_DELETED, from commit 0402a5d "qdev: DEVICE_DELETED event", Mar 2013, v1.5.0. It gets emitted when unplug completes. Meanwhile and mostly independently, backend reference counting evolved. When backends still lived as long as the process, there was none, obviously. We probably should've introduced it for drive_del in v0.14, but that didn't happen, because we (incorrectly) thought drive_del could simply delete and be done. Instead, it got introduced for block migration, in commit 84fb392 "blockdev: add refcount to DriveInfo", Jan 2011, v0.15. Note that DriveInfo is *not* the block backend. It captures a -drive / drive_add. Reference counting block backend there pressed DriveInfo to use as owner of the block backend. It took a while to clean that up, until commit fa510eb "block: use BDS ref for block jobs", Aug 2013, v1.7. DriveInfo's reference count even lingered until commit ae60e8e, v2.1. Note that reference counting began only after the unplug work was done except for the bug fixes. When I worked on BlockBackend, I wanted to do what your series does: put the reference counting to use to clean up the unplug mess. But then the BlockBackend job became too complex and I had to cut that part. So, what do we learn from my lengthy history lesson? I think we can learn what to watch out for. Here's my list: * Block backend auto-destroy wart: we need to destroy block backends on frontend unplug exactly as before. * The "drive_del hides the backend" wart: what we can't destroy immediately, we need to hide. * drive_del is where dragons be. I've fixed enough bugs there, and spent enough hours wracking my brain how to change things without messing it up to be on my toes there. > Patches 1 and 3 are mostly similar to the version I had earlier sent as > RFC, but they now pass all unit tests. Patch 2 is new, but I don't know > of a test that fails it. I'll review the actual patches next, hopefully tomorrow.