Hi; I'm trying to find out what the UI layer's threading and locking strategy is, at least as far as it applies to display device models.
Specifically: * is the device's GraphicHwOps::gfx_update method always called from one specific thread, or might it be called from any thread? * is that method called with any locks guaranteed held? (eg the iothread lock) * is the caller of the gfx_update method OK if an implementation of the method drops the iothread lock temporarily while it is executing? (my guess would be "no") * for a gfx_update_async = true device, what are the requirements on calling graphic_hw_update_done()? Does the caller need to hold any particular lock? Does the call need to be done from any particular thread? The background to this is that I'm looking again at the race condition involving the memory_region_snapshot_and_clear_dirty() function, as described here: https://lore.kernel.org/qemu-devel/CAFEAcA9odnPo2LPip295Uztri7JfoVnQbkJ=wn+k8dqneb_...@mail.gmail.com/T/#u Having worked through what is going on, as far as I can see: (1) in order to be sure that we have the right data to match the snapshotted dirty bitmap state, we must wait for all TCG vCPUs to leave their current TB (2) a vCPU might block waiting for the iothread lock mid-TB (3) therefore we cannot wait for the TCG vCPUs without dropping the iothread lock one way or another (4) but none of the callers expect that and various things break My tentative idea for a fix is a bit of an upheaval: * have the display devices set gfx_update_async = true * instead of doing everything synchronously in their gfx_update method, they do the initial setup and call an 'async' version of memory_region_snapshot_and_clear_dirty() * that async version of the function will do what it does today, but without trying to wait for TCG vCPUs * instead the caller arranges (via call_rcu(), probably) a callback that will happen once all the TCG CPUs have finished executing their current TB * that callback does the actual copy-from-guest-ram-to-display and then calls graphic_hw_update_done() This seems like an awful pain in the neck but I couldn't see anything better :-( Paolo: what (if any) guarantee does call_rcu() make about which thread the callback function gets executed on, and what locks are/are not held when it's called? (I haven't looked at the migration code's use of memory_global_after_dirty_log_sync() but I suspect it's similarly broken.) thanks -- PMM