Peter Xu <pet...@redhat.com> writes:

> CI: https://gitlab.com/peterx/qemu/-/pipelines/1577280033
>  (note: it's a pipeline of two patchsets, to save CI credits and time)
>
> v1: https://lore.kernel.org/r/20241204005138.702289-1-pet...@redhat.com
>
> This is v2 of the series, removing RFC tag, because my goal is to have them
> (or some newer version) merged.
>
> The major change is I merged last three patches, and did quite some changes
> here and there, to make sure the global disk activation status is always
> consistent.  The whole idea is still the same.  I say changelog won't help.
>
> I also temporarily dropped Fabiano's ping-pong test cases to avoid
> different versions floating on the list (as I know a new version is coming
> at some point. Fabiano: you're taking over the 10.0 pulls, so I assume
> you're aware so there's no concern on order of merges).  I'll review the
> test cases separately when they're ready, but this series is still tested
> with that pingpong test and it keeps working.
>
> I started looking at this problem as a whole when reviewing Fabiano's
> series, especially the patch (for a QEMU crash [1]):
>
> https://lore.kernel.org/r/20241125144612.16194-5-faro...@suse.de
>
> The proposed patch could work, but it's unwanted to add such side effect to
> migration.  So I start to think about whether we can provide a cleaner
> approach, because migration doesn't need the disks to be active to work at
> all.  Hence we should try to avoid adding a migration ABI (which doesn't
> matter now, but may matter some day) into prepare phase on disk activation
> status.  Migration should happen with disks inactivated.
>
> It's also a pure wish that, if bdrv_inactivate_all() could be benign to be
> called even if all disks are already inactive.  Then the bug is also gone.
> After all, similar call on bdrv_activate_all() upon all-active disks is all
> fine.  I hope that wish could still be fair.  But I don't know well on
> block layer to say anything meaningful.
>
> And when I was looking at that, I found more things spread all over the
> place on disk activation.  I decided to clean all of them up, while
> hopefully fixing the QEMU crash [1] too.
>
> For this v2, I did some more tests, I want to make sure all the past paths
> keep working at least on failure or cancel races, also in postcopy failure
> cases.  So I did below and they all run pass (when I said "emulated" below,
> I meant I hacked something to trigger those race / rare failures, because
> they aren't easy to trigger with vanilla binary):
>
> - Tested generic migrate_cancel during precopy, disk activation won't be
>   affected.  Disk status reports correct values in tracepoints.
>
> - Test Fabiano's ping-pong migration tests on PAUSED state VM.
>
> - Emulated precopy failure before sending non-iterable, disk inactivation
>   won't happen, and also activation won't trigger after migration cleanups
>   (even if activation on top of activate disk is benign, I checked traces
>   to make sure it'll provide consistent disk status, skipping activation).
>
> - Emulated precopy failure right after sending non-iterable. Disks will be
>   inactivated, but then can be reactivated properly before VM starts.
>
> - Emulated postcopy failure when sending the packed data (which is after
>   disk invalidated), and making sure src VM will get back to live properly,
>   re-activate the disks before starting.
>
> - Emulated concurrent migrate_cancel at the end of migration_completion()
>   of precopy, after disk inactivated.  Disks can be reactivated properly.
>
>   NOTE: here if dest QEMU didn't quit before migrate_cancel,
>   bdrv_activate_all() can crash src QEMU.  This behavior should be the same
>   before/after this patch.
>
> Comments welcomed, thanks.
>
> [1] https://gitlab.com/qemu-project/qemu/-/issues/2395
>
> Peter Xu (6):
>   migration: Add helper to get target runstate
>   qmp/cont: Only activate disks if migration completed
>   migration/block: Make late-block-active the default
>   migration/block: Apply late-block-active behavior to postcopy
>   migration/block: Fix possible race with block_inactive
>   migration/block: Rewrite disk activation
>
>  include/migration/misc.h |   4 ++
>  migration/migration.h    |   6 +-
>  migration/block-active.c |  94 +++++++++++++++++++++++++++
>  migration/colo.c         |   2 +-
>  migration/migration.c    | 136 +++++++++++++++------------------------
>  migration/savevm.c       |  46 ++++++-------
>  monitor/qmp-cmds.c       |  22 +++----
>  migration/meson.build    |   1 +
>  migration/trace-events   |   3 +
>  9 files changed, 188 insertions(+), 126 deletions(-)
>  create mode 100644 migration/block-active.c

Queued, thanks!

Reply via email to