On Fri, 14 Apr 2023 10:33:58 -0500 Eric Blake <ebl...@redhat.com> wrote:
> Consider what happens when performing a migration between two host > machines connected to an NFS server serving multiple block devices to > the guest, when the NFS server becomes unavailable. The migration > attempts to inactivate all block devices on the source (a necessary > step before the destination can take over); but if the NFS server is > non-responsive, the attempt to inactivate can itself fail. When that > happens, the destination fails to get the migrated guest (good, > because the source wasn't able to flush everything properly): > > (qemu) qemu-kvm: load of migration failed: Input/output error > > at which point, our only hope for the guest is for the source to take > back control. With the current code base, the host outputs a message, but > then appears to resume: > > (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: > bdrv_inactivate_all() failed (-1) > > (src qemu)info status > VM status: running > > but a second migration attempt now asserts: > > (src qemu) qemu-kvm: ../block.c:6738: int > bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & > BDRV_O_INACTIVE)' failed. > > Whether the guest is recoverable on the source after the first failure > is debatable, but what we do not want is to have qemu itself fail due > to an assertion. It looks like the problem is as follows: > > In migration.c:migration_completion(), the source sets 'inactivate' to > true (since COLO is not enabled), then tries > savevm.c:qemu_savevm_state_complete_precopy() with a request to > inactivate block devices. In turn, this calls > block.c:bdrv_inactivate_all(), which fails when flushing runs up > against the non-responsive NFS server. With savevm failing, we are > now left in a state where some, but not all, of the block devices have > been inactivated; but migration_completion() then jumps to 'fail' > rather than 'fail_invalidate' and skips an attempt to reclaim those > those disks by calling bdrv_activate_all(). Even if we do attempt to > reclaim disks, we aren't taking note of failure there, either. > > Thus, we have reached a state where the migration engine has forgotten > all state about whether a block device is inactive, because we did not > set s->block_inactive in enough places; so migration allows the source > to reach vm_start() and resume execution, violating the block layer > invariant that the guest CPUs should not be restarted while a device > is inactive. Note that the code in migration.c:migrate_fd_cancel() > will also try to reactivate all block devices if s->block_inactive was > set, but because we failed to set that flag after the first failure, > the source assumes it has reclaimed all devices, even though it still > has remaining inactivated devices and does not try again. Normally, > qmp_cont() will also try to reactivate all disks (or correctly fail if > the disks are not reclaimable because NFS is not yet back up), but the > auto-resumption of the source after a migration failure does not go > through qmp_cont(). And because we have left the block layer in an > inconsistent state with devices still inactivated, the later migration > attempt is hitting the assertion failure. > > Since it is important to not resume the source with inactive disks, > this patch marks s->block_inactive before attempting inactivation, > rather than after succeeding, in order to prevent any vm_start() until > it has successfully reactivated all devices. > > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982 > > Signed-off-by: Eric Blake <ebl...@redhat.com> Looks good to me from colo side of things. Acked-by: Lukas Straub <lukasstra...@web.de> Tested-by: Lukas Straub <lukasstra...@web.de> Best Regards, Lukas Straub > --- > > v2: Set s->block_inactive sooner [Juan] > --- > migration/migration.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index bda47891933..cb0d42c0610 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s) > MIGRATION_STATUS_DEVICE); > } > if (ret >= 0) { > + s->block_inactive = inactivate; > qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX); > ret = qemu_savevm_state_complete_precopy(s->to_dst_file, > false, > inactivate); > } > - if (inactivate && ret >= 0) { > - s->block_inactive = true; > - } > } > qemu_mutex_unlock_iothread(); > > @@ -3522,6 +3520,7 @@ fail_invalidate: > bdrv_activate_all(&local_err); > if (local_err) { > error_report_err(local_err); > + s->block_inactive = true; > } else { > s->block_inactive = false; > } > > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6 --
pgpUQvfgvH6xs.pgp
Description: OpenPGP digital signature