On 8/26/25 2:11 AM, Markus Armbruster wrote:
Jonah Palmer <jonah.pal...@oracle.com> writes:

On 8/25/25 8:44 AM, Markus Armbruster wrote:

[...]

Jonah Palmer <jonah.pal...@oracle.com> writes:

On 8/8/25 6:48 AM, Markus Armbruster wrote:

[...]

Jonah Palmer <jonah.pal...@oracle.com> writes:
Adds a new migration capability 'virtio-iterative' that will allow
virtio devices, where supported, to iteratively migrate configuration
changes that occur during the migration process.

Why is that desirable?

To be frank, I wasn't sure if having a migration capability, or even
have it toggleable at all, would be desirable or not. It appears though
that this might be better off as a per-device feature set via
--device virtio-net-pci,iterative-mig=on,..., for example.

See below.

And by "iteratively migrate configuration changes" I meant more along
the lines of the device's state as it continues running on the source.

Isn't that what migration does always?

Essentially yes, but today all of the state is only migrated at the end, once 
the source has been paused. So the final correct state is always sent to the 
destination.

As far as I understand (and ignoring lots of detail, including post
copy), we have three stages:

1. Source runs, migrate memory pages.  Pages that get dirtied after they
are migrated need to be migrated again.

2. Neither source or destination runs, migrate remaining memory pages
and device state.

3. Destination starts to run.

If the duration of stage 2 (downtime) was of no concern, we'd switch to
it immediately, i.e. without migrating anything in stage 1.  This would
minimize I/O.

Of course, we actually care for limiting downtime.  We switch to stage 2
when "little enough" is left for stage two to migrate.

If we're no longer waiting until the source has been paused and the initial 
state is sent early, then we need to make sure that any changes that happen is 
still communicated to the destination.

So you're proposing to treat suitable parts of the device state more
like memory pages.  Correct?


Not in the sense of "something got dirtied so let's immediately re-send that" like we would with RAM. It's more along the lines of "something got dirtied so let's make sure that gets re-sent at the start of stage 2".

The entire state of a virtio-net device (even with vhost-net / vhost-vDPA) is <10KB I believe. I don't believe there's much to gain by "iteratively" re-sending changes for virtio-net. It should be suitable enough to just re-send whatever changed during stage 1 (after the initial state was sent) at the start of stage 2.

This is why I'm currently looking into a solution that uses VMSD's .early_setup flag (that Peter recommended) rather than implementing a suite of SaveVMHandlers hooks (like this RFC does). We don't need this iterative capability as much as we need to start migrating the state earlier (and doing corresponding config/prep work) during stage 1.

Cover letter and commit message of PATCH 4 provide the motivation: you
observe a shorter downtime.  You speculate this is due to moving "heavy
allocations and page-fault latencies" from stage 2 to stage 1.  Correct?


Correct. But again I'd like to stress that this is just one part in reducing downtime during stage 2. The biggest reductions will come from the config/prep work that we're trying to move from stage 2 to stage 1, especially when vhost-vDPA is involved. And we can only do this early work once we have the state, hence why we're sending it earlier.

Is there anything that makes virtio-net particularly suitable?


Yes, especially with vhost-vDPA and configuring VQs. See Eugenio's comment here https://lore.kernel.org/qemu-devel/CAJaqyWdUutZrAWKy9d=ip+h+y3bnpturcl8xj06xfiznxpt...@mail.gmail.com/.

I think this patch's commit message should at least hint at the
motivation at a high level.  Details like measurements are best left to
PATCH 4.


You're right, this was my bad for not framing this RFC more clearly and the true motivations behind it. I will certainly be more direct and descriptive in the next RFC for this effort.

This RFC handles this by just re-sending the entire state again once the source 
has been paused. But of course this isn't optimal and I'm looking into how to 
better optimize this part.

How much is the entire state?


I'm not exactly sure how large it is but it should be <10KB even with vhost-vDPA. It could be slightly larger if we really up the number of queue pairs and/or have huge MAC/multicast lists.

But perhaps actual configuration changes (e.g. changing the number of
queue pairs) could also be supported mid-migration like this?

I don't know.

This capability is added to the validated capabilities list to ensure
both the source and destination support it before enabling.

What happens when only one side enables it?

The migration stream breaks if only one side enables it.

How does it break?  Error message pointing out the misconfiguration?


The destination VM is torn down and the source just reports that migration 
failed.

Exact same failure as for other misconfigurations, like missing a device
on the destination?


I hesitate to say "exact" but for example, when missing a device on one side you might see something like below (I removed a serial device):

qemu-system-x86_64: Unknown ramblock "0000:00:03.0/virtio-net-pci.rom", cannot accept migration qemu-system-x86_64: error while loading state for instance 0x0 of device 'ram'
qemu-system-x86_64: load of migration failed: Invalid argument
...

The expected order gets messed up and eventually the wrong data will end up somewhere else. In this case it was the RAM.

I don't believe the source/destination could be aware of the misconfiguration. 
IIUC the destination reads the migration stream and expects certain pieces of 
data in a certain order. If new data is added to the migration stream or the 
order has changed and the destination isn't expecting it, then the migration 
fails. It doesn't know exactly why, just that it read-in data that it wasn't 
expecting.

This is poor wording on my part, my apologies. I don't think it's even
possible to know the capabilities between the source & destination.

The capability defaults to off to maintain backward compatibility.

To enable the capability via HMP:
(qemu) migrate_set_capability virtio-iterative on

To enable the capability via QMP:
{"execute": "migrate-set-capabilities", "arguments": {
        "capabilities": [
           { "capability": "virtio-iterative", "state": true }
        ]
     }
}

Signed-off-by: Jonah Palmer <jonah.pal...@oracle.com>

[...]

diff --git a/qapi/migration.json b/qapi/migration.json
index 4963f6ca12..8f042c3ba5 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -479,6 +479,11 @@
  #     each RAM page.  Requires a migration URI that supports seeking,
  #     such as a file.  (since 9.0)
  #
+# @virtio-iterative: Enable iterative migration for virtio devices, if
+#     the device supports it. When enabled, and where supported, virtio
+#     devices will track and migrate configuration changes that may
+#     occur during the migration process. (Since 10.1)

When and why should the user enable this?

Well if all goes according to plan, always (at least for virtio-net).
This should improve the overall speed of live migration for a virtio-net
device (and vhost-net/vhost-vdpa).

So the only use for "disabled" would be when migrating to or from an
older version of QEMU that doesn't support this.  Fair?

Correct.

What's the default?

Disabled.

Awkward for something that should always be enabled.  But see below.

Please document defaults in the doc comment.


Ack.

What exactly do you mean by "where supported"?

I meant if both source's Qemu and destination's Qemu support it, as well
as for other virtio devices in the future if they decide to implement
iterative migration (e.g. a more general "enable iterative migration for
virtio devices").

But I think for now this is better left as a virtio-net configuration
rather than as a migration capability (e.g. --device
virtio-net-pci,iterative-mig=on/off,...)

Makes sense to me (but I'm not a migration expert).

A device property's default can depend on the machine type via compat
properties.  This is normally used to restrict a guest-visible change to
newer machine types.  Here, it's not guest-visible.  But it can get you
this:

* Migrate new machine type from new QEMU to new QEMU (old QEMU doesn't
   have the machine type): iterative is enabled by default.  Good.  User
   can disable it on both ends to not get the improvement.  Enabling it
   on just one breaks migration.

   All other cases go away with time.

* Migrate old machine type from new QEMU to new QEMU: iterative is
   disabled by default, which is sad, but no worse than before.  User can
   enable it on both ends to get the improvement.  Enabling it on just
   one breaks migration.

* Migrate old machine type from new QEMU to old QEMU or vice versa:
   iterative is off by default.  Good.  Enabling it on the new one breaks
   migration.

* Migrate old machine type from old QEMU to old QEMU: iterative is off

I figure almost all users could simply ignore this configuration knob
then.


Oh, that's interesting. I wasn't aware of this. But couldn't this potentially cause some headaches and confusion when attempting to migrate between 2 guests where one VM is using a machine type does support it and the other isn't?

For example, the source and destination VMs both specify '-machine q35,...' and the q35 alias resolves into, say, pc-q35-10.1 for the source VM and pc-q35-10.0 for the destination VM. And say this property is supported on >= pc-q35-10.1.

IIUC, this would mean that iterative is enabled by default on the source VM but disabled by default on the destination VM.

Then a user attempts the migration, the migration fails, and then they'd have to try and figure out why it's failing.

Furthermore, since it's a device property that's essentially set at VM creation time, either the source would have to be reset and explicitly set this property to off or the destination would have to be reset and use a newer (>= pc-q35-10.1) machine type before starting it back up and perform the migration.

Am I understanding this correctly?

[...]



Reply via email to