On Mon, Aug 11, 2025 at 08:30:19AM -0400, Jonah Palmer wrote: > > > On 8/7/25 12:31 PM, Peter Xu wrote: > > On Thu, Aug 07, 2025 at 10:18:38AM -0400, Jonah Palmer wrote: > > > > > > > > > On 8/6/25 12:27 PM, Peter Xu wrote: > > > > On Tue, Jul 22, 2025 at 12:41:26PM +0000, Jonah Palmer wrote: > > > > > Iterative live migration for virtio-net sends an initial > > > > > VMStateDescription while the source is still active. Because data > > > > > continues to flow for virtio-net, the guest's avail index continues to > > > > > increment after last_avail_idx had already been sent. This causes the > > > > > destination to often see something like this from virtio_error(): > > > > > > > > > > VQ 0 size 0x100 Guest index 0x0 inconsistent with Host index 0xc: > > > > > delta 0xfff4 > > > > > > > > This is pretty much understanable, as vmstate_save() / vmstate_load() > > > > are, > > > > IMHO, not designed to be used while VM is running. > > > > > > > > To me, it's still illegal (per previous patch) to use > > > > vmstate_save_state() > > > > while VM is running, in a save_setup() phase. > > > > > > Yea I understand where you're coming from. It just seemed too good to pass > > > up on as a way to send and receive the entire state of a device. > > > > > > I felt that if I were to implement something similar for iterative > > > migration > > > only that I'd, more or less, be duplicating a lot of already existing code > > > or vmstate logic. > > > > > > > > > > > Some very high level questions from migration POV: > > > > > > > > - Have we figured out why the downtime can be shrinked just by sending > > > > the > > > > vmstate twice? > > > > > > > > If we suspect it's memory got preheated, have we tried other ways to > > > > simply heat the memory up on dest side? For example, some form of > > > > mlock[all]()? IMHO it's pretty important we figure out the root of > > > > why > > > > such optimization came from. > > > > > > > > I do remember we have downtime issue with number of max_vqueues > > > > that may > > > > cause post_load() to be slow, I wonder there're other ways to > > > > improve it > > > > instead of vmstate_save(), especially in setup phase. > > > > > > > > > > Yea I believe that the downtime shrinks on the second vmstate_load_state > > > due > > > to preheated memory. But I'd like to stress that it's not my intention to > > > resend the entire vmstate again during the stop-and-copy phase if > > > iterative > > > migration was used. A future iteration of this series will eventually > > > include a more efficient approach to update the destination with any > > > deltas > > > since the vmstate was sent during the iterative portion (instead of just > > > resending the entire vmstate again). > > > > > > And yea there is an inefficiency regarding walking through > > > VIRTIO_QUEUE_MAX > > > (1024) VQs (twice with PCI) that I mentioned here in another comment: > > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/0f5b804d-3852-4159-b151-308a57f1e...@oracle.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjFhL0IUC$ > > > > > > This might be better handled in a separate series though rather than as > > > part > > > of this one. > > > > One thing to mention is I recall some other developer was trying to > > optimize device load from memory side: > > > > https://urldefense.com/v3/__https://lore.kernel.org/all/20230317081904.24389-1-xuchuangxc...@bytedance.com/__;!!ACWV5N9M2RV99hQ!Oyhh-o4V5gzcWsbmSxAkonhYn3xcLBF50-h-a9-D5MiKgbiHvkaAqdu1VZP5SVmuCk5GQu-sjBifRrAz$ > > > > So maybe there're more than one way of doing this, and I'm not sure which > > way is better, or both. > > > > Ack. I'll take a look at this. > > > > > > > > - Normally devices need iterative phase because: > > > > > > > > (a) the device may contain huge amount of data to transfer > > > > > > > > E.g. RAM and VFIO are good examples and fall into this category. > > > > > > > > (b) the device states are "iterable" from concept > > > > > > > > RAM is definitely true. VFIO somehow mimiced that even though > > > > it was > > > > a streamed binary protocol.. > > > > > > > > What's the answer for virtio-net here? How large is the device > > > > state? > > > > Is this relevant to vDPA and real hardware (so virtio-net can look > > > > similar to VFIO at some point)? > > > > > > > > > The main motivation behind implementing iterative migration for virtio-net > > > is really to improve the guest visible downtime seen when migrating a vDPA > > > device. > > > > > > That is, by implementing iterative migration for virtio-net, we can see > > > the > > > state of the device early on and get a head start on work that's currently > > > being done during the stop-and-copy phase. If we do this work before the > > > stop-and-copy phase, we can further decrease the time spent in this > > > window. > > > > > > This would include work such as sending down the CVQ commands for > > > queue-pair > > > creation (even more beneficial for multiqueue), RSS, filters, etc. > > > > > > I'm hoping to show this more explicitly in the next version of this RFC > > > series that I'm working on now. > > > > OK, thanks for the context. I can wait and read the new version. > > > > In all cases, please be noted that since migration thread does not take > > BQL, it means either the setup or iterable phase may happen concurrently > > with any of the vCPU threads. I think it means maybe it's not wise to try > > to iterate everything: please be ready to see e.g. 64bits MMIO register > > being partially updated when dumping it to the wire, for example. > > > > Gotcha. Some of the iterative hooks though like .save_setup, .load_state, > etc. do hold the BQL though, right?
load_state() definitely needs the lock. save_setup(), yes we have bql, but I really wish we don't depend on it, and I don't know whether it'll keep holding true - AFAIU, the majority of it really doesn't need the lock.. and I always wanted to see whether I can remove it. Normal iterations definitely runs without the lock. > > > Do you have a rough estimation of the size of the device states to migrate? > > > > Do you have a method at how I might be able to estimate this? I've been > trying to get some kind of rough estimation but failing to do so. Could I ask why you started this "migrate virtio-net in iteration phase" effort? I thought it was because there're a lot of data to migrate, and there should be a way to estimate the minumum. So is it not the case? How about vDPA devices? Do those devices have a lot of data to migrate? We really need a good enough reason to have a device provide save_iterate(). If it's only about "preheat some MMIO registers", we should, IMHO, look at more generic ways first. Thanks, -- Peter Xu