On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote: > On 29.04.2024 17:09, Peter Xu wrote: > > On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote: > > > On 24.04.2024 00:35, Peter Xu wrote: > > > > On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote: > > > > > On 24.04.2024 00:20, Peter Xu wrote: > > > > > > On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote: > > > > > > > On 19.04.2024 17:31, Peter Xu wrote: > > > > > > > > On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. BerrangΓ© > > > > > > > > wrote: > > > > > > > > > On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote: > > > > > > > > > > On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. > > > > > > > > > > Szmigiero wrote: > > > > > > > > > > > I think one of the reasons for these results is that > > > > > > > > > > > mixed (RAM + device > > > > > > > > > > > state) multifd channels participate in the RAM sync > > > > > > > > > > > process > > > > > > > > > > > (MULTIFD_FLAG_SYNC) whereas device state dedicated > > > > > > > > > > > channels don't. > > > > > > > > > > > > > > > > > > > > Firstly, I'm wondering whether we can have better names for > > > > > > > > > > these new > > > > > > > > > > hooks. Currently (only comment on the async* stuff): > > > > > > > > > > > > > > > > > > > > - complete_precopy_async > > > > > > > > > > - complete_precopy > > > > > > > > > > - complete_precopy_async_wait > > > > > > > > > > > > > > > > > > > > But perhaps better: > > > > > > > > > > > > > > > > > > > > - complete_precopy_begin > > > > > > > > > > - complete_precopy > > > > > > > > > > - complete_precopy_end > > > > > > > > > > > > > > > > > > > > ? > > > > > > > > > > > > > > > > > > > > As I don't see why the device must do something with async > > > > > > > > > > in such hook. > > > > > > > > > > To me it's more like you're splitting one process into > > > > > > > > > > multiple, then > > > > > > > > > > begin/end sounds more generic. > > > > > > > > > > > > > > > > > > > > Then, if with that in mind, IIUC we can already split > > > > > > > > > > ram_save_complete() > > > > > > > > > > into >1 phases too. For example, I would be curious whether > > > > > > > > > > the performance > > > > > > > > > > will go back to normal if we offloading > > > > > > > > > > multifd_send_sync_main() into the > > > > > > > > > > complete_precopy_end(), because we really only need one > > > > > > > > > > shot of that, and I > > > > > > > > > > am quite surprised it already greatly affects VFIO dumping > > > > > > > > > > its own things. > > > > > > > > > > > > > > > > > > > > I would even ask one step further as what Dan was asking: > > > > > > > > > > have you thought > > > > > > > > > > about dumping VFIO states via multifd even during > > > > > > > > > > iterations? Would that > > > > > > > > > > help even more than this series (which IIUC only helps > > > > > > > > > > during the blackout > > > > > > > > > > phase)? > > > > > > > > > > > > > > > > > > To dump during RAM iteration, the VFIO device will need to > > > > > > > > > have > > > > > > > > > dirty tracking and iterate on its state, because the guest > > > > > > > > > CPUs > > > > > > > > > will still be running potentially changing VFIO state. That > > > > > > > > > seems > > > > > > > > > impractical in the general case. > > > > > > > > > > > > > > > > We already do such interations in vfio_save_iterate()? > > > > > > > > > > > > > > > > My understanding is the recent VFIO work is based on the fact > > > > > > > > that the VFIO > > > > > > > > device can track device state changes more or less (besides > > > > > > > > being able to > > > > > > > > save/load full states). E.g. I still remember in our QE tests > > > > > > > > some old > > > > > > > > devices report much more dirty pages than expected during the > > > > > > > > iterations > > > > > > > > when we were looking into such issue that a huge amount of > > > > > > > > dirty pages > > > > > > > > reported. But newer models seem to have fixed that and report > > > > > > > > much less. > > > > > > > > > > > > > > > > That issue was about GPU not NICs, though, and IIUC a major > > > > > > > > portion of such > > > > > > > > tracking used to be for GPU vRAMs. So maybe I was mixing up > > > > > > > > these, and > > > > > > > > maybe they work differently. > > > > > > > > > > > > > > The device which this series was developed against (Mellanox > > > > > > > ConnectX-7) > > > > > > > is already transferring its live state before the VM gets stopped > > > > > > > (via > > > > > > > save_live_iterate SaveVMHandler). > > > > > > > > > > > > > > It's just that in addition to the live state it has more than 400 > > > > > > > MiB > > > > > > > of state that cannot be transferred while the VM is still running. > > > > > > > And that fact hurts a lot with respect to the migration downtime. > > > > > > > > > > > > > > AFAIK it's a very similar story for (some) GPUs. > > > > > > > > > > > > So during iteration phase VFIO cannot yet leverage the multifd > > > > > > channels > > > > > > when with this series, am I right? > > > > > > > > > > That's right. > > > > > > > > > > > Is it possible to extend that use case too? > > > > > > > > > > I guess so, but since this phase (iteration while the VM is still > > > > > running)Β doesn't impact downtime it is much less critical. > > > > > > > > But it affects the bandwidth, e.g. even with multifd enabled, the device > > > > iteration data will still bottleneck at ~15Gbps on a common system setup > > > > the best case, even if the hosts are 100Gbps direct connected. Would > > > > that > > > > be a concern in the future too, or it's known problem and it won't be > > > > fixed > > > > anyway? > > > > > > I think any improvements to the migration performance are good, even if > > > they don't impact downtime. > > > > > > It's just that this patch set focuses on the downtime phase as the more > > > critical thing. > > > > > > After this gets improved there's no reason why not to look at improving > > > performance of the VM live phase too if it brings sensible improvements. > > > > > > > I remember Avihai used to have plan to look into similar issues, I hope > > > > this is exactly what he is looking for. Otherwise changing migration > > > > protocol from time to time is cumbersome; we always need to provide a > > > > flag > > > > to make sure old systems migrates in the old ways, new systems run the > > > > new > > > > ways, and for such a relatively major change I'd want to double check on > > > > how far away we can support offload VFIO iterations data to multifd. > > > > > > The device state transfer is indicated by a new flag in the multifd > > > header (MULTIFD_FLAG_DEVICE_STATE). > > > > > > If we are to use multifd channels for VM live phase transfers these > > > could simply re-use the same flag type. > > > > Right, and that's also my major purpose of such request to consider both > > issues. > > > > If supporting iterators can be easy on top of this, I am thinking whether > > we should do this in one shot. The problem is even if the flag type can be > > reused, old/new qemu binaries may not be compatible and may not migrate > > well when: > > > > - The old qemu only supports the downtime optimizations > > - The new qemu supports both downtime + iteration optimizations > > I think the situation here will be the same as with any new flag > affecting the migration wire protocol - if the old version of QEMU > doesn't support that flag then it has to be kept at its backward-compatible > setting for migration to succeed. > > > IIUC, at least the device threads are currently created only at the end of > > migration when switching over for the downtime-only optimization (aka, this > > series). Then it means it won't be compatible with a new QEMU as the > > threads there will need to be created before iteration starts to take > > iteration data. So I believe we'll need yet another flag to tune the > > behavior of such, one for each optimizations (downtime v.s. data during > > iterations). If they work mostly similarly, I want to avoid two flags. > > It'll be chaos for user to see such similar flags and they'll be pretty > > confusing. > > The VFIO loading threads are created from vfio_load_setup(), which is > called at the very beginning of the migration, so they should be already > there. > > However, they aren't currently prepared to receive VM live phase data. > > > If possible, I wish we can spend some time looking into that if they're so > > close, and if it's low hanging fruit when on top of this series, maybe we > > can consider doing that in one shot. > > I'm still trying to figure out the complete explanation why dedicated > device state channels improve downtime as there was a bunch of holidays > last week here.
No rush. I am not sure whether it'll reduce downtime, but it may improve total migration time when multiple devices are used. > > I will have a look later what would it take to add VM live phase multifd > device state transfer support and also how invasive it would be as I > think it's better to keep the number of code conflicts in a patch set > to a manageable size as it reduces the chance of accidentally > introducing regressions when forward-porting the patch set to the git master. Yes it makes sense. It'll be good to look one step further in this case, then: - If it's easy to add support then we do in one batch, or - If it's not easy to add support, but if we can find a compatible way so that ABI can be transparent when adding that later, it'll be also nice, or - If we have solid clue it should be a major separate work, and we must need a new flag, then we at least know we should simply split the effort due to that complexity The hope is option (1)/(2) would work out. I hope Avihai can also chim in here (or please reach him out) because I remember he used to consider proposing such a whole solution, but maybe I just misunderstood. I suppose no harm to check with him. Thanks, -- Peter Xu