Jason Gunthorpe <j...@nvidia.com> wrote: > On Tue, May 17, 2022 at 10:00:45AM -0600, Alex Williamson wrote: > >> > This is really intended to be a NOP from where things are now, as if >> > you use mlx5 live migration without a patch like this then it causes a >> > botched pre-copy since everything just ends up permanently dirty. >> > >> > If it makes more sense we could abort the pre-copy too - in the end >> > there will be dirty tracking so I don't know if I'd invest in a big >> > adventure to fully define non-dirty tracking migration. >> >> How is pre-copy currently "botched" without a patch like this? If it's >> simply that the pre-copy doesn't converge and the downtime constraints >> don't allow the VM to enter stop-and-copy, that's the expected behavior >> AIUI, and supports backwards compatibility with existing SLAs. > > It means it always fails - that certainly isn't working live > migration. There is no point in trying to converge something that we > already know will never converge.
Fully agree with you here. But not how this is being done. I think we need a way to convince the migration code that it shouldn't even try to migrate RAM. That would fix the current use case, and your use case. >> I'm assuming that by setting this new skip_precopy flag that we're >> forcing the VM to move to stop-and-copy, regardless of any other SLA >> constraints placed on the migration. > > That does seem like a defect in this patch, any SLA constraints should > still all be checked under the assumption all ram is dirty. And how are we going to: - detect the network link speed - to be sure that we are inside downtime limit I think that it is not possible, so basically we are skiping the precopy stage and praying that the other bits are going to be ok. >> It seems like a better solution would be to expose to management >> tools that the VM contains a device that does not support the >> pre-copy phase so that downtime expectations can be adjusted. > > I don't expect this to be a real use case though.. > > Remember, you asked for this patch when you wanted qemu to have good > behavior when kernel support for legacy dirty tracking is removed > before we merge v2 support. I am an ignorant on the subject. Can I ask how the dirty memory is tracked on this v2? Thanks, Juan.