On Mon, 4 Nov 2024 at 23:19, Peter Xu <pet...@redhat.com> wrote: > Precopy keeps sending data even during postcopy, that's the background > stream (with/without preempt feature enabled). May need some amendment > when repost here.
* Okay. > > + if (channel == CH_POSTCOPY) { > > + return false; > > + } > > Please see my other reply, I think here it should never be POSTCOPY > channel, because postcopy (aka, preempt) channel is only created after the > main channel.. So I wonder whether this "if" will hit at all. * It does. migration_ioc_process_incoming() is called for each incoming connection. And every time it calls migration_should_start_incoming() to check if it should proceed with the migration or should wait for further connections, because multifd does not start until all its connections are established. * Similarly when the Postcopy channel is initiated towards the end of Precopy migration, migration_should_start_incoming() gets called, and returns 'false' because we don't want to start a new incoming migration at that time. Earlier it was receiving boolean value 'default_channel' as parameter, which was set to false while accepting 'postcopy' connection via => default_channel = !mis->from_src_file; > > + > > /* Multifd doesn't start unless all channels are established */ > > if (migrate_multifd()) { > > - return migration_has_all_channels(); > > - } > > - > > - /* Preempt channel only starts when the main channel is created */ > > - if (migrate_postcopy_preempt()) { > > - return main_channel; > > + return multifd_recv_all_channels_created(); > > I think this is incorrect.. We should also need to check main channel is > established before start incoming. The old code uses > migration_has_all_channels() which checks that. * Okay, will try to fix it better. But calling migration_has_all_channels() after checking migrate_multifd() does not seem/read right. > I don't yet see how you manage the multifd threads, etc, on both sides. Or > any logic to make sure multifd will properly flush the pages before > postcopy starts. IOW, any guarantee that all the pages will only be > installed using UFFDIO_COPY as long as vcpu starts on dest. Any comments? * There are no changes related to that. As this series only tries to enable the multifd and postcopy options together. > The most complicated part of this work would be testing, to make sure it > works in all previous cases, and that's majorly why we disabled it before: > it was because it randomly crashes, but not always; fundamentally it's > because when multifd was designed there wasn't enough consideration on > working together with postcopy. We didn't clearly know what's missing at > that time. * Right. I have tested this series with the following migration commands to confirm that migration works as expected and there were no crash(es) observed. === [Precopy] # virsh migrate --verbose --live --auto-converge f39vm qemu+ssh://<destination-host-name>/system [Precopy + Multifd] # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 02 \ f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 04 \ f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 08 \ f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 16 \ f39vm qemu+ssh://<destination-host-name>/system [Postcopy] # virsh migrate --verbose --live --auto-converge \ --postcopy --postcopy-after-precopy f39vm qemu+ssh://<destination-host-name>/system [Postcopy + Multifd] # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 02 \ --postcopy --postcopy-after-precopy f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 04 \ --postcopy --postcopy-after-precopy f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 08 \ --postcopy --postcopy-after-precopy f39vm qemu+ssh://<destination-host-name>/system # virsh migrate --verbose --live --auto-converge --parallel --parallel-connections 16 \ --postcopy --postcopy-after-precopy f39vm qemu+ssh://<destination-host-name>/system === > So we would definitely need to add test cases, just like whoever adds new > features to migration, to make sure at least it works for existing multifd > / postcopy test cases, but when both features enabled. ... > It will add quite a few tests to run here, but I don't see a good way > otherwise when we want to enable the two features, because it is indeed a > challenge to enable the two major features together here. > * Thank you for the hints about the tests, will surely look into them and try to learn about how to add new test cases. Thank you. --- - Prasad