On Tue, Nov 05, 2024 at 05:24:55PM +0530, Prasad Pandit wrote: > On Mon, 4 Nov 2024 at 23:19, Peter Xu <pet...@redhat.com> wrote: > > Precopy keeps sending data even during postcopy, that's the background > > stream (with/without preempt feature enabled). May need some amendment > > when repost here. > > * Okay. > > > > + if (channel == CH_POSTCOPY) { > > > + return false; > > > + } > > > > Please see my other reply, I think here it should never be POSTCOPY > > channel, because postcopy (aka, preempt) channel is only created after the > > main channel.. So I wonder whether this "if" will hit at all. > > * It does. migration_ioc_process_incoming() is called for each > incoming connection. And every time it calls > migration_should_start_incoming() to check if it should proceed with > the migration or should wait for further connections, because multifd > does not start until all its connections are established. > > * Similarly when the Postcopy channel is initiated towards the end of > Precopy migration, migration_should_start_incoming() gets called, and > returns 'false' because we don't want to start a new incoming > migration at that time. Earlier it was receiving boolean value > 'default_channel' as parameter, which was set to false while accepting > 'postcopy' connection via => default_channel = !mis->from_src_file;
Hmm yes, it will happen, but should only happen after the main channel. > > > > + > > > /* Multifd doesn't start unless all channels are established */ > > > if (migrate_multifd()) { > > > - return migration_has_all_channels(); > > > - } > > > - > > > - /* Preempt channel only starts when the main channel is created */ > > > - if (migrate_postcopy_preempt()) { > > > - return main_channel; > > > + return multifd_recv_all_channels_created(); > > > > I think this is incorrect.. We should also need to check main channel is > > established before start incoming. The old code uses > > migration_has_all_channels() which checks that. > > * Okay, will try to fix it better. But calling > migration_has_all_channels() after checking migrate_multifd() does not > seem/read right. > > > I don't yet see how you manage the multifd threads, etc, on both sides. Or > > any logic to make sure multifd will properly flush the pages before > > postcopy starts. IOW, any guarantee that all the pages will only be > > installed using UFFDIO_COPY as long as vcpu starts on dest. Any comments? > > * There are no changes related to that. As this series only tries to > enable the multifd and postcopy options together. In short, qemu_savevm_state_complete_precopy_iterable() is skipped in postcopy. I don't yet see anywhere multifd flushes pages. We need to flush multifd pages before vcpu starts on dest. As we discussed off-list, we can add an assertion in multifd recv threads to make sure it won't touch any guest page during active postcopy. Maybe something like: diff --git a/migration/multifd.c b/migration/multifd.c index 4374e14a96..32425137bd 100644 --- a/migration/multifd.c +++ b/migration/multifd.c @@ -1182,6 +1182,13 @@ static void *multifd_recv_thread(void *opaque) } if (has_data) { + /* + * Now we're going to receive some guest pages into iov + * buffers. If it's during postcopy, it means vcpu can be + * running, so this can corrupt memory when modified + * concurrently by multifd threads! + */ + assert(!migration_in_postcopy()); ret = multifd_recv_state->ops->recv(p, &local_err); if (ret != 0) { break; > > > The most complicated part of this work would be testing, to make sure it > > works in all previous cases, and that's majorly why we disabled it before: > > it was because it randomly crashes, but not always; fundamentally it's > > because when multifd was designed there wasn't enough consideration on > > working together with postcopy. We didn't clearly know what's missing at > > that time. > > * Right. I have tested this series with the following migration > commands to confirm that migration works as expected and there were no > crash(es) observed. > === > [Precopy] > # virsh migrate --verbose --live --auto-converge f39vm > qemu+ssh://<destination-host-name>/system > > [Precopy + Multifd] > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 02 \ > f39vm qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 04 \ > f39vm qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 08 \ > f39vm qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 16 \ > f39vm qemu+ssh://<destination-host-name>/system > > [Postcopy] > # virsh migrate --verbose --live --auto-converge \ > --postcopy --postcopy-after-precopy f39vm > qemu+ssh://<destination-host-name>/system I'd suggest we try interrupt postcopy with migrate-pause, then go with postcopy recover. I wonder how the current series work if the network failure will fail all the multifd iochannels, and how reconnect works. > > [Postcopy + Multifd] > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 02 \ > --postcopy --postcopy-after-precopy f39vm > qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 04 \ > --postcopy --postcopy-after-precopy f39vm > qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 08 \ > --postcopy --postcopy-after-precopy f39vm > qemu+ssh://<destination-host-name>/system > # virsh migrate --verbose --live --auto-converge --parallel > --parallel-connections 16 \ > --postcopy --postcopy-after-precopy f39vm > qemu+ssh://<destination-host-name>/system > === > > > So we would definitely need to add test cases, just like whoever adds new > > features to migration, to make sure at least it works for existing multifd > > / postcopy test cases, but when both features enabled. > ... > > It will add quite a few tests to run here, but I don't see a good way > > otherwise when we want to enable the two features, because it is indeed a > > challenge to enable the two major features together here. > > > > * Thank you for the hints about the tests, will surely look into them > and try to learn about how to add new test cases. Thanks, -- Peter Xu