On Tue, Nov 05, 2024 at 05:24:55PM +0530, Prasad Pandit wrote:
> On Mon, 4 Nov 2024 at 23:19, Peter Xu <pet...@redhat.com> wrote:
> > Precopy keeps sending data even during postcopy, that's the background
> > stream (with/without preempt feature enabled).  May need some amendment
> > when repost here.
> 
> * Okay.
> 
> > > +    if (channel == CH_POSTCOPY) {
> > > +        return false;
> > > +    }
> >
> > Please see my other reply, I think here it should never be POSTCOPY
> > channel, because postcopy (aka, preempt) channel is only created after the
> > main channel.. So I wonder whether this "if" will hit at all.
> 
> * It does. migration_ioc_process_incoming() is called for each
> incoming connection. And every time it calls
> migration_should_start_incoming() to check if it should proceed with
> the migration or should wait for further connections, because multifd
> does not start until all its connections are established.
> 
> * Similarly when the Postcopy channel is initiated towards the end of
> Precopy migration, migration_should_start_incoming() gets called, and
> returns 'false' because we don't want to start a new incoming
> migration at that time. Earlier it was receiving boolean value
> 'default_channel' as parameter, which was set to false while accepting
> 'postcopy' connection via => default_channel = !mis->from_src_file;

Hmm yes, it will happen, but should only happen after the main channel.

> 
> > > +
> > >      /* Multifd doesn't start unless all channels are established */
> > >      if (migrate_multifd()) {
> > > -        return migration_has_all_channels();
> > > -    }
> > > -
> > > -    /* Preempt channel only starts when the main channel is created */
> > > -    if (migrate_postcopy_preempt()) {
> > > -        return main_channel;
> > > +        return multifd_recv_all_channels_created();
> >
> > I think this is incorrect.. We should also need to check main channel is
> > established before start incoming.  The old code uses
> > migration_has_all_channels() which checks that.
> 
> * Okay, will try to fix it better.  But calling
> migration_has_all_channels() after checking migrate_multifd() does not
> seem/read right.
> 
> > I don't yet see how you manage the multifd threads, etc, on both sides.  Or
> > any logic to make sure multifd will properly flush the pages before
> > postcopy starts.  IOW, any guarantee that all the pages will only be
> > installed using UFFDIO_COPY as long as vcpu starts on dest.  Any comments?
> 
> * There are no changes related to that. As this series only tries to
> enable the multifd and postcopy options together.

In short, qemu_savevm_state_complete_precopy_iterable() is skipped in
postcopy.  I don't yet see anywhere multifd flushes pages.  We need to
flush multifd pages before vcpu starts on dest.

As we discussed off-list, we can add an assertion in multifd recv threads
to make sure it won't touch any guest page during active postcopy.

Maybe something like:

diff --git a/migration/multifd.c b/migration/multifd.c
index 4374e14a96..32425137bd 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -1182,6 +1182,13 @@ static void *multifd_recv_thread(void *opaque)
         }
 
         if (has_data) {
+            /*
+             * Now we're going to receive some guest pages into iov
+             * buffers.  If it's during postcopy, it means vcpu can be
+             * running, so this can corrupt memory when modified
+             * concurrently by multifd threads!
+             */
+            assert(!migration_in_postcopy());
             ret = multifd_recv_state->ops->recv(p, &local_err);
             if (ret != 0) {
                 break;

> 
> > The most complicated part of this work would be testing, to make sure it
> > works in all previous cases, and that's majorly why we disabled it before:
> > it was because it randomly crashes, but not always; fundamentally it's
> > because when multifd was designed there wasn't enough consideration on
> > working together with postcopy.  We didn't clearly know what's missing at
> > that time.
> 
> * Right. I have tested this series with the following migration
> commands to confirm that migration works as expected and there were no
> crash(es) observed.
> ===
> [Precopy]
> # virsh migrate --verbose --live --auto-converge f39vm
> qemu+ssh://<destination-host-name>/system
> 
> [Precopy + Multifd]
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 02 \
>      f39vm  qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 04 \
>      f39vm  qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 08 \
>      f39vm  qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 16 \
>      f39vm  qemu+ssh://<destination-host-name>/system
> 
> [Postcopy]
> # virsh migrate --verbose --live --auto-converge \
>     --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system

I'd suggest we try interrupt postcopy with migrate-pause, then go with
postcopy recover.  I wonder how the current series work if the network
failure will fail all the multifd iochannels, and how reconnect works.

> 
> [Postcopy + Multifd]
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 02 \
>     --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 04 \
>     --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 08 \
>     --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> # virsh migrate --verbose --live --auto-converge --parallel
> --parallel-connections 16 \
>     --postcopy --postcopy-after-precopy f39vm
> qemu+ssh://<destination-host-name>/system
> ===
> 
> > So we would definitely need to add test cases, just like whoever adds new
> > features to migration, to make sure at least it works for existing multifd
> > / postcopy test cases, but when both features enabled.
> ...
> > It will add quite a few tests to run here, but I don't see a good way
> > otherwise when we want to enable the two features, because it is indeed a
> > challenge to enable the two major features together here.
> >
> 
> * Thank you for the hints about the tests, will surely look into them
> and try to learn about how to add new test cases.

Thanks,

-- 
Peter Xu


Reply via email to