On Mon, 4 Nov 2024 at 23:19, Peter Xu <pet...@redhat.com> wrote:
> Precopy keeps sending data even during postcopy, that's the background
> stream (with/without preempt feature enabled).  May need some amendment
> when repost here.

* Okay.

> > +    if (channel == CH_POSTCOPY) {
> > +        return false;
> > +    }
>
> Please see my other reply, I think here it should never be POSTCOPY
> channel, because postcopy (aka, preempt) channel is only created after the
> main channel.. So I wonder whether this "if" will hit at all.

* It does. migration_ioc_process_incoming() is called for each
incoming connection. And every time it calls
migration_should_start_incoming() to check if it should proceed with
the migration or should wait for further connections, because multifd
does not start until all its connections are established.

* Similarly when the Postcopy channel is initiated towards the end of
Precopy migration, migration_should_start_incoming() gets called, and
returns 'false' because we don't want to start a new incoming
migration at that time. Earlier it was receiving boolean value
'default_channel' as parameter, which was set to false while accepting
'postcopy' connection via => default_channel = !mis->from_src_file;

> > +
> >      /* Multifd doesn't start unless all channels are established */
> >      if (migrate_multifd()) {
> > -        return migration_has_all_channels();
> > -    }
> > -
> > -    /* Preempt channel only starts when the main channel is created */
> > -    if (migrate_postcopy_preempt()) {
> > -        return main_channel;
> > +        return multifd_recv_all_channels_created();
>
> I think this is incorrect.. We should also need to check main channel is
> established before start incoming.  The old code uses
> migration_has_all_channels() which checks that.

* Okay, will try to fix it better.  But calling
migration_has_all_channels() after checking migrate_multifd() does not
seem/read right.

> I don't yet see how you manage the multifd threads, etc, on both sides.  Or
> any logic to make sure multifd will properly flush the pages before
> postcopy starts.  IOW, any guarantee that all the pages will only be
> installed using UFFDIO_COPY as long as vcpu starts on dest.  Any comments?

* There are no changes related to that. As this series only tries to
enable the multifd and postcopy options together.

> The most complicated part of this work would be testing, to make sure it
> works in all previous cases, and that's majorly why we disabled it before:
> it was because it randomly crashes, but not always; fundamentally it's
> because when multifd was designed there wasn't enough consideration on
> working together with postcopy.  We didn't clearly know what's missing at
> that time.

* Right. I have tested this series with the following migration
commands to confirm that migration works as expected and there were no
crash(es) observed.
===
[Precopy]
# virsh migrate --verbose --live --auto-converge f39vm
qemu+ssh://<destination-host-name>/system

[Precopy + Multifd]
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 02 \
     f39vm  qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 04 \
     f39vm  qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 08 \
     f39vm  qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 16 \
     f39vm  qemu+ssh://<destination-host-name>/system

[Postcopy]
# virsh migrate --verbose --live --auto-converge \
    --postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system

[Postcopy + Multifd]
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 02 \
    --postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 04 \
    --postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 08 \
    --postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 16 \
    --postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
===

> So we would definitely need to add test cases, just like whoever adds new
> features to migration, to make sure at least it works for existing multifd
> / postcopy test cases, but when both features enabled.
...
> It will add quite a few tests to run here, but I don't see a good way
> otherwise when we want to enable the two features, because it is indeed a
> challenge to enable the two major features together here.
>

* Thank you for the hints about the tests, will surely look into them
and try to learn about how to add new test cases.

Thank you.
---
  - Prasad


Reply via email to