Re: [PATCH v5 3/5] migration: enable multifd and postcopy together

Prasad Pandit Fri, 07 Feb 2025 02:34:10 -0800

Hi,

On Fri, 7 Feb 2025 at 04:46, Peter Xu <pet...@redhat.com> wrote:
> > +/* Migration channel types */
> > +enum { CH_DEFAULT, CH_MULTIFD, CH_POSTCOPY };
>
> Maybe s/DEFAULT/MAIN/?


* Okay.

> > -    if (migrate_multifd() && !migrate_mapped_ram() &&
> > -        !migrate_postcopy_ram() &&
> > -        qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_READ_MSG_PEEK)) {
> > +    if (!migration_should_start_incoming(channel)) {
>
> This says "if we assume this is the main channel, and if we shouldn't start
> incoming migration, then we should peek at the buffers".
> Could you help explain?

* New migration starts only when the main channel and if 'multifd' is
enabled all multifd channels are established. So, if 'main' and
'multifd' channels are _not_ established then migration should _not_
start. And in that case, incoming connection is likely for one of
those channels and so we should peek at the buffers, because both
'main' and 'multifd' channels send magic values.

* migration_should_start_incoming() function returns 'true' only when
'main' and 'multifd' channels are being established. For 'postcopy'
channel it returns false.


> > +            } else if (!mis->from_src_file
> > +                        && mis->state == MIGRATION_STATUS_POSTCOPY_PAUSED) 
> > {
> > +                /* reconnect default channel for postcopy recovery */
> > +                channel = CH_DEFAULT;
>
> This is still in the big "peek buffer" if condition.
> IMHO we can skip peeking buffer when postcopy paused, because in this stage
> the channel must be (1) main channel first, then (2) preempt channel next.

* It is in the big 'peek buffer' condition because the 'main' channel
(= CH_DEFAULT) is being established here. Ideally, all channels should
send magic values to be consistent. The 'main' channel sends magic
value when it is established before starting migration, but the same
'main' channel does not send magic value when it is established during
postcopy recovery, that is an inconsistency (a bug) here. Ideal fix is
to send a magic value every time the 'main' channel is established,
irrespective of when it is established.

* Adding conditionals to check if it is _POSTCOPY_PAUSED state then
don't peek will only lead to complicated 'if' conditionals. This
channel handling code is already complex and non-intuitive enough.

> > +        } else if (mis->from_src_file
> > +            && (!strcmp(ioc->name, "migration-tls-incoming")
> > +                || !strcmp(ioc->name, "migration-file-incoming"))) {
> > +            channel = CH_MULTIFD;
>
> Confused here too.  Why do we need to check ioc name? Shouldn't multifd has
> the headers?

* Because they are not 'multifd' channels, tls/file channels don't
send magic values, but are still handled by
'multifd_recv_new_channel()' function.
===
    ...
    if (default_channel) {
        migration_incoming_setup(f);
    } else {
        if (migrate_multifd()) {
            multifd_recv_new_channel(ioc, &local_err);
        } else {
            postcopy_preempt_new_channel(mis, f);
        }
===
In the code above, if 'default_channel==false' and multifd() is
enabled, all incoming connections are handled by
'multifd_recv_new_channel()', irrespective of whether it is a
'multifd' channel or not. While creating multifd channels, there is no
check for channel type like: if(channel == CH_MULTIFD).

* IMHO, if we make all channels behave with consistency, ie. either
they all send magic value or none sends magic value, that'll simplify
this code a lot.

> > -        assert(migration_needs_multiple_sockets());
> Could I ask why removal?

* Because that function returns migrate_multifd() =>
migrate_multifd() || migrate_postcopy_preempt();
* And the very following check is also migrate_multifd(), as below:

> >          if (migrate_multifd()) {
> >              multifd_recv_new_channel(ioc, &local_err);


> It might be better to avoid such "ret && XXX" nested check.  E.g. do you
> think below easier to read?
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 74c50cc72c..9eb2f3fdeb 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1064,12 +1064,14 @@ bool migration_has_all_channels(void)
>          return false;
>      }
>
> -    if (migrate_multifd()) {
> -        return multifd_recv_all_channels_created();
> +    if (migrate_multifd() &&
> +        !multifd_recv_all_channels_created()) {
> +        return false;
>      }
>
> -    if (migrate_postcopy_preempt()) {
> -        return mis->postcopy_qemufile_dst != NULL;
> +    if (migrate_postcopy_preempt() &&
> +        mis->postcopy_qemufile_dst == NULL) {
> +        return false;
>      }
>
>      return true;

* Will try it.

> > -    if (!migrate_multifd()) {
> > +    if (!migrate_multifd() || migration_in_postcopy()) {
> >          return 0;
> >      }
>
> [1]
>
> >
> >      if (new_caps[MIGRATION_CAPABILITY_BACKGROUND_SNAPSHOT]) {
> > diff --git a/migration/ram.c b/migration/ram.c
> > index f2326788de..bdba7abe73 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1295,7 +1295,7 @@ static int find_dirty_block(RAMState *rs, 
> > PageSearchStatus *pss)
> >          pss->page = 0;
> >          pss->block = QLIST_NEXT_RCU(pss->block, next);
> >          if (!pss->block) {
> > -            if (multifd_ram_sync_per_round()) {
> > +            if (multifd_ram_sync_per_round() && !migration_in_postcopy()) {
>
> If you have above[1], why need this?

* True, I tried with just [1] above first, but it was failing for some
reason. Will try again.

> This patch still did nothing for multifd in postcopy_start().  I'm not sure
> it's safe.
>
> What happens if some multifd pages were sent, then we start postcopy, dest
> vcpu threads running, then during postcopy some multifd pages finally
> arrived and modifying the guest pages during vcpus running?

* ram_save_target_page() function saves multifd pages only when
(..!migration_in_postcopy()) not in postcopy mode. Case of 'multifd'
page arriving late on destination and 'postcopy' starting before that
is strange, because if multifd page is getting late, that network
latency should affect 'postcopy' channel too, no? But still if it is
possible, do we want to call - multifd_ram_flush_and_sync() before
postcopy_start()? Will that help?  I'll check if/how it works.

Thank you.
---
  - Prasad

Re: [PATCH v5 3/5] migration: enable multifd and postcopy together

Reply via email to