Hello Peter,

On Fri, 7 Feb 2025 at 21:16, Peter Xu <pet...@redhat.com> wrote:
> This is not easy to follow neither with the current name, nor that you
> "assumed this is main channel" and test it.

* It is not my doing, nor is there any assumption, but that is how
current implementation works.
===
static bool migration_should_start_incoming(bool main_channel)
{
    /* Multifd doesn't start unless all channels are established */
    if (migrate_multifd()) {
        return migration_has_all_channels();
    }

    /* Preempt channel only starts when the main channel is created */
    if (migrate_postcopy_preempt()) {
        return main_channel;
    }

    /*
     * For all the rest types of migration, we should only reach here when
     * it's the main channel that's being created, and we should always
     * proceed with this channel.
     */
    assert(main_channel);
    return true;
}
===
* Above code returns 'true' for 'multifd' and 'main_channel' cases.
When migrate_postcopy_preempt() is true, main_channel is 'false', so
it returns false. All I have done is reused the
migration_should_start_incoming() function to simplify the 'if'
conditional at the top.

> I think you may want to split
> migration_has_all_channels() into migration_has_essential_channels() which
> only covers main and multifd cases.  Then you can check if (!has_esential)
> here.  You'd better also add a comment that all "essential channels" can be
> peeked.
>
> You may also want to bypass a few things, e.g. "postcopy paused stage" here
> rather than inside, because postcopy-recover only happens:
>
>   - First with a main channel, that is not peekable as no header when resume
>   - Then with preempt channel, that is also not peekable
>
> [a]
>
> You may also need to keep the mapped-ram check.  They also don't support
> peek.

* Instead of adding specific conditions and splitting functions, my
request is, let's work towards consistent channel behaviour that will
automatically simplify these conditions and channel handling. Maybe we
can do that in a subsequent series.

> >
> > > > +        } else if (mis->from_src_file
> > > > +            && (!strcmp(ioc->name, "migration-tls-incoming")
> > > > +                || !strcmp(ioc->name, "migration-file-incoming"))) {
> > > > +            channel = CH_MULTIFD;
> > >
> > > Confused here too.  Why do we need to check ioc name? Shouldn't multifd 
> > > has
> > > the headers?
> >
> > * Because they are not 'multifd' channels, tls/file channels don't
> > send magic values, but are still handled by
>
> It might be because you have a bug where you removed mapped-ram check at
> [b] above.  I think we need to keep it.

* ie. Because I removed the mapped-ram check, so tls/file channels are
handled by multifd_recv_new_channel()? No, that's not the case.
Rather, that is how it works currently. I have not changed anything,
only made it more explicit to see that when it is tls/file channel,
handle it as a CH_MULTIFD type. Looking at the current code, one can
not see clearly how tls/file channels are handled.

> Why TLS channels don't send magic?

* Probably because they do TLS hand-shake while establishing a connection?

> > because if multifd page is getting late, that network
> > latency should affect 'postcopy' channel too, no? But still if it is
>
> I don't think so.  postcopy doesn't use any multifd channels.

* Yes, but it uses the same wire/network.

> > possible, do we want to call - multifd_ram_flush_and_sync() before
> > postcopy_start()? Will that help?  I'll check if/how it works.
>
> Note that all things flushed may or may not be enough, because IIUC the
> flush only makes sure all threads are synced.

* We are again using 'flush' and 'sync' interchangeably. What does -
flush only makes sure all threads are synced - mean really? Is it not
writing all socket data onto the wire/channel?

* Flush should write all socket data onto the network wire/channel.
The _order_ in which threads flush/write their socket data onto the
wire/channel is to synchronise them, maintaining/controlling that
_order_ is not flush.

> It may not make sure the order of things to happen in multifd threads and 
> postcopy thread.  The
> latter is what we need - we need to make sure no page land in postcopy 
> threads.

* Let's see:
   1) When migration is in Postcopy mode, ram_save_multifd_page() is
_not_ called on the source side. ie. no multifd data gets enqueued on
the multifd queue.
       1.1) multifd_queue_page() function also calls multifd_send() if
the queue is full, before enqueueing new pages.
   2) If a multifd page reaches the destination during Postcopy mode,
it must have been sent/written on the multifd channel before Postcopy
mode started, right?
   3) In this case, writing/flushing all multifd socket data onto the
wire/channel, before calling postcopy_start() function should help
IIUC.
       3.1) ie. calling multifd_send() before postcopy_start() should
help to clear the multifd queue, before Postcopy begins.
       3.2) Same can be done by - multifd_ram_flush_and_sync() ->
multifd_send() - sequence.

* If all multifd pages are sent/written/flushed onto the multifd
channels before postcopy_start() is called, then multifd pages should
not arrive at the destination after postcopy starts IIUC.  If that is
happening, we need a reproducer for such a case. Do we have such a
reproducer?

> That's why I was requesting to add an assert() in multifd recv thread to
> make sure we will never receive a page during postcopy.

* ie. Add  assert(!migrate_in_postcopy())  in multifd_recv_thread()
function?  Okay.


Thank you.
---
  - Prasad


Reply via email to