On Sat, Feb 08, 2025 at 04:06:56PM +0530, Prasad Pandit wrote:
> Hello Peter,
> 
> On Fri, 7 Feb 2025 at 21:16, Peter Xu <pet...@redhat.com> wrote:
> > This is not easy to follow neither with the current name, nor that you
> > "assumed this is main channel" and test it.
> 
> * It is not my doing, nor is there any assumption, but that is how
> current implementation works.
> ===
> static bool migration_should_start_incoming(bool main_channel)
> {
>     /* Multifd doesn't start unless all channels are established */
>     if (migrate_multifd()) {
>         return migration_has_all_channels();
>     }
> 
>     /* Preempt channel only starts when the main channel is created */
>     if (migrate_postcopy_preempt()) {
>         return main_channel;
>     }
> 
>     /*
>      * For all the rest types of migration, we should only reach here when
>      * it's the main channel that's being created, and we should always
>      * proceed with this channel.
>      */
>     assert(main_channel);
>     return true;
> }
> ===
> * Above code returns 'true' for 'multifd' and 'main_channel' cases.
> When migrate_postcopy_preempt() is true, main_channel is 'false', so
> it returns false. All I have done is reused the
> migration_should_start_incoming() function to simplify the 'if'
> conditional at the top.

Yes, and I suggest a rename or introduce a new helper, per previous reply.

> 
> > I think you may want to split
> > migration_has_all_channels() into migration_has_essential_channels() which
> > only covers main and multifd cases.  Then you can check if (!has_esential)
> > here.  You'd better also add a comment that all "essential channels" can be
> > peeked.
> >
> > You may also want to bypass a few things, e.g. "postcopy paused stage" here
> > rather than inside, because postcopy-recover only happens:
> >
> >   - First with a main channel, that is not peekable as no header when resume
> >   - Then with preempt channel, that is also not peekable
> >
> > [a]
> >
> > You may also need to keep the mapped-ram check.  They also don't support
> > peek.
> 
> * Instead of adding specific conditions and splitting functions, my
> request is, let's work towards consistent channel behaviour that will
> automatically simplify these conditions and channel handling. Maybe we
> can do that in a subsequent series.

I didn't follow, sorry - do you mean this patch is correct on dropping the
mapped-ram check? I don't yet understand how it can work if without.

> 
> > >
> > > > > +        } else if (mis->from_src_file
> > > > > +            && (!strcmp(ioc->name, "migration-tls-incoming")
> > > > > +                || !strcmp(ioc->name, "migration-file-incoming"))) {
> > > > > +            channel = CH_MULTIFD;
> > > >
> > > > Confused here too.  Why do we need to check ioc name? Shouldn't multifd 
> > > > has
> > > > the headers?
> > >
> > > * Because they are not 'multifd' channels, tls/file channels don't
> > > send magic values, but are still handled by
> >
> > It might be because you have a bug where you removed mapped-ram check at
> > [b] above.  I think we need to keep it.
> 
> * ie. Because I removed the mapped-ram check, so tls/file channels are
> handled by multifd_recv_new_channel()? No, that's not the case.
> Rather, that is how it works currently. I have not changed anything,
> only made it more explicit to see that when it is tls/file channel,
> handle it as a CH_MULTIFD type. Looking at the current code, one can
> not see clearly how tls/file channels are handled.
> 
> > Why TLS channels don't send magic?
> 
> * Probably because they do TLS hand-shake while establishing a connection?

I meant tls channels should have these magics too.  Do you mean they're
not?

> 
> > > because if multifd page is getting late, that network
> > > latency should affect 'postcopy' channel too, no? But still if it is
> >
> > I don't think so.  postcopy doesn't use any multifd channels.
> 
> * Yes, but it uses the same wire/network.
> 
> > > possible, do we want to call - multifd_ram_flush_and_sync() before
> > > postcopy_start()? Will that help?  I'll check if/how it works.
> >
> > Note that all things flushed may or may not be enough, because IIUC the
> > flush only makes sure all threads are synced.
> 
> * We are again using 'flush' and 'sync' interchangeably. What does -
> flush only makes sure all threads are synced - mean really? Is it not
> writing all socket data onto the wire/channel?
> 
> * Flush should write all socket data onto the network wire/channel.
> The _order_ in which threads flush/write their socket data onto the
> wire/channel is to synchronise them, maintaining/controlling that
> _order_ is not flush.
> 
> > It may not make sure the order of things to happen in multifd threads and 
> > postcopy thread.  The
> > latter is what we need - we need to make sure no page land in postcopy 
> > threads.
> 
> * Let's see:
>    1) When migration is in Postcopy mode, ram_save_multifd_page() is
> _not_ called on the source side. ie. no multifd data gets enqueued on
> the multifd queue.
>        1.1) multifd_queue_page() function also calls multifd_send() if
> the queue is full, before enqueueing new pages.
>    2) If a multifd page reaches the destination during Postcopy mode,
> it must have been sent/written on the multifd channel before Postcopy
> mode started, right?

Yes.

>    3) In this case, writing/flushing all multifd socket data onto the
> wire/channel, before calling postcopy_start() function should help
> IIUC.
>        3.1) ie. calling multifd_send() before postcopy_start() should
> help to clear the multifd queue, before Postcopy begins.
>        3.2) Same can be done by - multifd_ram_flush_and_sync() ->
> multifd_send() - sequence.

No I don't think so.

Flushing sending side makes sure send buffer is empty.  It doesn't
guarantee recv buffer is empty on the other side.

> 
> * If all multifd pages are sent/written/flushed onto the multifd
> channels before postcopy_start() is called, then multifd pages should
> not arrive at the destination after postcopy starts IIUC.  If that is
> happening, we need a reproducer for such a case. Do we have such a
> reproducer?

With or without a reproducer, we need to at least justify it in theory.  If
it doesn't even work in theory, it's a problem.

> 
> > That's why I was requesting to add an assert() in multifd recv thread to
> > make sure we will never receive a page during postcopy.
> 
> * ie. Add  assert(!migrate_in_postcopy())  in multifd_recv_thread()
> function?  Okay.

Yes, probably before multifd_recv_state->ops->recv().

Thanks,

-- 
Peter Xu


Reply via email to