On Wed, Feb 12, 2025 at 06:57:48PM +0530, Prasad Pandit wrote:
> Hi,
> 
> On Tue, 11 Feb 2025 at 20:50, Peter Xu <pet...@redhat.com> wrote:
> > > * Yes. AFAIU, tls/file channels don't send magic values.
> > Please double check whether TLS will send magics.  AFAICT, they should.
> ===
>   * ... Also tls live migration already does
>   * tls handshake while initializing main channel so with tls this
>   * issue is not possible.
>   */
>   if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_READ_MSG_PEEK)) {
>   } else if (mis->from_src_file
>            && (!strcmp(ioc->name, "migration-tls-incoming")
>             || !strcmp(ioc->name, "migration-file-incoming"))) {
>         channel = CH_MULTIFD;
>   }
> ===
> * From the comment and condition above, both 'tls' and 'file' channels
> are not peekable, ie. no magic value to peek. The
> 'migration-file-incoming' check also helps to cover the
> migrate_mapped_ram() case IIUC.

I think it's not because TLS channels don't send magic, but TLS channels
are not prone to ordering issues.

In general, I'm not convinced we need to check names of iochannels.

> 
> > No.  We need to figure out a way to do that properly, and that's exactly
> > what I mentioned as one of the core changes we need in this series, which
> > is still missing.  We may or may not need an ACK message.  Please think
> > about it.
> 
> * First we tried to call 'multifd_send_shutdown()' to close multifd
> channels before calling postcopy_start(). That's the best case
> scenario wherein multifd channels are closed before postcopy starts.
> So that there's no confusion and/or jumbling of different data
> packets. It did not work, as qemu would crash during
> multifd_shutdown().

Have we debugged the crash? I'm not saying we should go with this, but
crash isn't a reason to not choose a design.

> 
> * Second is we push/flush all multifd pages before calling
> postcopy_start() and let the multifd channels exist/stay till the
> migration ends, after that they are duly shutdown. It is working well
> so far, passing all migration tests too.
> 
> * Third, if we want to confirm that multifd pages are received on the
> destination before calling postcopy_start(), then the best way is for
> the destination to send an acknowledgement to the source side that it
> has received and processed all multifd pages and nothing is pending on
> the multifd channels.
> 
> * Another could be to define a multifd_recv_flush() function, which
> could process and clear the receive side multifd queue, so that no
> multifd pages are pending there. Not sure how best to do this yet.
> Also considering it lacks proper communication and synchronisation
> between source and destination sides, it does not seem like the best
> solution.
> 
> * Do you have any other option/approach in mind?
> 
> > Please consider the case where multifd recv threads may get scheduled out
> > on dest host during precopy phase, not getting chance to be scheduled until
> > postcopy already started running on dst, then the recv thread can stumble
> > upon a page that was sent during precopy.  As long as that can be always
> > avoided, I think we should be good.
> 
> * TBH, this sounds like a remote corner case.

No this is not.  As I mentioned tons of times.. copying page in socket
buffers directly into guest page during vcpus running / postcopy is a very
hard to debug issue.  If it's possible to happen, the design is flawed.

> 
> * I'm testing the revised patch series and will send it shortly.

I don't think passing the unit tests prove the series is correct and should
be merged. We need to understand how it work, or we can't merge it.

I feel very frustrated multiple times that you seem to want to ignore what
I comment.  I don't know why you rush to repost things.

After a 2nd thought, I think maybe multifd flush and sync could work on src
side indeed, because when flush and sync there'll be a message
(MULTIFD_FLUSH) on main channel and that should order against the rest
postcopy messages that will also be sent on the main channel (if we do
multifd flush before all the postcopy processes).  Then it should guarantee
when postcopy starts on dest, dest multifd recv threads flushed all
messages, and no multifd message will arrive anymore.

But again we should guard it with an assert() in recv threads if you want
to postpone recycling of multifd threads, just to double check no outliers
we overlooked.

When proposing the patches you need to justify the design first before
anything.  Please evaluate above, these things are critical to appear in
either code comments or commit messages. Please digest everything before
reposting.

Thanks,

-- 
Peter Xu


Reply via email to