Re: [PATCH v5 3/5] migration: enable multifd and postcopy together

Prasad Pandit Wed, 12 Feb 2025 09:37:14 -0800

Hi,

On Wed, 12 Feb 2025 at 20:08, Peter Xu <pet...@redhat.com> wrote:
> I think it's not because TLS channels don't send magic, but TLS channels
> are not prone to ordering issues.
> In general, I'm not convinced we need to check names of iochannels.


* If the channel does not set '_READ_MSG_SEEK' flag, which magic value
are we going to read? As for checking names of the channels, it tells
the reader how an incoming channel is processed.

>> It did not work, as qemu would crash during multifd_shutdown().
>
> Have we debugged the crash? I'm not saying we should go with this, but
> crash isn't a reason to not choose a design.

* Yes, I did try to debug it but couldn't get to a conclusion in time.

> No this is not.  As I mentioned tons of times.. copying page in socket
> buffers directly into guest page during vcpus running / postcopy is a very
> hard to debug issue. If it's possible to happen, the design is flawed.

* Sure, agreed. So far in my testing it does not seem to happen. The
theory of multifd_recv_threads getting scheduled out and causing guest
page over-write seems remote to me, but I get the
possibility/probability. One possible solution is to have the
destination side send an 'acknowledgement' to the source side.

> I don't think passing the unit tests prove the series is correct and should
> be merged. We need to understand how it work, or we can't merge it.

* Well, passing unit tests should confirm that it does not break
existing functionality. Is there any metric/limit to such
understanding? Everybody understands/sees things differently.
Understanding is an ever evolving thing. Saying that merge should
happen based on understanding sounds weird to me.

> I feel very frustrated multiple times that you seem to want to ignore what
> I comment.  I don't know why you rush to repost things.

* I guess we are seeing/understanding things differently here.

> After a 2nd thought, I think maybe multifd flush and sync could work on src
> side indeed, because when flush and sync there'll be a message
> (MULTIFD_FLUSH) on main channel and that should order against the rest
> postcopy messages that will also be sent on the main channel (if we do
> multifd flush before all the postcopy processes).  Then it should guarantee
> when postcopy starts on dest, dest multifd recv threads flushed all
> messages, and no multifd message will arrive anymore.

* After a 2nd thought - that's evolving understanding right there. :)

* To mention here, to flush multifd channels with
'multifd_ram_flush_and_sync()' I tried calling

       migration_iteration_run -> multifd_ram_flush_and_sync(s->to_dst_file)

It does not work, it crashes. Is 's->to_dst_file' a right parameter there?

But calling a function below works
===
/* Send enqueued data pages onto next available multifd channel */
int multifd_send_flush(void)
{
    if (!multifd_payload_empty(multifd_ram_send)) {
        if (!multifd_send(&multifd_ram_send)) {
            error_report("%s: multifd_send fail", __func__);
            return -1;
        }
    }

    return 0;
}
===
    migration_iteration_run -> multifd_send_flush()

Works, but it is not sending the 'RAM_SAVE_FLAG_MULTIFD_FLUSH' message
as done by 'multifd_ram_flush_and_sync()' function.

> But again we should guard it with an assert() in recv threads if you want
> to postpone recycling of multifd threads, just to double check no outliers
> we overlooked.

* Yes, the revised patch is working with the
assert(!migrate_in_postcopy) in multifd_recv_thread.

* I was going to send a revised series with these changes, but will
wait on that for now.

Thank you.
---
  - Prasad

Re: [PATCH v5 3/5] migration: enable multifd and postcopy together

Reply via email to