Peter Xu <pet...@redhat.com> writes:

> On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote:
>> Peter Xu <pet...@redhat.com> writes:
>> 
>> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
>> >> > I never tried vsock, would it be used in any use case?
>> >> >
>> >> 
>> >> I don't know, I'm going by what's in the code.
>> >> 
>> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not 
>> >> > sure.
>> >> > Maybe there's something I missed.
>> >> 
>> >> The code was always had some variation of:
>> >> 
>> >> static bool transport_supports_multi_channels(SocketAddress *saddr)
>> >> {
>> >>     return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>> >>            strstart(uri, "vsock:", NULL);
>> >> }
>> >> 
>> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
>> >> only").
>> >
>> > Looks like a copy-paste error..  I found it, should be 9ba3b2baa1
>> > ("migration: add vsock as data channel support").
>> >
>> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d09750...@huawei.com/
>> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
>> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
>> >
>> 
>> Great, thanks for finding those. I'll add it to my little "lore"
>> folder. This kind of information is always good to know.
>> 
>> > So if I read it right.. the VM in this case is inside a container or
>> > something, then it talks to an "agent" on a PCIe device which understands
>> > virtio-vsock protocol.  So maybe vsock just performs better than other ways
>> > to dig that tunnel for the container.
>> >
>> > In that case, mentioning vsock is at least ok.
>> >
>> > [...]
>> >
>> >> >> +After the channels have been put into a wait state by the sync
>> >> >> +functions, the client code may continue to transmit additional data by
>> >> >> +issuing ``multifd_send()`` once again.
>> >> >> +
>> >> >> +Note:
>> >> >> +
>> >> >> +- the RAM migration does, effectively, a global synchronization by
>> >> >> +  chaining a call to ``multifd_send_sync_main()`` with the emission 
>> >> >> of a
>> >> >> +  flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>> >> >
>> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>> >> >
>> >> 
>> >> Eh.. big compatibility mess. I rather not mention it.
>> >
>> > It's not strictly a compatibility mess.  IIUC, it was used to be designed
>> > to always work with EOS.  I think at that time Juan was still focused on
>> > making it work and not whole perf tunings, but then we found it can be a
>> > major perf issue if we flush too soon.  Then if we flush it once per round,
>> > it may not always pair with a EOS.  That's why we needed a new message.
>> >
>> 
>> Being fully honest, at the time I got the impression the situation was
>> "random person inside RH decided to measure performance of random thing
>> and upstream maintainer felt pressure to push a fix".
>> 
>> Whether that was the case or not, it doesn't matter now, but we can't
>> deny that this _has_ generated some headache, just look at how many
>> issues arose from the introduction of that flag.
>
> I might be the "random person" here.  I remember I raised this question to
> Juan on why we need to flush for each iteration.
>

It's a good question indeed. It was the right call to address it, I just
wish we had come up with a more straight-forward solution to it.

> I also remember we did perf test, we can redo it.  But we can discuss the
> design first.
>
> To me, this is a fairly important question to ask.  Fundamentally, the very
> initial question is why do we need periodic flush and sync at all.  It's
> because we want to make sure new version of pages to land later than old
> versions.
>
> Note that we can achieve that in other ways too. E.g., if we only enqueue a
> page to a specific multifd thread (e.g. page_index % n_multifd_threads),
> then it'll guarantee the ordering without flush and sync, because new / old
> version for the same page will only go via the same channel, which
> guarantees ordering of packets in time order naturally.

Right.

> But that at least has risk of not being able to fully leverage the
> bandwidth, e.g., worst case is the guest has dirty pages that are
> accidentally always hashed to the same channel; consider a program
> keeps dirtying every 32K on a 4K psize system with 8 channels.  Or
> something like that.
>
> Not documenting EOS part is ok too from that pov, because it's confusing
> too on why we need to flush per EOS.  Per-round is more understandable from
> that POV, because we want to make sure new version lands later, and
> versioning boost for pages only happen per-round, not per-iteration.
>
>> 
>> > But hey, you're writting a doc that helps everyone.  You deserve to decide
>> > whether you like to mention it or not on this one. :)
>> 
>> My rant aside, I really want to avoid any readers having to think too
>> much about this flush thing. We're already seeing some confusion when
>> discussing it with Prasad in the other thread. The code itself and the
>> git log are more reliable to explain the compat situation IMO.
>> 
>> >
>> > IIRC we updated our compat rule so we maintain each machine type for only 6
>> > years.  It means the whole per-iteration + EOS stuff can be removed in 3.5
>> > years or so - we did that work in July 2022.  So it isn't that important
>> > either to mention indeed.
>> >
>> 
>> Yep, that as well.
>> 
>> >> 
>> >> > Maybe we should also add a sentence on the relationship of
>> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS 
>> >> > ),
>> >> > in that they should always be sent together, and only if so would it
>> >> > provide ordering of multifd messages and what happens in the main 
>> >> > migration
>> >> > thread.
>> >> >
>> >> 
>> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
>> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
>> >> driven by the source machine by the flags that are put on the
>> >> stream. IOW, this is a RAM migration design, not a multifd design. The
>> >> multifd design is (could be, we decide) that once sync packets are sent,
>> >> _something_ must do the following:
>> >> 
>> >>     for (i = 0; i < thread_count; i++) {
>> >>         trace_multifd_recv_sync_main_wait(i);
>> >>         qemu_sem_wait(&multifd_recv_state->sem_sync);
>> >>     }
>> >> 
>> >> ... which is already part of multifd_recv_sync_main(), but that just
>> >> _happens to be_ called by ram.c when it sees the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
>> >> design requirement. The ram.c code could for instance do the sync when
>> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
>> >
>> > I still think it should be done in RAM code only.  One major goal (if not
>> > the only goal..) is it wants to order different versions of pages and
>> > that's only what the RAM module is about, not migration in general.
>> >
>> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
>> > work for migration in general.
>> >
>> > Said so, I agree we do violate it from time to time, for example, we have a
>> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
>> > QEMU_VM_COMMAND.  But IIUC that was either kind of ancient (so we need to
>> > stick with them now.. postcopy was there for 10 years) or it needs some
>> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
>> > should still try to stick with the layering if possible.
>> 
>> All good points, but I was talking of something else:
>> 
>> I was just throwing an example of how it could be done differently to
>> make the point clear that the recv_sync has nothing to do with the ram
>> flag, that's just implementation detail. I was thinking specifically
>> about the multifd+postcopy work where we might need syncs but there is
>> no RAM_FLAGS there.
>> 
>> We don't actually _need_ to sync with the migration thread on the
>> destination like that. The client could send control information in it's
>> opaque packet (instead of in the migration thread) or in a completely
>> separate channel if it wanted. That sync is also not necessary if there
>> is no dependency around the data being transferred (i.e. mapped-ram just
>> takes data form the file and writes to guest memory)
>
> Mapped-ram is definitely different.
>
> For sockets, IIUC we do rely on the messages on the multifd channels _and_
> the message on the main channel.
>
> So I may not have fully get your points above, but..

My point is just a theoretical one. We _could_ make this work with a
different mechanism. And that's why I'm being careful in what to
document, that's all.

> See how it more or
> less implemented a remote memory barrier kind of thing _with_ the main
> channel message:
>
>      main channel    multifd channel 1        multifd channel 2      
>      ------------    -----------------        -----------------
>                        send page P v1
>   +------------------------------------------------------------------+
>   |  RAM_SAVE_FLAG_MULTIFD_FLUSH                                     |
>   |                    MULTIFD_FLAG_SYNC        MULTIFD_FLAG_SYNC    |
>   +------------------------------------------------------------------+
>                                                 send page P v2
>

This is a nice way of diagramming that!

> Then v1 and v2 of the page P are ordered.
>
> If without the message on the main channel:
>
>      main channel    multifd channel 1        multifd channel 2      
>      ------------    -----------------        -----------------
>                        send page P v1
>                        MULTIFD_FLAG_SYNC
>                                                 MULTIFD_FLAG_SYNC
>                                                 send page P v2
>
> Then I don't see what protects reorder of arrival of messages like:
>
>      main channel    multifd channel 1        multifd channel 2      
>      ------------    -----------------        -----------------
>                                                 MULTIFD_FLAG_SYNC
>                                                 send page P v2
>                        send page P v1
>                        MULTIFD_FLAG_SYNC
>

That's all fine. As long as the recv part doesn't see them out of
order. I'll try to write some code to confirm so I don't waste too much
of your time.

Reply via email to