Peter Xu <pet...@redhat.com> writes: > On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote: >> >> > I never tried vsock, would it be used in any use case? >> >> > >> >> >> >> I don't know, I'm going by what's in the code. >> >> >> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not >> >> > sure. >> >> > Maybe there's something I missed. >> >> >> >> The code was always had some variation of: >> >> >> >> static bool transport_supports_multi_channels(SocketAddress *saddr) >> >> { >> >> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) || >> >> strstart(uri, "vsock:", NULL); >> >> } >> >> >> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol >> >> only"). >> > >> > Looks like a copy-paste error.. I found it, should be 9ba3b2baa1 >> > ("migration: add vsock as data channel support"). >> > >> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d09750...@huawei.com/ >> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei >> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed >> > >> >> Great, thanks for finding those. I'll add it to my little "lore" >> folder. This kind of information is always good to know. >> >> > So if I read it right.. the VM in this case is inside a container or >> > something, then it talks to an "agent" on a PCIe device which understands >> > virtio-vsock protocol. So maybe vsock just performs better than other ways >> > to dig that tunnel for the container. >> > >> > In that case, mentioning vsock is at least ok. >> > >> > [...] >> > >> >> >> +After the channels have been put into a wait state by the sync >> >> >> +functions, the client code may continue to transmit additional data by >> >> >> +issuing ``multifd_send()`` once again. >> >> >> + >> >> >> +Note: >> >> >> + >> >> >> +- the RAM migration does, effectively, a global synchronization by >> >> >> + chaining a call to ``multifd_send_sync_main()`` with the emission >> >> >> of a >> >> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``) >> >> > >> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type. >> >> > >> >> >> >> Eh.. big compatibility mess. I rather not mention it. >> > >> > It's not strictly a compatibility mess. IIUC, it was used to be designed >> > to always work with EOS. I think at that time Juan was still focused on >> > making it work and not whole perf tunings, but then we found it can be a >> > major perf issue if we flush too soon. Then if we flush it once per round, >> > it may not always pair with a EOS. That's why we needed a new message. >> > >> >> Being fully honest, at the time I got the impression the situation was >> "random person inside RH decided to measure performance of random thing >> and upstream maintainer felt pressure to push a fix". >> >> Whether that was the case or not, it doesn't matter now, but we can't >> deny that this _has_ generated some headache, just look at how many >> issues arose from the introduction of that flag. > > I might be the "random person" here. I remember I raised this question to > Juan on why we need to flush for each iteration. >
It's a good question indeed. It was the right call to address it, I just wish we had come up with a more straight-forward solution to it. > I also remember we did perf test, we can redo it. But we can discuss the > design first. > > To me, this is a fairly important question to ask. Fundamentally, the very > initial question is why do we need periodic flush and sync at all. It's > because we want to make sure new version of pages to land later than old > versions. > > Note that we can achieve that in other ways too. E.g., if we only enqueue a > page to a specific multifd thread (e.g. page_index % n_multifd_threads), > then it'll guarantee the ordering without flush and sync, because new / old > version for the same page will only go via the same channel, which > guarantees ordering of packets in time order naturally. Right. > But that at least has risk of not being able to fully leverage the > bandwidth, e.g., worst case is the guest has dirty pages that are > accidentally always hashed to the same channel; consider a program > keeps dirtying every 32K on a 4K psize system with 8 channels. Or > something like that. > > Not documenting EOS part is ok too from that pov, because it's confusing > too on why we need to flush per EOS. Per-round is more understandable from > that POV, because we want to make sure new version lands later, and > versioning boost for pages only happen per-round, not per-iteration. > >> >> > But hey, you're writting a doc that helps everyone. You deserve to decide >> > whether you like to mention it or not on this one. :) >> >> My rant aside, I really want to avoid any readers having to think too >> much about this flush thing. We're already seeing some confusion when >> discussing it with Prasad in the other thread. The code itself and the >> git log are more reliable to explain the compat situation IMO. >> >> > >> > IIRC we updated our compat rule so we maintain each machine type for only 6 >> > years. It means the whole per-iteration + EOS stuff can be removed in 3.5 >> > years or so - we did that work in July 2022. So it isn't that important >> > either to mention indeed. >> > >> >> Yep, that as well. >> >> >> >> >> > Maybe we should also add a sentence on the relationship of >> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS >> >> > ), >> >> > in that they should always be sent together, and only if so would it >> >> > provide ordering of multifd messages and what happens in the main >> >> > migration >> >> > thread. >> >> > >> >> >> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need >> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is >> >> driven by the source machine by the flags that are put on the >> >> stream. IOW, this is a RAM migration design, not a multifd design. The >> >> multifd design is (could be, we decide) that once sync packets are sent, >> >> _something_ must do the following: >> >> >> >> for (i = 0; i < thread_count; i++) { >> >> trace_multifd_recv_sync_main_wait(i); >> >> qemu_sem_wait(&multifd_recv_state->sem_sync); >> >> } >> >> >> >> ... which is already part of multifd_recv_sync_main(), but that just >> >> _happens to be_ called by ram.c when it sees the >> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd >> >> design requirement. The ram.c code could for instance do the sync when >> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears. >> > >> > I still think it should be done in RAM code only. One major goal (if not >> > the only goal..) is it wants to order different versions of pages and >> > that's only what the RAM module is about, not migration in general. >> > >> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should >> > work for migration in general. >> > >> > Said so, I agree we do violate it from time to time, for example, we have a >> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under >> > QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to >> > stick with them now.. postcopy was there for 10 years) or it needs some >> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we >> > should still try to stick with the layering if possible. >> >> All good points, but I was talking of something else: >> >> I was just throwing an example of how it could be done differently to >> make the point clear that the recv_sync has nothing to do with the ram >> flag, that's just implementation detail. I was thinking specifically >> about the multifd+postcopy work where we might need syncs but there is >> no RAM_FLAGS there. >> >> We don't actually _need_ to sync with the migration thread on the >> destination like that. The client could send control information in it's >> opaque packet (instead of in the migration thread) or in a completely >> separate channel if it wanted. That sync is also not necessary if there >> is no dependency around the data being transferred (i.e. mapped-ram just >> takes data form the file and writes to guest memory) > > Mapped-ram is definitely different. > > For sockets, IIUC we do rely on the messages on the multifd channels _and_ > the message on the main channel. > > So I may not have fully get your points above, but.. My point is just a theoretical one. We _could_ make this work with a different mechanism. And that's why I'm being careful in what to document, that's all. > See how it more or > less implemented a remote memory barrier kind of thing _with_ the main > channel message: > > main channel multifd channel 1 multifd channel 2 > ------------ ----------------- ----------------- > send page P v1 > +------------------------------------------------------------------+ > | RAM_SAVE_FLAG_MULTIFD_FLUSH | > | MULTIFD_FLAG_SYNC MULTIFD_FLAG_SYNC | > +------------------------------------------------------------------+ > send page P v2 > This is a nice way of diagramming that! > Then v1 and v2 of the page P are ordered. > > If without the message on the main channel: > > main channel multifd channel 1 multifd channel 2 > ------------ ----------------- ----------------- > send page P v1 > MULTIFD_FLAG_SYNC > MULTIFD_FLAG_SYNC > send page P v2 > > Then I don't see what protects reorder of arrival of messages like: > > main channel multifd channel 1 multifd channel 2 > ------------ ----------------- ----------------- > MULTIFD_FLAG_SYNC > send page P v2 > send page P v1 > MULTIFD_FLAG_SYNC > That's all fine. As long as the recv part doesn't see them out of order. I'll try to write some code to confirm so I don't waste too much of your time.