Prasad Pandit <ppan...@redhat.com> writes: > On Thu, 20 Mar 2025 at 20:15, Fabiano Rosas <faro...@suse.de> wrote: >> Technically both can happen. But that would just be the case of >> file:fdset migration which requires an extra fd for O_DIRECT. So >> "multiple" in the usual sense of "more is better" is only >> fd-per-thread. IOW, using multiple fds is an implementation detail IMO, >> what people really care about is medium saturation, which we can only >> get (with multifd) via parallelization. > > * I see. Multifd is essentially multiple threads = thread pool then. >
Yes, that's what I'm trying to convey with the first sentence. Specifically to dispel any misconceptions that this is something esoteric. It's not. We're just using multiple threads with some custom locking and some callbacks. `migrate-set-capability multiple-threads-with-some-custom-locking-and-some-callbacks true` ...doesn't work that well =) >> > Because doing migration via QMP commands is not as >> > straightforward, I wonder who might do that and why. >> > >> >> All of QEMU developers, libvirt developers, cloud software developers, >> kernel developers etc. > > * Really? That must be using QMP apis via libvirt/virsh kind of tools > I guess. Otherwise how does one follow above instructions to enable > 'multifd' and set number of channels on both source and destination > machines? User has to open QMP shell on two machines and invoke QMP > commands? > Well, I can't speak for everyone, of course, but generally the less layers on top of the object of your work the better. I don't even have libvirt installed on my development machine for instance. It's convenient to deal directly with QEMU command line and QMP because that usually gives you a faster turnaround when experimenting with various command lines/commands. There are also setups that don't want to bring in too many dependencies, so having a full libvirt installation is not wanted. There's a bunch of little tools out there that invoke QEMU and give it QMP commands directly. There are several ways of accessing QMP, some examples I have lying around: == $QEMU ... -qmp unix:${SRC_SOCK},server,wait=off echo " { 'execute': 'qmp_capabilities' } { 'execute': 'migrate-set-capabilities','arguments':{ 'capabilities':[ \ { 'capability': 'mapped-ram', 'state': true }, \ { 'capability': 'multifd', 'state': true } \ ] } } { 'execute': 'migrate-set-parameters','arguments':{ 'multifd-channels': 8 } } { 'execute': 'migrate-set-parameters','arguments':{ 'max-bandwidth': 0 } } { 'execute': 'migrate-set-parameters','arguments':{ 'direct-io': true } } { 'execute': 'migrate${incoming}','arguments':{ 'uri': 'file:$MIGFILE' } } " | nc -NU $SRC_SOCK == (echo "migrate_set_capability x-ignore-shared on"; echo "migrate_set_capability validate-uuid on"; echo "migrate exec:cat>migfile-s390x"; echo "quit") | ./qemu-system-s390x -bios /tmp/migration-test-16K1Z2/bootsect -monitor stdio == $QEMU ... -qmp unix:${DST_SOCK},server,wait=off ./qemu/scripts/qmp/qmp-shell $DST_SOCK == $QEMU ... C-a c (qemu) info migrate >> > * So multifd mechanism can be used to transfer non-ram data as well? I >> > thought it's only used for RAM migration. Are device/gpu states etc >> > bits also transferred via multifd threads? >> > >> device state migration with multifd has been merged for 10.0 >> >> <rant> >> If it were up to me, we'd have a pool of multifd threads that transmit >> everything migration-related. > > * Same my thought: If multifd is to be used for all data, why not use > the existing QEMU thread pool OR make it a migration thread pool. > IIRC, there is also some discussion about having a thread pool for > VFIO or GPU state transfer. Having so many different thread pools does > not seem right. > To be clear, multifd is not meant to transfer all data. It was designed to transfer RAM pages and later got extended to deal with VFIO device state. It _could_ be further extended for all device states (vmstate) and it _could_ be further extended to handle control messages from the main migration thread (QEMU_VM_*, postcopy commands, etc). My opinion is that it would be interesting to have this kind of flexibility (at some point). But it might turn out that it doesn't make sense technically, it's costly in terms of development time, etc. I think we all agree that having different sets of threads managed in different ways is not ideal. The thing with multifd is that it's very important to keep the performance and constraints of ram migration. If we manage to achieve that with some generic thread pool, that's great. But it's an exploration work that will have to be done. >> Unfortunately, that's not so >> straight-forward to implement without rewriting a lot of code, multifd >> requires too much entanglement from the data producer. We're constantly >> dealing with details of data transmission getting in the way of data >> production/consumption (e.g. try to change ram.c to produce multiple >> pages at once and watch everyting explode). > > * Ideally there should be separation between what the client is doing > and how migration is working. > > * IMO, migration is a mechanism to transfer byte streams from one > machine to another. And while doing so, facilitate writing (data) at > specific addresses/offsets on the destination, not just append bytes > at the tail end. This entails that each individual migration packet > specifies where to write data on the destination. Let's say a > migration stream is a train of packets. Each packet has a header and > data. > > ( [header][...data...] )><><( [header][...data...] )><><( > [header][data] )><>< ... ><><( [header][data] ) > But then there's stuff like mapped-ram which wants its data free of any metadata because it mirrors the RAM layout in the migration file. > Header specifies: > - Serial number > - Header length > - Data length/size (2MB/4MB/8MB etc.) I generally like the idea of having the size of the header/data specified in the header itself. It does seem like it would allow for better extensibility over time. I spent a lot of time looking at those "unused" bytes in MultiFDPacket_t trying to figure out a way of embedding the size information in a backward-compatible way. We ended up going with Maciej's idea of isolating the common parts of the header in the MultiFDPacketHdr_t and having each data type define it's own specific sub-header. I don't know how this looks like in terms of type-safety and how we'd keep compatibility (two separate issues) because a variable-size header needs to end up in a well-defined structure at some point. It's generally more difficult to maintain code that simply takes a buffer and pokes at random offsets in there. Even with the length, an old QEMU would still not know about extra fields. > - Destination address <- offset where to write migration data, if > it is zero(0) append that data > - Data type (optional): Whether it is RAM/Device/GPU/CPU state etc. > - Data iteration number <- version/iteration of the same RAM page > ... more variables > ... more variables This is all in the end client-centric, which means it is "data" from the migration perspective. So the question I put earlier still remains, what determines the kind of data that goes in the header and the kind of data that goes in the data part of the packet? It seems we cannot escape from having the client bring it's own header format. > - Some reserved bytes > Migration data is: > - Just a data byte stream <= Data length/size above. > > * Such a train of packets is then transferred via 1 thread or 10 > threads is an operational change. > * Such a packet is pushed (Precopy) from source to destination OR > pulled (Postcopy) by destination from the source side is an > operational difference. In Postcopy phase, it could send a message > saying I need the next RAM packet for this offset and RAM module on > the source side provides only relevant data. Again packaging and > transmission is done by the migration module. Similarly the Postcopy > phase could send a message saying I need the next GPU packet, and the > GPU module on the source side would provide relevant data. > * How long such a train of packets is, is also immaterial. > * With such a separation, things like synchronisation of threads is > not connected to the data (RAM/GPU/CPU/etc.) type. > * It may also allow us to apply compression/encryption uniformly > across all channels/threads, irrespective of the data type. > * Since migration is a packet transport mechanism, > creation/modification/destruction of packets could be done by one > entity. Clients (like RAM/GPU/CPU/VFIO etc.) shall only supply 'data' > to be packaged and sent. It shouldn't be like RAM.c writes its own > pakcets as they like, GPU.c writes their own packets as they like, > that does not seem right. > Right, so we'd need an extra abstraction layer with a well defined interface to convert a raw packet into something that's useful for the clients. The vmstate macros actually do that work kind of well. A device emulation code does not need to care (too much) about how migration works as long as the vmstate is written properly. > >> +- A packet which is the final result of all the data aggregation >> >> + and/or transformation. The packet contains: a *header* with magic and >> >> + version numbers and flags that inform of special processing needed >> >> + on the destination; a *payload-specific header* with metadata referent >> >> + to the packet's data portion, e.g. page counts; and a variable-size >> >> + *data portion* which contains the actual opaque payload data. > > * Thread synchronisation and other such control messages could/should > be a separate packets of its own, to be sent on the main channel. Remember that currently the control data is put raw on the stream, it is not encapsulated by a packet. This would increase the amount of data put on the stream, which might affect throughput. > Thread synchronisation flags could/should not be combined with the > migration data packets above. Control message packets may have _no > data_ to be processed. (just sharing thoughts) > Yeah, the MULTIFD_FLAG_SYNC used to be part of a data packet and it was utterly confusing to debug sync issues like that. Peter did the work to make it a standalone (no data) packet. > Thank you. > --- > - Prasad