On Fri, Mar 07, 2025 at 10:42:02AM -0300, Fabiano Rosas wrote: > There's currently no documentation for multifd, we can at least > provide an overview of the feature.
We missed this for a long time indeed.. > > Signed-off-by: Fabiano Rosas <faro...@suse.de> > --- > Keep in mind the feature grew organically over the years and it has > had bugs that required reinventing some concepts, specially on the > sync part, so there's still some amount of inconsistency in the code > and that's not going to be fixed by documentation. > --- > docs/devel/migration/features.rst | 1 + > docs/devel/migration/multifd.rst | 254 ++++++++++++++++++++++++++++++ > 2 files changed, 255 insertions(+) > create mode 100644 docs/devel/migration/multifd.rst > > diff --git a/docs/devel/migration/features.rst > b/docs/devel/migration/features.rst > index 8f431d52f9..249d653124 100644 > --- a/docs/devel/migration/features.rst > +++ b/docs/devel/migration/features.rst > @@ -15,3 +15,4 @@ Migration has plenty of features to support different use > cases. > qpl-compression > uadk-compression > qatzip-compression > + multifd Considering that it's one of the main features (e.g. all compressors above are only sub-features of multifd), we could move this upper, maybe even the 1st one. > diff --git a/docs/devel/migration/multifd.rst > b/docs/devel/migration/multifd.rst > new file mode 100644 > index 0000000000..8f5ec840cb > --- /dev/null > +++ b/docs/devel/migration/multifd.rst > @@ -0,0 +1,254 @@ > +Multifd > +======= > + > +Multifd is the name given for the migration capability that enables > +data transfer using multiple threads. Multifd supports all the > +transport types currently in use with migration (inet, unix, vsock, > +fd, file). I never tried vsock, would it be used in any use case? It seems to be introduced by accident in 72a8192e225cea, but I'm not sure. Maybe there's something I missed. If we don't plan to obsolete rdma, we may also want to mention it.. in which case it doesn't support multifd. > + > +Usage > +----- > + > +On both source and destination, enable the ``multifd`` capability: > + > + ``migrate_set_capability multifd on`` > + > +Define a number of channels to use (default is 2, but 8 usually > +provides best performance). > + > + ``migrate_set_parameter multifd-channels 8`` > + > +Restrictions > +------------ > + > +For migration to a file, support is conditional on the presence of the > +mapped-ram capability, see `mapped-ram`. > + > +Snapshots are currently not supported. > + > +`postcopy` migration is currently not supported. > + > +Components > +---------- > + > +Multifd consists of: > + > +- A client that produces the data on the migration source side and > + consumes it on the destination. Currently the main client code is > + ram.c, which selects the RAM pages for migration; > + > +- A shared data structure (``MultiFDSendData``), used to transfer data > + between multifd and the client. On the source side, this structure > + is further subdivided into payload types (``MultiFDPayload``); > + > +- An API operating on the shared data structure to allow the client s/An API/A set of APIs/ > + code to interact with multifd; > + > + - ``multifd_send/recv()``: Transfers work to/from the channels. > + > + - ``multifd_*payload_*`` and ``MultiFDPayloadType``: Support > + defining an opaque payload. The payload is always wrapped by > + ``MultiFD*Data``. > + > + - ``multifd_send_data_*``: Used to manage the memory for the shared > + data structure. > + > + - ``multifd_*_sync_main()``: See :ref:`synchronization` below. When in doc, it might be helpful to list exact function names without asterisks, so that people can grep for them when reading. > + > +- A set of threads (aka channels, due to a 1:1 mapping to QIOChannels) > + responsible for doing I/O. Each multifd channel supports callbacks > + (``MultiFDMethods``) that can be used for fine-grained processing of > + the payload, such as compression and zero page detection. > + > +- A packet which is the final result of all the data aggregation > + and/or transformation. The packet contains: a *header* with magic and > + version numbers and flags that inform of special processing needed > + on the destination; a *payload-specific header* with metadata referent > + to the packet's data portion, e.g. page counts; and a variable-size > + *data portion* which contains the actual opaque payload data. > + > + Note that due to historical reasons, the terminology around multifd > + packets is inconsistent. > + > + The `mapped-ram` feature ignores packets entirely. If above "packet" section does not cover mapped-ram, while mapped-ram is part of multifd, maybe it means we should reword it? One option is we drop above paragraph completely, but enrich the previous section ("A set of threads.."), with: ... such as compression and zero page detection. Multifd threads can dump the results to different targets. For socket-based URIs, the data will be queued to the socket with multifd specific headers. For file-based URIs, the data may be applied directly on top of the target file at specific offset. Optionally, we may have another separate section to explain the socket headers. If so, we could have the header definition directly, and explain the fields. Might be more straightforward too. > + > +Operation > +--------- > + > +The multifd channels operate in parallel with the main migration > +thread. The transfer of data from a client code into multifd happens > +from the main migration thread using the multifd API. > + > +The interaction between the client code and the multifd channels > +happens in the ``multifd_send()`` and ``multifd_recv()`` > +methods. These are reponsible for selecting the next idle channel and > +making the shared data structure containing the payload accessible to > +that channel. The client code receives back an empty object which it > +then uses for the next iteration of data transfer. > + > +The selection of idle channels is simply a round-robin over the idle > +channels (``!p->pending_job``). Channels wait at a semaphore and once > +a channel is released it starts operating on the data immediately. The sender side is always like this indeed. For the recv side (and since you also mentioned it above), multifd treats it differently based on socket or file based. Maybe we should also discuss socket-based? Something like this? Multifd receive side relies on a proper ``MultiFDMethods.recv()`` method provided by the consumer of the pages to know how to load the pages. The recv threads can work in different ways depending on the channel type. For socket-based channels, multifd recv side is almost event-driven. Each multifd recv threads will be blocked reading the channels until a complete multifd packet header is received. With that, pages are loaded as they arrive on the ports with the ``MultiFDMethods.recv()`` method provided by the client, so as to post-process the data received. For file-based channels, multifd recv side works slightly differently. It works more like the sender side, that client can queue requests to multifd recv threads to load specific portion of file into corresponding portion of RAMs. The ``MultiFDMethods.recv()`` in this case simply always executes the load operation from file as requested. Feel free to take all or none. You can also mention it after the next paragraph on "client-specific handling". Anyway, some mentioning of event-driven model used in socket channels would be nice. > + > +Aside from eventually transmitting the data over the underlying > +QIOChannel, a channel's operation also includes calling back to the > +client code at pre-determined points to allow for client-specific > +handling such as data transformation (e.g. compression), creation of > +the packet header and arranging the data into iovs (``struct > +iovec``). Iovs are the type of data on which the QIOChannel operates. > + > +A high-level flow for each thread is: > + > +Migration thread: > + > +#. Populate shared structure with opaque data (e.g. ram pages) > +#. Call ``multifd_send()`` > + > + #. Loop over the channels until one is idle > + #. Switch pointers between client data and channel data > + #. Release channel semaphore > +#. Receive back empty object > +#. Repeat > + > +Multifd thread: > + > +#. Channel idle > +#. Gets released by ``multifd_send()`` > +#. Call ``MultiFDMethods`` methods to fill iov > + > + #. Compression may happen > + #. Zero page detection may happen > + #. Packet is written > + #. iov is written > +#. Pass iov into QIOChannel for transferring (I/O happens here) > +#. Repeat > + > +The destination side operates similarly but with ``multifd_recv()``, > +decompression instead of compression, etc. One important aspect is > +that when receiving the data, the iov will contain host virtual > +addresses, so guest memory is written to directly from multifd > +threads. > + > +About flags > +----------- > +The main thread orchestrates the migration by issuing control flags on > +the migration stream (``QEMU_VM_*``). > + > +The main memory is migrated by ram.c and includes specific control > +flags that are also put on the main migration stream > +(``RAM_SAVE_FLAG_*``). > + > +Multifd has its own set of flags (``MULTIFD_FLAG_*``) that are > +included into each packet. These may inform about properties such as > +the compression algorithm used if the data is compressed. I think I get your intention, on that we have different levels of flags and maybe it's not easy to know which is which. However since this is multifd specific doc, from that POV the first two paragraphs may be more suitable for some more high level doc to me. Meanwhile, I feel that reading the flag section without a quick packet header introduction is a tiny little abrupt to readers, as the flag is part of the packet but it came from nowhere yet. One option is we make this section "multifd packet header" then introduce all fields quickly including the flags. If you like keeping this it's ok too, we can work on top. > + > +.. _synchronization: > + > +Synchronization > +--------------- > + > +Data sent through multifd may arrive out of order and with different > +timing. Some clients may also have synchronization requirements to > +ensure data consistency, e.g. the RAM migration must ensure that > +memory pages received by the destination machine are ordered in > +relation to previous iterations of dirty tracking. > + > +Some cleanup tasks such as memory deallocation or error handling may > +need to happen only after all channels have finished sending/receiving > +the data. > + > +Multifd provides the ``multifd_send_sync_main()`` and > +``multifd_recv_sync_main()`` helpers to synchronize the main migration > +thread with the multifd channels. In addition, these helpers also > +trigger the emission of a sync packet (``MULTIFD_FLAG_SYNC``) which > +carries the synchronization command to the remote side of the > +migration. [1] > + > +After the channels have been put into a wait state by the sync > +functions, the client code may continue to transmit additional data by > +issuing ``multifd_send()`` once again. > + > +Note: > + > +- the RAM migration does, effectively, a global synchronization by > + chaining a call to ``multifd_send_sync_main()`` with the emission of a > + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``) ... or RAM_SAVE_FLAG_EOS ... depending on the machine type. Maybe we should also add a sentence on the relationship of MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ), in that they should always be sent together, and only if so would it provide ordering of multifd messages and what happens in the main migration thread. Maybe we can attach that sentence at the end of [1]. > + which in turn causes ``multifd_recv_sync_main()`` to be called on the > + destination. > + > + There are also backward compatibility concerns expressed by > + ``multifd_ram_sync_per_section()`` and > + ``multifd_ram_sync_per_round()``. See the code for detailed > + documentation. > + > +- the `mapped-ram` feature has different requirements because it's an > + asynchronous migration (source and destination not migrating at the > + same time). For that feature, only the sync between the channels is > + relevant to prevent cleanup to happen before data is completely > + written to (or read from) the migration file. > + > +Data transformation > +------------------- > + > +The ``MultiFDMethods`` structure defines callbacks that allow the > +client code to perform operations on the data at key points. These > +operations could be client-specific (e.g. compression), but also > +include a few required steps such as moving data into an iovs. See the > +struct's definition for more detailed documentation. > + > +Historically, the only client for multifd has been the RAM migration, > +so the ``MultiFDMethods`` are pre-registered in two categories, > +compression and no-compression, with the latter being the regular, > +uncompressed ram migration. > + > +Zero page detection > ++++++++++++++++++++ > + > +The migration without compression has a further specificity of Compressors also have zero page detection. E.g.: multifd_send_zero_page_detect() <- multifd_send_prepare_common() <- multifd_zstd_send_prepare() > +possibly doing zero page detection. It involves doing the detection of > +a zero page directly in the multifd channels instead of beforehand on > +the main migration thread (as it's been done in the past). This is the > +default behavior and can be disabled with: > + > + ``migrate_set_parameter zero-page-detection legacy`` > + > +or to disable zero page detection completely: > + > + ``migrate_set_parameter zero-page-detection none`` > + > +Error handling > +-------------- > + > +Any part of multifd code can be made to exit by setting the > +``exiting`` atomic flag of the multifd state. Whenever a multifd > +channel has an error, it should break out of its loop, set the flag to > +indicate other channels to exit as well and set the migration error > +with ``migrate_set_error()``. > + > +For clean exiting (triggered from outside the channels), the > +``multifd_send|recv_terminate_threads()`` functions set the > +``exiting`` flag and additionally release any channels that may be > +idle or waiting for a sync. > + > +Code structure > +-------------- > + > +Multifd code is divided into: > + > +The main file containing the core routines > + > +- multifd.c > + > +RAM migration > + > +- multifd-nocomp.c (nocomp, for "no compression") > +- multifd-zero-page.c > +- ram.c (also involved in non-multifd migrations & snapshots) > + > +Compressors > + > +- multifd-uadk.c > +- multifd-qatzip.c > +- multifd-zlib.c > +- multifd-qpl.c > +- multifd-zstd.c > -- > 2.35.3 > -- Peter Xu