Peter Xu <pet...@redhat.com> writes:

> On Fri, Mar 07, 2025 at 10:42:02AM -0300, Fabiano Rosas wrote:
>> There's currently no documentation for multifd, we can at least
>> provide an overview of the feature.
>
> We missed this for a long time indeed..
>
>> 
>> Signed-off-by: Fabiano Rosas <faro...@suse.de>
>> ---
>> Keep in mind the feature grew organically over the years and it has
>> had bugs that required reinventing some concepts, specially on the
>> sync part, so there's still some amount of inconsistency in the code
>> and that's not going to be fixed by documentation.
>> ---
>>  docs/devel/migration/features.rst |   1 +
>>  docs/devel/migration/multifd.rst  | 254 ++++++++++++++++++++++++++++++
>>  2 files changed, 255 insertions(+)
>>  create mode 100644 docs/devel/migration/multifd.rst
>> 
>> diff --git a/docs/devel/migration/features.rst 
>> b/docs/devel/migration/features.rst
>> index 8f431d52f9..249d653124 100644
>> --- a/docs/devel/migration/features.rst
>> +++ b/docs/devel/migration/features.rst
>> @@ -15,3 +15,4 @@ Migration has plenty of features to support different use 
>> cases.
>>     qpl-compression
>>     uadk-compression
>>     qatzip-compression
>> +   multifd
>
> Considering that it's one of the main features (e.g. all compressors above
> are only sub-features of multifd), we could move this upper, maybe even the
> 1st one.
>
>> diff --git a/docs/devel/migration/multifd.rst 
>> b/docs/devel/migration/multifd.rst
>> new file mode 100644
>> index 0000000000..8f5ec840cb
>> --- /dev/null
>> +++ b/docs/devel/migration/multifd.rst
>> @@ -0,0 +1,254 @@
>> +Multifd
>> +=======
>> +
>> +Multifd is the name given for the migration capability that enables
>> +data transfer using multiple threads. Multifd supports all the
>> +transport types currently in use with migration (inet, unix, vsock,
>> +fd, file).
>
> I never tried vsock, would it be used in any use case?
>

I don't know, I'm going by what's in the code.

> It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
> Maybe there's something I missed.

The code was always had some variation of:

static bool transport_supports_multi_channels(SocketAddress *saddr)
{
    return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
           strstart(uri, "vsock:", NULL);
}

Introduced by b7acd65707 ("migration: allow multifd for socket protocol
only").

> If we don't plan to obsolete rdma, we may also want to mention it.. in
> which case it doesn't support multifd.
>

ok.

>> +
>> +Usage
>> +-----
>> +
>> +On both source and destination, enable the ``multifd`` capability:
>> +
>> +    ``migrate_set_capability multifd on``
>> +
>> +Define a number of channels to use (default is 2, but 8 usually
>> +provides best performance).
>> +
>> +    ``migrate_set_parameter multifd-channels 8``
>> +
>> +Restrictions
>> +------------
>> +
>> +For migration to a file, support is conditional on the presence of the
>> +mapped-ram capability, see `mapped-ram`.
>> +
>> +Snapshots are currently not supported.
>> +
>> +`postcopy` migration is currently not supported.
>> +
>> +Components
>> +----------
>> +
>> +Multifd consists of:
>> +
>> +- A client that produces the data on the migration source side and
>> +  consumes it on the destination. Currently the main client code is
>> +  ram.c, which selects the RAM pages for migration;
>> +
>> +- A shared data structure (``MultiFDSendData``), used to transfer data
>> +  between multifd and the client. On the source side, this structure
>> +  is further subdivided into payload types (``MultiFDPayload``);
>> +
>> +- An API operating on the shared data structure to allow the client
>
> s/An API/A set of APIs/
>
>> +  code to interact with multifd;
>> +
>> +  - ``multifd_send/recv()``: Transfers work to/from the channels.
>> +
>> +  - ``multifd_*payload_*`` and ``MultiFDPayloadType``: Support
>> +    defining an opaque payload. The payload is always wrapped by
>> +    ``MultiFD*Data``.
>> +
>> +  - ``multifd_send_data_*``: Used to manage the memory for the shared
>> +    data structure.
>> +
>> +  - ``multifd_*_sync_main()``: See :ref:`synchronization` below.
>
> When in doc, it might be helpful to list exact function names without
> asterisks, so that people can grep for them when reading.
>
>> +
>> +- A set of threads (aka channels, due to a 1:1 mapping to QIOChannels)
>> +  responsible for doing I/O. Each multifd channel supports callbacks
>> +  (``MultiFDMethods``) that can be used for fine-grained processing of
>> +  the payload, such as compression and zero page detection.
>> +
>> +- A packet which is the final result of all the data aggregation
>> +  and/or transformation. The packet contains: a *header* with magic and
>> +  version numbers and flags that inform of special processing needed
>> +  on the destination; a *payload-specific header* with metadata referent
>> +  to the packet's data portion, e.g. page counts; and a variable-size
>> +  *data portion* which contains the actual opaque payload data.
>> +
>> +  Note that due to historical reasons, the terminology around multifd
>> +  packets is inconsistent.
>> +
>> +  The `mapped-ram` feature ignores packets entirely.
>
> If above "packet" section does not cover mapped-ram, while mapped-ram is
> part of multifd, maybe it means we should reword it?
>

I get your point. I just want to clearly point out the places where
mapped-ram is completely different. Maybe some of the suggestions you
made will be enough for that...

> One option is we drop above paragraph completely, but enrich the previous
> section ("A set of threads.."), with:
>

No, the packet is important. Mainly because it's a mess. We should have
*more* information about it on the docs.

>     ... such as compression and zero page detection.  Multifd threads can
>     dump the results to different targets.  For socket-based URIs, the data
>     will be queued to the socket with multifd specific headers.  For
>     file-based URIs, the data may be applied directly on top of the target
>     file at specific offset.
>
> Optionally, we may have another separate section to explain the socket
> headers.  If so, we could have the header definition directly, and explain
> the fields.  Might be more straightforward too.
>

Probably, yes.

>> +
>> +Operation
>> +---------
>> +
>> +The multifd channels operate in parallel with the main migration
>> +thread. The transfer of data from a client code into multifd happens
>> +from the main migration thread using the multifd API.
>> +
>> +The interaction between the client code and the multifd channels
>> +happens in the ``multifd_send()`` and ``multifd_recv()``
>> +methods. These are reponsible for selecting the next idle channel and
>> +making the shared data structure containing the payload accessible to
>> +that channel. The client code receives back an empty object which it
>> +then uses for the next iteration of data transfer.
>> +
>> +The selection of idle channels is simply a round-robin over the idle
>> +channels (``!p->pending_job``). Channels wait at a semaphore and once
>> +a channel is released it starts operating on the data immediately.
>
> The sender side is always like this indeed.  For the recv side (and since
> you also mentioned it above), multifd treats it differently based on socket
> or file based.  Maybe we should also discuss socket-based?
>
> Something like this?
>
>   Multifd receive side relies on a proper ``MultiFDMethods.recv()`` method
>   provided by the consumer of the pages to know how to load the pages.  The
>   recv threads can work in different ways depending on the channel type.
>
>   For socket-based channels, multifd recv side is almost event-driven.
>   Each multifd recv threads will be blocked reading the channels until a
>   complete multifd packet header is received.  With that, pages are loaded
>   as they arrive on the ports with the ``MultiFDMethods.recv()`` method
>   provided by the client, so as to post-process the data received.
>
>   For file-based channels, multifd recv side works slightly differently.
>   It works more like the sender side, that client can queue requests to
>   multifd recv threads to load specific portion of file into corresponding
>   portion of RAMs.  The ``MultiFDMethods.recv()`` in this case simply
>   always executes the load operation from file as requested.
>
> Feel free to take all or none.  You can also mention it after the next
> paragraph on "client-specific handling".  Anyway, some mentioning of
> event-driven model used in socket channels would be nice.
>

ok.

>> +
>> +Aside from eventually transmitting the data over the underlying
>> +QIOChannel, a channel's operation also includes calling back to the
>> +client code at pre-determined points to allow for client-specific
>> +handling such as data transformation (e.g. compression), creation of
>> +the packet header and arranging the data into iovs (``struct
>> +iovec``). Iovs are the type of data on which the QIOChannel operates.
>> +
>> +A high-level flow for each thread is:
>> +
>> +Migration thread:
>> +
>> +#. Populate shared structure with opaque data (e.g. ram pages)
>> +#. Call ``multifd_send()``
>> +
>> +   #. Loop over the channels until one is idle
>> +   #. Switch pointers between client data and channel data
>> +   #. Release channel semaphore
>> +#. Receive back empty object
>> +#. Repeat
>> +
>> +Multifd thread:
>> +
>> +#. Channel idle
>> +#. Gets released by ``multifd_send()``
>> +#. Call ``MultiFDMethods`` methods to fill iov
>> +
>> +   #. Compression may happen
>> +   #. Zero page detection may happen
>> +   #. Packet is written
>> +   #. iov is written
>> +#. Pass iov into QIOChannel for transferring (I/O happens here)
>> +#. Repeat
>> +
>> +The destination side operates similarly but with ``multifd_recv()``,
>> +decompression instead of compression, etc. One important aspect is
>> +that when receiving the data, the iov will contain host virtual
>> +addresses, so guest memory is written to directly from multifd
>> +threads.
>> +
>> +About flags
>> +-----------
>> +The main thread orchestrates the migration by issuing control flags on
>> +the migration stream (``QEMU_VM_*``).
>> +
>> +The main memory is migrated by ram.c and includes specific control
>> +flags that are also put on the main migration stream
>> +(``RAM_SAVE_FLAG_*``).
>> +
>> +Multifd has its own set of flags (``MULTIFD_FLAG_*``) that are
>> +included into each packet. These may inform about properties such as
>> +the compression algorithm used if the data is compressed.
>
> I think I get your intention, on that we have different levels of flags and
> maybe it's not easy to know which is which.  However since this is multifd
> specific doc, from that POV the first two paragraphs may be more suitable
> for some more high level doc to me.
>

This is just to avoid mentioning RAM_SAVE_FLAG_MULTIFD_FLUSH below out
of nowhere. I'll try to merge the relevant part into there.

> Meanwhile, I feel that reading the flag section without a quick packet
> header introduction is a tiny little abrupt to readers, as the flag is part
> of the packet but it came from nowhere yet.  One option is we make this
> section "multifd packet header" then introduce all fields quickly including
> the flags.  If you like keeping this it's ok too, we can work on top.
>

I did mention the packet and the flags up there. It appears you missed
it, so I need to make it more explicit indeed. =)

>> +
>> +.. _synchronization:
>> +
>> +Synchronization
>> +---------------
>> +
>> +Data sent through multifd may arrive out of order and with different
>> +timing. Some clients may also have synchronization requirements to
>> +ensure data consistency, e.g. the RAM migration must ensure that
>> +memory pages received by the destination machine are ordered in
>> +relation to previous iterations of dirty tracking.
>> +
>> +Some cleanup tasks such as memory deallocation or error handling may
>> +need to happen only after all channels have finished sending/receiving
>> +the data.
>> +
>> +Multifd provides the ``multifd_send_sync_main()`` and
>> +``multifd_recv_sync_main()`` helpers to synchronize the main migration
>> +thread with the multifd channels. In addition, these helpers also
>> +trigger the emission of a sync packet (``MULTIFD_FLAG_SYNC``) which
>> +carries the synchronization command to the remote side of the
>> +migration.
>
> [1]
>
>> +
>> +After the channels have been put into a wait state by the sync
>> +functions, the client code may continue to transmit additional data by
>> +issuing ``multifd_send()`` once again.
>> +
>> +Note:
>> +
>> +- the RAM migration does, effectively, a global synchronization by
>> +  chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> +  flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>
> ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>

Eh.. big compatibility mess. I rather not mention it.

> Maybe we should also add a sentence on the relationship of
> MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
> in that they should always be sent together, and only if so would it
> provide ordering of multifd messages and what happens in the main migration
> thread.
>

The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
driven by the source machine by the flags that are put on the
stream. IOW, this is a RAM migration design, not a multifd design. The
multifd design is (could be, we decide) that once sync packets are sent,
_something_ must do the following:

    for (i = 0; i < thread_count; i++) {
        trace_multifd_recv_sync_main_wait(i);
        qemu_sem_wait(&multifd_recv_state->sem_sync);
    }

... which is already part of multifd_recv_sync_main(), but that just
_happens to be_ called by ram.c when it sees the
RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
design requirement. The ram.c code could for instance do the sync when
some QEMU_VM_SECTION_EOS (or whatever it's called) appears.

> Maybe we can attach that sentence at the end of [1].
>
>> +  which in turn causes ``multifd_recv_sync_main()`` to be called on the
>> +  destination.
>> +
>> +  There are also backward compatibility concerns expressed by
>> +  ``multifd_ram_sync_per_section()`` and
>> +  ``multifd_ram_sync_per_round()``. See the code for detailed
>> +  documentation.
>> +
>> +- the `mapped-ram` feature has different requirements because it's an
>> +  asynchronous migration (source and destination not migrating at the
>> +  same time). For that feature, only the sync between the channels is
>> +  relevant to prevent cleanup to happen before data is completely
>> +  written to (or read from) the migration file.
>> +
>> +Data transformation
>> +-------------------
>> +
>> +The ``MultiFDMethods`` structure defines callbacks that allow the
>> +client code to perform operations on the data at key points. These
>> +operations could be client-specific (e.g. compression), but also
>> +include a few required steps such as moving data into an iovs. See the
>> +struct's definition for more detailed documentation.
>> +
>> +Historically, the only client for multifd has been the RAM migration,
>> +so the ``MultiFDMethods`` are pre-registered in two categories,
>> +compression and no-compression, with the latter being the regular,
>> +uncompressed ram migration.
>> +
>> +Zero page detection
>> ++++++++++++++++++++
>> +
>> +The migration without compression has a further specificity of
>
> Compressors also have zero page detection.  E.g.:
>
>   multifd_send_zero_page_detect()
>     <- multifd_send_prepare_common()
>       <- multifd_zstd_send_prepare()
>

Oops, I forgot. I thinking surely detecting zeros comes along with the
compression algorithm and we don't need to tell it.

>> +possibly doing zero page detection. It involves doing the detection of
>> +a zero page directly in the multifd channels instead of beforehand on
>> +the main migration thread (as it's been done in the past). This is the
>> +default behavior and can be disabled with:
>> +
>> +    ``migrate_set_parameter zero-page-detection legacy``
>> +
>> +or to disable zero page detection completely:
>> +
>> +    ``migrate_set_parameter zero-page-detection none``
>> +
>> +Error handling
>> +--------------
>> +
>> +Any part of multifd code can be made to exit by setting the
>> +``exiting`` atomic flag of the multifd state. Whenever a multifd
>> +channel has an error, it should break out of its loop, set the flag to
>> +indicate other channels to exit as well and set the migration error
>> +with ``migrate_set_error()``.
>> +
>> +For clean exiting (triggered from outside the channels), the
>> +``multifd_send|recv_terminate_threads()`` functions set the
>> +``exiting`` flag and additionally release any channels that may be
>> +idle or waiting for a sync.
>> +
>> +Code structure
>> +--------------
>> +
>> +Multifd code is divided into:
>> +
>> +The main file containing the core routines
>> +
>> +- multifd.c
>> +
>> +RAM migration
>> +
>> +- multifd-nocomp.c (nocomp, for "no compression")
>> +- multifd-zero-page.c
>> +- ram.c (also involved in non-multifd migrations & snapshots)
>> +
>> +Compressors
>> +
>> +- multifd-uadk.c
>> +- multifd-qatzip.c
>> +- multifd-zlib.c
>> +- multifd-qpl.c
>> +- multifd-zstd.c
>> -- 
>> 2.35.3
>> 

Reply via email to