On 3/1/25 00:38, Fabiano Rosas wrote:
Cédric Le Goater <c...@redhat.com> writes:

On 2/27/25 23:01, Maciej S. Szmigiero wrote:
On 27.02.2025 07:59, Cédric Le Goater wrote:
On 2/19/25 21:34, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>

Update the VFIO documentation at docs/devel/migration describing the
changes brought by the multifd device state transfer.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
---
   docs/devel/migration/vfio.rst | 80 +++++++++++++++++++++++++++++++----
   1 file changed, 71 insertions(+), 9 deletions(-)

diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
index c49482eab66d..d9b169d29921 100644
--- a/docs/devel/migration/vfio.rst
+++ b/docs/devel/migration/vfio.rst
@@ -16,6 +16,37 @@ helps to reduce the total downtime of the VM. VFIO devices 
opt-in to pre-copy
   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
   VFIO_DEVICE_FEATURE_MIGRATION ioctl.

Please add a new "multifd" documentation subsection at the end of the file
with this part :

+Starting from QEMU version 10.0 there's a possibility to transfer VFIO device
+_STOP_COPY state via multifd channels. This helps reduce downtime - especially
+with multiple VFIO devices or with devices having a large migration state.
+As an additional benefit, setting the VFIO device to _STOP_COPY state and
+saving its config space is also parallelized (run in a separate thread) in
+such migration mode.
+
+The multifd VFIO device state transfer is controlled by
+"x-migration-multifd-transfer" VFIO device property. This property defaults to
+AUTO, which means that VFIO device state transfer via multifd channels is
+attempted in configurations that otherwise support it.
+

Done - I also moved the parts about x-migration-max-queued-buffers
and x-migration-load-config-after-iter description there since
obviously they wouldn't make sense being left alone in the top section.

I was expecting a much more detailed explanation on the design too  :

   * in the cover letter
   * in the hw/vfio/migration-multifd.c
   * in some new file under docs/devel/migration/

I forgot to add  :

       * guide on how to use this new feature from QEMU and libvirt.
         something we can refer to for tests. That's a must have.
       * usage scenarios
         There are some benefits but it is not obvious a user would
         like to use multiple VFs in one VM, please explain.
         This is a major addition which needs justification anyhow
       * pros and cons

I'm not sure what descriptions you exactly want in these places,

Looking from the VFIO subsystem, the way this series works is very opaque.
There are a couple of a new migration handlers, new threads, new channels,
etc. It has been discussed several times with migration folks, please provide
a summary for a new reader as ignorant as everyone would be when looking at
a new file.


but since
that's just documentation (not code) it could be added after the code freeze...

That's the risk of not getting any ! and the initial proposal should be
discussed before code freeze.

For the general framework, I was expecting an extension of a "multifd"
subsection under :

    https://qemu.readthedocs.io/en/v9.2.0/devel/migration/features.html

but it doesn't exist :/

Hi, see if this helps. Let me know what can be improved and if something
needs to be more detailed. Please ignore the formatting, I'll send a
proper patch after the carnaval.

This is very good !  Thanks a lot Fabiano for providing this input.

@Maciej, it's probably better if you keep your docs separate anyway so
we don't add another dependency. I can merge them later.

Perfect. Maciej, We will adjust the file to apply it to before merging.


Thanks,

C.




multifd.rst:

Multifd
=======

Multifd is the name given for the migration capability that enables
data transfer using multiple threads. Multifd supports all the
transport types currently in use with migration (inet, unix, vsock,
fd, file).

Restrictions
------------

For migration to a file, support is conditional on the presence of the
mapped-ram capability, see #mapped-ram.

Snapshots are currently not supported.

Postcopy migration is currently not supported.

Usage
-----

On both source and destination, enable the ``multifd`` capability:

     ``migrate_set_capability multifd on``

Define a number of channels to use (default is 2, but 8 usually
provides best performance).

     ``migrate_set_parameter multifd-channels 8``

Components
----------

Multifd consists of:

- A client that produces the data on the migration source side and
   consumes it on the destination. Currently the main client code is
   ram.c, which selects the RAM pages for migration;

- A shared data structure (MultiFDSendData), used to transfer data
   between multifd and the client. On the source side, this structure
   is further subdivided into payload types (MultiFDPayload);

- An API operating on the shared data structure to allow the client
   code to interact with multifd;

   - multifd_send/recv(): A dispatcher that transfers work to/from the
     channels.

   - multifd_*payload_* and MultiFDPayloadType: Support defining an
     opaque payload. The payload is always wrapped by
     MultiFDSend|RecvData.

   - multifd_send_data_*: Used to manage the memory for the shared data
     structure.

- The threads that process the data (aka channels, due to a 1:1
   mapping to QIOChannels). Each multifd channel supports callbacks
   that can be used for fine-grained processing of the payload, such as
   compression and zero page detection.

- A packet which is the final result of all the data aggregation
   and/or transformation. The packet contains a header, a
   payload-specific header and a variable-size data portion.

    - The packet header: contains a magic number, a version number and
      flags that inform of special processing needed on the
      destination.

    - The payload-specific header: contains metadata referent to the
      packet's data portion, such as page counts.

    - The data portion: contains the actual opaque payload data.

   Note that due to historical reasons, the terminology around multifd
   packets is inconsistent.

   The mapped-ram feature ignores packets entirely.

Theory of operation
-------------------

The multifd channels operate in parallel with the main migration
thread. The transfer of data from a client code into multifd happens
from the main migration thread using the multifd API.

The interaction between the client code and the multifd channels
happens in the multifd_send() and multifd_recv() methods. These are
reponsible for selecting the next idle channel and making the shared
data structure containing the payload accessible to that channel. The
client code receives back an empty object which it then uses for the
next iteration of data transfer.

The selection of idle channels is simply a round-robin over the idle
channels (!p->pending_job). Channels wait at a semaphore, once a
channel is released, it starts operating on the data immediately.

Aside from eventually transmitting the data over the underlying
QIOChannel, a channel's operation also includes calling back to the
client code at pre-determined points to allow for client-specific
handling such as data transformation (e.g. compression), creation of
the packet header and arranging the data into iovs (struct
iovec). Iovs are the type of data on which the QIOChannel operates.

Client code (migration thread):
1. Populate shared structure with opaque data (ram pages, device state)
2. Call multifd_send()
    2a. Loop over the channels until one is idle
    2b. Switch pointers between client data and channel data
    2c. Release channel semaphore
3. Receive back empty object
4. Repeat

Multifd channel (multifd thread):
1. Channel idle
2. Gets released by multifd_send()
3. Call multifd_ops methods to fill iov
    3a. Compression may happen
    3b. Zero page detection may happen
    3c. Packet is written
    3d. iov is written
4. Pass iov into QIOChannel for transferring
5. Repeat

The destination side operates similarly but with multifd_recv(),
decompression instead of compression, etc. One important aspect is
that when receiving the data, the iov will contain host virtual
addresses, so guest memory is written to directly from multifd
threads.

About flags
-----------
The main thread orchestrates the migration by issuing control flags on
the migration stream (QEMU_VM_*).

The main memory is migrated by ram.c and includes specific control
flags that are also put on the main migration stream
(RAM_SAVE_FLAG_*).

Multifd has its own set of MULTIFD_FLAGs that are included into each
packet. These may inform about properties such as the compression
algorithm used if the data is compressed.

Synchronization
---------------

Since the migration process is iterative due to RAM dirty tracking, it
is necessary to invalidate data that is no longer current (e.g. due to
the source VM touching the page). This is done by having a
synchronization point triggered by the migration thread at key points
during the migration. Data that's received after the synchronization
point is allowed to overwrite data received prior to that point.

To perform the synchronization, multifd provides the
multifd_send_sync_main() and multifd_recv_sync_main() helpers. These
are called whenever the client code whishes to ensure that all data
sent previously has now been received by the destination.

The synchronization process involves performing a flush of the
ramaining client data still left to be transmitted and issuing a
multifd packet containing the MULTIFD_FLAG_SYNC flag. This flag
informs the receiving end that it should finish reading the data and
wait for a synchronization point.

To complete the sync, the main migration stream issues a
RAM_SAVE_FLAG_MULTIFD_FLUSH flag. When that flag is received by the
destination, it ensures all of its channels have seen the
MULTIFD_FLAG_SYNC and moves them to an idle state.

The client code can then continue with a second round of data by
issuing multifd_send() once again.

The synchronization process also ensures that internal synchronization
happens, i.e. between each thread. This is necessary to avoid threads
lagging behind sending or receiving when the migration approaches
completion.

The mapped-ram feature has different synchronization requirements
because it's an asynchronous migration (source and destination not
migrating at the same time). For that feature, only the internal sync
is relevant.

Data transformation
-------------------

Each multifd channel executes a set of callbacks before transmitting
the data. These callbacks allow the client code to alter the data
format right before sending and after receiving.

Since the object of the RAM migration is always the memory page and
the only processing done for memory pages is zero page detection,
which is already part of compression in a sense, the multifd_ops
functions are mutually exclusively divided into compression and
no-compression.

The migration without compression (i.e. regular ram migration) has a
further specificity as mentioned of possibly doing zero page detection
(see zero-page-detection migration parameter). This consists of
sending all pages to multifd and letting the detection of a zero page
happen in the multifd channels instead of doing it beforehand on the
main migration thread as it was done in the past.

Code structure
--------------

Multifd code is divided into:

The main file containing the core routines

- multifd.c

RAM migration

- multifd-nocomp.c (nocomp, for "no compression")
- multifd-zero-page.c
- ram.c (also involved in non-multifd migrations + snapshots)

Compressors

- multifd-uadk.c
- multifd-qatzip.c
- multifd-zlib.c
- multifd-qpl.c
- multifd-zstd.c



Reply via email to