Hello everyone,

This series adds a new migration capability called "precopy initial
data". The purpose of this capability is to reduce migration downtime in
cases where loading of migration data in the destination can take a lot
of time, such as with VFIO migration data.

The series then moves to add precopy support and precopy initial data
support for VFIO migration.

Precopy initial data is used by VFIO migration, but other migration
users can add support for it and use it as well.

=== Background ===

Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migration users. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources and
prepare internal data structures which can take a significant amount of
time to do.

This poses a problem, as the source may think that the remaining
migration data is small enough to meet the downtime limit, so it will
stop the VM and complete the migration, but in fact sending and loading
the data in the destination may take longer than the downtime limit.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy stream [1]. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

The new precopy initial data migration capability helps us achieve this.
It allows the source to send initial precopy data and the destination to
ACK that this data has been loaded. Migration will not attempt to stop
the source VM and complete the migration until this ACK is received.

Note that this relies on the return path capability to communicate from
the destination back to the source.

=== Flow of operation ===

To use precopy initial data, the capability must be enabled in the
source.

As this capability must be supported also in the destination, a
handshake is performed during migration setup. The purpose of the
handshake is to notify the destination that precopy initial data is used
and to check if it's supported.

The handshake is done in two levels. First, a general handshake is done
with the destination migration code to notify that precopy initial data
is used. Then, for each migration user in the source that supports
precopy initial data, a handshake is done with its counterpart in the
destination:
If both support it, precopy initial data will be used for them.
If source doesn't support it, precopy initial data will not be used for
them.
If source supports it and destination doesn't, migration will be failed.

Assuming the handshake succeeded, migration starts to send precopy data
and as part of it also the initial precopy data. Initial precopy data is
just like any other precopy data and as such, migration code is not
aware of it. Therefore, it's the responsibility of the migration users
(such as VFIO devices) to notify their counterparts in the destination
that their initial precopy data has been sent (for example, VFIO
migration does it when its initial bytes reach zero).

In the destination, migration code will query each migration user that
supports precopy initial data and check if its initial data has been
loaded. If initial data has been loaded by all of them, an ACK will be
sent to the source which will now be able to complete migration when
appropriate.

=== Test results ===

The below table shows the downtime of two identical migrations. In the
first migration precopy initial data is disabled and in the second it is
enabled. The migrated VM is assigned with a mlx5 VFIO device which has
300MB of device data to be migrated.

+----------------------+-----------------------+----------+
| Precopy initial data | VFIO device data size | Downtime |
+----------------------+-----------------------+----------+
|       Disabled       |         300MB         |  1900ms  |
|       Enabled        |         300MB         |  420ms   |
+----------------------+-----------------------+----------+

Precopy initial data gives a roughly 4.5 times improvement in downtime.
The 1480ms difference is time that is used for resource allocation for
the VFIO device in the destination. Without precopy initial data, this
time is spent when the source VM is stopped and thus the downtime is
much higher. With precopy initial data, the time is spent when the
source VM is still running.

=== Patch breakdown ===

- Patches 1-5 add the precopy initial data capability.
- Patches 6-7 add VFIO migration precopy support. Similar version of
  them was previously sent here [2].
- Patch 8 adds precopy initial data support for VFIO migration.

Thanks for reviewing!

[1]
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/vfio.h#L1048

[2]
https://lore.kernel.org/qemu-devel/20230222174915.5647-3-avih...@nvidia.com/

Avihai Horon (8):
  migration: Add precopy initial data capability
  migration: Add precopy initial data handshake
  migration: Add precopy initial data loaded ACK functionality
  migration: Enable precopy initial data capability
  tests: Add migration precopy initial data capability test
  vfio/migration: Refactor vfio_save_block() to return saved data size
  vfio/migration: Add VFIO migration pre-copy support
  vfio/migration: Add support for precopy initial data capability

 docs/devel/vfio-migration.rst |  35 ++++--
 qapi/migration.json           |   9 +-
 include/hw/vfio/vfio-common.h |   6 +
 include/migration/register.h  |  13 ++
 migration/migration.h         |  15 +++
 migration/options.h           |   1 +
 migration/savevm.h            |   1 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 218 +++++++++++++++++++++++++++++++---
 migration/migration.c         |  45 ++++++-
 migration/options.c           |  16 +++
 migration/savevm.c            | 141 ++++++++++++++++++++++
 tests/qtest/migration-test.c  |  23 ++++
 hw/vfio/trace-events          |   4 +-
 migration/trace-events        |   4 +
 15 files changed, 504 insertions(+), 33 deletions(-)

-- 
2.26.3


Reply via email to