On 2/22/2019 4:08 AM, Alex Williamson wrote:
> On Wed, 20 Feb 2019 02:53:18 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>
>> - Migration function are implemented for VFIO_DEVICE_TYPE_PCI device.
>> - Added SaveVMHandlers and implemented all basic functions required for live
>> migration.
>> - Added VM state change handler to know running or stopped state of VM.
>> - Added migration state change notifier to get notification on migration
>> state
>> change. This state is translated to VFIO device state and conveyed to
>> vendor
>> driver.
>> - VFIO device supports migration or not is decided based of migration region
>> query. If migration region query is successful then migration is supported
>> else migration is blocked.
>> - Structure vfio_device_migration_info is mapped at 0th offset of migration
>> region and should always trapped by VFIO device's driver. Added both type
>> of
>> access support, trapped or mmapped, for data section of the region.
>> - To save device state, read pending_bytes and data_offset using structure
>> vfio_device_migration_info, accordingly copy data from the region.
>> - To restore device state, write data_offset and data_size in the structure
>> and write data in the region.
>> - To get dirty page bitmap, write start address and pfn count then read
>> count of
>> pfns copied and accordingly read those from the rest of the region or
>> mmaped
>> part of the region. This copy is iterated till page bitmap for all
>> requested
>> pfns are copied.
>>
>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
>> Reviewed-by: Neo Jia <c...@nvidia.com>
>> ---
>> hw/vfio/Makefile.objs | 2 +-
>> hw/vfio/migration.c | 714
>> ++++++++++++++++++++++++++++++++++++++++++
>> include/hw/vfio/vfio-common.h | 20 ++
>> 3 files changed, 735 insertions(+), 1 deletion(-)
>> create mode 100644 hw/vfio/migration.c
>>
>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs
>> index abad8b818c9b..36033d1437c5 100644
>> --- a/hw/vfio/Makefile.objs
>> +++ b/hw/vfio/Makefile.objs
>> @@ -1,4 +1,4 @@
>> -obj-y += common.o spapr.o
>> +obj-y += common.o spapr.o migration.o
>> obj-$(CONFIG_VFIO_PCI) += pci.o pci-quirks.o display.o
>> obj-$(CONFIG_VFIO_CCW) += ccw.o
>> obj-$(CONFIG_VFIO_PLATFORM) += platform.o
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> new file mode 100644
>> index 000000000000..d7b6d972c043
>> --- /dev/null
>> +++ b/hw/vfio/migration.c
>> @@ -0,0 +1,714 @@
>> +/*
>> + * Migration support for VFIO devices
>> + *
>> + * Copyright NVIDIA, Inc. 2018
>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include <linux/vfio.h>
>> +
>> +#include "hw/vfio/vfio-common.h"
>> +#include "cpu.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include "migration/register.h"
>> +#include "migration/blocker.h"
>> +#include "migration/misc.h"
>> +#include "qapi/error.h"
>> +#include "exec/ramlist.h"
>> +#include "exec/ram_addr.h"
>> +#include "pci.h"
>> +
>> +/*
>> + * Flags used as delimiter:
>> + * 0xffffffff => MSB 32-bit all 1s
>> + * 0xef10 => emulated (virtual) function IO
>
> :^)
>
>> + * 0x0000 => 16-bits reserved for flags
>> + */
>> +#define VFIO_MIG_FLAG_END_OF_STATE (0xffffffffef100001ULL)
>> +#define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
>> +#define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>> +#define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>> +
>> +static void vfio_migration_region_exit(VFIODevice *vbasedev)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> +
>> + if (!migration) {
>> + return;
>> + }
>> +
>> + if (migration->region.buffer.size) {
>> + vfio_region_exit(&migration->region.buffer);
>> + vfio_region_finalize(&migration->region.buffer);
>> + }
>> +}
>> +
>> +static int vfio_migration_region_init(VFIODevice *vbasedev)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + Object *obj = NULL;
>> + int ret = -EINVAL;
>> +
>> + if (!migration) {
>> + return ret;
>> + }
>> +
>> + /* Migration support added for PCI device only */
>> + if (vbasedev->type == VFIO_DEVICE_TYPE_PCI) {
>> + obj = vfio_pci_get_object(vbasedev);
>> + }
>> +
>> + if (!obj) {
>> + return ret;
>> + }
>> +
>> + ret = vfio_region_setup(obj, vbasedev, &migration->region.buffer,
>> + migration->region.index, "migration");
>> + if (ret) {
>> + error_report("Failed to setup VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>> +
>> + if (!migration->region.buffer.size) {
>> + ret = -EINVAL;
>> + error_report("Invalid region size of VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>> +
>> + if (migration->region.buffer.mmaps) {
>> + ret = vfio_region_mmap(&migration->region.buffer);
>> + if (ret) {
>> + error_report("Failed to mmap VFIO migration region %d: %s",
>> + migration->region.index, strerror(-ret));
>> + goto err;
>> + }
>
> In patch 5 this eventually gets called from our realize function, so
> we're forcing the kernel to allocate whatever backing store it might
> use for these mmaps for the entire life of the device, in case maybe we
> want to migrate it. Like my reply to Yan, unless I'm misunderstanding
> how drivers will back this mmap this seems really inefficient.
>
Backing storage can be allocated on access, that is from fault handler,
but will remain allocated for entire life of device.
This can be moved to .save_setup and unmmap from .save_cleanup,
similarly during resuming .load_setup and .load_cleanup can be used.
I'll update patch in next iteration.
>> + }
>> +
>> + return 0;
>> +
>> +err:
>> + vfio_migration_region_exit(vbasedev);
>> + return ret;
>> +}
>> +
>> +static int vfio_migration_set_state(VFIODevice *vbasedev, uint32_t state)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + VFIORegion *region = &migration->region.buffer;
>> + int ret = 0;
>> +
>> + ret = pwrite(vbasedev->fd, &state, sizeof(state),
>> + region->fd_offset + offsetof(struct
>> vfio_device_migration_info,
>> + device_state));
>> + if (ret < 0) {
>> + error_report("Failed to set migration state %d %s",
>> + ret, strerror(errno));
>> + return ret;
>> + }
>> +
>> + vbasedev->device_state = state;
>> + return 0;
>> +}
>> +
>> +void vfio_get_dirty_page_list(VFIODevice *vbasedev,
>> + uint64_t start_pfn,
>> + uint64_t pfn_count,
>> + uint64_t page_size)
>> +{
>> + VFIOMigration *migration = vbasedev->migration;
>> + VFIORegion *region = &migration->region.buffer;
>> + uint64_t count = 0, copied_pfns = 0;
>> + int ret;
>> +
>> + ret = pwrite(vbasedev->fd, &start_pfn, sizeof(start_pfn),
>> + region->fd_offset + offsetof(struct
>> vfio_device_migration_info,
>> + start_pfn));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty pages start address %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + ret = pwrite(vbasedev->fd, &page_size, sizeof(page_size),
>> + region->fd_offset + offsetof(struct
>> vfio_device_migration_info,
>> + page_size));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty page size %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + ret = pwrite(vbasedev->fd, &pfn_count, sizeof(pfn_count),
>> + region->fd_offset + offsetof(struct
>> vfio_device_migration_info,
>> + total_pfns));
>> + if (ret < 0) {
>> + error_report("Failed to set dirty page total pfns %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + do {
>> + uint64_t bitmap_size;
>> + void *buf = NULL;
>> + bool buffer_mmaped = false;
>> +
>> + /* Read dirty_pfns.copied */
>> + ret = pread(vbasedev->fd, &copied_pfns, sizeof(copied_pfns),
>> + region->fd_offset + offsetof(struct
>> vfio_device_migration_info,
>> + copied_pfns));
>> + if (ret < 0) {
>> + error_report("Failed to get dirty pages bitmap count %d %s",
>> + ret, strerror(errno));
>> + return;
>> + }
>> +
>> + if (copied_pfns == 0) {
>> + /*
>> + * copied_pfns could be 0 if driver doesn't have any page to
>> report
>> + * dirty in given range
>> + */
>> + break;
>> + }
>> +
>> + bitmap_size = (BITS_TO_LONGS(copied_pfns) + 1) * sizeof(unsigned
>> long);
>> +
>> + if (region->mmaps) {
>> + int i;
>> +
>> + for (i = 0; i < region->nr_mmaps; i++) {
>> + if (region->mmaps[i].size >= bitmap_size) {
>> + buf = region->mmaps[i].mmap;
>> + buffer_mmaped = true;
>> + break;
>> + }
>
> Wait, the first sparse mmap region that's big enough to hold the bitmap
> is necessarily the one where we read it? Shouldn't we test that the
> mmap range covers the correct offset? This also makes me realize that
> if a driver wanted the device memory mmap'd but not the dirty memory
> backing, I don't think they can do it with this interface. Perhaps
> there should be separate data and dirty offset fields in the header to
> that a driver can place them at separate offsets with different mmap
> coverage if they desire. Thanks,
>
Makes sense. I'll add dirty_offset in header.
Thanks,
Kirti
> Alex
>