Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface

Alex Williamson Thu, 07 Mar 2019 11:45:51 -0800

On Wed, 27 Feb 2019 01:35:01 +0530
Kirti Wankhede <kwankh...@nvidia.com> wrote:


> Alex,
> 
> On 2/22/2019 3:53 AM, Alex Williamson wrote:
> > On Wed, 20 Feb 2019 02:53:16 +0530
> > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> >   
> >> - Defined MIGRATION region type and sub-type.
> >> - Used 2 bits to define VFIO device states.
> >>     Bit 0 => 0/1 => _STOPPED/_RUNNING
> >>     Bit 1 => 0/1 => _RESUMING/_SAVING
> >>     Combination of these bits defines VFIO device's state during migration
> >>     _RUNNING => Normal VFIO device running state.
> >>     _STOPPED => VFIO device stopped.
> >>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but 
> >> start
> >>                           saving state of device i.e. pre-copy state
> >>     _SAVING | _STOPPED => vCPUs are stoppped, VFIO device should be 
> >> stopped, and
> >>                           save device state,i.e. stop-n-copy state
> >>     _RESUMING => VFIO device resuming state.  
> > 
> > Shouldn't we have a non-_RESUMING/_SAVING run state?  If these are
> > indicating directly flow, maybe we need two bits:
> > 
> >   00b - None, normal runtime
> >   01b - Saving
> >   10b - Resuming
> >   11b - Invalid/reserved (maybe a Failed state indicator)
> >   
> 
> There has to be 2 more states:
> _SAVING | _RUNNING => SAVING while device is RUNNING - pre-copy phase
> _SAVING | _STOPPED => SAVING while device is STOPPED - stop-and-copy phase

Sorry, this was unclear, a separate third bit would need to be the stop
bit, the above was only focusing on the data direction, because...

> So the 2 bits used in this patch are:
> 00b - _RESUMING

This is actually:

_RESUMING | _STOPPED

> 01b - _RUNNING - Normal Running

And this is really:

_RESUMING | _RUNNING

We're just selectively ignoring _RESUMING when we choose to.  So my
question is really more like:

|0|00|
 ^ ^^
 | ||
 | |+-- Saving
 | +--- Resuming
 +----- Stopped

Where Saving|Resuming both set is either invalid or a failure indicator.

> 10b - _SAVING | _STOPPED - stop-and-copy phase
> 11b - _SAVING | _RUNNING - pre-copy phase
> 
> 
> >> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>   offset of migration region to get/set VFIO device related information.
> >>   Defined members of structure and usage on read/write access:
> >>     * device_state: (write only)
> >>         To convey VFIO device state to be transitioned to.  
> > 
> > Seems trivial and potentially useful to support read here, we have 30
> > (or maybe 29) bits yet to define.
> >   
> 
> Ok, those bits can be used later.
> 
> >>     * pending bytes: (read only)
> >>         To get pending bytes yet to be migrated for VFIO device
> >>     * data_offset: (read/write)
> >>         To get or set data offset in migration from where data exist
> >>         during _SAVING and _RESUMING state  
> > 
> > What's the use case for writing this?
> >   
> 
> During resuming, user space application (QEMU) writes data to migration
> region. At that time QEMU writes data_offset from where data is written,
> so that vendor driver can read data from that offset. If data section is
> mmapped, data_offset is start of mmapped region where as if data section
> is trapped, data_offset is sizeof(struct vfio_device_migration_info) +
> 1, just after vfio_device_migration_info structure.

It doesn't make sense to me that this proposal allows the user to write
wherever they want, the vendor driver defines whether they support mmap
for a given operation and the user can choose to make use of mmap or
not.  Therefore I'd suggest that the vendor driver expose a read-only
data_offset that matches a sparse mmap capability entry should the
driver support mmap.  The use should always read or write data from the
vendor defined data_offset.  Otherwise we have scenarios like the other
patch I commented on where the user implicitly decides that the first
mmap region large enough is the one to be used for a given operation,
removing the vendor driver's ability to decide whether it wants to
support mmap for that operation.

> >>     * data_size: (write only)
> >>         To convey size of data copied in migration region during _RESUMING
> >>         state  
> > 
> > How to know how much is available for read?  
> 
> pending_bytes tells how much is still to be read.

But pending_bytes can be bigger than the region, right?  Does the data
area necessarily extend to the end of the region?
 
> >>     * start_pfn, page_size, total_pfns: (write only)
> >>         To get bitmap of dirty pages from vendor driver from given
> >>         start address for total_pfns.  
> > 
> > What would happen if a user wrote in 1MB for page size?  Is the vendor
> > driver expected to support arbitrary page sizes?  Are we only trying to
> > convey the page size and would that page size ever be other than
> > getpagesize()?
> >   
> >>     * copied_pfns: (read only)
> >>         To get number of pfns bitmap copied in migration region.
> >>         Vendor driver should copy the bitmap with bits set only for
> >>         pages to be marked dirty in migration region. Vendor driver
> >>         should return 0 if there are 0 pages dirty in requested
> >>         range.  
> > 
> > This is useful, but I wonder if it's really a required feature for the
> > vendor driver.  For instance, with mdev IOMMU support we could wrap an
> > arbitrary PCI device as mdev, but we don't necessarily have dirty page
> > tracking.  Would a device need to report -1 here if it wanted to
> > indicate any page could be dirty if we only know how to collect the
> > state of the device itself for migration (ie. force the device to be
> > stopped first).
> >    
> 
> Does that mean if returned -1 here, mark all pages in the section as dirty?

Yes
 
> >> Migration region looks like:
> >>  ------------------------------------------------------------------
> >> |vfio_device_migration_info|    data section                      |
> >> |                          |     ///////////////////////////////  |
> >>  ------------------------------------------------------------------  
> >                              ^ what's this?
> >   
> >>  ^                              ^                              ^
> >>  offset 0-trapped part        data.offset                 data.size  
> > 
> > Isn't data.size above really (data.offset + data.size)?  '.' vs '_'
> > inconsistency vs above.
> >    
> 
> Yes, it should be '_'. I'll correct that.
> 
> 
> >> Data section is always followed by vfio_device_migration_info
> >> structure in the region, so data.offset will always be none-0.  
> > 
> > This seems exactly backwards from the diagram, data section follows
> > vfio_device_migration_info.  Also, "non-zero".
> >   
> >> Offset from where data is copied is decided by kernel driver, data  
> > 
> > But data_offset is listed as read-write.
> >   
> 
> data_offset is read during _SAVING state so QEMU knows from where to
> read data
> data_offset is written during _RESUMING state so that QEMU can convey
> offset in migration region to vendor driver from where data should be
> considered as input data.

Please let's drop the idea that the user can write data to an arbitrary
offset within the region.  Does it serve any value?
 
> >> section can be trapped or mapped depending on how kernel driver
> >> defines data section. If mmapped, then data.offset should be page
> >> aligned, where as initial section which contain
> >> vfio_device_migration_info structure might not end at offset which
> >> is page aligned.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> >> Reviewed-by: Neo Jia <c...@nvidia.com>
> >> ---
> >>  linux-headers/linux/vfio.h | 65 
> >> ++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 65 insertions(+)
> >>
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 12a7b1dc53c8..1b12a9b95e00 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -368,6 +368,71 @@ struct vfio_region_gfx_edid {
> >>   */
> >>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD      (1)
> >>  
> >> +/* Migration region type and sub-type */
> >> +#define VFIO_REGION_TYPE_MIGRATION                (2)
> >> +#define VFIO_REGION_SUBTYPE_MIGRATION             (1)
> >> +
> >> +/**
> >> + * Structure vfio_device_migration_info is placed at 0th offset of
> >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related 
> >> migration
> >> + * information. Field accesses from this structure are only supported at 
> >> their
> >> + * native width and alignment, otherwise should return error.
> >> + *
> >> + * device_state: (write only)
> >> + *      To indicate vendor driver the state VFIO device should be 
> >> transitioned
> >> + *      to. If device state transition fails, write to this field return 
> >> error.
> >> + *      It consists of 2 bits.
> >> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that 
> >> indicates
> >> + *        _STOPPED state. When device is changed to _STOPPED, driver 
> >> should stop
> >> + *        device before write returns.
> >> + *      - If bit 1 set, indicates _SAVING state. When its reset, that 
> >> indicates
> >> + *        _RESUMING state.
> >> + *
> >> + * pending bytes: (read only)
> >> + *      Read pending bytes yet to be migrated from vendor driver  
> > 
> > Is this essentially a volatile value, changing based on data previously
> > copied and device activity?  
> 
> Yes.
> 
> >   
> >> + *
> >> + * data_offset: (read/write)
> >> + *      User application should read data_offset in migration region from 
> >> where
> >> + *      user application should read data during _SAVING state.
> >> + *      User application would write data_offset in migration region from 
> >> where
> >> + *      user application is had written data during _RESUMING state.  
> > 
> > Why wouldn't data_offset simply be fixed?  Do we really want to support
> > the user writing a arbitrary offsets in the migration region?  Each
> > chunk can simply start at the kernel provided data_offset.
> >  
> 
> data_offset differs based on region is trapped on mapped. In case when
> region is mmaped, QEMU writes data to mapped region and then write
> data_offset field, indicating data is present in mmaped buffer. Read
> below for more detailed steps.
> 
> >> + *
> >> + * data_size: (write only)
> >> + *      User application should write size of data copied in migration 
> >> region
> >> + *      during _RESUMING state.  
> > 
> > How much does the user read on _SAVING then?
> >   
> 
> pending_bytes tells bytes that should be read on _SAVING.

So pending_bytes is always less than or equal to the size of the
region, thus pending_bytes cannot be used to gauge the _total_ pending
device state?

> >> + *
> >> + * start_pfn: (write only)
> >> + *      Start address pfn to get bitmap of dirty pages from vendor driver 
> >> duing
> >> + *      _SAVING state.
> >> + *
> >> + * page_size: (write only)
> >> + *      User application should write the page_size of pfn.
> >> + *
> >> + * total_pfns: (write only)
> >> + *      Total pfn count from start_pfn for which dirty bitmap is 
> >> requested.  
> > 
> > So effectively the area that begins at data_offset is dual purpose
> > (triple purpose vs Yan's, config, memory, and dirty bitmap) but the
> > protocol isn't entirely evident to me. I think we need to write it out
> > as Yan provided.  If I'm in the _SAVING state, do I simply read from
> > data_offset to min(data_size, region.size - data_offset)?  If that area
> > is mmap'd, how does the user indicate to the kernel to prepare the data
> > or that X bytes were acquired?  In the _RESUMING state, do I write from
> > data_offset to min(data_size, region.size - data_offset) and indicate
> > the write using data_size?
> >   
> 
> Let me list down the steps for each state, hope that helps to answer all
> above questions.
> 
> In _SAVING|_RUNNING device state:
> - read pending_bytes
> - read data_offset - indicates kernel driver to write data to staging
> buffer which is mmapped.
> - if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> - if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)

Ok, pending_bytes can describe more than the region currently holds.
It still seems worthwhile to have a data_size so that we can extend
this interface.  For instance what if a driver wanted to trap data
accesses but mmap a dirty bitmap.  This interface doesn't allow for
that since they're both necessarily covered by the same sparse mmap
offset.

> - Write data packet to file stream as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data,
> VFIO_MIG_FLAG_END_OF_STATE }
> 
> 
> In _SAVING|_STOPPED device state:
> a. read config space of device and save to migration file stream. This
> doesn't need to be from vendor driver. Any other special config state
> from driver can be saved as data in following iteration.
> b. read pending_bytes - indicates kernel driver to write data to staging
> buffer which is mmapped.
> c. if data section is trapped, pread() from data_offset to
> min(pending_bytes, remaining_region)
> d. if data section is mmaped, read mmaped buffer of size
> min(pending_bytes, remaining_region)
> e. Write data packet as below:
> {VFIO_MIG_FLAG_DEV_DATA_STATE, data_size, actual data}
> f. iterate through steps b to e until (pending_bytes > 0)

What indicates to the vendor driver to deduct from pending_bytes and
advance the data?  It seems like it's assumed that a read of
pending_bytes indicates this, but I don't like that implicit
interaction, a user could be interrupted and read pending_bytes again
to see if there's still data and introduce data loss.  IMO, there
should be an explicit "Ok, I read # bytes, advance the data stream"
type operation.

> g. Write {VFIO_MIG_FLAG_END_OF_STATE}
> 
> 
> Dirty page tracking (.log_sync) is part of RAM copying state, where
> vendor driver provides the bitmap of pages which are dirtied by vendor
> driver through migration region and as part of RAM copy, those pages
> gets copied to file stream.
> 
> 
> In _RESUMING device state:
> - load config state.
> - For data packet, till VFIO_MIG_FLAG_END_OF_STATE is not reached
>     - read data_size from packet, read buffer of data_size
>         if region is mmaped, write data of data_size to mmaped region.
>     - write data_offset and data_size.
>         In case of mmapped region, write to data_size indicates kernel
> driver that data is written in staging buffer.
>     - if region is trapped, pwrite() data of data_size from data_offset.

I still think data_offset should be read_only and this should use the
same operation above to indicate how many bytes were written rather
than read.  Thanks,

Alex

> >> + *
> >> + * copied_pfns: (read only)
> >> + *      pfn count for which dirty bitmap is copied to migration region.
> >> + *      Vendor driver should copy the bitmap with bits set only for pages 
> >> to be
> >> + *      marked dirty in migration region.
> >> + *      Vendor driver should return 0 if there are 0 pages dirty in 
> >> requested
> >> + *      range.
> >> + */
> >> +
> >> +struct vfio_device_migration_info {
> >> +        __u32 device_state;         /* VFIO device state */
> >> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >> +        __u32 reserved;
> >> +        __u64 pending_bytes;
> >> +        __u64 data_offset;
> >> +        __u64 data_size;
> >> +        __u64 start_pfn;
> >> +        __u64 page_size;
> >> +        __u64 total_pfns;
> >> +        __u64 copied_pfns;
> >> +} __attribute__((packed));
> >> +  
> > 
> > As you mentioned, and with Yan's version, we still need to figure out
> > something with compatibility and versioning.  Thanks,
> >   
> 
> Yes, we still need to figure out compatibility and versioning.
> 
> Thanks,
> Kirti
> 
> > Alex
> >   
> >>  /*
> >>   * The MSIX mappable capability informs that MSIX data of a BAR can be 
> >> mmapped
> >>   * which allows direct access to non-MSIX registers which happened to be 
> >> within  
> >

Re: [Qemu-devel] [PATCH v3 1/5] VFIO KABI for migration interface

Reply via email to