> -----Original Message-----
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Sent: 28 January 2022 08:29
> To: Jag Raman <jag.ra...@oracle.com>
> Cc: John Levon <john.le...@nutanix.com>; Thanos Makatos
> <thanos.maka...@nutanix.com>; qemu-devel <qemu-devel@nongnu.org>;
> Marc-André Lureau <marcandre.lur...@gmail.com>; Philippe Mathieu-Daudé
> <f4...@amsat.org>; Paolo Bonzini <pbonz...@redhat.com>; Beraldo Leal
> <bl...@redhat.com>; Daniel P. Berrangé <berra...@redhat.com>;
> edua...@habkost.net; Michael S. Tsirkin <m...@redhat.com>; Marcel
> Apfelbaum <marcel.apfelb...@gmail.com>; Eric Blake <ebl...@redhat.com>;
> Markus Armbruster <arm...@redhat.com>; Juan Quintela
> <quint...@redhat.com>; Dr . David Alan Gilbert <dgilb...@redhat.com>; Elena
> Ufimtseva <elena.ufimts...@oracle.com>; John Johnson
> <john.g.john...@oracle.com>
> Subject: Re: [PATCH v5 17/18] vfio-user: register handlers to facilitate 
> migration
> 
> On Thu, Jan 27, 2022 at 05:04:26PM +0000, Jag Raman wrote:
> >
> >
> > > On Jan 25, 2022, at 10:48 AM, Stefan Hajnoczi <stefa...@redhat.com>
> wrote:
> > >
> > > On Wed, Jan 19, 2022 at 04:42:06PM -0500, Jagannathan Raman wrote:
> > >> +     * The client subsequetly asks the remote server for any data that
> > >
> > > subsequently
> > >
> > >> +static void vfu_mig_state_running(vfu_ctx_t *vfu_ctx)
> > >> +{
> > >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> > >> +    VfuObjectClass *k = VFU_OBJECT_GET_CLASS(OBJECT(o));
> > >> +    static int migrated_devs;
> > >> +    Error *local_err = NULL;
> > >> +    int ret;
> > >> +
> > >> +    /**
> > >> +     * TODO: move to VFU_MIGR_STATE_RESUME handler. Presently, the
> > >> +     * VMSD data from source is not available at RESUME state.
> > >> +     * Working on a fix for this.
> > >> +     */
> > >> +    if (!o->vfu_mig_file) {
> > >> +        o->vfu_mig_file = qemu_fopen_ops(o, &vfu_mig_fops_load, false);
> > >> +    }
> > >> +
> > >> +    ret = qemu_remote_loadvm(o->vfu_mig_file);
> > >> +    if (ret) {
> > >> +        VFU_OBJECT_ERROR(o, "vfu: failed to restore device state");
> > >> +        return;
> > >> +    }
> > >> +
> > >> +    qemu_file_shutdown(o->vfu_mig_file);
> > >> +    o->vfu_mig_file = NULL;
> > >> +
> > >> +    /* VFU_MIGR_STATE_RUNNING begins here */
> > >> +    if (++migrated_devs == k->nr_devs) {
> > >
> > > When is this counter reset so migration can be tried again if it
> > > fails/cancels?
> >
> > Detecting cancellation is a pending item. We will address it in the
> > next rev. Will check with you if  we get stuck during the process
> > of implementing it.
> >
> > >
> > >> +static ssize_t vfu_mig_read_data(vfu_ctx_t *vfu_ctx, void *buf,
> > >> +                                 uint64_t size, uint64_t offset)
> > >> +{
> > >> +    VfuObject *o = vfu_get_private(vfu_ctx);
> > >> +
> > >> +    if (offset > o->vfu_mig_buf_size) {
> > >> +        return -1;
> > >> +    }
> > >> +
> > >> +    if ((offset + size) > o->vfu_mig_buf_size) {
> > >> +        warn_report("vfu: buffer overflow - check pending_bytes");
> > >> +        size = o->vfu_mig_buf_size - offset;
> > >> +    }
> > >> +
> > >> +    memcpy(buf, (o->vfu_mig_buf + offset), size);
> > >> +
> > >> +    o->vfu_mig_buf_pending -= size;
> > >
> > > This assumes that the caller increments offset by size each time. If
> > > that assumption is okay, then we can just trust offset and don't need to
> > > do arithmetic on vfu_mig_buf_pending. If that assumption is not correct,
> > > then the code needs to be extended to safely update vfu_mig_buf_pending
> > > when offset jumps around arbitrarily between calls.
> >
> > Going by the definition of vfu_migration_callbacks_t in the library, I 
> > assumed
> > that read_data advances the offset by size bytes.
> >
> > Will add a comment a comment to explain that.

libvfio-user does not automatically increment offset by size each time, since
the vfio-user client can re-read the migration data multiple times. In
libvfio-user API we state:

    Function that is called to read migration data. offset and size can be
    any subrange on the offset and size previously returned by prepare_data.

Reading the pending_bytes register is what marks the end of the iteration, and
this is where you need to decrement vfu_mig_buf_pending.

I'll add more unit tests to libvfio-user to validate this behavior.

> >
> > >
> > >> +uint64_t vmstate_vmsd_size(PCIDevice *pci_dev)
> > >> +{
> > >> +    DeviceClass *dc = DEVICE_GET_CLASS(DEVICE(pci_dev));
> > >> +    const VMStateField *field = NULL;
> > >> +    uint64_t size = 0;
> > >> +
> > >> +    if (!dc->vmsd) {
> > >> +        return 0;
> > >> +    }
> > >> +
> > >> +    field = dc->vmsd->fields;
> > >> +    while (field && field->name) {
> > >> +        size += vmstate_size(pci_dev, field);
> > >> +        field++;
> > >> +    }
> > >> +
> > >> +    return size;
> > >> +}
> > >
> > > This function looks incorrect because it ignores subsections as well as
> > > runtime behavior during save(). Although VMStateDescription is partially
> > > declarative, there is still a bunch of imperative code that can write to
> > > the QEMUFile at save() time so there's no way of knowing the size ahead
> > > of time.
> >
> > I see your point, it would be a problem for any field which has the
> > (VMS_BUFFER | VMS_ALLOC) flags set.
> >
> > >
> > > I asked this in a previous revision of this series but I'm not sure if
> > > it was answered: is it really necessary to know the size of the vmstate?
> > > I thought the VFIO migration interface is designed to support
> > > streaming reads/writes. We could choose a fixed size like 64KB and
> > > stream the vmstate in 64KB chunks.
> >
> > The library exposes the migration data to the client as a device BAR with
> > fixed size - the size of which is fixed at boot time, even when using
> > vfu_migration_callbacks_t callbacks.
> >
> > I don’t believe the library supports streaming vmstate/migration-data - see
> > the following comment in migration_region_access() defined in the library:
> >
> > * Does this mean that partial reads are not allowed?
> >
> > Thanos or John,
> >
> >     Could you please clarify this?

libvfio-user does support streaming of migration data, this comment is based on
the VFIO documentation:

    d. Read data_size bytes of data from (region + data_offset) from the
        migration region.

It's not clear to me whether streaming should be allowed, I'd be surprised if
it didn't.

> >
> > Stefan,
> >     We attempted to answer the migration cancellation and vmstate size
> >     questions previously also, in the following email:
> >
> > https://lore.kernel.org/all/F48606B1-15A4-4DD2-9D71-
> 2fcafc0e6...@oracle.com/
> 
> >  libvfio-user has the vfu_migration_callbacks_t interface that allows the
> >  device to save/load more data regardless of the size of the migration
> >  region. I don't see the issue here since the region doesn't need to be
> >  sized to fit the savevm data?
> 
> The answer didn't make sense to me:
> 
> "In both scenarios at the server end - whether using the migration BAR or
> using callbacks, the migration data is transported to the other end using
> the BAR. As such we need to specify the BAR’s size during initialization.
> 
> In the case of the callbacks, the library translates the BAR access to 
> callbacks."
> 
> The BAR and the migration region within it need a size but my
> understanding is that VFIO migration is designed to stream the device
> state, allowing it to be broken up into multiple reads/writes with
> knowing the device state's size upfront. Here is the description from
> <linux/vfio.h>:
> 
>   * The sequence to be followed while in pre-copy state and stop-and-copy 
> state
>   * is as follows:
>   * a. Read pending_bytes, indicating the start of a new iteration to get 
> device
>   *    data. Repeated read on pending_bytes at this stage should have no side
>   *    effects.
>   *    If pending_bytes == 0, the user application should not iterate to get 
> data
>   *    for that device.
>   *    If pending_bytes > 0, perform the following steps.
>   * b. Read data_offset, indicating that the vendor driver should make data
>   *    available through the data section. The vendor driver should return 
> this
>   *    read operation only after data is available from (region + data_offset)
>   *    to (region + data_offset + data_size).
>   * c. Read data_size, which is the amount of data in bytes available through
>   *    the migration region.
>   *    Read on data_offset and data_size should return the offset and size of
>   *    the current buffer if the user application reads data_offset and
>   *    data_size more than once here.
>   * d. Read data_size bytes of data from (region + data_offset) from the
>   *    migration region.
>   * e. Process the data.
>   * f. Read pending_bytes, which indicates that the data from the previous
>   *    iteration has been read. If pending_bytes > 0, go to step b.
>   *
>   * The user application can transition from the _SAVING|_RUNNING
>   * (pre-copy state) to the _SAVING (stop-and-copy) state regardless of the
>   * number of pending bytes. The user application should iterate in _SAVING
>   * (stop-and-copy) until pending_bytes is 0.
> 
> This means you can report pending_bytes > 0 until the entire vmstate has
> been read and can pick a fixed chunk size like 64KB for the migration
> region. There's no need to size the migration region to fit the entire
> vmstate.
> 
> Stefan

Reply via email to