RE: [PATCH v8] introduce vfio-user protocol specification

Thanos Makatos Mon, 14 Jun 2021 03:08:41 -0700

> -----Original Message-----
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Sent: 04 May 2021 14:52
> To: Thanos Makatos <thanos.maka...@nutanix.com>
> Cc: qemu-devel@nongnu.org; John Levon <le...@movementarian.org>;
> John G Johnson <john.g.john...@oracle.com>;
> benjamin.wal...@intel.com; Elena Ufimtseva
> <elena.ufimts...@oracle.com>; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle <swapnil.in...@nutanix.com>;
> konrad.w...@oracle.com; alex.william...@redhat.com;
> yuvalkash...@gmail.com; tina.zh...@intel.com;
> marcandre.lur...@redhat.com; ism...@linux.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi <fel...@nutanix.com>;
> xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> <raphael.norw...@nutanix.com>; changpeng....@intel.com;
> dgilb...@redhat.com; Yan Zhao <yan.y.z...@intel.com>; Michael S . Tsirkin
> <m...@redhat.com>; Gerd Hoffmann <kra...@redhat.com>; Christophe de
> Dinechin <cdupo...@redhat.com>; Jason Wang <jasow...@redhat.com>;
> Cornelia Huck <coh...@redhat.com>; Kirti Wankhede
> <kwankh...@nvidia.com>; Paolo Bonzini <pbonz...@redhat.com>;
> mpiszc...@ddn.com; John Levon <john.le...@nutanix.com>
> Subject: Re: [PATCH v8] introduce vfio-user protocol specification
> 
> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be
> > emulated outside QEMU, in a separate process. vfio-user reuses the
> > existing VFIO defines, structs and concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson <john.g.john...@oracle.com>
> > Signed-off-by: Thanos Makatos <thanos.maka...@nutanix.com>
> > Signed-off-by: John Levon <john.le...@nutanix.com>
> >
> > ---
> >
> > Changed since v1:
> >   * fix coding style issues
> >   * update MAINTAINERS for VFIO-over-socket
> >   * add vfio-over-socket to ToC
> >
> > Changed since v2:
> >   * fix whitespace
> >
> > Changed since v3:
> >   * rename protocol to vfio-user
> >   * add table of contents
> >   * fix Unicode problems
> >   * fix typos and various reStructuredText issues
> >   * various stylistic improvements
> >   * add backend program conventions
> >   * rewrite part of intro, drop QEMU-specific stuff
> >   * drop QEMU-specific paragraph about implementation
> >   * explain that passing of FDs isn't necessary
> >   * minor improvements in the VFIO section
> >   * various text substitutions for the sake of consistency
> >   * drop paragraph about client and server, already explained in
> >   * intro
> >   * drop device ID
> >   * drop type from version
> >   * elaborate on request concurrency
> >   * convert some inessential paragraphs into notes
> >   * explain why some existing VFIO defines cannot be reused
> >   * explain how to make changes to the protocol
> >   * improve text of DMA map
> >   * reword comment about existing VFIO commands
> >   * add reference to Version section
> >   * reset device on disconnection
> >   * reword live migration section
> >   * replace sys/vfio.h with linux/vfio.h
> >   * drop reference to iovec
> >   * use argz the same way it is used in VFIO
> >   * add type field in header for clarity
> >
> > Changed since v4:
> >   * introduce support for live migration as defined in
> >   * include/uapi/linux/vfio.h
> >   * introduce 'max_fds' and 'migration' capabilities:
> >   * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
> >   * fix minor typos and reworded some text for clarity
> >
> > Changed since v5:
> >   * fix minor typos
> >   * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> >   * clarify meaning of VFIO bitmap size field
> >   * move version major/minor outside JSON
> >   * client proposes version first
> >   * make Errno optional in message header
> >   * clarification about message ID uniqueness
> >   * clarify that server->client request can appear in between
> >     client->server request/reply
> >
> > Changed since v6:
> >   * put JSON strings in double quotes
> >   * clarify reply behavior on error
> >   * introduce max message size capability
> >   * clarify semantics when failing to map multiple DMA regions in a
> >     single command
> >
> > Changed since v7:
> >   * client proposes version instead of server
> >   * support ioeventfd and ioregionfd for unmapped regions
> >   * reword struct vfio_bitmap for clarity
> >   * clarify use of argsz in VFIO device info
> >   * allow individual IRQs to be disabled
> > ---
> >  MAINTAINERS              |    7 +
> >  docs/devel/index.rst     |    1 +
> >  docs/devel/vfio-user.rst | 1854
> > ++++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 1862 insertions(+)
> >  create mode 100644 docs/devel/vfio-user.rst
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index 36055f14c5..bd1194002b
> > 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1849,6 +1849,13 @@ F: hw/vfio/ap.c
> >  F: docs/system/s390x/vfio-ap.rst
> >  L: qemu-s3...@nongnu.org
> >
> > +vfio-user
> > +M: John G Johnson <john.g.john...@oracle.com>
> > +M: Thanos Makatos <thanos.maka...@nutanix.com>
> > +M: John Levon <john.le...@nutanix.com>
> > +S: Supported
> > +F: docs/devel/vfio-user.rst
> > +
> >  vhost
> >  M: Michael S. Tsirkin <m...@redhat.com>
> >  S: Supported
> > diff --git a/docs/devel/index.rst b/docs/devel/index.rst index
> > 6cf7e2d233..7d1ea63e02 100644
> > --- a/docs/devel/index.rst
> > +++ b/docs/devel/index.rst
> > @@ -42,3 +42,4 @@ Contents:
> >     qom
> >     block-coroutine-wrapper
> >     multi-process
> > +   vfio-user
> > diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst new
> > file mode 100644 index 0000000000..b3498eec02
> > --- /dev/null
> > +++ b/docs/devel/vfio-user.rst
> > @@ -0,0 +1,1854 @@
> > +.. include:: <isonum.txt>
> > +
> > +********************************
> > +vfio-user Protocol Specification
> > +********************************
> > +
> > +------------
> > +Version_ 0.1
> > +------------
> > +
> > +.. contents:: Table of Contents
> > +
> > +Introduction
> > +============
> > +vfio-user is a protocol that allows a device to be emulated in a
> > +separate process outside of a Virtual Machine Monitor (VMM).
> > +vfio-user devices consist of a generic VFIO device type, living
> > +inside the VMM, which we call the client, and the core device
> > +implementation, living outside the VMM, which we call the server.
> > +
> > +The `Linux VFIO ioctl interface
> > +<https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_
> > +been chosen as the base for this protocol for the following reasons:
> > +
> > +1) It is a mature and stable API, backed by an extensively used
> framework.
> > +2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can
> be
> > +   largely reused.
> > +
> > +.. Note::
> > +   In a proof of concept implementation it has been demonstrated that
> using VFIO
> > +   over a UNIX domain socket is a viable option. vfio-user is designed with
> > +   QEMU in mind, however it could be used by other client applications.
> The
> > +   vfio-user protocol does not require that QEMU's VFIO client
> implementation
> > +   is used in QEMU.
> > +
> > +None of the VFIO kernel modules are required for supporting the
> > +protocol, neither in the client nor the server, only the source header 
> > files
> are used.
> > +
> > +The main idea is to allow a virtual device to function in a separate
> > +process in the same host over a UNIX domain socket. A UNIX domain
> > +socket (AF_UNIX) is chosen because file descriptors can be trivially
> > +sent over it, which in turn
> > +allows:
> > +
> > +* Sharing of client memory for DMA with the server.
> > +* Sharing of server memory with the client for fast MMIO.
> > +* Efficient sharing of eventfd's for triggering interrupts.
> > +
> > +Other socket types could be used which allow the server to run in a
> > +separate guest in the same host (AF_VSOCK) or remotely (AF_INET).
> > +Theoretically the underlying transport does not necessarily have to
> > +be a socket, however we do not examine such alternatives. In this
> > +protocol version we focus on using a UNIX domain socket and introduce
> > +basic support for the other two types of sockets without considering
> performance implications.
> > +
> > +While passing of file descriptors is desirable for performance
> > +reasons, it is not necessary neither for the client nor for the
> > +server to support it in order
> 
> Double negative. "not" can be removed.
> 
> > +to implement the protocol. There is always an in-band,
> > +message-passing fall back mechanism.
> > +
> > +VFIO
> > +====
> > +VFIO is a framework that allows a physical device to be securely
> > +passed through to a user space process; the device-specific kernel
> > +driver does not drive the device at all.  Typically, the user space
> > +process is a VMM and the device is passed through to it in order to
> > +achieve high performance. VFIO provides an API and the required
> > +functionality in the kernel. QEMU has adopted VFIO to allow a guest
> > +to directly access physical devices, instead of emulating them in software.
> > +
> > +vfio-user reuses the core VFIO concepts defined in its API, but
> > +implements them as messages to be sent over a socket. It does not
> > +change the kernel-based VFIO in any way, in fact none of the VFIO
> > +kernel modules need to be loaded to use vfio-user. It is also
> > +possible for the client to concurrently use the current kernel-based VFIO
> for one device, and vfio-user for another device.
> > +
> > +VFIO Device Model
> > +-----------------
> > +A device under VFIO presents a standard interface to the user
> > +process. Many of the VFIO operations in the existing interface use
> > +the ioctl() system call, and references to the existing interface are
> > +called the ioctl() implementation in this document.
> > +
> > +The following sections describe the set of messages that implement
> > +the VFIO interface over a socket. In many cases, the messages are
> > +direct translations of data structures used in the ioctl()
> > +implementation. Messages derived from ioctl()s will have a name
> > +derived from the ioctl() command name.  E.g., the VFIO_GET_INFO
> > +ioctl() command becomes a VFIO_USER_GET_INFO message.  The
> purpose of
> > +this reuse is to share as much code as feasible with the ioctl()
> implementation.
> > +
> > +Connection Initiation
> > +^^^^^^^^^^^^^^^^^^^^^
> > +After the client connects to the server, the initial client message
> > +is VFIO_USER_VERSION to propose a protocol version and set of
> > +capabilities to apply to the session. The server replies with a
> > +compatible version and set of capabilities it supports, or closes the
> > +connection if it cannot support the advertised version.
> > +
> > +DMA Memory Configuration
> > +^^^^^^^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> messages to
> > +inform the server of the valid DMA ranges that the server can access
> > +on behalf of a device. DMA memory may be accessed by the server via
> > +VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages over
> the socket.
> > +
> > +An optimization for server access to client memory is for the client
> > +to provide file descriptors the server can mmap() to directly access
> > +client memory. Note that mmap() privileges cannot be revoked by the
> > +client, therefore file descriptors should only be exported in
> > +environments where the client trusts the server not to corrupt guest
> memory.
> > +
> > +Device Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the
> > +server for information about the device. This information includes:
> > +
> > +* The device type and whether it supports reset
> > +(``VFIO_DEVICE_FLAGS_``),
> > +* the number of device regions, and
> > +* the device presents to the client the number of interrupt types the
> > +device
> > +  supports.
> > +
> > +Region Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to
> query
> > +the server for information about the device's memory regions. This
> information describes:
> > +
> > +* Read and write permissions, whether it can be memory mapped, and
> > +whether it
> > +  supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
> > +* Region index, size, and offset.
> > +
> > +When a region can be mapped by the client, the server provides a file
> > +descriptor which the client can mmap(). The server is responsible for
> > +polling for client updates to memory mapped regions.
> > +
> > +Region Capabilities
> > +"""""""""""""""""""
> > +Some regions have additional capabilities that cannot be described
> > +adequately by the region info data structure. These capabilities are
> > +returned in the region info reply in a list similar to PCI
> > +capabilities in a PCI device's configuration space.
> > +
> > +Sparse Regions
> > +""""""""""""""
> > +A region can be memory-mappable in whole or in part. When only a
> > +subset of a region can be mapped by the client, a
> > +VFIO_REGION_INFO_CAP_SPARSE_MMAP capability is included in the
> region
> > +info reply. This capability describes which portions can be mapped by the
> client.
> > +
> > +.. Note::
> > +   For example, in a virtual NVMe controller, sparse regions can be used so
> > +   that accesses to the NVMe registers (found in the beginning of BAR0)
> are
> > +   trapped (an infrequent event), while allowing direct access to the
> doorbells
> > +   (an extremely frequent event as every I/O submission requires a write
> to
> > +   BAR0), found right after the NVMe registers in BAR0.
> > +
> > +Device-Specific Regions
> > +"""""""""""""""""""""""
> > +
> > +A device can define regions additional to the standard ones (e.g. PCI
> > +indexes 0-8). This is achieved by including a
> > +VFIO_REGION_INFO_CAP_TYPE capability in the region info reply of a
> > +device-specific region. Such regions are reflected in ``struct
> > +vfio_device_info.num_regions``. Thus, for PCI devices this value can be
> equal to, or higher than, VFIO_PCI_NUM_REGIONS.
> > +
> > +Region I/O via file descriptors
> > +-------------------------------
> > +
> > +For unmapped regions, region I/O from the client is done via
> > +VFIO_USER_REGION_READ/WRITE.  As an optimization, ioeventfds or
> > +ioregionfds may be configured for sub-regions of some regions. A
> > +client may request information on these sub-regions via
> > +VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the returned
> file
> > +descriptors as ioeventfds or ioregionfds, the server can be directly
> > +notified of I/O (for example, by KVM) without taking a trip through the
> client.
> > +
> > +Interrupts
> > +^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query
> the
> > +server for the device's interrupt types. The interrupt types are
> > +specific to the bus the device is attached to, and the client is
> > +expected to know the capabilities of each interrupt type. The server
> > +can signal an interrupt either with VFIO_USER_VM_INTERRUPT messages
> > +over the socket, or can directly inject interrupts into the guest via
> > +an event file descriptor. The client configures how the server signals an
> interrupt with VFIO_USER_SET_IRQS messages.
> > +
> > +Device Read and Write
> > +^^^^^^^^^^^^^^^^^^^^^
> > +When the guest executes load or store operations to device memory,
> > +the client
> 
> <linux/vfio.h> calls it "device regions", not "device memory".
> s/device memory/unmapped device regions/?
> 
> > +forwards these operations to the server with VFIO_USER_REGION_READ
> or
> > +VFIO_USER_REGION_WRITE messages. The server will reply with data
> from
> > +the device on read operations or an acknowledgement on write
> operations.
> > +
> > +DMA
> > +^^^
> > +When a device performs DMA accesses to guest memory, the server will
> > +forward them to the client with VFIO_USER_DMA_READ and
> VFIO_USER_DMA_WRITE messages.
> > +These messages can only be used to access guest memory the client has
> > +configured into the server.
> > +
> > +Protocol Specification
> > +======================
> > +To distinguish from the base VFIO symbols, all vfio-user symbols are
> > +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in
> > +the little-endian format, although this may be relaxed in future
> > +revision in cases where the client and server are both big-endian.
> > +The messages are formatted for seamless reuse of the native VFIO
> structs.
> > +
> > +Socket
> > +------
> > +
> > +A server can serve:
> > +
> > +1) one or more clients, and/or
> > +2) one or more virtual devices, belonging to one or more clients.
> > +
> > +The current protocol specification requires a dedicated socket per
> > +client/server connection. It is a server-side implementation detail
> > +whether a single server handles multiple virtual devices from the
> > +same or multiple clients. The location of the socket is
> > +implementation-specific. Multiplexing clients, devices, and servers
> > +over the same socket is not supported in this version of the protocol.
> > +
> > +Authentication
> > +--------------
> > +For AF_UNIX, we rely on OS mandatory access controls on the socket
> > +files, therefore it is up to the management layer to set up the socket as
> required.
> > +Socket types than span guests or hosts will require a proper
> > +authentication mechanism. Defining that mechanism is deferred to a
> > +future version of the protocol.
> > +
> > +Command Concurrency
> > +-------------------
> > +A client may pipeline multiple commands without waiting for previous
> > +command replies.  The server will process commands in the order they
> > +are received.  A consequence of this is if a client issues a command
> > +with the *No_reply* bit, then subseqently issues a command without
> > +*No_reply*, the older command will have been processed before the
> > +reply to the younger command is sent by the server.  The client must
> > +be aware of the device's capability to process concurrent commands if
> > +pipelining is used.  For example, pipelining allows multiple client
> > +threads to concurently access device memory; the client must ensure
> these acceses obey device semantics.
> 
> s/acceses/accesses/
> 
> > +
> > +An example is a frame buffer device, where the device may allow
> > +concurrent access to different areas of video memory, but may have
> > +indeterminate behavior if concurrent acceses are performed to command
> or status registers.
> > +
> > +Note that unrelated messages sent from the sevrer to the client can
> > +appear in
> 
> s/sevrer/server/
> 
> > +between a client to server request/reply and vice versa.
> > +
> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either
> > +intentionally or unexpectedly. Both the client and the server need to
> > +know how to handle such events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because 
> > of
> a
> > +   device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client
> > +should reset the VFIO device when it detects the socket has been
> disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the
> > +client has exited. Currently, this means that the guest is shut down
> > +so the device is no longer needed therefore the server can
> > +automatically exit. However, there can be cases where a client
> disconnection should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the
> > +server should take no action when the client disconnects. If anything
> > +happens to the client the control stack will know about it and can
> > +clean up resources accordingly.
> 
> Also, hot unplug?
> 
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained for
> future reconnection?
> 
> > +
> > +Request Retry and Response Timeout
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +A failed command is a command that has been successfully sent and has
> > +been responded to with an error code. Failure to send the command in
> > +the first place (e.g. because the socket is disconnected) is a
> > +different type of error examined earlier in the disconnect section.
> > +
> > +.. Note::
> > +   QEMU's VFIO retries certain operations if they fail. While this makes
> sense
> > +   for real HW, we don't know for sure whether it makes sense for virtual
> > +   devices.
> > +
> > +Defining a retry and timeout scheme is deferred to a future version
> > +of the protocol.
> > +
> > +.. _Commands:
> > +
> > +Commands
> > +--------
> > +The following table lists the VFIO message command IDs, and whether
> > +the message command is sent from the client or the server.
> > +
> > ++------------------------------------+---------+-------------------+
> > +| Name                               | Command | Request Direction |
> >
> ++====================================+=========+=========
> ==========+
> > +| VFIO_USER_VERSION                  | 1       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_MAP                  | 2       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_UNMAP                | 3       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_INFO          | 4       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_INFO   | 5       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_IRQ_INFO      | 7       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_SET_IRQS          | 8       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_READ              | 9       | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_WRITE             | 10      | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ                 | 11      | server -> client  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_WRITE                | 12      | server -> client  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_VM_INTERRUPT             | 13      | server -> client  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_RESET             | 14      | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DIRTY_PAGES              | 15      | client -> server  |
> > ++------------------------------------+---------+-------------------+
> > +
> > +
> > +.. Note:: Some VFIO defines cannot be reused since their values are
> > +   architecture-specific (e.g. VFIO_IOMMU_MAP_DMA).
> 
> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends
> VFIO_USER_VM_INTERRUPT before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages are
> processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete before
> the next pipelined message starts. It imposes a maximum issue depth of 1.
> Still, it seems like it would be relatively easy to hit re-entrancy or 
> deadlock
> issues since both the client and the server can initiate messages and may
> need to wait for a response.
> 
> > +
> > +Header
> > +------
> > +All messages, both command messages and reply messages, are
> preceded
> > +by a header that contains basic information about the message. The
> > +header is followed by message-specific data described in the sections
> below.
> > +
> > ++----------------+--------+-------------+
> > +| Name           | Offset | Size        |
> > ++================+========+=============+
> > +| Message ID     | 0      | 2           |
> > ++----------------+--------+-------------+
> > +| Command        | 2      | 2           |
> > ++----------------+--------+-------------+
> > +| Message size   | 4      | 4           |
> > ++----------------+--------+-------------+
> > +| Flags          | 8      | 4           |
> > ++----------------+--------+-------------+
> > +|                | +-----+------------+ |
> > +|                | | Bit | Definition | |
> > +|                | +=====+============+ |
> > +|                | | 0-3 | Type       | |
> > +|                | +-----+------------+ |
> > +|                | | 4   | No_reply   | |
> > +|                | +-----+------------+ |
> > +|                | | 5   | Error      | |
> > +|                | +-----+------------+ |
> > ++----------------+--------+-------------+
> > +| Error          | 12     | 4           |
> > ++----------------+--------+-------------+
> > +| <message data> | 16     | variable    |
> > ++----------------+--------+-------------+
> > +
> > +* *Message ID* identifies the message, and is echoed in the command's
> > +reply
> > +  message. Message IDs belong entirely to the sender, can be re-used
> > +(even
> > +  concurrently) and the receiver must not make any assumptions about
> > +their
> > +  uniqueness.
> > +* *Command* specifies the command to be executed, listed in
> Commands_.
> > +* *Message size* contains the size of the entire message, including the
> header.
> > +* *Flags* contains attributes of the message:
> > +
> > +  * The *Type* bits indicate the message type.
> > +
> > +    *  *Command* (value 0x0) indicates a command message.
> > +    *  *Reply* (value 0x1) indicates a reply message acknowledging a
> previous
> > +       command with the same message ID.
> > +  * *No_reply* in a command message indicates that no reply is needed
> for this command.
> > +    This is commonly used when multiple commands are sent, and only the
> last needs
> > +    acknowledgement.
> > +  * *Error* in a reply message indicates the command being
> acknowledged had
> > +    an error. In this case, the *Error* field will be valid.
> > +
> > +* *Error* in a reply message is an optional UNIX errno value. It may
> > +be zero
> > +  even if the Error bit is set in Flags. It is reserved in a command 
> > message.
> > +
> > +Each command message in Commands_ must be replied to with a reply
> > +message, unless the message sets the *No_Reply* bit.  The reply
> > +consists of the header with the *Reply* bit set, plus any additional data.
> > +
> > +If an error occurs, the reply message must only include the reply header.
> > +
> > +VFIO_USER_VERSION
> > +-----------------
> > +
> > +This is the initial message sent by the client after the socket
> > +connection is
> > +established:
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+-------------------------------------------+
> > +| Name         | Value                                     |
> >
> ++==============+=========================================
> ==+
> > +| Message ID   | <ID>                                      |
> > ++--------------+-------------------------------------------+
> > +| Command      | 1                                         |
> > ++--------------+-------------------------------------------+
> > +| Message size | 16 + version header + version data length |
> > ++--------------+-------------------------------------------+
> > +| Flags        | Reply bit set in reply                    |
> > ++--------------+-------------------------------------------+
> > +| Error        | 0/errno                                   |
> > ++--------------+-------------------------------------------+
> > +| Version      | version header                            |
> > ++--------------+-------------------------------------------+
> > +
> > +Version Header Format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++---------------+--------+------------------------------------------------+
> > +| Name          | Offset | Size                                           |
> >
> ++===============+========+===============================
> ============
> > ++=====+
> > +| version major | 16     | 2                                              |
> > ++---------------+--------+------------------------------------------------+
> > +| version minor | 18     | 2                                              |
> > ++---------------+--------+------------------------------------------------+
> > +| version data  | 22     | variable (including terminating NUL            |
> > +|               |        | character). Optional.                          |
> > ++---------------+--------+------------------------------------------------+
> > +
> > +Version Data Format
> > +^^^^^^^^^^^^^^^^^^^
> > +
> > +The version data is an optional JSON byte array with the following format:
> 
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
> 
>   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
> 
> Please indicate the character encoding. I guess it is always UTF-8?
> 
> > +
> > ++--------------------+------------------+-----------------------------------+
> > +| Name               | Type             | Description                      
> >  |
> >
> ++====================+==================+================
> ============
> > ++=======+
> > +| ``"capabilities"`` | collection of    | Contains common capabilities 
> > that |
> > +|                    | name/value pairs | the sender supports. Optional.   
> >  |
> > ++--------------------+------------------+-----------------------------------+
> > +
> > +Capabilities:
> > +
> > ++--------------------+------------------+-------------------------------------+
> > +| Name               | Type             | Description                      
> >    |
> >
> ++====================+==================+================
> ============
> > ++=========+
> > +| ``"max_fds"``      | number           | Maximum number of file 
> > descriptors  |
> > +|                    |                  | the can be received by the 
> > sender.  |
> > +|                    |                  | Optional. If not specified then 
> > the |
> > +|                    |                  | receiver must assume             
> >    |
> > +|                    |                  | ``"max_fds"=1``.                 
> >    |
> 
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).
> 
> > ++--------------------+------------------+-------------------------------------+
> > +| ``"max_msg_size"`` | number           | Maximum message size in bytes
> that  |
> > +|                    |                  | the receiver can handle, 
> > including  |
> > +|                    |                  | the header. Optional. If not     
> >    |
> > +|                    |                  | specified then the receiver must 
> >    |
> > +|                    |                  | assume ``"max_msg_size"=4096``.  
> >    |
> > ++--------------------+------------------+-------------------------------------+
> > +| ``"migration"``    | collection of    | Migration capability parameters. 
> > If |
> > +|                    | name/value pairs | missing then migration is not    
> >    |
> > +|                    |                  | supported by the sender.         
> >    |
> > ++--------------------+------------------+-------------------------------------+
> > +
> > +The migration capability contains the following name/value pairs:
> > +
> > ++--------------+--------+-----------------------------------------------+
> > +| Name         | Type   | Description                                   |
> >
> ++==============+========+================================
> ============
> > ++===+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +|              |        | between the client and the server is used.    |
> > ++--------------+--------+-----------------------------------------------+
> 
> "in bytes"?
> 
> > +
> > +
> > +.. _Version:
> > +
> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon establishing a connection, the client must send a
> > +VFIO_USER_VERSION message proposing a protocol version and a set of
> > +capabilities. The server compares these with the versions and
> > +capabilities it supports and sends a VFIO_USER_VERSION reply according
> to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the
> > +client
> > +  does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the
> > +minor
> > +  version proposed.
> > +* The capability list must be a subset of those proposed. If the
> > +server
> > +  requires a capability the client did not include, it closes the 
> > connection.
> 
> Does the server echo back all capabilities it has accepted so the client can 
> still
> close the connection if it sees the server didn't accept a capability?
> 
> > +
> > +The protocol major version will only change when incompatible
> > +protocol changes are made, such as changing the message format. The
> > +minor version may change when compatible changes are made, such as
> > +adding new messages or capabilities, Both the client and server must
> > +support all minor versions less than the maximum minor version it
> > +supports. E.g., an implementation that supports version 1.3 must also
> support 1.0 through 1.2.
> > +
> > +When making a change to this specification, the protocol version
> > +number must be included in the form "added in version X.Y"
> > +
> > +
> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 2                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it
> > +of the memory regions the server can access. It must be sent before
> > +the server can perform any DMA to the client. It is normally sent
> > +directly after the version handshake is completed, but may also occur
> > +when memory is added to the client, or if the client uses a vIOMMU.
> > +If the client does not expect the server to perform DMA then it does
> > +not need to send to the server VFIO_USER_DMA_MAP commands. If the
> > +server does not need to perform DMA then it can ignore such commands
> > +but it must still reply to them. The table is an array of the following
> structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name        | Offset | Size        |
> > ++=============+========+=============+
> > +| Address     | 0      | 8           |
> > ++-------------+--------+-------------+
> > +| Size        | 8      | 8           |
> > ++-------------+--------+-------------+
> > +| Offset      | 16     | 8           |
> > ++-------------+--------+-------------+
> > +| Protections | 24     | 4           |
> > ++-------------+--------+-------------+
> > +| Flags       | 28     | 4           |
> > ++-------------+--------+-------------+
> > +|             | +-----+------------+ |
> > +|             | | Bit | Definition | |
> > +|             | +=====+============+ |
> > +|             | | 0   | Mappable   | |
> > +|             | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> 
> "in bytes"?
> 
> > +* *Offset* is the file offset of the region with respect to the
> > +associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> 
> Please be more specific. Does it only include PROT_READ and PROT_WRITE?
> What about PROT_EXEC?
> 
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *Mappable* indicates that the region can be mapped via the mmap()
> system
> > +    call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an
> > +array of file descriptors must be sent as part of the message
> > +meta-data. Each mappable region entry must have a corresponding file
> > +descriptor. On AF_UNIX sockets, the file descriptors must be passed
> > +as SCM_RIGHTS type ancillary data. Otherwise, if a DMA region cannot
> > +be directly mapped by the server, it can be accessed by the server
> > +using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages,
> explained
> > +in `Read and Write Operations`_. A command to map over an existing
> region must be failed by the server with ``EEXIST`` set in error field in the
> reply.
> 
> Does this mean a vIOMMU update, like a protections bits change requires an
> unmap command followed by a map command? That is not an atomic
> operation but hopefully guests don't try to update a vIOMMU mapping while
> accessing it.

Correct, it's not an atomic operation. We could consider adding such an 
operation
If you think it would be useful?

> 
> By the way, this DMA mapping design is the eager mapping approach.
> Another approach is the lazy mapping approach where the server requests
> translations as necessary. The advantage is that the client does not have to
> send each mapping to the server. In the case of
> VFIO_USER_DMA_READ/WRITE no mappings need to be sent at all. Only
> mmaps need mapping messages.
> 
> > +Adding multiple DMA regions can partially fail. The response does not
> > +indicate which regions were added and which were not, therefore it is
> > +a client implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > +   The server can optionally remove succesfully added DMA regions
> > +making this
> 
> s/succesfully/successfully/
> 
> > +   operation atomic.
> > +   The client can recover by attempting to unmap one by one all the DMA
> regions
> > +   in the VFIO_USER_DMA_MAP command, ignoring failures for regions
> that do not
> > +   exist.
> > +
> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 3                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it
> > +that a DMA region, previously made available via a
> VFIO_USER_DMA_MAP
> > +command message, is no longer available for DMA. It typically occurs
> > +when memory is subtracted from the client or if the client uses a
> > +vIOMMU. If the client does not expect the server to perform DMA then
> > +it does not need to send to the server VFIO_USER_DMA_UNMAP
> commands.
> > +If the server does not need to perform DMA then it can ignore such
> > +commands but it must still reply to them. The table is an
> 
> I'm a little confused by the last two sentences about not sending or ignoring
> VFIO_USER_DMA_UNMAP. Does it mean that VFIO_USER_DMA_MAP does
> not need to be sent either when the device is known never to need DMA?
> 
> > +array of the following structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++--------------+--------+---------------------------------------+
> > +| Name         | Offset | Size                                  |
> >
> ++==============+========+================================
> =======+
> > +| Address      | 0      | 8                                     |
> > ++--------------+--------+---------------------------------------+
> > +| Size         | 8      | 8                                     |
> > ++--------------+--------+---------------------------------------+
> > +| Offset       | 16     | 8                                     |
> > ++--------------+--------+---------------------------------------+
> > +| Protections  | 24     | 4                                     |
> > ++--------------+--------+---------------------------------------+
> > +| Flags        | 28     | 4                                     |
> > ++--------------+--------+---------------------------------------+
> > +|              | +-----+--------------------------------------+ |
> > +|              | | Bit | Definition                           | |
> > +|              | +=====+======================================+ |
> > +|              | | 0   | VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP | |
> > +|              | +-----+--------------------------------------+ |
> > ++--------------+--------+---------------------------------------+
> > +| VFIO Bitmaps | 32     | variable                              |
> > ++--------------+--------+---------------------------------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the
> > +associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> 
> Why are offset and protections required for the unmap command?
> 
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a
> dirty page bitmap
> > +    must be populated before unmapping the DMA region. The client must
> provide
> > +    a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, 
> > with
> > +    the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized.
> > +
> > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region
> > +(explained
> > +  below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in
> Flags.
> 
> I'm confused, it's 1 "VFIO Bitmaps" per "Table entry". Why does it contain
> one struct vfio_bitmap per region? What is a "region" in this context?
> 
> > +
> > +.. _VFIO bitmap format:
> > +
> > +VFIO bitmap format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > +If the VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP bit is set in the
> > +request, the server must append to the header the ``struct
> > +vfio_bitmap`` received in the command followed by the bitmap, for
> > +each region. ``struct vfio_bitmap`` has the following format:
> > +
> > ++--------+--------+------+
> > +| Name   | Offset | Size |
> > ++========+========+======+
> > +| pgsize | 0      | 8    |
> > ++--------+--------+------+
> > +| size   | 8      | 8    |
> > ++--------+--------+------+
> > +| data   | 16     | 8    |
> > ++--------+--------+------+
> > +
> > +* *pgsize* is the page size for the bitmap, in bytes.
> > +* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap
> header.
> > +* *data* This field is unused in vfio-user.
> > +
> > +The VFIO bitmap structure is defined in ``<linux/vfio.h>`` (``struct
> > +vfio_bitmap``).
> > +
> > +Each ``struct vfio_bitmap`` entry is followed by the region's bitmap.
> > +Each bit in the bitmap represents one page of size ``struct
> vfio_bitmap.pgsize``.
> > +
> > +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is not set in Flags
> then
> > +the size of the message is: 16 + (# of table entries * 32).
> > +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags then
> the
> > +size of the message is: 16 + (# of table entries * 56) + size of all 
> > bitmaps.
> 
> 
> > +
> > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file
> descriptor
> > +is mapped then the server must release all references to that DMA
> > +region before replying, which includes potentially in flight DMA
> > +transactions. Removing a portion of a DMA region is possible.
> 
> "Removing a portion of a DMA region is possible"
> -> doing so splits a larger DMA region into one or two smaller remaining
> regions?
> 
> How do potentially large messages work around max_msg_size? It is hard for
> the client/server to anticipate the maximum message size that will be
> required ahead of time, so they can't really know if they will hit a situation
> where max_msg_size is too low.
> 
> > +
> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name         | Value                      |
> > ++==============+============================+
> > +| Message ID   | <ID>                       |
> > ++--------------+----------------------------+
> > +| Command      | 4                          |
> > ++--------------+----------------------------+
> > +| Message size | 32                         |
> > ++--------------+----------------------------+
> > +| Flags        | Reply bit set in reply     |
> > ++--------------+----------------------------+
> > +| Error        | 0/errno                    |
> > ++--------------+----------------------------+
> > +| Device info  | VFIO device info           |
> > ++--------------+----------------------------+
> > +
> > +This command message is sent by the client to the server to query for
> > +basic information about the device. The VFIO device info structure is
> > +defined in ``<linux/vfio.h>`` (``struct vfio_device_info``).
> 
> Wait, "VFIO device info format" below is missing the cap_offset field, so it's
> exactly not the same as <linux/vfio.h>?
> 
> > +
> > +VFIO device info format
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+--------------------------+
> > +| Name        | Offset | Size                     |
> > ++=============+========+==========================+
> > +| argsz       | 16     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| flags       | 20     | 4                        |
> > ++-------------+--------+--------------------------+
> > +|             | +-----+-------------------------+ |
> > +|             | | Bit | Definition              | |
> > +|             | +=====+=========================+ |
> > +|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
> > +|             | +-----+-------------------------+ |
> > +|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
> > +|             | +-----+-------------------------+ |
> > ++-------------+--------+--------------------------+
> > +| num_regions | 24     | 4                        |
> > ++-------------+--------+--------------------------+
> > +| num_irqs    | 28     | 4                        |
> > ++-------------+--------+--------------------------+
> > +
> > +* *argsz* is the size of the VFIO device info structure. This is the
> > +only field that should be set to non-zero in the request, identifying
> > +the client's expected size. Currently this is a fixed value.
> > +* *flags* contains the following device attributes.
> > +
> > +  * VFIO_DEVICE_FLAGS_RESET indicates that the device supports the
> > +    VFIO_USER_DEVICE_RESET message.
> > +  * VFIO_DEVICE_FLAGS_PCI indicates that the device is a PCI device.
> > +
> > +* *num_regions* is the number of memory regions that the device
> exposes.
> > +* *num_irqs* is the number of distinct interrupt types that the device
> supports.
> > +
> > +This version of the protocol only supports PCI devices. Additional
> > +devices may be supported in future versions.
> 
> I've reviewed up to here so far.
> 
> Stefan
RE: [PATCH v8] introduce vfio-user protocol specification

Reply via email to