> -----Original Message-----
> From: Stefan Hajnoczi <stefa...@redhat.com>
> Sent: 04 May 2021 14:52
> To: Thanos Makatos <thanos.maka...@nutanix.com>
> Cc: qemu-devel@nongnu.org; John Levon <le...@movementarian.org>;
> John G Johnson <john.g.john...@oracle.com>;
> benjamin.wal...@intel.com; Elena Ufimtseva
> <elena.ufimts...@oracle.com>; jag.ra...@oracle.com;
> james.r.har...@intel.com; Swapnil Ingle <swapnil.in...@nutanix.com>;
> konrad.w...@oracle.com; alex.william...@redhat.com;
> yuvalkash...@gmail.com; tina.zh...@intel.com;
> marcandre.lur...@redhat.com; ism...@linux.com;
> kanth.ghatr...@oracle.com; Felipe Franciosi <fel...@nutanix.com>;
> xiuchun...@intel.com; tomassetti.and...@gmail.com; Raphael Norwitz
> <raphael.norw...@nutanix.com>; changpeng....@intel.com;
> dgilb...@redhat.com; Yan Zhao <yan.y.z...@intel.com>; Michael S . Tsirkin
> <m...@redhat.com>; Gerd Hoffmann <kra...@redhat.com>; Christophe de
> Dinechin <cdupo...@redhat.com>; Jason Wang <jasow...@redhat.com>;
> Cornelia Huck <coh...@redhat.com>; Kirti Wankhede
> <kwankh...@nvidia.com>; Paolo Bonzini <pbonz...@redhat.com>;
> mpiszc...@ddn.com; John Levon <john.le...@nutanix.com>
> Subject: Re: [PATCH v8] introduce vfio-user protocol specification
>
> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be
> > emulated outside QEMU, in a separate process. vfio-user reuses the
> > existing VFIO defines, structs and concepts.
> >
> > It has been earlier discussed as an RFC in:
> > "RFC: use VFIO over a UNIX domain socket to implement device offloading"
> >
> > Signed-off-by: John G Johnson <john.g.john...@oracle.com>
> > Signed-off-by: Thanos Makatos <thanos.maka...@nutanix.com>
> > Signed-off-by: John Levon <john.le...@nutanix.com>
> >
> > ---
> >
> > Changed since v1:
> > * fix coding style issues
> > * update MAINTAINERS for VFIO-over-socket
> > * add vfio-over-socket to ToC
> >
> > Changed since v2:
> > * fix whitespace
> >
> > Changed since v3:
> > * rename protocol to vfio-user
> > * add table of contents
> > * fix Unicode problems
> > * fix typos and various reStructuredText issues
> > * various stylistic improvements
> > * add backend program conventions
> > * rewrite part of intro, drop QEMU-specific stuff
> > * drop QEMU-specific paragraph about implementation
> > * explain that passing of FDs isn't necessary
> > * minor improvements in the VFIO section
> > * various text substitutions for the sake of consistency
> > * drop paragraph about client and server, already explained in
> > * intro
> > * drop device ID
> > * drop type from version
> > * elaborate on request concurrency
> > * convert some inessential paragraphs into notes
> > * explain why some existing VFIO defines cannot be reused
> > * explain how to make changes to the protocol
> > * improve text of DMA map
> > * reword comment about existing VFIO commands
> > * add reference to Version section
> > * reset device on disconnection
> > * reword live migration section
> > * replace sys/vfio.h with linux/vfio.h
> > * drop reference to iovec
> > * use argz the same way it is used in VFIO
> > * add type field in header for clarity
> >
> > Changed since v4:
> > * introduce support for live migration as defined in
> > * include/uapi/linux/vfio.h
> > * introduce 'max_fds' and 'migration' capabilities:
> > * remove 'index' from VFIO_USER_DEVICE_GET_IRQ_INFO
> > * fix minor typos and reworded some text for clarity
> >
> > Changed since v5:
> > * fix minor typos
> > * separate VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> > * clarify meaning of VFIO bitmap size field
> > * move version major/minor outside JSON
> > * client proposes version first
> > * make Errno optional in message header
> > * clarification about message ID uniqueness
> > * clarify that server->client request can appear in between
> > client->server request/reply
> >
> > Changed since v6:
> > * put JSON strings in double quotes
> > * clarify reply behavior on error
> > * introduce max message size capability
> > * clarify semantics when failing to map multiple DMA regions in a
> > single command
> >
> > Changed since v7:
> > * client proposes version instead of server
> > * support ioeventfd and ioregionfd for unmapped regions
> > * reword struct vfio_bitmap for clarity
> > * clarify use of argsz in VFIO device info
> > * allow individual IRQs to be disabled
> > ---
> > MAINTAINERS | 7 +
> > docs/devel/index.rst | 1 +
> > docs/devel/vfio-user.rst | 1854
> > ++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 1862 insertions(+)
> > create mode 100644 docs/devel/vfio-user.rst
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index 36055f14c5..bd1194002b
> > 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1849,6 +1849,13 @@ F: hw/vfio/ap.c
> > F: docs/system/s390x/vfio-ap.rst
> > L: qemu-s3...@nongnu.org
> >
> > +vfio-user
> > +M: John G Johnson <john.g.john...@oracle.com>
> > +M: Thanos Makatos <thanos.maka...@nutanix.com>
> > +M: John Levon <john.le...@nutanix.com>
> > +S: Supported
> > +F: docs/devel/vfio-user.rst
> > +
> > vhost
> > M: Michael S. Tsirkin <m...@redhat.com>
> > S: Supported
> > diff --git a/docs/devel/index.rst b/docs/devel/index.rst index
> > 6cf7e2d233..7d1ea63e02 100644
> > --- a/docs/devel/index.rst
> > +++ b/docs/devel/index.rst
> > @@ -42,3 +42,4 @@ Contents:
> > qom
> > block-coroutine-wrapper
> > multi-process
> > + vfio-user
> > diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst new
> > file mode 100644 index 0000000000..b3498eec02
> > --- /dev/null
> > +++ b/docs/devel/vfio-user.rst
> > @@ -0,0 +1,1854 @@
> > +.. include:: <isonum.txt>
> > +
> > +********************************
> > +vfio-user Protocol Specification
> > +********************************
> > +
> > +------------
> > +Version_ 0.1
> > +------------
> > +
> > +.. contents:: Table of Contents
> > +
> > +Introduction
> > +============
> > +vfio-user is a protocol that allows a device to be emulated in a
> > +separate process outside of a Virtual Machine Monitor (VMM).
> > +vfio-user devices consist of a generic VFIO device type, living
> > +inside the VMM, which we call the client, and the core device
> > +implementation, living outside the VMM, which we call the server.
> > +
> > +The `Linux VFIO ioctl interface
> > +<https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_
> > +been chosen as the base for this protocol for the following reasons:
> > +
> > +1) It is a mature and stable API, backed by an extensively used
> framework.
> > +2) The existing VFIO client implementation in QEMU (qemu/hw/vfio/) can
> be
> > + largely reused.
> > +
> > +.. Note::
> > + In a proof of concept implementation it has been demonstrated that
> using VFIO
> > + over a UNIX domain socket is a viable option. vfio-user is designed with
> > + QEMU in mind, however it could be used by other client applications.
> The
> > + vfio-user protocol does not require that QEMU's VFIO client
> implementation
> > + is used in QEMU.
> > +
> > +None of the VFIO kernel modules are required for supporting the
> > +protocol, neither in the client nor the server, only the source header
> > files
> are used.
> > +
> > +The main idea is to allow a virtual device to function in a separate
> > +process in the same host over a UNIX domain socket. A UNIX domain
> > +socket (AF_UNIX) is chosen because file descriptors can be trivially
> > +sent over it, which in turn
> > +allows:
> > +
> > +* Sharing of client memory for DMA with the server.
> > +* Sharing of server memory with the client for fast MMIO.
> > +* Efficient sharing of eventfd's for triggering interrupts.
> > +
> > +Other socket types could be used which allow the server to run in a
> > +separate guest in the same host (AF_VSOCK) or remotely (AF_INET).
> > +Theoretically the underlying transport does not necessarily have to
> > +be a socket, however we do not examine such alternatives. In this
> > +protocol version we focus on using a UNIX domain socket and introduce
> > +basic support for the other two types of sockets without considering
> performance implications.
> > +
> > +While passing of file descriptors is desirable for performance
> > +reasons, it is not necessary neither for the client nor for the
> > +server to support it in order
>
> Double negative. "not" can be removed.
>
> > +to implement the protocol. There is always an in-band,
> > +message-passing fall back mechanism.
> > +
> > +VFIO
> > +====
> > +VFIO is a framework that allows a physical device to be securely
> > +passed through to a user space process; the device-specific kernel
> > +driver does not drive the device at all. Typically, the user space
> > +process is a VMM and the device is passed through to it in order to
> > +achieve high performance. VFIO provides an API and the required
> > +functionality in the kernel. QEMU has adopted VFIO to allow a guest
> > +to directly access physical devices, instead of emulating them in software.
> > +
> > +vfio-user reuses the core VFIO concepts defined in its API, but
> > +implements them as messages to be sent over a socket. It does not
> > +change the kernel-based VFIO in any way, in fact none of the VFIO
> > +kernel modules need to be loaded to use vfio-user. It is also
> > +possible for the client to concurrently use the current kernel-based VFIO
> for one device, and vfio-user for another device.
> > +
> > +VFIO Device Model
> > +-----------------
> > +A device under VFIO presents a standard interface to the user
> > +process. Many of the VFIO operations in the existing interface use
> > +the ioctl() system call, and references to the existing interface are
> > +called the ioctl() implementation in this document.
> > +
> > +The following sections describe the set of messages that implement
> > +the VFIO interface over a socket. In many cases, the messages are
> > +direct translations of data structures used in the ioctl()
> > +implementation. Messages derived from ioctl()s will have a name
> > +derived from the ioctl() command name. E.g., the VFIO_GET_INFO
> > +ioctl() command becomes a VFIO_USER_GET_INFO message. The
> purpose of
> > +this reuse is to share as much code as feasible with the ioctl()
> implementation.
> > +
> > +Connection Initiation
> > +^^^^^^^^^^^^^^^^^^^^^
> > +After the client connects to the server, the initial client message
> > +is VFIO_USER_VERSION to propose a protocol version and set of
> > +capabilities to apply to the session. The server replies with a
> > +compatible version and set of capabilities it supports, or closes the
> > +connection if it cannot support the advertised version.
> > +
> > +DMA Memory Configuration
> > +^^^^^^^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DMA_MAP and VFIO_USER_DMA_UNMAP
> messages to
> > +inform the server of the valid DMA ranges that the server can access
> > +on behalf of a device. DMA memory may be accessed by the server via
> > +VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages over
> the socket.
> > +
> > +An optimization for server access to client memory is for the client
> > +to provide file descriptors the server can mmap() to directly access
> > +client memory. Note that mmap() privileges cannot be revoked by the
> > +client, therefore file descriptors should only be exported in
> > +environments where the client trusts the server not to corrupt guest
> memory.
> > +
> > +Device Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses a VFIO_USER_DEVICE_GET_INFO message to query the
> > +server for information about the device. This information includes:
> > +
> > +* The device type and whether it supports reset
> > +(``VFIO_DEVICE_FLAGS_``),
> > +* the number of device regions, and
> > +* the device presents to the client the number of interrupt types the
> > +device
> > + supports.
> > +
> > +Region Information
> > +^^^^^^^^^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_REGION_INFO messages to
> query
> > +the server for information about the device's memory regions. This
> information describes:
> > +
> > +* Read and write permissions, whether it can be memory mapped, and
> > +whether it
> > + supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
> > +* Region index, size, and offset.
> > +
> > +When a region can be mapped by the client, the server provides a file
> > +descriptor which the client can mmap(). The server is responsible for
> > +polling for client updates to memory mapped regions.
> > +
> > +Region Capabilities
> > +"""""""""""""""""""
> > +Some regions have additional capabilities that cannot be described
> > +adequately by the region info data structure. These capabilities are
> > +returned in the region info reply in a list similar to PCI
> > +capabilities in a PCI device's configuration space.
> > +
> > +Sparse Regions
> > +""""""""""""""
> > +A region can be memory-mappable in whole or in part. When only a
> > +subset of a region can be mapped by the client, a
> > +VFIO_REGION_INFO_CAP_SPARSE_MMAP capability is included in the
> region
> > +info reply. This capability describes which portions can be mapped by the
> client.
> > +
> > +.. Note::
> > + For example, in a virtual NVMe controller, sparse regions can be used so
> > + that accesses to the NVMe registers (found in the beginning of BAR0)
> are
> > + trapped (an infrequent event), while allowing direct access to the
> doorbells
> > + (an extremely frequent event as every I/O submission requires a write
> to
> > + BAR0), found right after the NVMe registers in BAR0.
> > +
> > +Device-Specific Regions
> > +"""""""""""""""""""""""
> > +
> > +A device can define regions additional to the standard ones (e.g. PCI
> > +indexes 0-8). This is achieved by including a
> > +VFIO_REGION_INFO_CAP_TYPE capability in the region info reply of a
> > +device-specific region. Such regions are reflected in ``struct
> > +vfio_device_info.num_regions``. Thus, for PCI devices this value can be
> equal to, or higher than, VFIO_PCI_NUM_REGIONS.
> > +
> > +Region I/O via file descriptors
> > +-------------------------------
> > +
> > +For unmapped regions, region I/O from the client is done via
> > +VFIO_USER_REGION_READ/WRITE. As an optimization, ioeventfds or
> > +ioregionfds may be configured for sub-regions of some regions. A
> > +client may request information on these sub-regions via
> > +VFIO_USER_DEVICE_GET_REGION_IO_FDS; by configuring the returned
> file
> > +descriptors as ioeventfds or ioregionfds, the server can be directly
> > +notified of I/O (for example, by KVM) without taking a trip through the
> client.
> > +
> > +Interrupts
> > +^^^^^^^^^^
> > +The client uses VFIO_USER_DEVICE_GET_IRQ_INFO messages to query
> the
> > +server for the device's interrupt types. The interrupt types are
> > +specific to the bus the device is attached to, and the client is
> > +expected to know the capabilities of each interrupt type. The server
> > +can signal an interrupt either with VFIO_USER_VM_INTERRUPT messages
> > +over the socket, or can directly inject interrupts into the guest via
> > +an event file descriptor. The client configures how the server signals an
> interrupt with VFIO_USER_SET_IRQS messages.
> > +
> > +Device Read and Write
> > +^^^^^^^^^^^^^^^^^^^^^
> > +When the guest executes load or store operations to device memory,
> > +the client
>
> <linux/vfio.h> calls it "device regions", not "device memory".
> s/device memory/unmapped device regions/?
>
> > +forwards these operations to the server with VFIO_USER_REGION_READ
> or
> > +VFIO_USER_REGION_WRITE messages. The server will reply with data
> from
> > +the device on read operations or an acknowledgement on write
> operations.
> > +
> > +DMA
> > +^^^
> > +When a device performs DMA accesses to guest memory, the server will
> > +forward them to the client with VFIO_USER_DMA_READ and
> VFIO_USER_DMA_WRITE messages.
> > +These messages can only be used to access guest memory the client has
> > +configured into the server.
> > +
> > +Protocol Specification
> > +======================
> > +To distinguish from the base VFIO symbols, all vfio-user symbols are
> > +prefixed with vfio_user or VFIO_USER. In revision 0.1, all data is in
> > +the little-endian format, although this may be relaxed in future
> > +revision in cases where the client and server are both big-endian.
> > +The messages are formatted for seamless reuse of the native VFIO
> structs.
> > +
> > +Socket
> > +------
> > +
> > +A server can serve:
> > +
> > +1) one or more clients, and/or
> > +2) one or more virtual devices, belonging to one or more clients.
> > +
> > +The current protocol specification requires a dedicated socket per
> > +client/server connection. It is a server-side implementation detail
> > +whether a single server handles multiple virtual devices from the
> > +same or multiple clients. The location of the socket is
> > +implementation-specific. Multiplexing clients, devices, and servers
> > +over the same socket is not supported in this version of the protocol.
> > +
> > +Authentication
> > +--------------
> > +For AF_UNIX, we rely on OS mandatory access controls on the socket
> > +files, therefore it is up to the management layer to set up the socket as
> required.
> > +Socket types than span guests or hosts will require a proper
> > +authentication mechanism. Defining that mechanism is deferred to a
> > +future version of the protocol.
> > +
> > +Command Concurrency
> > +-------------------
> > +A client may pipeline multiple commands without waiting for previous
> > +command replies. The server will process commands in the order they
> > +are received. A consequence of this is if a client issues a command
> > +with the *No_reply* bit, then subseqently issues a command without
> > +*No_reply*, the older command will have been processed before the
> > +reply to the younger command is sent by the server. The client must
> > +be aware of the device's capability to process concurrent commands if
> > +pipelining is used. For example, pipelining allows multiple client
> > +threads to concurently access device memory; the client must ensure
> these acceses obey device semantics.
>
> s/acceses/accesses/
>
> > +
> > +An example is a frame buffer device, where the device may allow
> > +concurrent access to different areas of video memory, but may have
> > +indeterminate behavior if concurrent acceses are performed to command
> or status registers.
> > +
> > +Note that unrelated messages sent from the sevrer to the client can
> > +appear in
>
> s/sevrer/server/
>
> > +between a client to server request/reply and vice versa.
> > +
> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either
> > +intentionally or unexpectedly. Both the client and the server need to
> > +know how to handle such events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because
> > of
> a
> > + device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client
> > +should reset the VFIO device when it detects the socket has been
> disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the
> > +client has exited. Currently, this means that the guest is shut down
> > +so the device is no longer needed therefore the server can
> > +automatically exit. However, there can be cases where a client
> disconnection should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > + implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the
> > +server should take no action when the client disconnects. If anything
> > +happens to the client the control stack will know about it and can
> > +clean up resources accordingly.
>
> Also, hot unplug?
>
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained for
> future reconnection?
>
> > +
> > +Request Retry and Response Timeout
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +A failed command is a command that has been successfully sent and has
> > +been responded to with an error code. Failure to send the command in
> > +the first place (e.g. because the socket is disconnected) is a
> > +different type of error examined earlier in the disconnect section.
> > +
> > +.. Note::
> > + QEMU's VFIO retries certain operations if they fail. While this makes
> sense
> > + for real HW, we don't know for sure whether it makes sense for virtual
> > + devices.
> > +
> > +Defining a retry and timeout scheme is deferred to a future version
> > +of the protocol.
> > +
> > +.. _Commands:
> > +
> > +Commands
> > +--------
> > +The following table lists the VFIO message command IDs, and whether
> > +the message command is sent from the client or the server.
> > +
> > ++------------------------------------+---------+-------------------+
> > +| Name | Command | Request Direction |
> >
> ++====================================+=========+=========
> ==========+
> > +| VFIO_USER_VERSION | 1 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_MAP | 2 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_UNMAP | 3 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_INFO | 4 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_INFO | 5 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_REGION_IO_FDS | 6 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_GET_IRQ_INFO | 7 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_SET_IRQS | 8 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_READ | 9 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_REGION_WRITE | 10 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_READ | 11 | server -> client |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DMA_WRITE | 12 | server -> client |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_VM_INTERRUPT | 13 | server -> client |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DEVICE_RESET | 14 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +| VFIO_USER_DIRTY_PAGES | 15 | client -> server |
> > ++------------------------------------+---------+-------------------+
> > +
> > +
> > +.. Note:: Some VFIO defines cannot be reused since their values are
> > + architecture-specific (e.g. VFIO_IOMMU_MAP_DMA).
>
> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends
> VFIO_USER_VM_INTERRUPT before replying to the write message.
>
> Multi-threaded clients and servers could end up deadlocking if messages are
> processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
>
> Pipelining has the nice effect that the oldest message must complete before
> the next pipelined message starts. It imposes a maximum issue depth of 1.
> Still, it seems like it would be relatively easy to hit re-entrancy or
> deadlock
> issues since both the client and the server can initiate messages and may
> need to wait for a response.
>
> > +
> > +Header
> > +------
> > +All messages, both command messages and reply messages, are
> preceded
> > +by a header that contains basic information about the message. The
> > +header is followed by message-specific data described in the sections
> below.
> > +
> > ++----------------+--------+-------------+
> > +| Name | Offset | Size |
> > ++================+========+=============+
> > +| Message ID | 0 | 2 |
> > ++----------------+--------+-------------+
> > +| Command | 2 | 2 |
> > ++----------------+--------+-------------+
> > +| Message size | 4 | 4 |
> > ++----------------+--------+-------------+
> > +| Flags | 8 | 4 |
> > ++----------------+--------+-------------+
> > +| | +-----+------------+ |
> > +| | | Bit | Definition | |
> > +| | +=====+============+ |
> > +| | | 0-3 | Type | |
> > +| | +-----+------------+ |
> > +| | | 4 | No_reply | |
> > +| | +-----+------------+ |
> > +| | | 5 | Error | |
> > +| | +-----+------------+ |
> > ++----------------+--------+-------------+
> > +| Error | 12 | 4 |
> > ++----------------+--------+-------------+
> > +| <message data> | 16 | variable |
> > ++----------------+--------+-------------+
> > +
> > +* *Message ID* identifies the message, and is echoed in the command's
> > +reply
> > + message. Message IDs belong entirely to the sender, can be re-used
> > +(even
> > + concurrently) and the receiver must not make any assumptions about
> > +their
> > + uniqueness.
> > +* *Command* specifies the command to be executed, listed in
> Commands_.
> > +* *Message size* contains the size of the entire message, including the
> header.
> > +* *Flags* contains attributes of the message:
> > +
> > + * The *Type* bits indicate the message type.
> > +
> > + * *Command* (value 0x0) indicates a command message.
> > + * *Reply* (value 0x1) indicates a reply message acknowledging a
> previous
> > + command with the same message ID.
> > + * *No_reply* in a command message indicates that no reply is needed
> for this command.
> > + This is commonly used when multiple commands are sent, and only the
> last needs
> > + acknowledgement.
> > + * *Error* in a reply message indicates the command being
> acknowledged had
> > + an error. In this case, the *Error* field will be valid.
> > +
> > +* *Error* in a reply message is an optional UNIX errno value. It may
> > +be zero
> > + even if the Error bit is set in Flags. It is reserved in a command
> > message.
> > +
> > +Each command message in Commands_ must be replied to with a reply
> > +message, unless the message sets the *No_Reply* bit. The reply
> > +consists of the header with the *Reply* bit set, plus any additional data.
> > +
> > +If an error occurs, the reply message must only include the reply header.
> > +
> > +VFIO_USER_VERSION
> > +-----------------
> > +
> > +This is the initial message sent by the client after the socket
> > +connection is
> > +established:
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+-------------------------------------------+
> > +| Name | Value |
> >
> ++==============+=========================================
> ==+
> > +| Message ID | <ID> |
> > ++--------------+-------------------------------------------+
> > +| Command | 1 |
> > ++--------------+-------------------------------------------+
> > +| Message size | 16 + version header + version data length |
> > ++--------------+-------------------------------------------+
> > +| Flags | Reply bit set in reply |
> > ++--------------+-------------------------------------------+
> > +| Error | 0/errno |
> > ++--------------+-------------------------------------------+
> > +| Version | version header |
> > ++--------------+-------------------------------------------+
> > +
> > +Version Header Format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++---------------+--------+------------------------------------------------+
> > +| Name | Offset | Size |
> >
> ++===============+========+===============================
> ============
> > ++=====+
> > +| version major | 16 | 2 |
> > ++---------------+--------+------------------------------------------------+
> > +| version minor | 18 | 2 |
> > ++---------------+--------+------------------------------------------------+
> > +| version data | 22 | variable (including terminating NUL |
> > +| | | character). Optional. |
> > ++---------------+--------+------------------------------------------------+
> > +
> > +Version Data Format
> > +^^^^^^^^^^^^^^^^^^^
> > +
> > +The version data is an optional JSON byte array with the following format:
>
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
>
> JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
>
> Please indicate the character encoding. I guess it is always UTF-8?
>
> > +
> > ++--------------------+------------------+-----------------------------------+
> > +| Name | Type | Description
> > |
> >
> ++====================+==================+================
> ============
> > ++=======+
> > +| ``"capabilities"`` | collection of | Contains common capabilities
> > that |
> > +| | name/value pairs | the sender supports. Optional.
> > |
> > ++--------------------+------------------+-----------------------------------+
> > +
> > +Capabilities:
> > +
> > ++--------------------+------------------+-------------------------------------+
> > +| Name | Type | Description
> > |
> >
> ++====================+==================+================
> ============
> > ++=========+
> > +| ``"max_fds"`` | number | Maximum number of file
> > descriptors |
> > +| | | the can be received by the
> > sender. |
> > +| | | Optional. If not specified then
> > the |
> > +| | | receiver must assume
> > |
> > +| | | ``"max_fds"=1``.
> > |
>
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).
>
> > ++--------------------+------------------+-------------------------------------+
> > +| ``"max_msg_size"`` | number | Maximum message size in bytes
> that |
> > +| | | the receiver can handle,
> > including |
> > +| | | the header. Optional. If not
> > |
> > +| | | specified then the receiver must
> > |
> > +| | | assume ``"max_msg_size"=4096``.
> > |
> > ++--------------------+------------------+-------------------------------------+
> > +| ``"migration"`` | collection of | Migration capability parameters.
> > If |
> > +| | name/value pairs | missing then migration is not
> > |
> > +| | | supported by the sender.
> > |
> > ++--------------------+------------------+-------------------------------------+
> > +
> > +The migration capability contains the following name/value pairs:
> > +
> > ++--------------+--------+-----------------------------------------------+
> > +| Name | Type | Description |
> >
> ++==============+========+================================
> ============
> > ++===+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +| | | between the client and the server is used. |
> > ++--------------+--------+-----------------------------------------------+
>
> "in bytes"?
>
> > +
> > +
> > +.. _Version:
> > +
> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon establishing a connection, the client must send a
> > +VFIO_USER_VERSION message proposing a protocol version and a set of
> > +capabilities. The server compares these with the versions and
> > +capabilities it supports and sends a VFIO_USER_VERSION reply according
> to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the
> > +client
> > + does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the
> > +minor
> > + version proposed.
> > +* The capability list must be a subset of those proposed. If the
> > +server
> > + requires a capability the client did not include, it closes the
> > connection.
>
> Does the server echo back all capabilities it has accepted so the client can
> still
> close the connection if it sees the server didn't accept a capability?
>
> > +
> > +The protocol major version will only change when incompatible
> > +protocol changes are made, such as changing the message format. The
> > +minor version may change when compatible changes are made, such as
> > +adding new messages or capabilities, Both the client and server must
> > +support all minor versions less than the maximum minor version it
> > +supports. E.g., an implementation that supports version 1.3 must also
> support 1.0 through 1.2.
> > +
> > +When making a change to this specification, the protocol version
> > +number must be included in the form "added in version X.Y"
> > +
> > +
> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name | Value |
> > ++==============+========================+
> > +| Message ID | <ID> |
> > ++--------------+------------------------+
> > +| Command | 2 |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size |
> > ++--------------+------------------------+
> > +| Flags | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error | 0/errno |
> > ++--------------+------------------------+
> > +| Table | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it
> > +of the memory regions the server can access. It must be sent before
> > +the server can perform any DMA to the client. It is normally sent
> > +directly after the version handshake is completed, but may also occur
> > +when memory is added to the client, or if the client uses a vIOMMU.
> > +If the client does not expect the server to perform DMA then it does
> > +not need to send to the server VFIO_USER_DMA_MAP commands. If the
> > +server does not need to perform DMA then it can ignore such commands
> > +but it must still reply to them. The table is an array of the following
> structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name | Offset | Size |
> > ++=============+========+=============+
> > +| Address | 0 | 8 |
> > ++-------------+--------+-------------+
> > +| Size | 8 | 8 |
> > ++-------------+--------+-------------+
> > +| Offset | 16 | 8 |
> > ++-------------+--------+-------------+
> > +| Protections | 24 | 4 |
> > ++-------------+--------+-------------+
> > +| Flags | 28 | 4 |
> > ++-------------+--------+-------------+
> > +| | +-----+------------+ |
> > +| | | Bit | Definition | |
> > +| | +=====+============+ |
> > +| | | 0 | Mappable | |
> > +| | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
>
> "in bytes"?
>
> > +* *Offset* is the file offset of the region with respect to the
> > +associated file
> > + descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > + ``<sys/mman.h>``.
>
> Please be more specific. Does it only include PROT_READ and PROT_WRITE?
> What about PROT_EXEC?
>
> > +* *Flags* contains the following region attributes:
> > +
> > + * *Mappable* indicates that the region can be mapped via the mmap()
> system
> > + call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an
> > +array of file descriptors must be sent as part of the message
> > +meta-data. Each mappable region entry must have a corresponding file
> > +descriptor. On AF_UNIX sockets, the file descriptors must be passed
> > +as SCM_RIGHTS type ancillary data. Otherwise, if a DMA region cannot
> > +be directly mapped by the server, it can be accessed by the server
> > +using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages,
> explained
> > +in `Read and Write Operations`_. A command to map over an existing
> region must be failed by the server with ``EEXIST`` set in error field in the
> reply.
>
> Does this mean a vIOMMU update, like a protections bits change requires an
> unmap command followed by a map command? That is not an atomic
> operation but hopefully guests don't try to update a vIOMMU mapping while
> accessing it.
Correct, it's not an atomic operation. We could consider adding such an
operation
If you think it would be useful?
>
> By the way, this DMA mapping design is the eager mapping approach.
> Another approach is the lazy mapping approach where the server requests
> translations as necessary. The advantage is that the client does not have to
> send each mapping to the server. In the case of
> VFIO_USER_DMA_READ/WRITE no mappings need to be sent at all. Only
> mmaps need mapping messages.
>
> > +Adding multiple DMA regions can partially fail. The response does not
> > +indicate which regions were added and which were not, therefore it is
> > +a client implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > + The server can optionally remove succesfully added DMA regions
> > +making this
>
> s/succesfully/successfully/
>
> > + operation atomic.
> > + The client can recover by attempting to unmap one by one all the DMA
> regions
> > + in the VFIO_USER_DMA_MAP command, ignoring failures for regions
> that do not
> > + exist.
> > +
> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name | Value |
> > ++==============+========================+
> > +| Message ID | <ID> |
> > ++--------------+------------------------+
> > +| Command | 3 |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size |
> > ++--------------+------------------------+
> > +| Flags | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error | 0/errno |
> > ++--------------+------------------------+
> > +| Table | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it
> > +that a DMA region, previously made available via a
> VFIO_USER_DMA_MAP
> > +command message, is no longer available for DMA. It typically occurs
> > +when memory is subtracted from the client or if the client uses a
> > +vIOMMU. If the client does not expect the server to perform DMA then
> > +it does not need to send to the server VFIO_USER_DMA_UNMAP
> commands.
> > +If the server does not need to perform DMA then it can ignore such
> > +commands but it must still reply to them. The table is an
>
> I'm a little confused by the last two sentences about not sending or ignoring
> VFIO_USER_DMA_UNMAP. Does it mean that VFIO_USER_DMA_MAP does
> not need to be sent either when the device is known never to need DMA?
>
> > +array of the following structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++--------------+--------+---------------------------------------+
> > +| Name | Offset | Size |
> >
> ++==============+========+================================
> =======+
> > +| Address | 0 | 8 |
> > ++--------------+--------+---------------------------------------+
> > +| Size | 8 | 8 |
> > ++--------------+--------+---------------------------------------+
> > +| Offset | 16 | 8 |
> > ++--------------+--------+---------------------------------------+
> > +| Protections | 24 | 4 |
> > ++--------------+--------+---------------------------------------+
> > +| Flags | 28 | 4 |
> > ++--------------+--------+---------------------------------------+
> > +| | +-----+--------------------------------------+ |
> > +| | | Bit | Definition | |
> > +| | +=====+======================================+ |
> > +| | | 0 | VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP | |
> > +| | +-----+--------------------------------------+ |
> > ++--------------+--------+---------------------------------------+
> > +| VFIO Bitmaps | 32 | variable |
> > ++--------------+--------+---------------------------------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the
> > +associated file
> > + descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > + ``<sys/mman.h>``.
>
> Why are offset and protections required for the unmap command?
>
> > +* *Flags* contains the following region attributes:
> > +
> > + * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a
> dirty page bitmap
> > + must be populated before unmapping the DMA region. The client must
> provide
> > + a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region,
> > with
> > + the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized.
> > +
> > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region
> > +(explained
> > + below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in
> Flags.
>
> I'm confused, it's 1 "VFIO Bitmaps" per "Table entry". Why does it contain
> one struct vfio_bitmap per region? What is a "region" in this context?
>
> > +
> > +.. _VFIO bitmap format:
> > +
> > +VFIO bitmap format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > +If the VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP bit is set in the
> > +request, the server must append to the header the ``struct
> > +vfio_bitmap`` received in the command followed by the bitmap, for
> > +each region. ``struct vfio_bitmap`` has the following format:
> > +
> > ++--------+--------+------+
> > +| Name | Offset | Size |
> > ++========+========+======+
> > +| pgsize | 0 | 8 |
> > ++--------+--------+------+
> > +| size | 8 | 8 |
> > ++--------+--------+------+
> > +| data | 16 | 8 |
> > ++--------+--------+------+
> > +
> > +* *pgsize* is the page size for the bitmap, in bytes.
> > +* *size* is the size for the bitmap, in bytes, excluding the VFIO bitmap
> header.
> > +* *data* This field is unused in vfio-user.
> > +
> > +The VFIO bitmap structure is defined in ``<linux/vfio.h>`` (``struct
> > +vfio_bitmap``).
> > +
> > +Each ``struct vfio_bitmap`` entry is followed by the region's bitmap.
> > +Each bit in the bitmap represents one page of size ``struct
> vfio_bitmap.pgsize``.
> > +
> > +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is not set in Flags
> then
> > +the size of the message is: 16 + (# of table entries * 32).
> > +If ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags then
> the
> > +size of the message is: 16 + (# of table entries * 56) + size of all
> > bitmaps.
>
>
> > +
> > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file
> descriptor
> > +is mapped then the server must release all references to that DMA
> > +region before replying, which includes potentially in flight DMA
> > +transactions. Removing a portion of a DMA region is possible.
>
> "Removing a portion of a DMA region is possible"
> -> doing so splits a larger DMA region into one or two smaller remaining
> regions?
>
> How do potentially large messages work around max_msg_size? It is hard for
> the client/server to anticipate the maximum message size that will be
> required ahead of time, so they can't really know if they will hit a situation
> where max_msg_size is too low.
>
> > +
> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name | Value |
> > ++==============+============================+
> > +| Message ID | <ID> |
> > ++--------------+----------------------------+
> > +| Command | 4 |
> > ++--------------+----------------------------+
> > +| Message size | 32 |
> > ++--------------+----------------------------+
> > +| Flags | Reply bit set in reply |
> > ++--------------+----------------------------+
> > +| Error | 0/errno |
> > ++--------------+----------------------------+
> > +| Device info | VFIO device info |
> > ++--------------+----------------------------+
> > +
> > +This command message is sent by the client to the server to query for
> > +basic information about the device. The VFIO device info structure is
> > +defined in ``<linux/vfio.h>`` (``struct vfio_device_info``).
>
> Wait, "VFIO device info format" below is missing the cap_offset field, so it's
> exactly not the same as <linux/vfio.h>?
>
> > +
> > +VFIO device info format
> > +^^^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+--------------------------+
> > +| Name | Offset | Size |
> > ++=============+========+==========================+
> > +| argsz | 16 | 4 |
> > ++-------------+--------+--------------------------+
> > +| flags | 20 | 4 |
> > ++-------------+--------+--------------------------+
> > +| | +-----+-------------------------+ |
> > +| | | Bit | Definition | |
> > +| | +=====+=========================+ |
> > +| | | 0 | VFIO_DEVICE_FLAGS_RESET | |
> > +| | +-----+-------------------------+ |
> > +| | | 1 | VFIO_DEVICE_FLAGS_PCI | |
> > +| | +-----+-------------------------+ |
> > ++-------------+--------+--------------------------+
> > +| num_regions | 24 | 4 |
> > ++-------------+--------+--------------------------+
> > +| num_irqs | 28 | 4 |
> > ++-------------+--------+--------------------------+
> > +
> > +* *argsz* is the size of the VFIO device info structure. This is the
> > +only field that should be set to non-zero in the request, identifying
> > +the client's expected size. Currently this is a fixed value.
> > +* *flags* contains the following device attributes.
> > +
> > + * VFIO_DEVICE_FLAGS_RESET indicates that the device supports the
> > + VFIO_USER_DEVICE_RESET message.
> > + * VFIO_DEVICE_FLAGS_PCI indicates that the device is a PCI device.
> > +
> > +* *num_regions* is the number of memory regions that the device
> exposes.
> > +* *num_irqs* is the number of distinct interrupt types that the device
> supports.
> > +
> > +This version of the protocol only supports PCI devices. Additional
> > +devices may be supported in future versions.
>
> I've reviewed up to here so far.
>
> Stefan