On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote: > On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote: > > This patch introduces the vfio-user protocol specification (formerly > > known as VFIO-over-socket), which is designed to allow devices to be
Thanks for your review so far Stefan! We'll make sure to respond to all your comments as needed, so this is just a partial response. > > +Socket Disconnection Behavior > > +----------------------------- > > +The server and the client can disconnect from each other, either > > intentionally > > +or unexpectedly. Both the client and the server need to know how to handle > > such > > +events. > > + > > +Server Disconnection > > +^^^^^^^^^^^^^^^^^^^^ > > +A server disconnecting from the client may indicate that: > > + > > +1) A virtual device has been restarted, either intentionally (e.g. because > > of a > > + device update) or unintentionally (e.g. because of a crash). > > +2) A virtual device has been shut down with no intention to be restarted. > > + > > +It is impossible for the client to know whether or not a failure is > > +intermittent or innocuous and should be retried, therefore the client > > should > > +reset the VFIO device when it detects the socket has been disconnected. > > +Error recovery will be driven by the guest's device error handling > > +behavior. > > + > > +Client Disconnection > > +^^^^^^^^^^^^^^^^^^^^ > > +The client disconnecting from the server primarily means that the client > > +has exited. Currently, this means that the guest is shut down so the > > device is > > +no longer needed therefore the server can automatically exit. However, > > there > > +can be cases where a client disconnection should not result in a server > > exit: > > + > > +1) A single server serving multiple clients. > > +2) A multi-process QEMU upgrading itself step by step, which is not yet > > + implemented. > > + > > +Therefore in order for the protocol to be forward compatible the server > > should > > +take no action when the client disconnects. If anything happens to the > > client > > +the control stack will know about it and can clean up resources > > +accordingly. > > Also, hot unplug? > > Does anything need to be said about mmaps and file descriptors on > disconnected? I guess they need to be cleaned up and are not retained > for future reconnection? Yes. We're still iterating on the implications of these scenarios and trying to figure out the right approaches, so a lot of this is still very much under-specified. > Are there rules for avoiding deadlock between client->server and > server->client messages? For example, the client sends > VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT > before replying to the write message. > > Multi-threaded clients and servers could end up deadlocking if messages > are processed while polling threads handle other device activity (e.g. > I/O requests that cause DMA messages). > > Pipelining has the nice effect that the oldest message must complete > before the next pipelined message starts. It imposes a maximum issue > depth of 1. Still, it seems like it would be relatively easy to hit > re-entrancy or deadlock issues since both the client and the server can > initiate messages and may need to wait for a response. It's certainly the case that implementation-wise right now these are issues, at least on the library side. I think I'm probably OK with requiring responses to be provided prior to async messages like VM_INTERRUPT. > > +The version data is an optional JSON byte array with the following format: > > RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding > says: > > JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32. > > Please indicate the character encoding. I guess it is always UTF-8? Yes. > > +| ``"max_fds"`` | number | Maximum number of file > > descriptors | > > +| | | the can be received by the > > sender. | > > +| | | Optional. If not specified then > > the | > > +| | | receiver must assume > > | > > +| | | ``"max_fds"=1``. > > | > > Maximum per message? Please clarify and consider renaming it to > max_msg_fds (it's also more consistent with max_msg_size). Yes, that's a much better name. > > +| Name | Type | Description | > > ++==============+========+===============================================+ > > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest | > > +| | | between the client and the server is used. | > > ++--------------+--------+-----------------------------------------------+ > > "in bytes"? We'll add to the prelude that all memory sizes are in byte units unless specified otherwise. > > +Versioning and Feature Support > > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > +Upon establishing a connection, the client must send a VFIO_USER_VERSION > > message > > +proposing a protocol version and a set of capabilities. The server compares > > +these with the versions and capabilities it supports and sends a > > +VFIO_USER_VERSION reply according to the following rules. > > + > > +* The major version in the reply must be the same as proposed. If the > > client > > + does not support the proposed major, it closes the connection. > > +* The minor version in the reply must be equal to or less than the minor > > + version proposed. > > +* The capability list must be a subset of those proposed. If the server > > + requires a capability the client did not include, it closes the > > connection. > > Does the server echo back all capabilities it has accepted so the client > can still close the connection if it sees the server didn't accept a > capability? Yes, if a client *requires* a cap that the server doesn't report back, it will be missing from the server response JSON. The spec needs more detail here. The response reflects the server's state. If a server doesn't support a cap, it won't appear in the server's response at all. If a client doesn't support a cap, it *will* appear in the server's response anyway. The values for each capability have cap-specific semantics. For example, for max_msg_fds, the client->server JSON lets a client give a maximum number of fd's supported in server->client messages. The server's response JSON, in turn, lets a server give a maximum number of fd's supported in client->server messages. migration::pgsize is a min() function instead as you can see above, etc. > By the way, this DMA mapping design is the eager mapping approach. Another > approach is the lazy mapping approach where the server requests translations > as necessary. The advantage is that the client does not have to send each > mapping to the server. In the case of VFIO_USER_DMA_READ/WRITE no mappings > need to be sent at all. Only mmaps need mapping messages. Are you arguing that we should implement this? It would non-trivially complicate the implementations on the server-side, where the library "owns" the mappings logic, but an API user is responsible for doing actual read/writes. > How do potentially large messages work around max_msg_size? It is hard > for the client/server to anticipate the maximum message size that will > be required ahead of time, so they can't really know if they will hit a > situation where max_msg_size is too low. Are there specific messages you're worried about? would it help to add a specification stipulation as to minimum size that clients and servers must support? Ultimately the max msg size exists solely to ease implementation: with a reasonable fixed size, we can always consume the entire data in one go, rather than doing partial reads. Obviously that needs a limit to avoid unbounded allocations. > > +VFIO_USER_DEVICE_GET_INFO > > +------------------------- > > + > > +Message format > > +^^^^^^^^^^^^^^ > > + > > ++--------------+----------------------------+ > > +| Name | Value | > > ++==============+============================+ > > +| Message ID | <ID> | > > ++--------------+----------------------------+ > > +| Command | 4 | > > ++--------------+----------------------------+ > > +| Message size | 32 | > > ++--------------+----------------------------+ > > +| Flags | Reply bit set in reply | > > ++--------------+----------------------------+ > > +| Error | 0/errno | > > ++--------------+----------------------------+ > > +| Device info | VFIO device info | > > ++--------------+----------------------------+ > > + > > +This command message is sent by the client to the server to query for basic > > +information about the device. The VFIO device info structure is defined in > > +``<linux/vfio.h>`` (``struct vfio_device_info``). > > Wait, "VFIO device info format" below is missing the cap_offset field, > so it's exactly not the same as <linux/vfio.h>? We had to move away from directly consuming struct vfio_device_info when cap_offset was added. Generally trying to use vfio.h at all seems like a bad idea. That's an implementation thing, but this was a dangling reference we need to clean up. regards john