Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

Laszlo Ersek Wed, 30 Aug 2023 01:59:52 -0700

On 8/30/23 10:41, Laszlo Ersek wrote:
> I'm adding Stefan to the CC list, and an additional piece of explanation
> below:
> 
> On 8/27/23 20:29, Laszlo Ersek wrote:
>> (1) The virtio-1.0 specification
>> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes:
>>
>>> 3     General Initialization And Device Operation
>>> 3.1   Device Initialization
>>> 3.1.1 Driver Requirements: Device Initialization
>>>
>>> [...]
>>>
>>> 7. Perform device-specific setup, including discovery of virtqueues for
>>>    the device, optional per-bus setup, reading and possibly writing the
>>>    device’s virtio configuration space, and population of virtqueues.
>>>
>>> 8. Set the DRIVER_OK status bit. At this point the device is “live”.
>>
>> and
>>
>>> 4         Virtio Transport Options
>>> 4.1       Virtio Over PCI Bus
>>> 4.1.4     Virtio Structure PCI Capabilities
>>> 4.1.4.3   Common configuration structure layout
>>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout
>>>
>>> [...]
>>>
>>> The driver MUST configure the other virtqueue fields before enabling the
>>> virtqueue with queue_enable.
>>>
>>> [...]
>>
>> These together mean that the following sub-sequence of steps is valid for
>> a virtio-1.0 guest driver:
>>
>> (1.1) set "queue_enable" for the needed queues as the final part of device
>> initialization step (7),
>>
>> (1.2) set DRIVER_OK in step (8),
>>
>> (1.3) immediately start sending virtio requests to the device.
>>
>> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES
>> special virtio feature is negotiated, then virtio rings start in disabled
>> state, according to
>> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>.
>> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for
>> enabling vrings.
>>
>> Therefore setting "queue_enable" from the guest (1.1) is a *control plane*
>> operation, which travels from the guest through QEMU to the vhost-user
>> backend, using a unix domain socket.
>>
>> Whereas sending a virtio request (1.3) is a *data plane* operation, which
>> evades QEMU -- it travels from guest to the vhost-user backend via
>> eventfd.
>>
>> This means that steps (1.1) and (1.3) travel through different channels,
>> and their relative order can be reversed, as perceived by the vhost-user
>> backend.
>>
>> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs
>> against the Rust-language virtiofsd version 1.7.2. (Which uses version
>> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost
>> crate.)
>>
>> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the
>> device initialization steps (i.e., control plane operations), and
>> immediately sends a FUSE_INIT request too (i.e., performs a data plane
>> operation). In the Rust-language virtiofsd, this creates a race between
>> two components that run *concurrently*, i.e., in different threads or
>> processes:
>>
>> - Control plane, handling vhost-user protocol messages:
>>
>>   The "VhostUserSlaveReqHandlerMut::set_vring_enable" method
>>   [crates/vhost-user-backend/src/handler.rs] handles
>>   VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled"
>>   flag according to the message processed.
>>
>> - Data plane, handling virtio / FUSE requests:
>>
>>   The "VringEpollHandler::handle_event" method
>>   [crates/vhost-user-backend/src/event_loop.rs] handles the incoming
>>   virtio / FUSE request, consuming the virtio kick at the same time. If
>>   the vring's "enabled" flag is set, the virtio / FUSE request is
>>   processed genuinely. If the vring's "enabled" flag is clear, then the
>>   virtio / FUSE request is discarded.
>>
>> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*.
>> However, if the data plane processor in virtiofsd wins the race, then it
>> sees the FUSE_INIT *before* the control plane processor took notice of
>> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane
>> processor. Therefore the latter drops FUSE_INIT on the floor, and goes
>> back to waiting for further virtio / FUSE requests with epoll_wait.
>> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock.
> 
> I can explain why this issue has not been triggered by / witnessed with
> the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c").
> 
> That driver registers *two* driver (callback) structures, a virtio
> driver, and a filesystem driver.
> 
> (1) The virtio driver half initializes the virtio device, and takes a
> note of the particular virtio filesystem, remembering its "tag". See
> virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe()
> -> virtio_fs_add_instance().
> 
> Importantly, at this time, no FUSE_INIT request is sent.
> 
> (2) The filesystem driver half has a totally independent entry point.
> The relevant parts (after the driver registration) are:
> 
> (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and
> 
> (b) if the "tag" was found, (b) virtio_fs_get_tree() ->
> virtio_fs_fill_super() -> fuse_send_init().
> 
> Importantly, this occurs when guest userspace (i.e., an interactive
> user, or a userspace automatism such as systemd) tries to mount a
> *concrete* virtio filesystem, identified by its tag (such as in "mount
> -t virtiofs TAG /mount/point").
> 
> 
> This means that there is an *abritrarily long* delay between (1)
> VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the
> guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest
> kernel driver sends to virtiofsd while inside virtio_fs_get_tree()).
> 
> That huge delay is plenty for masking the race.
> 
> But the race is there nonetheless.


Furthermore, the race was not seen in the C-language virtiofsd
implementation (removed in QEMU commit
e0dc2631ec4ac718ebe22ddea0ab25524eb37b0e) for the following reason:

The C language virtiofsd *did not care* about
VHOST_USER_SET_VRING_ENABLE at all:

- Upon VHOST_USER_GET_VRING_BASE, vu_get_vring_base_exec() in
libvhost-user would call fv_queue_set_started() in virtiofsd, and the
latter would start the data plane thread fv_queue_thread().

- Upon VHOST_USER_SET_VRING_ENABLE, vu_set_vring_enable_exec() in
libvhost-user would set the "enable" field, but not call back into
virtiofsd. And virtiofsd ("tools/virtiofsd/fuse_virtio.c") nowhere
checks the "enable" field.

In summary, the C-language virtiofsd didn't implement queue enablement
in a conformant way. The Rust-language version does, but that exposes a
race in how QEMU sends VHOST_USER_SET_VRING_ENABLE. The race is
triggered by the OVMF guest driver, and not triggered by the Linux guest
driver (since the latter introduces an unbounded delay between vring
enablement and FUSE_INIT submission).

Laszlo

> 
> 
> Also note that this race does not exist for vhost-net. For vhost-net,
> AIUI, such queue operations are handled with ioctl()s, and ioctl()s are
> synchronous by nature. Cf.
> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>:
> 
> "The original vhost-user specification only demands replies for certain
> commands. This differs from the vhost protocol implementation where
> commands are sent over an ioctl() call and block until the back-end has
> completed."
> 
> Laszlo
> 
>>
>> The deadlock is not deterministic. OVMF hangs infrequently during first
>> boot. However, OVMF hangs almost certainly during reboots from the UEFI
>> shell.
>>
>> The race can be "reliably masked" by inserting a very small delay -- a
>> single debug message -- at the top of "VringEpollHandler::handle_event",
>> i.e., just before the data plane processor checks the "enabled" field of
>> the vring. That delay suffices for the control plane processor to act upon
>> VHOST_USER_SET_VRING_ENABLE.
>>
>> We can deterministically prevent the race in QEMU, by blocking OVMF inside
>> step (1.1) -- i.e., in the write to the "queue_enable" register -- until
>> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU
>> cannot advance to the FUSE_INIT submission before virtiofsd's control
>> plane processor takes notice of the queue being enabled.
>>
>> Wait for VHOST_USER_SET_VRING_ENABLE completion by:
>>
>> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting
>>   for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature
>>   has been negotiated, or
>>
>> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires
>>   a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK.
>>
>> Cc: "Michael S. Tsirkin" <m...@redhat.com> (supporter:vhost)
>> Cc: Eugenio Perez Martin <epere...@redhat.com>
>> Cc: German Maglione <gmagli...@redhat.com>
>> Cc: Liu Jiang <ge...@linux.alibaba.com>
>> Cc: Sergio Lopez Pascual <s...@redhat.com>
>> Cc: Stefano Garzarella <sgarz...@redhat.com>
>> Signed-off-by: Laszlo Ersek <ler...@redhat.com>
>> ---
>>  hw/virtio/vhost-user.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>> index beb4b832245e..01e0ca90c538 100644
>> --- a/hw/virtio/vhost-user.c
>> +++ b/hw/virtio/vhost-user.c
>> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct 
>> vhost_dev *dev, int enable)
>>              .num   = enable,
>>          };
>>  
>> -        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, 
>> false);
>> +        ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, 
>> true);
>>          if (ret < 0) {
>>              /*
>>               * Restoring the previous state is likely infeasible, as well as
>

Re: [PATCH 7/7] vhost-user: call VHOST_USER_SET_VRING_ENABLE synchronously

Reply via email to