On 8/30/23 10:41, Laszlo Ersek wrote: > I'm adding Stefan to the CC list, and an additional piece of explanation > below: > > On 8/27/23 20:29, Laszlo Ersek wrote: >> (1) The virtio-1.0 specification >> <http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html> writes: >> >>> 3 General Initialization And Device Operation >>> 3.1 Device Initialization >>> 3.1.1 Driver Requirements: Device Initialization >>> >>> [...] >>> >>> 7. Perform device-specific setup, including discovery of virtqueues for >>> the device, optional per-bus setup, reading and possibly writing the >>> device’s virtio configuration space, and population of virtqueues. >>> >>> 8. Set the DRIVER_OK status bit. At this point the device is “live”. >> >> and >> >>> 4 Virtio Transport Options >>> 4.1 Virtio Over PCI Bus >>> 4.1.4 Virtio Structure PCI Capabilities >>> 4.1.4.3 Common configuration structure layout >>> 4.1.4.3.2 Driver Requirements: Common configuration structure layout >>> >>> [...] >>> >>> The driver MUST configure the other virtqueue fields before enabling the >>> virtqueue with queue_enable. >>> >>> [...] >> >> These together mean that the following sub-sequence of steps is valid for >> a virtio-1.0 guest driver: >> >> (1.1) set "queue_enable" for the needed queues as the final part of device >> initialization step (7), >> >> (1.2) set DRIVER_OK in step (8), >> >> (1.3) immediately start sending virtio requests to the device. >> >> (2) When vhost-user is enabled, and the VHOST_USER_F_PROTOCOL_FEATURES >> special virtio feature is negotiated, then virtio rings start in disabled >> state, according to >> <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#ring-states>. >> In this case, explicit VHOST_USER_SET_VRING_ENABLE messages are needed for >> enabling vrings. >> >> Therefore setting "queue_enable" from the guest (1.1) is a *control plane* >> operation, which travels from the guest through QEMU to the vhost-user >> backend, using a unix domain socket. >> >> Whereas sending a virtio request (1.3) is a *data plane* operation, which >> evades QEMU -- it travels from guest to the vhost-user backend via >> eventfd. >> >> This means that steps (1.1) and (1.3) travel through different channels, >> and their relative order can be reversed, as perceived by the vhost-user >> backend. >> >> That's exactly what happens when OVMF's virtiofs driver (VirtioFsDxe) runs >> against the Rust-language virtiofsd version 1.7.2. (Which uses version >> 0.10.1 of the vhost-user-backend crate, and version 0.8.1 of the vhost >> crate.) >> >> Namely, when VirtioFsDxe binds a virtiofs device, it goes through the >> device initialization steps (i.e., control plane operations), and >> immediately sends a FUSE_INIT request too (i.e., performs a data plane >> operation). In the Rust-language virtiofsd, this creates a race between >> two components that run *concurrently*, i.e., in different threads or >> processes: >> >> - Control plane, handling vhost-user protocol messages: >> >> The "VhostUserSlaveReqHandlerMut::set_vring_enable" method >> [crates/vhost-user-backend/src/handler.rs] handles >> VHOST_USER_SET_VRING_ENABLE messages, and updates each vring's "enabled" >> flag according to the message processed. >> >> - Data plane, handling virtio / FUSE requests: >> >> The "VringEpollHandler::handle_event" method >> [crates/vhost-user-backend/src/event_loop.rs] handles the incoming >> virtio / FUSE request, consuming the virtio kick at the same time. If >> the vring's "enabled" flag is set, the virtio / FUSE request is >> processed genuinely. If the vring's "enabled" flag is clear, then the >> virtio / FUSE request is discarded. >> >> Note that OVMF enables the queue *first*, and sends FUSE_INIT *second*. >> However, if the data plane processor in virtiofsd wins the race, then it >> sees the FUSE_INIT *before* the control plane processor took notice of >> VHOST_USER_SET_VRING_ENABLE and green-lit the queue for the data plane >> processor. Therefore the latter drops FUSE_INIT on the floor, and goes >> back to waiting for further virtio / FUSE requests with epoll_wait. >> Meanwhile OVMF is stuck waiting for the FUSET_INIT response -- a deadlock. > > I can explain why this issue has not been triggered by / witnessed with > the Linux guest driver for virtiofs ("fs/fuse/virtio_fs.c"). > > That driver registers *two* driver (callback) structures, a virtio > driver, and a filesystem driver. > > (1) The virtio driver half initializes the virtio device, and takes a > note of the particular virtio filesystem, remembering its "tag". See > virtio_fs_probe() -> virtio_device_ready(), and then virtio_fs_probe() > -> virtio_fs_add_instance(). > > Importantly, at this time, no FUSE_INIT request is sent. > > (2) The filesystem driver half has a totally independent entry point. > The relevant parts (after the driver registration) are: > > (a) virtio_fs_get_tree() -> virtio_fs_find_instance(), and > > (b) if the "tag" was found, (b) virtio_fs_get_tree() -> > virtio_fs_fill_super() -> fuse_send_init(). > > Importantly, this occurs when guest userspace (i.e., an interactive > user, or a userspace automatism such as systemd) tries to mount a > *concrete* virtio filesystem, identified by its tag (such as in "mount > -t virtiofs TAG /mount/point"). > > > This means that there is an *abritrarily long* delay between (1) > VHOST_USER_SET_VRING_ENABLE (which QEMU sends to virtiofsd while the > guest is inside virtio_fs_probe()) and (2) FUSE_INIT (which the guest > kernel driver sends to virtiofsd while inside virtio_fs_get_tree()). > > That huge delay is plenty for masking the race. > > But the race is there nonetheless.
Furthermore, the race was not seen in the C-language virtiofsd implementation (removed in QEMU commit e0dc2631ec4ac718ebe22ddea0ab25524eb37b0e) for the following reason: The C language virtiofsd *did not care* about VHOST_USER_SET_VRING_ENABLE at all: - Upon VHOST_USER_GET_VRING_BASE, vu_get_vring_base_exec() in libvhost-user would call fv_queue_set_started() in virtiofsd, and the latter would start the data plane thread fv_queue_thread(). - Upon VHOST_USER_SET_VRING_ENABLE, vu_set_vring_enable_exec() in libvhost-user would set the "enable" field, but not call back into virtiofsd. And virtiofsd ("tools/virtiofsd/fuse_virtio.c") nowhere checks the "enable" field. In summary, the C-language virtiofsd didn't implement queue enablement in a conformant way. The Rust-language version does, but that exposes a race in how QEMU sends VHOST_USER_SET_VRING_ENABLE. The race is triggered by the OVMF guest driver, and not triggered by the Linux guest driver (since the latter introduces an unbounded delay between vring enablement and FUSE_INIT submission). Laszlo > > > Also note that this race does not exist for vhost-net. For vhost-net, > AIUI, such queue operations are handled with ioctl()s, and ioctl()s are > synchronous by nature. Cf. > <https://qemu-project.gitlab.io/qemu/interop/vhost-user.html#vhost-user-protocol-f-reply-ack>: > > "The original vhost-user specification only demands replies for certain > commands. This differs from the vhost protocol implementation where > commands are sent over an ioctl() call and block until the back-end has > completed." > > Laszlo > >> >> The deadlock is not deterministic. OVMF hangs infrequently during first >> boot. However, OVMF hangs almost certainly during reboots from the UEFI >> shell. >> >> The race can be "reliably masked" by inserting a very small delay -- a >> single debug message -- at the top of "VringEpollHandler::handle_event", >> i.e., just before the data plane processor checks the "enabled" field of >> the vring. That delay suffices for the control plane processor to act upon >> VHOST_USER_SET_VRING_ENABLE. >> >> We can deterministically prevent the race in QEMU, by blocking OVMF inside >> step (1.1) -- i.e., in the write to the "queue_enable" register -- until >> VHOST_USER_SET_VRING_ENABLE actually *completes*. That way OVMF's VCPU >> cannot advance to the FUSE_INIT submission before virtiofsd's control >> plane processor takes notice of the queue being enabled. >> >> Wait for VHOST_USER_SET_VRING_ENABLE completion by: >> >> - setting the NEED_REPLY flag on VHOST_USER_SET_VRING_ENABLE, and waiting >> for the reply, if the VHOST_USER_PROTOCOL_F_REPLY_ACK vhost-user feature >> has been negotiated, or >> >> - performing a separate VHOST_USER_GET_FEATURES *exchange*, which requires >> a backend response regardless of VHOST_USER_PROTOCOL_F_REPLY_ACK. >> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> (supporter:vhost) >> Cc: Eugenio Perez Martin <epere...@redhat.com> >> Cc: German Maglione <gmagli...@redhat.com> >> Cc: Liu Jiang <ge...@linux.alibaba.com> >> Cc: Sergio Lopez Pascual <s...@redhat.com> >> Cc: Stefano Garzarella <sgarz...@redhat.com> >> Signed-off-by: Laszlo Ersek <ler...@redhat.com> >> --- >> hw/virtio/vhost-user.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c >> index beb4b832245e..01e0ca90c538 100644 >> --- a/hw/virtio/vhost-user.c >> +++ b/hw/virtio/vhost-user.c >> @@ -1235,7 +1235,7 @@ static int vhost_user_set_vring_enable(struct >> vhost_dev *dev, int enable) >> .num = enable, >> }; >> >> - ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, >> false); >> + ret = vhost_set_vring(dev, VHOST_USER_SET_VRING_ENABLE, &state, >> true); >> if (ret < 0) { >> /* >> * Restoring the previous state is likely infeasible, as well as >