Am 27.09.2020 um 15:04 hat Ying Fang geschrieben: > A VM in the cloud environment may use a virutal disk as the backend storage, > and there are usually filesystems on the virtual block device. When backend > storage is temporarily down, any I/O issued to the virtual block device will > cause an error. For example, an error occurred in ext4 filesystem would make > the filesystem readonly. However a cloud backend storage can be soon > recovered. > For example, an IP-SAN may be down due to network failure and will be online > soon after network is recovered. The error in the filesystem may not be > recovered unless a device reattach or system restart. So an I/O rehandle is > in need to implement a self-healing mechanism. > > This patch series propose a feature called I/O hang. It can rehandle AIOs > with EIO error without sending error back to guest. From guest's perspective > of view it is just like an IO is hanging and not returned. Guest can get > back running smoothly when I/O is recovred with this feature enabled.
What is the problem with setting werror=stop and rerror=stop for the device? Is it that QEMU won't automatically retry, but management tool interaction is required to resume the guest? I haven't checked your patches in detail yet, but implementing this functionality in the backend means that blk_drain() will hang (or if it doesn't hang, it doesn't do what it's supposed to do), making the whole QEMU process unresponsive until the I/O succeeds again. Amongst others, this would make it impossible to migrate away from a host with storage problems. Kevin