On 2020/10/27 0:53, Stefan Hajnoczi wrote: > On Thu, Oct 22, 2020 at 09:02:54PM +0800, Jiahui Cen wrote: >> A VM in the cloud environment may use a virutal disk as the backend storage, >> and there are usually filesystems on the virtual block device. When backend >> storage is temporarily down, any I/O issued to the virtual block device will >> cause an error. For example, an error occurred in ext4 filesystem would make >> the filesystem readonly. However a cloud backend storage can be soon >> recovered. >> For example, an IP-SAN may be down due to network failure and will be online >> soon after network is recovered. The error in the filesystem may not be >> recovered unless a device reattach or system restart. So an I/O rehandle is >> in need to implement a self-healing mechanism. >> >> This patch series propose a feature called I/O hang. It can rehandle AIOs >> with EIO error without sending error back to guest. From guest's perspective >> of view it is just like an IO is hanging and not returned. Guest can get >> back running smoothly when I/O is recovred with this feature enabled. > > Hi, > This feature seems like an extension of the existing -drive > rerror=/werror= parameters: > > werror=action,rerror=action > Specify which action to take on write and read errors. Valid > actions are: "ignore" (ignore the error and try to continue), > "stop" (pause QEMU), "report" (report the error to the guest), > "enospc" (pause QEMU only if the host disk is full; report the > error to the guest otherwise). The default setting is > werror=enospc and rerror=report. > > That mechanism already has a list of requests to retry and live > migration integration. Using the werror=/rerror= mechanism would avoid > code duplication between these features. You could add a > werror/rerror=retry error action for this feature. > > Does that sound good? > > Stefan >
Hi Stefan, Thanks for your reply. Extending the rerror=/werror= mechanism is a feasible way for the retry feature. However, AFAIK, the rerror=/werror= mechanism in block-backend layer only provides ACTION, and the real handler of errors need be implemented several times in device layer for different devices. While our I/O Hang mechanism directly handles AIO errors no matter which type of devices it is. Is it a more common way to implement the feature in block-backend layer? Especially we can set retry timeout in a common structure BlockBackend. Besides, is there any reason that QEMU implements the rerror=/werror mechansim in device layer rather than in block-backend layer? Jiahui