Most importantly, start using a dedicated IO thread for the state file when doing a live snapshot.
Having the state file be in the iohandler context means that a blk_drain_all() call in the main thread or vCPU thread that happens while the snapshot is running will result in a deadlock. This change should also help in general to reduce load on the main thread and for it to get stuck on IO, i.e. same benefits as using a dedicated IO thread for regular drives. This is particularly interesting when the VM state storage is a network storage like NFS. With some luck, it could also help with bug #6262 [0]. The failure there happens while issuing/right after the savevm-start QMP command, so the most likely coroutine is the process_savevm_co() that was previously scheduled to the iohandler context. Likely someone polls the iohandler context and wants to enter the already scheduled coroutine leading to the abort(): > qemu_aio_coroutine_enter: Co-routine was already scheduled in > 'aio_co_schedule' With a dedicated iothread, there hopefully is no such race. Additionally, fix up some edge cases in error handling and setting the state of the snapshot operation. [0]: https://bugzilla.proxmox.com/show_bug.cgi?id=6262 Fiona Ebner (6): savevm-async: improve setting state of snapshot operation in savevm-end handler savevm-async: rename saved_vm_running to vm_needs_start savevm-async: improve runstate preservation savevm-async: cleanup error handling in savevm_start savevm-async: use dedicated iothread for state file savevm-async: treat failure to set iothread context as a hard failure migration/savevm-async.c | 119 +++++++++++++++++++++++---------------- 1 file changed, 69 insertions(+), 50 deletions(-) -- 2.39.5 _______________________________________________ pve-devel mailing list pve-devel@lists.proxmox.com https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel