On Wed, Oct 11, 2017 at 08:13:10PM +0100, Dr. David Alan Gilbert (git) wrote: > From: "Dr. David Alan Gilbert" <dgilb...@redhat.com> > > Hi, > This set attempts to make a race condition between migration and > drive-mirror (and other block users) soluble by allowing the migration > to be paused after the source qemu releases the block devices but > before the serialisation of the device state. > > The symptom of this failure, as reported by Wangjie, is a: > _co_do_pwritev: Assertion `!(bs->open_flags & 0x0800)' failed > > and the source qemu dieing; so the problem is pretty nasty. > This has only been seen on 2.9 onwards, but the theory is that > prior to 2.9 it might have been happening anyway and we were > perhaps getting unreported corruptions (lost writes); so this > really needs fixing. > > This flow came from discussions between Kevin and me, and we can't > see a way of fixing it without exposing a new state to the management > layer. > > The flow is now: > > (qemu) migrate_set_capability pause-before-device on > (qemu) migrate -d ... > (qemu) info migrate > ... > Migration status: pause-before-device > ... > << issue commands to clean up any block jobs>> > > (qemu) migrate_continue pause-before-device > (qemu) info migrate > ... > Migration status: completed
I'm curious why QEMU doesn't have enough info to clean up the block jobs automatically ? What is the key thing that libvirt knows about the block jobs, that QEMU is lacking ? If QEMU had the right info it could do it automatically & avoid this extra lock-step synchronization with libvirt. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|