I've spent the last week debugging an issue that is hitting OpenStack with drive-mirror/block job usage.
The OpenStack automated test system runs a massive integration test suite against every commit to any part of OpenStack. Amongst the many things tested is a live snapshot feature which relies on QMEU's drive-mirror functionality. When this was enabled in the test suite, we started seeing failures in the automated test harness. Specifically we are seeing that a monitor command for 'query-block-jobs' never replies to libvirt. After 3 minutes of waiting the test harness times out and kills the VM. When working normally the entire test will complete in just a couple of seconds, so we don't think the 3 minute timeout is hitting a false positive. For each commit going into openstack the tests trigger about 200 uses of drive-mirror, and we have about a 50% change of failing on any single commit. So the rate failure of this problem is only around 1 in 400 uses of drive-mirror, but that's still enough to totally doom our automated testing rig, since a failing test blocks the commit from merging into GIT master. Unfortunately it is a total PITA to debug this. No one has ever managed to reproduce the problem outside of our automated test system environment, even when running the same tests locally and we can't log into the test systems to get GDB traces or install custom QEMU builds. The best I can do is to collect debug logs from libvirtd, and get stdio from QEMU. The QEMU stderr/stdout shows nothing at all. The libvirtd log shows the following sequence of monitor interactions with QEMU: 1. Libvirt tries to cancel any potential stale jobs: msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-9"} reply={"id": "libvirt-9", "error": {"class": "DeviceNotActive", "desc": "No active block job on device 'drive-virtio-disk0'"}} msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-10"} reply={"id": "libvirt-10", "error": {"class": "DeviceNotActive", "desc": "No active block job on device 'drive-virtio-disk0'"}} 2. Libvirt starts the drive mirror operation msg={"execute":"drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/opt/stack/data/nova/instances/snapshots/tmp79q8xw/9739fea13dfe4e4794f85b57d6572d81.delta","speed":0,"sync":"top","mode":"existing","format":"qcow2"},"id":"libvirt-11"} reply={"return": {}, "id": "libvirt-11"} 3. Libvirt checks progress of the block jobs until offset == len msg={"execute":"query-block-jobs","id":"libvirt-12"} reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true, "len": 25165824, "offset": 0, "paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-12"} event={"timestamp": {"seconds": 1404903950, "microseconds": 273185}, "event": "BLOCK_JOB_READY", "data": {"device": "drive-virtio-disk0", "len": 25165824, "offset": 25165824, "speed": 0, "type": "mirror"}} msg={"execute":"query-block-jobs","id":"libvirt-13"} reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": false, "len": 25165824, "offset": 25165824, "paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-13"} 4. Libvirt cleans up the block job: msg={"execute":"block-job-cancel","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-14"} reply={"return": {}, "id": "libvirt-14"} 5. Libvirt waits for cleanup to complete: msg={"execute":"query-block-jobs","id":"libvirt-15"} reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true, "len": 25165824, "offset": 25165824, "paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-15"} msg={"execute":"query-block-jobs","id":"libvirt-16"} reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk0", "busy": true, "len": 25165824, "offset": 25165824, "paused": false, "speed": 0, "type": "mirror"}], "id": "libvirt-16"} msg={"execute":"query-block-jobs","id":"libvirt-17"} <...hang...> So we can see this last 'query-block-jobs' command hangs. I've looked at the code for handling this monitor command and struggling to come up with any ideas of why this might hang. My best idea was the bdrv_iterate() call it does might be happening at the same time as another thread modifies the list, but debugging on a local QEMU shows no changes to the list at all due to drive-mirror/block jobs, so that doesn't seem to be the cause. Can anyone conceive of a reason why this monitor command might hang ? Our test systems run Ubuntu 14.04 which has QEMU version 2.0.0. I see a couple of changes in GIT master to blockjob.c sinc v2.0.0 but nothing that claims to be solving a hang. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|