On Fri, Nov 11, 2011 at 01:03:20PM +0000, Daniel P. Berrange wrote: > Libvirt recently introduced a change to the way it does 'save to file' > with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by > default. When saving to file, we didn't want any artificial limit, > but rather to max out the underlying storage. So when doing save to > file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so > it is effectively unlimited. > > After doing this, we discovered that the QEMU monitor was becoming > entirely blocked. It did not even return from the 'migrate' command > until migration was complete despite the 'detach' flag being set. > This was a bug in libvirt, because we passed a plain file descriptor > which does not support EAGAIN. Thank you POSIX. > > Libvirt has another mode where it uses an I/O helper command so get > O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring > that this pipe FD really does have O_NONBLOCK set, we still saw some > odd behaviour. > > I'm not sure whether what I describe can neccessarily be called a QEMU > bug, but I wanted to raise it for discussion anyway.... > > The sequence of steps is > > - libvirt sets qemu migration bandwidth to "unlimited" > - libvirt opens a pipe() and sets O_NONBLOCK on the write end > - libvirt spawns libvirt-iohelper giving it the target file > on disk, and the read end of the pipe > - libvirt does 'getfd migfile' monitor command to give QEMU > the write end of the pipe > - libvirt does 'migrate fd:migfile -d' to run migration > - In parallel > - QEMU is writing to the pipe (which is non-blocking) > - libvirt_helper reading the pipe & writing to disk with O_DIRECT
I should have mentioned that the way I'm testing this is with libvirt 0.9.7, with both QEMU 0.14 and QEMU GIT master, using a guest with 2 GB of RAM: $ virsh start l3 Domain l3 started $ virsh dominfo l3 Id: 17 Name: l3 UUID: c7a3edbd-edaf-9455-926a-d65c16db1803 OS Type: hvm State: running CPU(s): 1 CPU time: 1.1s Max memory: 2292000 kB Used memory: 2292736 kB Persistent: yes Autostart: disable Managed save: no Security model: selinux Security DOI: 0 Security label: system_u:system_r:unconfined_t:s0:c94,c700 (permissive) To actually perform the save-to-file, I use the '--bypass-cache' flag for libvirt, which ensures we pass a pipe to QEMU and run our I/O helper for O_DIRECT, instead of directly giving QEMU a plain file $ virsh save --bypass-cache l3 l3.image Domain l3 saved to l3.image > - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms > > - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms > > - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds* I use the attached systemtap script for determining these eg run this before starting the migration to disk: # stap qemu-mig.stp Begin 0.000 Start 5.198 > Begin 5.220 < Begin 0.022 5.220 > Iterate 5.224 < Iterate 0.004 ...snip.. 6.299 > Iterate 6.314 < Iterate 0.015 6.314 > Iterate 6.319 < Iterate 0.005 6.409 > Iterate 8.139 < Iterate 1.730 <<< very slow iteration 8.152 > Iterate 13.078 < Iterate 4.926 <<< very slow iteration 13.963 > Iterate 14.248 < Iterate 0.285 14.441 > Iterate 14.448 < Iterate 0.007 ...snip... 24.171 > Iterate 24.178 < Iterate 0.007 24.178 > Complete 24.588 < Complete 0.410 <Ctrl-C> avg 79 = sum 8033 / count 101; min 3 max 4926 value |-------------------------------------------------- count 0 | 0 1 | 0 2 | 1 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 74 8 |@@@@@@@@@ 19 16 |@ 3 32 | 0 64 | 0 128 | 0 256 |@ 2 512 | 0 1024 | 1 2048 | 0 4096 | 1 8192 | 0 16384 | 0 Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
global then; global deltas; global start; function print_ts(str) { now = gettimeofday_ns() / (1000*1000) delta = (now - start) printf("%3d.%03d %s\n", (delta / 1000), (delta % 1000), str); } probe begin { printf("Begin\n"); then = 0; start = gettimeofday_ns() / (1000*1000); print_ts("Start"); } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin") { then = gettimeofday_ns() / (1000*1000); print_ts("> Begin"); } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin").return { now = gettimeofday_ns() / (1000*1000); if (then != 0) { delta = now - then; deltas <<< delta; print_ts(sprintf("< Begin %3d.%03d", (delta / 1000), (delta % 1000))); } } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate") { then = gettimeofday_ns() / (1000*1000); print_ts("> Iterate"); } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate").return { now = gettimeofday_ns() / (1000*1000); if (then != 0) { delta = now - then; deltas <<< delta; print_ts(sprintf("< Iterate %3d.%03d", (delta / 1000), (delta % 1000))); } } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete") { then = gettimeofday_ns() / (1000*1000); print_ts("> Complete"); } probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete").return { now = gettimeofday_ns() / (1000*1000); if (then != 0) { delta = now - then; deltas <<< delta; print_ts(sprintf("< Complete %3d.%03d", (delta / 1000), (delta % 1000))); } } probe end { printf ("avg %d = sum %d / count %d; min %d max %d\n", @avg(deltas), @sum(deltas), @count(deltas), @min(deltas), @max(deltas)); print (@hist_log(deltas)); }