Hi! I was wondering why qemu-img consumes so much CPU when it converts one partially allocated qcow2 image to another qcow2 image and I've written a patch which improves the situation a little.
I have an image: > $ qemu-img info ubuntu.10.04.qcow2 > image: ubuntu.10.04.qcow2 > file format: qcow2 > virtual size: 20G (21474836480 bytes) > disk size: 2.2G > cluster_size: 65536 I create a new copy on write image: > $ qemu-img create -f qcow2 -o backing_file=ubuntu.10.04.qcow2 volume.qcow2 > 100G ... and use it for a while. Then I want to create a non-copy on write image from it to send it to someone: > qemu-img convert -O qcow2 volume.qcow2 snapshot.qcow2 The last operation consumes a lot of CPU, so I run qemu-img under profiler and realized, that most of CPU time is consumed by is_not_zero() function. I had made a couple of optimizations on it and got the following output for `time qemu-img convert -O qcow2 volume.qcow2 snapshot.qcow2`: x86_64 machine: Original qemu-img: real 0m56.159s user 0m34.670s sys 0m12.079s Patched qemu-img: real 0m34.805s user 0m18.445s sys 0m12.552s x86 machine: Original qemu-img: real 1m13.991s user 0m24.734s sys 0m6.604s Patched qemu-img: real 1m6.898s user 0m16.021s sys 0m6.700s Please, see on the consumed user CPU time. I think that the optimization worth it, so it will be awesome if you accept this patch (see the attachment). Thanks for your attention.
From 61d228c0ea0d518de48a08577cd6d282e2f97759 Mon Sep 17 00:00:00 2001 From: Dmitry Konishchev <konishc...@gmail.com> Date: Tue, 17 May 2011 16:29:48 +0400 Subject: [PATCH] is_not_zero() optimization in qemu-img --- qemu-img.c | 24 +++++++++++++++++++++--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/qemu-img.c b/qemu-img.c index e825123..41b4e32 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -498,12 +498,30 @@ static int img_commit(int argc, char **argv) static int is_not_zero(const uint8_t *sector, int len) { + /* + * Use long as the biggest available internal data type that fits into the + * CPU register and unroll the loop to smooth out the effect of memory + * latency. + */ + int i; - len >>= 2; - for(i = 0;i < len; i++) { - if (((uint32_t *)sector)[i] != 0) + len /= sizeof(long); + + long d0; + long d1; + long d2; + long d3; + + for(i = 0; i < len; i += 4) { + d0 = ((const long*) sector)[i + 0]; + d1 = ((const long*) sector)[i + 1]; + d2 = ((const long*) sector)[i + 2]; + d3 = ((const long*) sector)[i + 3]; + + if (d0 || d1 || d2 || d3) return 1; } + return 0; } -- 1.7.4.1