Hi! I was wondering why qemu-img consumes so much CPU when it converts
one partially allocated qcow2 image to another qcow2 image and I've
written a patch which improves the situation a little.

I have an image:
> $ qemu-img info ubuntu.10.04.qcow2
> image: ubuntu.10.04.qcow2
> file format: qcow2
> virtual size: 20G (21474836480 bytes)
> disk size: 2.2G
> cluster_size: 65536

I create a new copy on write image:
> $ qemu-img create -f qcow2 -o backing_file=ubuntu.10.04.qcow2 volume.qcow2 
> 100G
... and use it for a while.

Then I want to create a non-copy on write image from it to send it to someone:
> qemu-img convert -O qcow2 volume.qcow2 snapshot.qcow2


The last operation consumes a lot of CPU, so I run qemu-img under
profiler and realized, that most of CPU time is consumed by
is_not_zero() function. I had made a couple of optimizations on it and
got the following output for `time qemu-img convert -O qcow2
volume.qcow2 snapshot.qcow2`:

x86_64 machine:

Original qemu-img:
real 0m56.159s
user 0m34.670s
sys  0m12.079s

Patched qemu-img:
real 0m34.805s
user 0m18.445s
sys  0m12.552s


x86 machine:

Original qemu-img:
real 1m13.991s
user 0m24.734s
sys  0m6.604s

Patched qemu-img:
real 1m6.898s
user 0m16.021s
sys  0m6.700s

Please, see on the consumed user CPU time. I think that the
optimization worth it, so it will be awesome if you accept this patch
(see the attachment).

Thanks for your attention.
From 61d228c0ea0d518de48a08577cd6d282e2f97759 Mon Sep 17 00:00:00 2001
From: Dmitry Konishchev <konishc...@gmail.com>
Date: Tue, 17 May 2011 16:29:48 +0400
Subject: [PATCH] is_not_zero() optimization in qemu-img

---
 qemu-img.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index e825123..41b4e32 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -498,12 +498,30 @@ static int img_commit(int argc, char **argv)
 
 static int is_not_zero(const uint8_t *sector, int len)
 {
+    /*
+     * Use long as the biggest available internal data type that fits into the
+     * CPU register and unroll the loop to smooth out the effect of memory
+     * latency.
+     */
+
     int i;
-    len >>= 2;
-    for(i = 0;i < len; i++) {
-        if (((uint32_t *)sector)[i] != 0)
+    len /= sizeof(long);
+
+    long d0;
+    long d1;
+    long d2;
+    long d3;
+
+    for(i = 0; i < len; i += 4) {
+        d0 = ((const long*) sector)[i + 0];
+        d1 = ((const long*) sector)[i + 1];
+        d2 = ((const long*) sector)[i + 2];
+        d3 = ((const long*) sector)[i + 3];
+
+        if (d0 || d1 || d2 || d3)
             return 1;
     }
+
     return 0;
 }
 
-- 
1.7.4.1

Reply via email to