Hi Kevin,

On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote:
> Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
>> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kw...@redhat.com> wrote:
>> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
>> > >> Hi,
>> > >>
>> > >> These patches bring up below 4 changes:
>> > >>         - introduce object allocation pool and apply it to
>> > >>         virtio-blk dataplane for improving its performance
>> > >>
>> > >>         - introduce selective coroutine bypass mechanism
>> > >>         for improving performance of virtio-blk dataplane with
>> > >>         raw format image
>> > >
>> > > Before applying any bypassing patches, I think we should understand in
>> > > detail where we are losing performance with coroutines enabled.
>> >
>> > From the below profiling data, CPU becomes slow to run instructions
>> > with coroutine, and CPU dcache miss is increased so it is very
>> > likely caused by switching stack frequently.
>> >
>> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2
>> >
>> > http://pastebin.com/ae0vnQ6V
>>
>> I have been wondering how to prove that the root cause is the ucontext
>> coroutine mechanism (stack switching).  Here is an idea:
>>
>> Hack your "bypass" code path to run the request inside a coroutine.
>> That way you can compare "bypass without coroutine" against "bypass with
>> coroutine".
>>
>> Right now I think there are doubts because the bypass code path is
>> indeed a different (and not 100% correct) code path.  So this approach
>> might prove that the coroutines are adding the overhead and not
>> something that you bypassed.
>
> My doubts aren't only that the overhead might not come from the
> coroutines, but also whether any coroutine-related overhead is really
> unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> just that instead of introducing additional code paths.

OK, thank you for taking look at the problem, and hope we can
figure out the root cause, :-)

>
> Another thought I had was this: If the performance difference is indeed
> only coroutines, then that is completely inside the block layer and we
> don't actually need a VM to test it. We could instead have something
> like a simple qemu-img based benchmark and should be observing the same.

Even it is simpler to run a coroutine-only benchmark, and I just
wrote a raw one, and looks coroutine does decrease performance
a lot, please see the attachment patch, and thanks for your template
to help me add the 'co_bench' command in qemu-img.

>From the profiling data in below link:

    http://pastebin.com/YwH2uwbq

With coroutine, the running time for same loading is increased
~50%(1.325s vs. 0.903s), and dcache load events is increased
~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).

The bypass code in the benchmark is very similar with the approach
used in the bypass patch, since linux-aio with O_DIRECT seldom
blocks in the the kernel I/O path.

Maybe the benchmark is a bit extremely, but given modern storage
device may reach millions of IOPS, and it is very easy to slow down
the I/O by coroutine.

> I played a bit with the following, I hope it's not too naive. I couldn't
> see a difference with your patches, but at least one reason for this is
> probably that my laptop SSD isn't fast enough to make the CPU the
> bottleneck. Haven't tried ramdisk yet, that would probably be the next
> thing. (I actually wrote the patch up just for some profiling on my own,
> not for comparing throughput, but it should be usable for that as well.)

This might not be good for the test since it is basically a sequential
read test, which can be optimized a lot by kernel. And I always use
randread benchmark.


Thanks,
diff --git a/Makefile b/Makefile
index d6b9dc1..a59523c 100644
--- a/Makefile
+++ b/Makefile
@@ -211,7 +211,7 @@ util/module.o-cflags = -D'CONFIG_BLOCK_MODULES=$(block-modules)'
 
 qemu-img.o: qemu-img-cmds.h
 
-qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a
+qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a -lcrypt
 qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a
 
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ae64b3d..7601b9a 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -15,6 +15,12 @@ STEXI
 @item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
 ETEXI
 
+DEF("co_bench", co_bench,
+    "co_bench -c count -q -b")
+STEXI
+@item co_bench [-c] @var{count} [-b] [-q]
+ETEXI
+
 DEF("check", img_check,
     "check [-q] [-f fmt] [--output=ofmt]  [-r [leaks | all]] filename")
 STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 92e9529..d73b171 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -366,6 +366,106 @@ static int add_old_style_options(const char *fmt, QemuOpts *opts,
     return 0;
 }
 
+struct crypt_data {
+    unsigned long sum;
+    bool bypass;
+};
+
+static unsigned long crypt_and_sum(const char *key, const char *salt)
+{
+    char *enc = crypt(key, salt);
+    int len = strlen(enc);
+    int i;
+    unsigned long sum = 0;
+
+    for (i = 0; i < len; i++)
+        sum += enc[i];
+
+    return sum;
+}
+
+static void gen_key(char *key, int len)
+{
+    char set[] = {
+        '0', '1', '2', '3', '4', '5', '6', '7', '8',
+        'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
+        'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
+        's', 't', 'w', 'x', 'y', 'z', '_', '-', ' ',
+        '*', '$', '#', '%',
+    };
+    int cnt = sizeof(set) / sizeof(set[0]);
+    int i;
+
+    for (i = 0; i < len; i++) {
+        key[i] = set[rand() % cnt];
+    }
+}
+
+static void crypt_bench(void *opaque)
+{
+    struct crypt_data *data = opaque;
+    const int len = 8;
+    char key1[8];
+    char key2[8];
+    char salt[3];
+
+    gen_key(key1, len);
+    gen_key(key2, len);
+    salt[0] = key1[0];
+    salt[1] = key2[7];
+    salt[2] = '0';
+
+    data->sum += crypt_and_sum(key1, salt);
+    data->sum += crypt_and_sum(key2, salt);
+
+    if (!data->bypass) {
+        qemu_coroutine_yield();
+    }
+}
+
+static int co_bench(int argc, char **argv)
+{
+    int c;
+    bool bypass = false;
+    unsigned long cnt = 1;
+    int num = 1;
+    unsigned long i;
+    struct crypt_data data = {
+        .sum = 0,
+        .bypass = bypass,
+    };
+
+    for(;;) {
+        c = getopt(argc, argv, "bc:q");
+        if (c == -1) {
+            break;
+        }
+        switch(c) {
+        case 'b':
+            bypass = true;
+            break;
+        case 'c':
+	    num = atoi(optarg);
+            break;
+        }
+    }
+
+    data.bypass = bypass;
+
+    srand((unsigned int)&i);
+    srand(rand());
+    for (i = 0; i < num * cnt; i++) {
+        Coroutine *co;
+        if (!data.bypass) {
+            co = qemu_coroutine_create(crypt_bench);
+            qemu_coroutine_enter(co, &data);
+        } else {
+            crypt_bench(&data);
+	}
+    }
+    return (int)data.sum;
+}
+
 static int img_create(int argc, char **argv)
 {
     int c;

Reply via email to