Hi Kevin, On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <kw...@redhat.com> wrote: > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben: >> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote: >> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <kw...@redhat.com> wrote: >> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben: >> > >> Hi, >> > >> >> > >> These patches bring up below 4 changes: >> > >> - introduce object allocation pool and apply it to >> > >> virtio-blk dataplane for improving its performance >> > >> >> > >> - introduce selective coroutine bypass mechanism >> > >> for improving performance of virtio-blk dataplane with >> > >> raw format image >> > > >> > > Before applying any bypassing patches, I think we should understand in >> > > detail where we are losing performance with coroutines enabled. >> > >> > From the below profiling data, CPU becomes slow to run instructions >> > with coroutine, and CPU dcache miss is increased so it is very >> > likely caused by switching stack frequently. >> > >> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2 >> > >> > http://pastebin.com/ae0vnQ6V >> >> I have been wondering how to prove that the root cause is the ucontext >> coroutine mechanism (stack switching). Here is an idea: >> >> Hack your "bypass" code path to run the request inside a coroutine. >> That way you can compare "bypass without coroutine" against "bypass with >> coroutine". >> >> Right now I think there are doubts because the bypass code path is >> indeed a different (and not 100% correct) code path. So this approach >> might prove that the coroutines are adding the overhead and not >> something that you bypassed. > > My doubts aren't only that the overhead might not come from the > coroutines, but also whether any coroutine-related overhead is really > unavoidable. If we can optimise coroutines, I'd strongly prefer to do > just that instead of introducing additional code paths.
OK, thank you for taking look at the problem, and hope we can figure out the root cause, :-) > > Another thought I had was this: If the performance difference is indeed > only coroutines, then that is completely inside the block layer and we > don't actually need a VM to test it. We could instead have something > like a simple qemu-img based benchmark and should be observing the same. Even it is simpler to run a coroutine-only benchmark, and I just wrote a raw one, and looks coroutine does decrease performance a lot, please see the attachment patch, and thanks for your template to help me add the 'co_bench' command in qemu-img. >From the profiling data in below link: http://pastebin.com/YwH2uwbq With coroutine, the running time for same loading is increased ~50%(1.325s vs. 0.903s), and dcache load events is increased ~35%(693M vs. 512M), insns per cycle is decreased by ~50%( 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter). The bypass code in the benchmark is very similar with the approach used in the bypass patch, since linux-aio with O_DIRECT seldom blocks in the the kernel I/O path. Maybe the benchmark is a bit extremely, but given modern storage device may reach millions of IOPS, and it is very easy to slow down the I/O by coroutine. > I played a bit with the following, I hope it's not too naive. I couldn't > see a difference with your patches, but at least one reason for this is > probably that my laptop SSD isn't fast enough to make the CPU the > bottleneck. Haven't tried ramdisk yet, that would probably be the next > thing. (I actually wrote the patch up just for some profiling on my own, > not for comparing throughput, but it should be usable for that as well.) This might not be good for the test since it is basically a sequential read test, which can be optimized a lot by kernel. And I always use randread benchmark. Thanks,
diff --git a/Makefile b/Makefile index d6b9dc1..a59523c 100644 --- a/Makefile +++ b/Makefile @@ -211,7 +211,7 @@ util/module.o-cflags = -D'CONFIG_BLOCK_MODULES=$(block-modules)' qemu-img.o: qemu-img-cmds.h -qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a +qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a -lcrypt qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx index ae64b3d..7601b9a 100644 --- a/qemu-img-cmds.hx +++ b/qemu-img-cmds.hx @@ -15,6 +15,12 @@ STEXI @item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename ETEXI +DEF("co_bench", co_bench, + "co_bench -c count -q -b") +STEXI +@item co_bench [-c] @var{count} [-b] [-q] +ETEXI + DEF("check", img_check, "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] filename") STEXI diff --git a/qemu-img.c b/qemu-img.c index 92e9529..d73b171 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -366,6 +366,106 @@ static int add_old_style_options(const char *fmt, QemuOpts *opts, return 0; } +struct crypt_data { + unsigned long sum; + bool bypass; +}; + +static unsigned long crypt_and_sum(const char *key, const char *salt) +{ + char *enc = crypt(key, salt); + int len = strlen(enc); + int i; + unsigned long sum = 0; + + for (i = 0; i < len; i++) + sum += enc[i]; + + return sum; +} + +static void gen_key(char *key, int len) +{ + char set[] = { + '0', '1', '2', '3', '4', '5', '6', '7', '8', + 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', + 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', + 's', 't', 'w', 'x', 'y', 'z', '_', '-', ' ', + '*', '$', '#', '%', + }; + int cnt = sizeof(set) / sizeof(set[0]); + int i; + + for (i = 0; i < len; i++) { + key[i] = set[rand() % cnt]; + } +} + +static void crypt_bench(void *opaque) +{ + struct crypt_data *data = opaque; + const int len = 8; + char key1[8]; + char key2[8]; + char salt[3]; + + gen_key(key1, len); + gen_key(key2, len); + salt[0] = key1[0]; + salt[1] = key2[7]; + salt[2] = '0'; + + data->sum += crypt_and_sum(key1, salt); + data->sum += crypt_and_sum(key2, salt); + + if (!data->bypass) { + qemu_coroutine_yield(); + } +} + +static int co_bench(int argc, char **argv) +{ + int c; + bool bypass = false; + unsigned long cnt = 1; + int num = 1; + unsigned long i; + struct crypt_data data = { + .sum = 0, + .bypass = bypass, + }; + + for(;;) { + c = getopt(argc, argv, "bc:q"); + if (c == -1) { + break; + } + switch(c) { + case 'b': + bypass = true; + break; + case 'c': + num = atoi(optarg); + break; + } + } + + data.bypass = bypass; + + srand((unsigned int)&i); + srand(rand()); + for (i = 0; i < num * cnt; i++) { + Coroutine *co; + if (!data.bypass) { + co = qemu_coroutine_create(crypt_bench); + qemu_coroutine_enter(co, &data); + } else { + crypt_bench(&data); + } + } + return (int)data.sum; +} + static int img_create(int argc, char **argv) { int c;