On Thu, Jul 31, 2014 at 5:15 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: > Il 31/07/2014 10:59, Ming Lei ha scritto: >>> > No guesses please. Actually that's also my guess, but since you are >>> > submitting the patch you must do better and show profiles where stack >>> > switching disappears after the patches. >> Follows the below hardware events reported by 'perf stat' when running >> fio randread benchmark for 2min in VM(single vq, 2 jobs): >> >> sudo ~/bin/perf stat -e >> L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses >> ./nqemu-start-mq 4 1 >> >> 1), without bypassing coroutine via forcing to set 's->raw_format ' as >> false, see patch 5/15 >> >> - throughout: 95K >> 232,564,905,115 instructions >> 161.991075781 seconds time elapsed >> >> >> 2), with bypassing coroutinue >> - throughput: 115K >> 255,526,629,881 instructions >> 162.333465490 seconds time elapsed > > Ok, so you are saving 10% instructions per iop: before 232G / 95K = > 2.45M instructions/iop, 255G / 115K = 2.22M instructions/iop.
I am wondering if it is useful since IOP is from view of VM, and instructions is got from host(qemu). If qemu dataplane handles IO quickly enough, it can save instructions by batch operations. But I will collect the 'perf report' on 'instruction' event for you. L1-dcache-load-misses ratio is increased by 1%, and instructions per cycle is decreased by 10%(0.88->0.97), which is big too, and means qemu becomes much slower when running block I/O in VM by coroutine. Thanks,