On Thu, Jul 31, 2014 at 7:37 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > Il 30/07/2014 19:15, Ming Lei ha scritto: >> On Wed, Jul 30, 2014 at 9:45 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: >>> Il 30/07/2014 13:39, Ming Lei ha scritto: >>>> This patch introduces several APIs for supporting bypass qemu coroutine >>>> in case of being not necessary and for performance's sake. >>> >>> No, this is wrong. Dataplane *must* use the same code as non-dataplane, >>> anything else is a step backwards. >> >> As we saw, coroutine has brought up performance regression >> on dataplane, and it isn't necessary to use co in some cases, is it? > > Yes, and it's not necessary on non-dataplane either. It's not necessary > on virtio-scsi, and it will not be necessary on virtio-scsi dataplane > either. > >>> If you want to bypass coroutines, bdrv_aio_readv/writev must detect the >>> conditions that allow doing that and call the bdrv_aio_readv/writev >>> directly. >> >> That is easy to detect, please see the 5th patch. > > No, that's not enough. Dataplane right now prevents block jobs, but > that's going to change and it could require coroutines even for raw devices. > >>> To begin with, have you benchmarked QEMU and can you provide a trace of >>> *where* the coroutine overhead lies? >> >> I guess it may be caused by the stack switch, at least in one of >> my box, bypassing co can improve throughput by ~7%, and by >> ~15% in another box. > > No guesses please. Actually that's also my guess, but since you are > submitting the patch you must do better and show profiles where stack > switching disappears after the patches.
Follows the below hardware events reported by 'perf stat' when running fio randread benchmark for 2min in VM(single vq, 2 jobs): sudo ~/bin/perf stat -e L1-dcache-loads,L1-dcache-load-misses,cpu-cycles,instructions,branch-instructions,branch-misses,branch-loads,branch-load-misses,dTLB-loads,dTLB-load-misses ./nqemu-start-mq 4 1 1), without bypassing coroutine via forcing to set 's->raw_format ' as false, see patch 5/15 - throughout: 95K Performance counter stats for './nqemu-start-mq 4 1': 69,231,035,842 L1-dcache-loads [40.10%] 1,909,978,930 L1-dcache-load-misses # 2.76% of all L1-dcache hits [39.98%] 263,731,501,086 cpu-cycles [40.03%] 232,564,905,115 instructions # 0.88 insns per cycle [50.23%] 46,157,868,745 branch-instructions [49.82%] 785,618,591 branch-misses # 1.70% of all branches [49.99%] 46,280,342,654 branch-loads [49.95%] 34,934,790,140 branch-load-misses [50.02%] 69,447,857,237 dTLB-loads [40.13%] 169,617,374 dTLB-load-misses # 0.24% of all dTLB cache hits [40.04%] 161.991075781 seconds time elapsed 2), with bypassing coroutinue - throughput: 115K Performance counter stats for './nqemu-start-mq 4 1': 76,784,224,509 L1-dcache-loads [39.93%] 1,334,036,447 L1-dcache-load-misses # 1.74% of all L1-dcache hits [39.91%] 262,697,428,470 cpu-cycles [40.03%] 255,526,629,881 instructions # 0.97 insns per cycle [50.01%] 50,160,082,611 branch-instructions [49.97%] 564,407,788 branch-misses # 1.13% of all branches [50.08%] 50,331,510,702 branch-loads [50.08%] 35,760,766,459 branch-load-misses [50.03%] 76,706,000,951 dTLB-loads [40.00%] 123,291,001 dTLB-load-misses # 0.16% of all dTLB cache hits [40.02%] 162.333465490 seconds time elapsed