Alex Bennée <alex.ben...@linaro.org> writes: > Alex Bennée <alex.ben...@linaro.org> writes: > >> This is the fourth iteration of the RFC patch set which aims to >> provide the basic framework for MTTCG. I hope this will provide a good >> base for discussion at KVM Forum later this month. >> > <snip> >> >> In practice the memory barrier problems don't show up with an x86 >> host. In fact I have created a tree which merges in the Emilio's >> cmpxchg atomics which happily boots ARMv7 Debian systems without any >> additional changes. You can find that at: >> >> >> https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2 >> > <snip> >> Performance >> =========== >> >> You can't do full work-load testing on this tree due to the lack of >> atomic support (but I will run some numbers on >> mttcg/base-patches-v4-with-cmpxchg-atomics-v2). > > So here is a more real world work load run: > > retry.py called with > ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', > 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', > 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', > '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', > 'virtio-net-device,netdev=unet', '-drive', > 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', > '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 > systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', > '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', > '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=single'] > run 1: ret=0 (PASS), time=261.794911 (1/1) > run 2: ret=0 (PASS), time=257.290045 (2/2) > run 3: ret=0 (PASS), time=256.536991 (3/3) > run 4: ret=0 (PASS), time=254.036260 (4/4) > run 5: ret=0 (PASS), time=256.539165 (5/5) > Results summary: > 0: 5 times (100.00%), avg time 257.239 (8.00 varience/2.83 deviation) > Ran command 5 times, 5 passes > > retry.py called with > ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', > 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', > 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', > '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', > 'virtio-net-device,netdev=unet', '-drive', > 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none', > '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 > systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', > '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', > '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=multi'] > run 1: ret=0 (PASS), time=86.597459 (1/1) > run 2: ret=0 (PASS), time=82.843904 (2/2) > run 3: ret=0 (PASS), time=84.095910 (3/3) > run 4: ret=0 (PASS), time=83.844595 (4/4) > run 5: ret=0 (PASS), time=83.594768 (5/5) > Results summary: > 0: 5 times (100.00%), avg time 84.195 (2.02 varience/1.42 deviation) > Ran command 5 times, 5 passes > > This shows a 30% overhead over the ideal for running multi-threaded but > still seeing a decent improvement in wall time. > > So the test itself is booting the system, running the > benchmark-build.service: > > # A benchmark target > # > # This shutsdown once the boot has completed > > [Unit] > Description=Default > Requires=basic.target > After=basic.target > AllowIsolate=yes > > [Service] > Type=oneshot > ExecStart=/root/mysrc/testcases.git/build-dir.sh > /root/src/stress-ng.git/ > ExecStartPost=/sbin/poweroff > > [Install] > WantedBy=multi-user.target > > And the build-dir script is a simple: > > #!/bin/sh > # > NR_CPUS=$(grep -c ^processor /proc/cpuinfo) > set -e > cd $1 > make clean > make -j${NR_CPUS} > cd - > > Measuring this over increasing -smp
Measuring this over increasing -smp -smp time -smp 1 / smp time as bar x faster ----------------------------------------------------- 1 238.184 238.184 WWWWWWWWWWWW 1.000 2 133.402 119.092 WWWWWWh 1.785 3 99.531 79.395 WWWWW 2.393 4 82.760 59.546 WWWW. 2.878 5 82.513 47.637 WWWW. 2.887 6 78.922 39.697 WWWH 3.018 7 87.181 34.026 WWWW; 2.732 8 87.098 29.773 WWWW; 2.735 So a more complete analysis shows the benefits start to tail off as we push past 4 vCPUs. However on my machine which is 4+4 hyperthreads that could be just as much a feature of the host system. Indeed the results start getting noisy at 7/8 vCPUs. Interestingly a perf run against -smp 6 shows gic_update topping the graph (3.14% of total execution time). That function does have a big TODO for optimisation on it ;-) -- Alex Bennée