Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG

Alex Bennée Fri, 12 Aug 2016 08:35:42 -0700

Alex Bennée <alex.ben...@linaro.org> writes:

> Alex Bennée <alex.ben...@linaro.org> writes:
>
>> This is the fourth iteration of the RFC patch set which aims to
>> provide the basic framework for MTTCG. I hope this will provide a good
>> base for discussion at KVM Forum later this month.
>>
> <snip>
>>
>> In practice the memory barrier problems don't show up with an x86
>> host. In fact I have created a tree which merges in the Emilio's
>> cmpxchg atomics which happily boots ARMv7 Debian systems without any
>> additional changes. You can find that at:
>>
>>   
>> https://github.com/stsquad/qemu/tree/mttcg/base-patches-v4-with-cmpxchg-atomics-v2
>>
> <snip>
>> Performance
>> ===========
>>
>> You can't do full work-load testing on this tree due to the lack of
>> atomic support (but I will run some numbers on
>> mttcg/base-patches-v4-with-cmpxchg-atomics-v2).
>
> So here is a more real world work load run:
>
>   retry.py called with 
> ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 
> 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 
> 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', 
> '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 
> 'virtio-net-device,netdev=unet', '-drive', 
> 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none',
>  '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 
> systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', 
> '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', 
> '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=single']
>   run 1: ret=0 (PASS), time=261.794911 (1/1)
>   run 2: ret=0 (PASS), time=257.290045 (2/2)
>   run 3: ret=0 (PASS), time=256.536991 (3/3)
>   run 4: ret=0 (PASS), time=254.036260 (4/4)
>   run 5: ret=0 (PASS), time=256.539165 (5/5)
>   Results summary:
>   0: 5 times (100.00%), avg time 257.239 (8.00 varience/2.83 deviation)
>   Ran command 5 times, 5 passes
>
>   retry.py called with 
> ['/home/alex/lsrc/qemu/qemu.git/arm-softmmu/qemu-system-arm', '-machine', 
> 'type=virt', '-display', 'none', '-smp', '1', '-m', '4096', '-cpu', 
> 'cortex-a15', '-serial', 'telnet:127.0.0.1:4444', '-monitor', 'stdio', 
> '-netdev', 'user,id=unet,hostfwd=tcp::2222-:22', '-device', 
> 'virtio-net-device,netdev=unet', '-drive', 
> 'file=/home/alex/lsrc/qemu/images/jessie-arm32.qcow2,id=myblock,index=0,if=none',
>  '-device', 'virtio-blk-device,drive=myblock', '-append', 'console=ttyAMA0 
> systemd.unit=benchmark-build.service root=/dev/vda1', '-kernel', 
> '/home/alex/lsrc/qemu/images/aarch32-current-linux-kernel-only.img', '-smp', 
> '4', '-name', 'debug-threads=on', '-accel', 'tcg,thread=multi']
>   run 1: ret=0 (PASS), time=86.597459 (1/1)
>   run 2: ret=0 (PASS), time=82.843904 (2/2)
>   run 3: ret=0 (PASS), time=84.095910 (3/3)
>   run 4: ret=0 (PASS), time=83.844595 (4/4)
>   run 5: ret=0 (PASS), time=83.594768 (5/5)
>   Results summary:
>   0: 5 times (100.00%), avg time 84.195 (2.02 varience/1.42 deviation)
>   Ran command 5 times, 5 passes
>
> This shows a 30% overhead over the ideal for running multi-threaded but
> still seeing a decent improvement in wall time.
>
> So the test itself is booting the system, running the
> benchmark-build.service:
>
>   # A benchmark target
>   #
>   # This shutsdown once the boot has completed
>
>   [Unit]
>   Description=Default
>   Requires=basic.target
>   After=basic.target
>   AllowIsolate=yes
>
>   [Service]
>   Type=oneshot
>   ExecStart=/root/mysrc/testcases.git/build-dir.sh
>   /root/src/stress-ng.git/
>   ExecStartPost=/sbin/poweroff
>
>   [Install]
>   WantedBy=multi-user.target
>
> And the build-dir script is a simple:
>
>     #!/bin/sh
>     #
>     NR_CPUS=$(grep -c ^processor /proc/cpuinfo)
>     set -e
>     cd $1
>     make clean
>     make -j${NR_CPUS}
>     cd -
>
> Measuring this over increasing -smp



Measuring this over increasing -smp

 -smp     time  -smp 1 / smp  time as bar   x faster
-----------------------------------------------------
    1  238.184       238.184  WWWWWWWWWWWW     1.000
    2  133.402       119.092  WWWWWWh          1.785
    3   99.531        79.395  WWWWW            2.393
    4   82.760        59.546  WWWW.            2.878
    5   82.513        47.637  WWWW.            2.887
    6   78.922        39.697  WWWH             3.018
    7   87.181        34.026  WWWW;            2.732
    8   87.098        29.773  WWWW;            2.735

So a more complete analysis shows the benefits start to tail off as we
push past 4 vCPUs. However on my machine which is 4+4 hyperthreads that
could be just as much a feature of the host system. Indeed the results
start getting noisy at 7/8 vCPUs.

Interestingly a perf run against -smp 6 shows gic_update topping the
graph (3.14% of total execution time). That function does have a big
TODO for optimisation on it ;-)

--
Alex Bennée

Re: [Qemu-devel] [RFC v4 00/28] Base enabling patches for MTTCG

Reply via email to