We re-ran the numbers for a ppc64 VM, using the additional configuration details. This seems to show the scalability gains much clearer.
Speedup vs a single thread for kernel build 7 +-----------------------------------------------------------------------+ | + + + + + + | | ########### baseline ******* | | ##### #### cpu lock ####### | | ## #### | 6 |-+ ## ## +-| | ## #### | | ## ### | | ## ***** # | | ## **** *** # | | ## *** * | 5 |-+ ## *** **** +-| | # **** ** | | # ** ** | | #* ** | | #* ** | | #* * | | # ****** | | # ** | | # * | 3 |-+ # +-| | # | | # | | # | | # | 2 |-+ # +-| | # | | # | | # | | # | | # + + + + + + | 1 +-----------------------------------------------------------------------+ 0 5 10 15 20 25 30 35 Guest vCPUs https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing Thanks & Regards, -Rob On Wed, 20 May 2020 at 11:01, Robert Foley <robert.fo...@linaro.org> wrote: > > On Wed, 20 May 2020 at 00:46, Emilio G. Cota <c...@braap.org> wrote: > > > > On Mon, May 18, 2020 at 09:46:36 -0400, Robert Foley wrote: > > > > Thanks for doing these tests. I know from experience that benchmarking > > is hard and incredibly time consuming, so please do not be discouraged by > > my comments below. > > > > Hi, > Thanks for all the comments, and for including the script! > These are all very helpful. > > We will work to replicate these results using a PPC VM, > and will re-post them here. > > Thanks & Regards, > -Rob > > > A couple of points: > > > > 1. I am not familiar with aarch64 KVM but I'd expect it to scale almost > > like the native run. Are you assigning enough RAM to the guest? Also, > > it can help to run the kernel build in a ramfs in the guest. > > > 2. The build itself does not seem to impose a scaling limit, since > > it scales very well when run natively (per-thread I presume aarch64 TCG is > > still slower than native, even if TCG is run on a faster x86 machine). > > The limit here is probably aarch64 TCG. In particular, last time I > > checked aarch64 TCG has room for improvement scalability-wise handling > > interrupts and some TLB operations; this is likely to explain why we > > see no benefit with per-CPU locks, i.e. the bottleneck is elsewhere. > > This can be confirmed with the sync profiler. > > > > IIRC I originally used ppc64 for this test because ppc64 TCG does not > > have any other big bottlenecks scalability-wise. I just checked but > > unfortunately I can't find the ppc64 image I used :( What I can offer > > is the script I used to run these benchmarks; see the appended. > > > > Thanks, > > Emilio > > > > --- > > #!/bin/bash > > > > set -eu > > > > # path to host files > > MYHOME=/local/home/cota/src > > > > # guest image > > QEMU_INST_PATH=$MYHOME/qemu-inst > > IMG=$MYHOME/qemu/img/ppc64/ubuntu.qcow2 > > > > ARCH=ppc64 > > COMMON_ARGS="-M pseries -nodefaults \ > > -hda $IMG -nographic -serial stdio \ > > -net nic -net user,hostfwd=tcp::2222-:22 \ > > -m 48G" > > > > # path to this script's directory, where .txt output will be copied > > # from the guest. > > QELT=$MYHOME/qelt > > HOST_PATH=$QELT/fig/kcomp > > > > # The guest must be able to SSH to the HOST without entering a password. > > # The way I set this up is to have a passwordless SSH key in the guest's > > # root user, and then copy that key's public key to the host. > > # I used the root user because the guest runs on bootup (as root) a > > # script that scp's run-guest.sh (see below) from the host, then executes > > it. > > # This is done via a tiny script in the guest invoked from systemd once > > # boot-up has completed. > > HOST=f...@bar.edu > > > > # This is a script in the host to use an appropriate cpumask to > > # use cores in the same socket if possible. > > # See https://github.com/cota/cputopology-perl > > CPUTOPO=$MYHOME/cputopology-perl > > > > # For each run we create this file that then the guest will SCP > > # and execute. It is a quick and dirty way of passing arguments to the > > guest. > > create_file () { > > TAG=$1 > > CORES=$2 > > NAME=$ARCH.$TAG-$CORES.txt > > > > echo '#!/bin/bash' > run-guest.sh > > echo 'cp -r /home/cota/linux-4.18-rc7 /tmp2/linux' >> run-guest.sh > > echo "cd /tmp2/linux" >> run-guest.sh > > echo "{ time make -j $CORES vmlinux >/dev/null; } 2>>/home/cota/$NAME" > > >> run-guest.sh > > # Output with execution time is then scp'ed to the host. > > echo "ssh $HOST 'cat >> $HOST_PATH/$NAME' < /home/cota/$NAME" >> > > run-guest.sh > > echo "poweroff" >> run-guest.sh > > } > > > > # Change here THREADS and also the TAGS that point to different QEMU > > installations. > > for THREADS in 64 32 16; do > > for TAG in cpu-exclusive-work cputlb-no-bql per-cpu-lock cpu-has-work > > baseline; do > > QEMU=$QEMU_INST_PATH/$TAG/bin/qemu-system-$ARCH > > CPUMASK=$($CPUTOPO/list.pl --policy=compact-smt $THREADS) > > > > create_file $TAG $THREADS > > time taskset -c $CPUMASK $QEMU $COMMON_ARGS -smp $THREADS > > done > > done