Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL

Robert Foley Thu, 21 May 2020 07:18:09 -0700

We re-ran the numbers for a ppc64 VM, using the additional configuration
details.
This seems to show the scalability gains much clearer.


                   Speedup vs a single thread for kernel build

  7 +-----------------------------------------------------------------------+
    |         +          +         +         +         +          +         |
    |                                    ###########       baseline ******* |
    |                               #####           ####   cpu lock ####### |
    |                             ##                    ####                |
  6 |-+                         ##                          ##            +-|
    |                         ##                              ####          |
    |                       ##                                    ###       |
    |                     ##        *****                            #      |
    |                   ##      ****     ***                          #     |
    |                 ##     ***            *                               |
  5 |-+             ##    ***                ****                         +-|
    |              #  ****                       **                         |
    |             # **                             **                       |
    |             #*                                 **                     |
    |          #*                                          **               |
    |         #*                                             *              |
    |         #                                               ******        |
    |        #                                                      **      |
    |       #                                                         *     |
  3 |-+     #                                                             +-|
    |      #                                                                |
    |      #                                                                |
    |     #                                                                 |
    |     #                                                                 |
  2 |-+  #                                                                +-|
    |    #                                                                  |
    |   #                                                                   |
    |   #                                                                   |
    |  #                                                                    |
    |  #      +          +         +         +         +          +         |
  1 +-----------------------------------------------------------------------+
    0         5          10        15        20        25         30        35
                                   Guest vCPUs

https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing

Thanks & Regards,
-Rob

On Wed, 20 May 2020 at 11:01, Robert Foley <robert.fo...@linaro.org> wrote:
>
> On Wed, 20 May 2020 at 00:46, Emilio G. Cota <c...@braap.org> wrote:
> >
> > On Mon, May 18, 2020 at 09:46:36 -0400, Robert Foley wrote:
> >
> > Thanks for doing these tests. I know from experience that benchmarking
> > is hard and incredibly time consuming, so please do not be discouraged by
> > my comments below.
> >
>
> Hi,
> Thanks for all the comments, and for including the script!
> These are all very helpful.
>
> We will work to replicate these results using a PPC VM,
> and will re-post them here.
>
> Thanks & Regards,
> -Rob
>
> > A couple of points:
> >
> > 1. I am not familiar with aarch64 KVM but I'd expect it to scale almost
> > like the native run. Are you assigning enough RAM to the guest? Also,
> > it can help to run the kernel build in a ramfs in the guest.
>
> > 2. The build itself does not seem to impose a scaling limit, since
> > it scales very well when run natively (per-thread I presume aarch64 TCG is
> > still slower than native, even if TCG is run on a faster x86 machine).
> > The limit here is probably aarch64 TCG. In particular, last time I
> > checked aarch64 TCG has room for improvement scalability-wise handling
> > interrupts and some TLB operations; this is likely to explain why we
> > see no benefit with per-CPU locks, i.e. the bottleneck is elsewhere.
> > This can be confirmed with the sync profiler.
> >
> > IIRC I originally used ppc64 for this test because ppc64 TCG does not
> > have any other big bottlenecks scalability-wise. I just checked but
> > unfortunately I can't find the ppc64 image I used :( What I can offer
> > is the script I used to run these benchmarks; see the appended.
> >
> > Thanks,
> >                 Emilio
> >
> > ---
> > #!/bin/bash
> >
> > set -eu
> >
> > # path to host files
> > MYHOME=/local/home/cota/src
> >
> > # guest image
> > QEMU_INST_PATH=$MYHOME/qemu-inst
> > IMG=$MYHOME/qemu/img/ppc64/ubuntu.qcow2
> >
> > ARCH=ppc64
> > COMMON_ARGS="-M pseries -nodefaults \
> >                 -hda $IMG -nographic -serial stdio \
> >                 -net nic -net user,hostfwd=tcp::2222-:22 \
> >                 -m 48G"
> >
> > # path to this script's directory, where .txt output will be copied
> > # from the guest.
> > QELT=$MYHOME/qelt
> > HOST_PATH=$QELT/fig/kcomp
> >
> > # The guest must be able to SSH to the HOST without entering a password.
> > # The way I set this up is to have a passwordless SSH key in the guest's
> > # root user, and then copy that key's public key to the host.
> > # I used the root user because the guest runs on bootup (as root) a
> > # script that scp's run-guest.sh (see below) from the host, then executes 
> > it.
> > # This is done via a tiny script in the guest invoked from systemd once
> > # boot-up has completed.
> > HOST=f...@bar.edu
> >
> > # This is a script in the host to use an appropriate cpumask to
> > # use cores in the same socket if possible.
> > # See https://github.com/cota/cputopology-perl
> > CPUTOPO=$MYHOME/cputopology-perl
> >
> > # For each run we create this file that then the guest will SCP
> > # and execute. It is a quick and dirty way of passing arguments to the 
> > guest.
> > create_file () {
> >     TAG=$1
> >     CORES=$2
> >     NAME=$ARCH.$TAG-$CORES.txt
> >
> >     echo '#!/bin/bash' > run-guest.sh
> >     echo 'cp -r /home/cota/linux-4.18-rc7 /tmp2/linux' >> run-guest.sh
> >     echo "cd /tmp2/linux" >> run-guest.sh
> >     echo "{ time make -j $CORES vmlinux >/dev/null; } 2>>/home/cota/$NAME" 
> > >> run-guest.sh
> >     # Output with execution time is then scp'ed to the host.
> >     echo "ssh $HOST 'cat >> $HOST_PATH/$NAME' < /home/cota/$NAME" >> 
> > run-guest.sh
> >     echo "poweroff" >> run-guest.sh
> > }
> >
> > # Change here THREADS and also the TAGS that point to different QEMU 
> > installations.
> > for THREADS in 64 32 16; do
> >     for TAG in cpu-exclusive-work cputlb-no-bql per-cpu-lock cpu-has-work 
> > baseline; do
> >         QEMU=$QEMU_INST_PATH/$TAG/bin/qemu-system-$ARCH
> >         CPUMASK=$($CPUTOPO/list.pl --policy=compact-smt $THREADS)
> >
> >         create_file $TAG $THREADS
> >         time taskset -c $CPUMASK $QEMU $COMMON_ARGS -smp $THREADS
> >     done
> > done

Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL

Reply via email to