> Date: Sat, 22 Aug 2020 22:05:44 -0500
> From: Scott Cheloha <[email protected]>
> 
> On Tue, Jul 28, 2020 at 10:02:07AM +0300, Paul Irofti wrote:
> > 
> > [...]
> > 
> > Is the issue with LFENCE slowing down the network stack settled? That was
> > the argument against it last time.
> 
> ... a month passes.  Nobody says anything.
> 
> This "it might slow down the network stack" thing keeps coming up, and
> yet nobody can point to (a) who expressed this concern or (b) what the
> penalty is in practice.
> 
> Note that the alternative is "your timecounter might not be monotonic
> between threads".  For me, that's already a dealbreaker.
> 
> But for sake of discussion let's look at some data.  For those of you
> watching from home, please follow along!  I would like to know what
> your results look like.
> 
> To start, here is a microbenchmarking program for clock_gettime(2) on
> amd64.  If you have the userspace timecounter, then
> 
>       clock_gettime(CLOCK_MONOTONIC, ...);
> 
> is a suitable surrogate for nanouptime(9), so this microbenchmark can
> actually tell us about how nanouptime(9) or nanotime(9) would be
> impacted by a comparable change in the kernel timecounter.
> 
> --
> 
> /*
>  * clock_gettime-bench.c
>  */
> #include <err.h>
> #include <limits.h>
> #include <time.h>
> #include <stdio.h>
> #include <stdlib.h>
> 
> static uint64_t
> rdtsc_lfence(void)
> {
>       uint32_t hi, lo;
> 
>       __asm volatile("lfence; rdtsc; lfence" : "=d" (hi), "=a" (lo));
>       return ((uint64_t)hi << 32) | lo;
> }
> 
> int
> main(int argc, char *argv[])
> {
>       struct timespec now;
>       uint64_t begin, end;
>       long long count, i;
>       const char *errstr;
> 
>       if (argc != 2) {
>               fprintf(stderr, "usage: %s count\n", getprogname());
>               return 1;
>       }
>       count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
>       if (errstr != NULL)
>               errx(1, "count is %s: %s", errstr, argv[1]);
> 
>       begin = rdtsc_lfence();
>       for (i = 0; i < count; i++)
>               clock_gettime(CLOCK_MONOTONIC, &now);
>       end = rdtsc_lfence();
> 
>       printf("%lld\t%llu\n", count, end - begin);
> 
>       return 0;
> }
> 
> --
> 
> Now consider a benchmark of 100K clock_gettime(2) calls against the
> userspace timecounter.
> 
> $ clock_gettime-bench 100000
> 100000  15703664
> 
> Let's collect 10K of these benchmarks -- our samples -- atop an
> unpatched libc.  Use the shell script below.  Note that we throw out
> samples where we hit a context switch.
> 
> --
> 
> #! /bin/sh
> 
> [ $# -ne 1 ] && exit 1
> RESULTS=$1
> shift
> 
> TIME=$(mktemp) || exit 1
> TMP=$(mktemp) || exit 1
> 
> # Collect 10K samples.
> i=0
> while [ $i -lt 10000 ]; do
>       # Call clock_gettime(2) 100K times.
>       /usr/bin/time -l ~/scratch/clock_gettime-bench 100000 > $TMP 2> $TIME
>       # Ignore this sample if a context switch occurred.
>       if egrep -q '[1-9][0-9]* +(in)?voluntary context' $TIME; then
>               continue
>       fi
>       cat $TMP >> $RESULTS
>       i=$((i + 1))
> done
> 
> rm $TMP $TIME
> 
> --
> 
> Run it like this:
> 
> $ ksh bench.sh unpatched.out
> 
> That will take ~5-10 minutes at most.
> 
> Next, we'll patch libc to add the LFENCE to the userspace timecounter.
> 
> Index: usertc.c
> ===================================================================
> RCS file: /cvs/src/lib/libc/arch/amd64/gen/usertc.c,v
> retrieving revision 1.2
> diff -u -p -r1.2 usertc.c
> --- usertc.c  8 Jul 2020 09:17:48 -0000       1.2
> +++ usertc.c  22 Aug 2020 22:18:47 -0000
> @@ -19,10 +19,10 @@
>  #include <sys/timetc.h>
>  
>  static inline u_int
> -rdtsc(void)
> +rdtsc_lfence(void)
>  {
>       uint32_t hi, lo;
> -     asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
> +     asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi));
>       return ((uint64_t)lo)|(((uint64_t)hi)<<32);
>  }
>  
> @@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_
>  {
>       switch (tk->tk_user) {
>       case TC_TSC:
> -             *tc = rdtsc();
> +             *tc = rdtsc_lfence();
>               return 0;
>       }
>  
> --
> 
> Recompile and reinstall libc.
> 
> Then rerun the benchmark.  Be careful not to overwrite our results
> from the unpatched libc:
> 
> $ ksh bench.sh patched.out
> 
> --
> 
> Alright, now let's compare the results.  I'm not a mathemagician so I
> use ministat and trust it implicitly.  A stat jock could probably do
> this in R or with some python, but I am not that clever, so I will
> stick with ministat.
> 
> There is no ministat port for OpenBSD, but it is pretty trivial to
> clone this github repo and build it on -current:
> 
> https://github.com/thorduri/ministat
> 
> --
> 
> Okay, you have ministat?
> 
> Let's compare the results.  We want the 2nd column in the output
> (-C2).  I'm not interested in the graph (-q), given our population
> size.  We have N=10000, so let's push the CI up (-c 99.5).
> 
> $ ~/repo/ministat/ministat -C2 -q -c99.5 unpatched.out patched.out
> x unpatched.out
> + patched.out
>     N           Min           Max        Median           Avg        Stddev
> x 10000      13752102      18019218      14442398      14431918     237842.31
> + 10000      15196970      16992030      15721390      15779178     181623.5
> Difference at 99.5% confidence
>         1.34726e+06 +/- 9247.11
>         9.33528% +/- 0.064074%
>         (Student's t, pooled s = 211608)
> 
> So, in conclusion, prefixing RDTSC with LFENCE incurs a penalty.  It
> is unambiguously slower.
> 
> What is the penalty?  No more than ~10%.  Which translates to maybe
> ~10-15 cycles on my machine.
> 
> Something else worth noting is that with the LFENCE in place the
> stddev is smaller.  So, reading the clock, though slower, takes a more
> *consistent* amount of time with the LFENCE in place.
> 
> --
> 
> I don't think this penalty is so bad.
> 
> Thoughts?

Cool.  The kernel enironment is different of course.  But this gives a
good idea of the potential impact.  I'd say this is ok.  I don't think
this will result in anything close to a 10% reduction in packet
forwarding.

I think you should go ahead with a change that makes both libc and the
kernel use rdtsc_lfence() for timekeeping.  I also think that
cpu_rnd_messybits() should continue to use a naked RDTSC instruction
since that gives us more jitter.

Cheers,

Mark

Reply via email to