> Date: Sat, 22 Aug 2020 22:05:44 -0500 > From: Scott Cheloha <[email protected]> > > On Tue, Jul 28, 2020 at 10:02:07AM +0300, Paul Irofti wrote: > > > > [...] > > > > Is the issue with LFENCE slowing down the network stack settled? That was > > the argument against it last time. > > ... a month passes. Nobody says anything. > > This "it might slow down the network stack" thing keeps coming up, and > yet nobody can point to (a) who expressed this concern or (b) what the > penalty is in practice. > > Note that the alternative is "your timecounter might not be monotonic > between threads". For me, that's already a dealbreaker. > > But for sake of discussion let's look at some data. For those of you > watching from home, please follow along! I would like to know what > your results look like. > > To start, here is a microbenchmarking program for clock_gettime(2) on > amd64. If you have the userspace timecounter, then > > clock_gettime(CLOCK_MONOTONIC, ...); > > is a suitable surrogate for nanouptime(9), so this microbenchmark can > actually tell us about how nanouptime(9) or nanotime(9) would be > impacted by a comparable change in the kernel timecounter. > > -- > > /* > * clock_gettime-bench.c > */ > #include <err.h> > #include <limits.h> > #include <time.h> > #include <stdio.h> > #include <stdlib.h> > > static uint64_t > rdtsc_lfence(void) > { > uint32_t hi, lo; > > __asm volatile("lfence; rdtsc; lfence" : "=d" (hi), "=a" (lo)); > return ((uint64_t)hi << 32) | lo; > } > > int > main(int argc, char *argv[]) > { > struct timespec now; > uint64_t begin, end; > long long count, i; > const char *errstr; > > if (argc != 2) { > fprintf(stderr, "usage: %s count\n", getprogname()); > return 1; > } > count = strtonum(argv[1], 1, LLONG_MAX, &errstr); > if (errstr != NULL) > errx(1, "count is %s: %s", errstr, argv[1]); > > begin = rdtsc_lfence(); > for (i = 0; i < count; i++) > clock_gettime(CLOCK_MONOTONIC, &now); > end = rdtsc_lfence(); > > printf("%lld\t%llu\n", count, end - begin); > > return 0; > } > > -- > > Now consider a benchmark of 100K clock_gettime(2) calls against the > userspace timecounter. > > $ clock_gettime-bench 100000 > 100000 15703664 > > Let's collect 10K of these benchmarks -- our samples -- atop an > unpatched libc. Use the shell script below. Note that we throw out > samples where we hit a context switch. > > -- > > #! /bin/sh > > [ $# -ne 1 ] && exit 1 > RESULTS=$1 > shift > > TIME=$(mktemp) || exit 1 > TMP=$(mktemp) || exit 1 > > # Collect 10K samples. > i=0 > while [ $i -lt 10000 ]; do > # Call clock_gettime(2) 100K times. > /usr/bin/time -l ~/scratch/clock_gettime-bench 100000 > $TMP 2> $TIME > # Ignore this sample if a context switch occurred. > if egrep -q '[1-9][0-9]* +(in)?voluntary context' $TIME; then > continue > fi > cat $TMP >> $RESULTS > i=$((i + 1)) > done > > rm $TMP $TIME > > -- > > Run it like this: > > $ ksh bench.sh unpatched.out > > That will take ~5-10 minutes at most. > > Next, we'll patch libc to add the LFENCE to the userspace timecounter. > > Index: usertc.c > =================================================================== > RCS file: /cvs/src/lib/libc/arch/amd64/gen/usertc.c,v > retrieving revision 1.2 > diff -u -p -r1.2 usertc.c > --- usertc.c 8 Jul 2020 09:17:48 -0000 1.2 > +++ usertc.c 22 Aug 2020 22:18:47 -0000 > @@ -19,10 +19,10 @@ > #include <sys/timetc.h> > > static inline u_int > -rdtsc(void) > +rdtsc_lfence(void) > { > uint32_t hi, lo; > - asm volatile("rdtsc" : "=a"(lo), "=d"(hi)); > + asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi)); > return ((uint64_t)lo)|(((uint64_t)hi)<<32); > } > > @@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_ > { > switch (tk->tk_user) { > case TC_TSC: > - *tc = rdtsc(); > + *tc = rdtsc_lfence(); > return 0; > } > > -- > > Recompile and reinstall libc. > > Then rerun the benchmark. Be careful not to overwrite our results > from the unpatched libc: > > $ ksh bench.sh patched.out > > -- > > Alright, now let's compare the results. I'm not a mathemagician so I > use ministat and trust it implicitly. A stat jock could probably do > this in R or with some python, but I am not that clever, so I will > stick with ministat. > > There is no ministat port for OpenBSD, but it is pretty trivial to > clone this github repo and build it on -current: > > https://github.com/thorduri/ministat > > -- > > Okay, you have ministat? > > Let's compare the results. We want the 2nd column in the output > (-C2). I'm not interested in the graph (-q), given our population > size. We have N=10000, so let's push the CI up (-c 99.5). > > $ ~/repo/ministat/ministat -C2 -q -c99.5 unpatched.out patched.out > x unpatched.out > + patched.out > N Min Max Median Avg Stddev > x 10000 13752102 18019218 14442398 14431918 237842.31 > + 10000 15196970 16992030 15721390 15779178 181623.5 > Difference at 99.5% confidence > 1.34726e+06 +/- 9247.11 > 9.33528% +/- 0.064074% > (Student's t, pooled s = 211608) > > So, in conclusion, prefixing RDTSC with LFENCE incurs a penalty. It > is unambiguously slower. > > What is the penalty? No more than ~10%. Which translates to maybe > ~10-15 cycles on my machine. > > Something else worth noting is that with the LFENCE in place the > stddev is smaller. So, reading the clock, though slower, takes a more > *consistent* amount of time with the LFENCE in place. > > -- > > I don't think this penalty is so bad. > > Thoughts?
Cool. The kernel enironment is different of course. But this gives a good idea of the potential impact. I'd say this is ok. I don't think this will result in anything close to a 10% reduction in packet forwarding. I think you should go ahead with a change that makes both libc and the kernel use rdtsc_lfence() for timekeeping. I also think that cpu_rnd_messybits() should continue to use a naked RDTSC instruction since that gives us more jitter. Cheers, Mark
