On Tue, Jul 28, 2020 at 10:02:07AM +0300, Paul Irofti wrote:
>
> [...]
>
> Is the issue with LFENCE slowing down the network stack settled? That was
> the argument against it last time.
... a month passes. Nobody says anything.
This "it might slow down the network stack" thing keeps coming up, and
yet nobody can point to (a) who expressed this concern or (b) what the
penalty is in practice.
Note that the alternative is "your timecounter might not be monotonic
between threads". For me, that's already a dealbreaker.
But for sake of discussion let's look at some data. For those of you
watching from home, please follow along! I would like to know what
your results look like.
To start, here is a microbenchmarking program for clock_gettime(2) on
amd64. If you have the userspace timecounter, then
clock_gettime(CLOCK_MONOTONIC, ...);
is a suitable surrogate for nanouptime(9), so this microbenchmark can
actually tell us about how nanouptime(9) or nanotime(9) would be
impacted by a comparable change in the kernel timecounter.
--
/*
* clock_gettime-bench.c
*/
#include <err.h>
#include <limits.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
static uint64_t
rdtsc_lfence(void)
{
uint32_t hi, lo;
__asm volatile("lfence; rdtsc; lfence" : "=d" (hi), "=a" (lo));
return ((uint64_t)hi << 32) | lo;
}
int
main(int argc, char *argv[])
{
struct timespec now;
uint64_t begin, end;
long long count, i;
const char *errstr;
if (argc != 2) {
fprintf(stderr, "usage: %s count\n", getprogname());
return 1;
}
count = strtonum(argv[1], 1, LLONG_MAX, &errstr);
if (errstr != NULL)
errx(1, "count is %s: %s", errstr, argv[1]);
begin = rdtsc_lfence();
for (i = 0; i < count; i++)
clock_gettime(CLOCK_MONOTONIC, &now);
end = rdtsc_lfence();
printf("%lld\t%llu\n", count, end - begin);
return 0;
}
--
Now consider a benchmark of 100K clock_gettime(2) calls against the
userspace timecounter.
$ clock_gettime-bench 100000
100000 15703664
Let's collect 10K of these benchmarks -- our samples -- atop an
unpatched libc. Use the shell script below. Note that we throw out
samples where we hit a context switch.
--
#! /bin/sh
[ $# -ne 1 ] && exit 1
RESULTS=$1
shift
TIME=$(mktemp) || exit 1
TMP=$(mktemp) || exit 1
# Collect 10K samples.
i=0
while [ $i -lt 10000 ]; do
# Call clock_gettime(2) 100K times.
/usr/bin/time -l ~/scratch/clock_gettime-bench 100000 > $TMP 2> $TIME
# Ignore this sample if a context switch occurred.
if egrep -q '[1-9][0-9]* +(in)?voluntary context' $TIME; then
continue
fi
cat $TMP >> $RESULTS
i=$((i + 1))
done
rm $TMP $TIME
--
Run it like this:
$ ksh bench.sh unpatched.out
That will take ~5-10 minutes at most.
Next, we'll patch libc to add the LFENCE to the userspace timecounter.
Index: usertc.c
===================================================================
RCS file: /cvs/src/lib/libc/arch/amd64/gen/usertc.c,v
retrieving revision 1.2
diff -u -p -r1.2 usertc.c
--- usertc.c 8 Jul 2020 09:17:48 -0000 1.2
+++ usertc.c 22 Aug 2020 22:18:47 -0000
@@ -19,10 +19,10 @@
#include <sys/timetc.h>
static inline u_int
-rdtsc(void)
+rdtsc_lfence(void)
{
uint32_t hi, lo;
- asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
+ asm volatile("lfence; rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)lo)|(((uint64_t)hi)<<32);
}
@@ -31,7 +31,7 @@ tc_get_timecount(struct timekeep *tk, u_
{
switch (tk->tk_user) {
case TC_TSC:
- *tc = rdtsc();
+ *tc = rdtsc_lfence();
return 0;
}
--
Recompile and reinstall libc.
Then rerun the benchmark. Be careful not to overwrite our results
from the unpatched libc:
$ ksh bench.sh patched.out
--
Alright, now let's compare the results. I'm not a mathemagician so I
use ministat and trust it implicitly. A stat jock could probably do
this in R or with some python, but I am not that clever, so I will
stick with ministat.
There is no ministat port for OpenBSD, but it is pretty trivial to
clone this github repo and build it on -current:
https://github.com/thorduri/ministat
--
Okay, you have ministat?
Let's compare the results. We want the 2nd column in the output
(-C2). I'm not interested in the graph (-q), given our population
size. We have N=10000, so let's push the CI up (-c 99.5).
$ ~/repo/ministat/ministat -C2 -q -c99.5 unpatched.out patched.out
x unpatched.out
+ patched.out
N Min Max Median Avg Stddev
x 10000 13752102 18019218 14442398 14431918 237842.31
+ 10000 15196970 16992030 15721390 15779178 181623.5
Difference at 99.5% confidence
1.34726e+06 +/- 9247.11
9.33528% +/- 0.064074%
(Student's t, pooled s = 211608)
So, in conclusion, prefixing RDTSC with LFENCE incurs a penalty. It
is unambiguously slower.
What is the penalty? No more than ~10%. Which translates to maybe
~10-15 cycles on my machine.
Something else worth noting is that with the LFENCE in place the
stddev is smaller. So, reading the clock, though slower, takes a more
*consistent* amount of time with the LFENCE in place.
--
I don't think this penalty is so bad.
Thoughts?